ArticlePDF Available

A MeSH-based text mining method for identifying novel prebiotics

December 2016
Medicine 95(49):e5585

December 2016
95(49):e5585

DOI:10.1097/MD.0000000000005585

License
CC BY-NC-ND 4.0

Authors:

Guangyu Shan

Yiming Lu

National Center for Protein Sciences (Beijing)

Wubin Qu

Show all 5 authorsHide

Prebiotics contribute to the well-being of their host by altering the composition of the gut microbiota. Discovering new prebiotics is a challenging and arduous task due to strict inclusion criteria; thus, highly limited numbers of prebiotic candidates have been identified. Notably, the large numbers of published studies may contain substantial information attached to various features of known prebiotics that can be used to predict new candidates. In this paper, we propose a medical subject headings (MeSH)-based text mining method for identifying new prebiotics with structured texts obtained from PubMed. We defined an optimal feature set for prebiotics prediction using a systematic feature-ranking algorithm with which a variety of carbohydrates can be accurately classified into different clusters in accordance with their chemical and biological attributes. The optimal feature set was used to separate positive prebiotics from other carbohydrates, and a cross-validation procedure was employed to assess the prediction accuracy of the model. Our method achieved a specificity of 0.876 and a sensitivity of 0.838. Finally, we identified a high-confidence list of candidates of prebiotics that are strongly supported by the literature. Our study demonstrates that text mining from high-volume biomedical literature is a promising approach in searching for new prebiotics.

The framework of prediction. 1. Download PubMed XML documents of 127 carbohydrates, including 15 positive prebiotics and 112 carbohydrates. 2. Compute the optimal parameter set (a, w, and k) for the model by exhaustive grid search and assign top k features as model feature set. 3. Use ROC curve to evaluate the performance of the model. 4. Perform prediction procedure to mining novel prebiotics. ROC = receiver operating characteristic curve, XML = extensible markup language.

…

Exhaustive grid search for the optimal parameter set via 5-fold cross-validation. The figure describes the contribution of 3 parameters (a, w, and k) in the model. Each column adopts a fixed k. (A) described the optimal a was 1 while the optimal w = 0.6, k = 800 were screened in (B). After that weak normalization also has been investigated in (C).

…

Feature enrichment analysis. Top 20 qualifier names were extracted from 800 features. The categories in the figure can roughly indicate the highlevel concept of 800 features. Those concepts are highly correlated with realworld prebiotics chemical property.

…

Hierarchical clustering of carbohydrates. If we observe a putative branch associated with the MeSH tree in NLM, we could, therefore, infer that features can be employed to predict potential prebiotics. Carbohydrates were clustered in hierarchical mode. Many branch structures are highly correlated with MeSH tree in NLM and we could therefore infer that the features have a large portion of prebiotics property, which can be employed to predict potential prebiotics in prediction step. MeSH = medical subject headings, NLM = National Library of Medicines.

…

Figures - uploaded by Guangyu Shan

Content may be subject to copyright.

Content uploaded by Guangyu Shan

Content may be subject to copyright.

A MeSH-based text mining method for identifying

novel prebiotics

Guangyu Shan, MS, Yiming Lu, PhD, Bo Min, PhD, Wubin Qu, MS, Chenggang Zhang, PhD

∗

Abstract

Prebiotics contribute to the well-being of their host by altering the composition of the gut microbiota. Discovering new prebiotics is a

challenging and arduous task due to strict inclusion criteria; thus, highly limited numbers of prebiotic candidates have been identiﬁed.

Notably, the large numbers of published studies may contain substantial information attached to various features of known prebiotics

that can be used to predict new candidates. In this paper, we propose a medical subject headings (MeSH)-based text mining method

for identifying new prebiotics with structured texts obtained from PubMed. We deﬁned an optimal feature set for prebiotics prediction

using a systematic feature-ranking algorithm with which a variety of carbohydrates can be accurately classiﬁed into different clusters

in accordance with their chemical and biological attributes. The optimal feature set was used to separate positive prebiotics from

other carbohydrates, and a cross-validation procedure was employed to assess the prediction accuracy of the model. Our method

achieved a speciﬁcity of 0.876 and a sensitivity of 0.838. Finally, we identiﬁed a high-conﬁdence list of candidates of prebiotics that

are strongly supported by the literature. Our study demonstrates that text mining from high-volume biomedical literature is a

promising approach in searching for new prebiotics.

Abbreviations: AUC =area under the curve, MeSH =medical subject headings, NLM =National Library of Medicines, RF =

random forest, ROC =receiver operating characteristic curve, XML =extensible markup language.

Keywords: Carbohydrates, MeSH-term, Prebiotics, Prebiotics prediction, Text mining

1. Introduction

The health beneﬁts of prebiotics, such as cancer risk reduction,

immune system enhancement, and constipation relief have been

widely accepted. A food ingredient can be considered a prebiotic

only when it satisﬁes 3 criteria: (1) resistant to gastric acidity and

mammalian enzymes, (2) prone to fermentation by intestinal

microbiota, and (3) selective to stimulation of the growth and/or

activity of beneﬁcial intestinal microbiota.

[1]

Identifying new

prebiotics in accordance with these 3 criteria via the screening of

various chemical compounds is a very laborious and challenging

task. Scientists have been performing related work since 1995

when the criteria were ﬁrst proposed. However, only two

carbohydrates have been reported until 2007: Inulin and

Fructooligosaccarides.

[1]

Several researchers began to develop other approaches by

reviewing published literature and searching for keywords in

PubMed, and 3 carbohydrates were shown to alter the micro-

biota balance of the large bowel by increasing the number of

biﬁdobacteria and lactobacillus. The success of these studies

suggested the possibility of using a text mining-based method to

identify prebiotics by transforming the inclusion criteria into a

collection of literal features. Text mining efforts developed a

variety of approaches to obtain information in structured

biomedical text using techniques such as machine learning,

natural language processing, biostatistics, information technolo-

gy, and pattern recognition.

[2]

In the rapidly growing ﬁelds of knowledge discovery and text

mining, relevant literature can be used to obtain implicit and

unrevealed information. Swanson

[3]

began to mine information

from biomedical literature for Raynaud disease treatment in

1986. He found from a biomedical paper that Raynaud disease is

a peripheral circulatory disorder associated with and exacerbated

by high platelet aggregation, high blood viscosity, and

vasoconstriction; in other biomedical literature, he found that

ﬁsh oil could reduce these symptoms. Accordingly, he proposed

the hypothesis that ﬁsh oil may be helpful for people suffering

from Raynaud disease, which had not previously been reported.

Three years later, this hypothesis was clinically conﬁrmed by

DiGiacomo et al.

[4]

Corresponding to this method, Ramadan

et al

[5]

traced 11 indirect connections between migraines and

magnesium using summaries of published papers, and the effect

Editor: Giovanni Tarantino.

GS and YL have contributed equally to this work.

Author Contributions: Conceived and designed the experiments: GS, YL, and

BM. Performed the experiments: GS. Analyzed the data: GS, YL, and BM.

Contributed reagents/materials/analysis tools: GS. Wrote the paper: GS, LY, WQ,

and CZ.

Funding provided by the National Basic Research Project (973 program)

(2012CB518200), the General Program (31401141, 81573251, 30900830) of the

Natural Science Foundation of China, the State Key Laboratory of Proteomics of

China (SKLP-Y201303, SKLP-O201104, and SKLP-K201004), and the Special

Key Programs for Science and Technology of China (2012ZX09102301–016).

The authors have no conﬂicts of interest to disclose.

Supplemental Digital Content is available for this article.

Beijing Institute of Radiation Medicine, State Key Laboratory of Proteomics,

Cognitive and Mental Health Research Center, Beijing, PR China.

∗

Correspondence: Chenggang Zhang, Academy of Military Medical Sciences,

Beijing, PR China (e-mail: zhangcg@bmi.ac.cn).

rights reserved.

This is an open access article distributed under the terms of the Creative

Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-

ND), where it is permissible to download and share the work provided it is

properly cited. The work cannot be changed in any way or used commercially.

Medicine (2016) 95:49(e5585)

Received: 7 August 2016 / Received in ﬁnal form: 2 November 2016 / Accepted:

7 November 2016

http://dx.doi.org/10.1097/MD.0000000000005585

Observational Study Medicine®

OPEN

of magnesium was later experimentally validated.

[6]

Thus far,

text mining has become an indispensable tool for extracting

knowledge from biomedical literature.

Feature selection is a critical procedure for text mining to tease

out valuable features from large amounts of data.

[7]

Many

techniques, such as support vector machine (SVM),

[8]

genetic

programming (GP),

[9,10]

logistic regression (LR),

[11]

and proba-

bilistic neural network (PNN),

[12]

can perform this process only

in a general and cursory manner. MedMeSH summarizer can

assess very large amounts of biomedical data in a short period

and is generally used for genome-wide expression proﬁles.

[13]

MedMeSH summarizer can achieve decent performance in

speciﬁc as opposed to general assessments.

Inspired by MedMeSH and the philosophy of mining tacit

knowledge from biomedical literature, we herein developed a

novel medical subject headings (MeSH)-based text mining

method for identifying new prebiotics utilizing the PubMed

database. PubMed comprises more than 24 million citations for

biomedical literature from MEDLINE, life science journals, and

online books.

[14]

MeSH is the National Library of Medicines

(NLM)-controlled vocabulary thesaurus speciﬁed for indexing

articles from PubMed. We extracted from MeSH because it is

easily available through the PubMed service of the National

Library of MEDLINE, whereas full texts of research studies are

often only accessible by subscription.

[15]

Additionally, utilizing

MeSH rather than the full text not only reduces computation time

but also enables higher dataset throughput.

[16]

Bhattacharya et al

demonstrated that MeSH terms could represent the whole text

accurately if screened appropriately, that is, we can extract

representative features from massive amounts of literature using

these high-quality widgets.

[16]

We hypothesized that carbohydrates with the properties of

prebiotics share similar literal features. To better extract the

features of known prebiotics, we ﬁrst used an exhaustive text

mining approach to mine prebiotic-related topical MeSH terms

from structured documents downloaded from PubMed. And then

selected a list of optimal MeSH terms that are closely related to

known prebiotics

[17]

and ranked a large set of carbohydrates

according to the scores calculated from their MeSH frequency

proﬁles. At last, we used a cross-validation technique to assess the

prediction accuracy of our model.

2. Methods

2.1. Data preparation

Firstly, 2 kinds of data were being prepared: positive prebiotics

set and carbohydrates set. We used a list of positive prebiotics

summarized by Al-Sheraji et al.

[14]

The list is in Table 1 which

contains 15 prebiotics that we denoted as positive prebiotics set.

Nearly all positive prebiotics are non-digestible carbohydrates.

Thus, we constructed carbohydrates set using the ofﬁcial names

of all available carbohydrates from the NLM MeSH tree

structures. To ensure the speciﬁcity of the prediction, only

carbohydrates that belong to the lowest level of the tree were

selected, with the exception of the lowest-level carbohydrates that

could not cover the carbohydrates represented by their parent

node (in this case, the parent node was also included). Positive

prebiotics were also removed from the carbohydrates set. The

ﬁnal carbohydrates set contains 112 carbohydrates (Supporting

Information, S1 Table. The ofﬁcial names of carbohydrates set.

(XLSX), http://links.lww.com/MD/B447; S2 Table. The ofﬁcial

names of 50 positives for method validation. (XLSX), http://links.

lww.com/MD/B448). Each of the names of 15 positive prebiotics

and 112 carbohydrates were used as a query to search relevant

literature in PubMed, and the hit documents were downloaded in

extensible markup language (XML) format, respectively. MeSH

terms in the XML documents are extracted using the ElementTree

Python package. Therefore, each substance contains a MeSH

term list extracted from its relevant literature. Each list contains

thousands of features, which will enable us a robust foundation

for the ﬁnal model. This study did not require the ethical approval

and informed consent due to all analyses were carried out based

on the data extracted from previous published literature.

2.2. Stop words ﬁltering

Stop words, which can undermine the efﬁcacy and effectiveness

of the mining task due to high frequency, usually need to be

removed ﬁrst. MeSH curators removed traditional stop words

such as “a,”“the,”and “for”; however, some MeSH terms with

extremely high-frequency remain, which signiﬁcantly reduces

model performance. These MeSH terms were ﬁltered according

to Zipf law. Zipf law states that the rank-proportional frequency

of a word is inversely proportional to its frequency rank among

all words in a given natural language corpus. Thus, the purity of

the corpus can be optimized by removing MeSH terms with

particularly high frequency under the following ﬁlter procedure.

1. Initiate a query list containing all carbohydrates in positive

prebiotics set and carbohydrates set;

2. Rank their MeSH terms in descending order according to their

total frequency. We considered the ﬁrst region (top 20 terms

with high frequency) of Zipf curve. Four colleagues in our lab

majoring in prebiotics helped to examine the candidates list

and remove those that are biologically important;

3. The remaining MeSH terms from this region constituted the

MeSH stop words list.

2.3. Data normalization

The normalization of MeSH terms frequency is necessary because

of well-studied prebiotics can retrieve much more literature than

other prebiotics and will introduce bias into the ultimate feature

set of the cluster. To avoid this situation, the frequency matrix is

normalized according to Eq. (1), where a(0 a1) is a

Table 1

Types and sources of known prebiotics.

Type of prebiotic Sources of prebiotics References

Inulin Wheat, onion, bananas

[1]

Fructooligosaccharides Asparagus, sugar beet, garlic, etc.

[18]

Isomaltulose Honey, sugarcane juice

[19]

Xylooligosaccharides Bamboo shoots, fruits, vegetables, etc.

[20]

Galactooligosaccharides Human’s milk and cow’s milk

[21]

Cyclodextrins Water-soluble glucans

[22]

Rafﬁnose oligosaccharides Seeds of legumes, lentils, peas, etc.

[23]

Soybean oligosaccharides Soybean

[24]

Lactulose Lactose (milk)

[25]

Lactosucrose Lactose

[26]

Palatinose Sucrose

[19]

Maltooligosaccharides Starch

[27]

Isomaltooligosaccharides Starch

[27]

Arabinoxylooligosaccharides Wheat bran

[28]

Enzyme-resistant dextrin Potato starch

[29]

Shan et al. Medicine (2016) 95:49 Medicine

normalization parameter controlling the correlation degree with

the corpus volume. a=0 implies no normalization and a=1

implies complete normalization. We ﬁrst build a positive

prebiotics MeSH frequency matrix f

with numerical value,

where each row represents a prebiotic and each column refers to a

MeSH term occurring in the positive prebiotics set. Mdenotes

prebiotics (rows). Thus, F

is the absolute MeSH term frequency

while f

is the relative MeSH term frequency of each positive

prebiotics.

fij ¼Fij

i¼1Fij



að0a1Þð1Þ

2.3.1. Feature selection. To select features from the matrix we

mentioned above, we utilized the MedMeSH summarizers

algorithm, which has been applied to assign pertinent MeSH

terms to describe the functionality of a group of genes.

[30]

MedMeSH summarizer summarizes a group of genes by ﬁltering

biomedical literature and assigning relevant keywords describing

the functionality of the genes. This system constructed a P∗Q co-

occurrence matrix where P denotes the genes in the cluster and Q

reﬂects the MeSH terms that were extracted from the retrieved

literature. The cell value of the matrix is the frequency of each

MeSH term. With this matrix, an overall score of each MeSH

term can be calculated and the most inﬂuential terms will be

screened to describe the functionality of this cluster. Here, we

utilized this matrix to classify all the MeSH terms into two ﬁelds:

Major topics and Particular topics.

2.3.2. Major Topics. Terms occurring in most prebiotics with

high frequency. N denotes MeSH terms (columns). Criterion R

rank the MeSH terms by decreasing order of the means m

mi¼PN

j¼1fij

Nði¼1;...;MÞð2Þ

2.3.3. Particular Topics. Terms occurring in a subset of

prebiotics with high frequency. sin Eq. (3) is the ratio of the

mean/standard deviation of their MeSH feature vectors. Criterion

: rank the MeSH terms by decreasing order of the ratios s2

i=mi.

si¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

j¼1ðfij−miÞ2

sði¼1;…;MÞð3Þ

All MeSH terms in the matrix are ranked in accordance with

the 2 criteria described previously and assigned to an overall rank

Rin Eq. (4). The weight parameter waimed at providing a

summary of the cluster by balancing the major and particular

topics. MeSH terms are arranged by their overall relevance ranks

Rin ascending order. Truncated top kMeSH terms as prebiotics

summary feature set to construct normalization matrix for

subsequent prediction.

R¼wR1þð1−wÞR2ð0w1Þð4Þ

2.4. Parameter optimization

Three key parameters, including a,w, and k, were screened for

feature selection. aranges from 0 to 1; 0 implies no normalization

and 1 implies complete normalization. walso ranges from 0 to 1;

1 implies that the major topic terms dominated the feature set and

0 implies that the particular topics dominated the model. The last

parameter kis the number of features we saved for the ﬁnal

feature set.

An exhaustive global grid search is implemented for screening

the optimal parameter set. All possible combinations of the

parameter values are evaluated, and the best combination is

retained. Each parameter is designated with a suitable variation

scope: a∊[0,1], step =0.2; w∊[0,1], step =0.1; k∊[200,1000],

step =200 for optimal parameter screening. To evaluate the

performance of the parameter sets, we employed a 5-fold cross-

validation method. After repeating the simulation 100 times, the

average rank of 3 positive prebiotics is used to assess the

performance of each parameter set. A more accurate model is

expected to rank positive prebiotics at the top of the predicted list;

thus, a smaller average rank value means higher rank positions

for them, which indicates a better parameter set.

2.5. Feature enrichment analysis

In the XML document, each MeSH term has two attributes that

were curated by an expert: “Descriptor Name”and “Qualiﬁer

Name”.“Descriptor Name”refers to the ofﬁcial name of the

MeSH terms, and “Qualiﬁer Name”refers to the speciﬁc related

ﬁelds. For example, MeSH term Inositol possesses a Descriptor

Name—Inositol and 2 Qualiﬁer Names—Chemistry & Pharma-

cology. Thus, to perform the enrichment analysis is to extract all

“Qualiﬁer Name”under each MeSH—“Descriptor Name”for

frequency calculation. Principal groups in frequency distribution

bar plot can denote the property of MeSH group.

2.6. Random forest model training for comparison

Random forest is an outstanding machine learning algorithm,

which can handle sparse matrix and large amount of variables.

Using the MeSH term frequency of positive and negative

carbohydrates as features, the Random forest models were

trained and tested with 100 times repeats of 5-fold cross-

validation, and the averaged areas under the receiver operating

characteristic curve (ROC) (area under the curve [AUC]) were

used for performance comparison in different datasets. The

training and testing procedures of random forest model were

implemented using “randomForest”package in R programming

language.

2.7. Model evaluation and predicting novel prebiotics

We build carbohydrate prediction matrix f

according to Eq. (1)

with numerical value, where each row represents a carbohydrate

and each column refers to a feature. This matrix can be used to

predict novel prebiotics by Eq. (5). Each carbohydrate obtained

as their own score denotes the ability to be potential

prebiotics.

RB¼X

i¼1

fij

ð5Þ

Then, we carried out 5-fold cross-validation to evaluate the

predictive performance of the model. In each round, 4 randomly

generated folds were used for feature selection, and the ﬁfth fold

was reserved for prediction with carbohydrates set. That is to say.

There will yield 2 columns with respect to prediction set in each

round: R

score column and binary state column (1 denotes

prebiotics, 0 denotes not prebiotics). Two columns yielded by this

Shan et al. Medicine (2016) 95:49 www.md-journal.com

step can produce one AUC score and after the prediction

procedure was repeated 100 times. The average AUC was

deployed as a measure to evaluate the prediction performance.

A model returns a vector of scores between 0 and 1 for a

combined prediction proﬁle. These scores are then mapped

to a binary state indicating “prebiotics”or “non-prebiotics”by

choosing a cut-off. For each combination of proﬁles, the existence

of a prebiotic is considered positive (P) or negative (N). True (T)

means that the predicted and observed categories are identical,

and false (F) implies otherwise. The notations TP, FP, TN, and

FN combine these labels to return the number of data points

(combined prediction proﬁle) in each category. These values are

consistent with a cut-off at which carbohydrates prediction ranks

are mapped onto binary predictions. The predicted scores are

transformed into binary predictions using sensitivity and

speciﬁcity over the entire score range. The speciﬁcity is deﬁned

as TN/(FP + TN) and the sensitivity is TP/(TP + FN). Lastly, we

calculate the average speciﬁcity and average sensitivity for each

round (repeat 100 times). The best cut-off point for balancing the

average sensitivity and average speciﬁcity of our model is the

point on the curve closest to the (0, 1) point. We deploy the

corresponding cut-off to indicate potential prebiotics, which is

calculated via the R package named ROCR.

[43]

3. Results

3.1. Text mining framework for novel prebiotics prediction

We developed a systemic MeSH-based text mining approach to

robustly predict new prebiotics. The feature selection part of our

method is inspired by the MedMeSH summarizer. It is a text

mining algorithm to describe the functionality of a group of

genes. But our method moves further from here, it not only

summarizes a cluster by using MeSH terms, but also predicts

novel concepts with the same property from the cluster. In

addition, MedMeSH summarizer uses ﬁxed parameter set for

gene cluster summarizing. However, we found that a ﬁxed

parameter set usually introduce many unrelated terms emerged as

topic terms in our dataset, which will undermine the subsequent

prediction result. To overcome this problem, we developed an

exhaustive global search method to determine the optimal

parameter set for our dataset of prebiotics. High-proﬁle features

were screened out and were validated by feature enrichment

analysis and the ROC plot.

The workﬂow of prebiotics prediction is shown in Fig. 1. We

ﬁrst collected known prebiotics from Table 1 and carbohydrates

set from the NLM MeSH tree structure in our queries to retrieve

MeSH-related documents from PubMed. To construct the proﬁle

Figure 1. The framework of prediction. 1. Download PubMed XML documents of 127 carbohydrates, including 15 positive prebiotics and 112 carbohydrates. 2.

Compute the optimal parameter set (a,w, and k) for the model by exhaustive grid search and assign top kfeatures as model feature set. 3. Use ROC curve to

evaluate the performance of the model. 4. Perform prediction procedure to mining novel prebiotics. ROC =receiver operating characteristic curve, XML =

extensible markup language.

Shan et al. Medicine (2016) 95:49 Medicine

of each substance (prebiotics or carbohydrates), MeSH terms

were extracted with respect to their retrieval literature and their

frequency was calculated by Eq. (1). After that, we calculated 10

MeSH terms as stopwords, including Animals, Humans, Male,

Female, Rats, Adult, Mice, Aged, Middle Aged, and Child. Those

terms were removed from the corpus prior to the following

analysis.

Our model primarily aims to predict new prebiotics on the

basis of MeSH frequency by extracting highly representative

features, which were originally employed by Kankar et al

[30]

investigating the functionality of a gene group. We learn from his

philosophy and adapted it to a more concrete task: novel

prebiotics prediction. Unlike the previous one-ﬁts-all solution for

the gene set, we reﬁned the feature discovery pattern by

considering the unbalanced data across the feature selection

procedure.

We calculated two parameters (R

and R

) to identify different

types of MeSH terms. R

is calculated by Eq. (2) that can take

major topics into account whereas R

is produced by Eq. (3)

which aims to consider particular topics. To improve feature

selection step, we speciﬁed an exhaustive grid search method to

determine an optimal parameter set with 5-fold cross-validation.

Each parameter in the model is being traversed by certain step in

the value range. Soon after that, we selected 800 features from 15

positive prebiotics that have been determined by the optimal

parameter set (a=1, w=0.6, k=800). Then, we deployed feature

enrichment analysis and carbohydrates clustering to evaluate the

performance of the feature set. The representative ability to

prebiotics property of the feature set was very good, which also

revealed the performance of the optimal parameter set on the

other side. After that, we evaluated the ﬁnal model and selected

threshold which denote the boundary between carbohydrates

with prebiotics property and without the property by ROC.

According to the threshold, top 11 carbohydrates were identiﬁed

as novel prebiotics. At last, we made a thorough literature

investigation towards those new prebiotics.

3.2. Optimal parameter set for prebiotics prediction

Corpus volume that associated with a carbohydrate often

substantially varies between positive prebiotics and carbohy-

drates. Well-studied prebiotics, such as inulin and fructooligo-

saccharides, are substantially more common in research than

other carbohydrates, which introduce strong bias into the model.

To balance the effect of the corpus volume, we introduced the

parameter of ato control the extent of normalization of MeSH

frequency. To balance the generic topics and particular topics, a

weight parameter wis introduced to ensure that the ﬁnal feature

set could take these 2 diverse topics into full consideration. The

last parameter kis the number of features we saved for the ﬁnal

feature set. An optimal set of parameters are crucial for precisely

prediction of prebiotics, and we used an exhaustive global grid

search method to determine the optimal parameter set (see

Section 2).

Performance analyses of each parameter are shown in Fig. 2.

a=1 achieves best average rank regardless of the change in w,

indicating that full normalization is necessary for the applied

datasets, as shown in Fig. 2A. w=0.6 (k=800, a=1.0) achieves

the best average rank in Fig. 2B, suggesting that generic topics

have been assigned more contribution for particular topics in

Figure 2. Exhaustive grid search for the optimal parameter set via 5-fold cross-validation. The ﬁgure describes the contribution of 3 parameters (a,w, and k)inthe

model. Each column adopts a ﬁxed k. (A) described the optimal awas 1 while the optimal w=0.6, k=800 were screened in (B). After that weak normalization also

has been investigated in (C).

Shan et al. Medicine (2016) 95:49 www.md-journal.com

known prebiotics summaries under full normalization circum-

stances. Beyond that, wunder weak normalization (a=0.2) also

has been investigated to further understand the impact of

normalization (results shown in Fig. 2C). w=1.0 achieves the

best average rank regardless of the change in wunder weak

normalization, suggesting that generic topics are used to

represent the entire known prebiotics summary, which indicates

that full normalization is necessary when encountering unbal-

anced data (otherwise, the system will automatically abandon a

particular instance to maintain performance). Notably, when

screening the optimal parameter a, the average rank is

represented by an integration of w. Finally, the optimal

parameters of a=1, w=0.6, and k=800 are chosen for further

analyses. After determining the optimal parameter set, two

divergent topics (generic and particular) are balanced by

parameter wto generate a feature summary of positive prebiotics.

3.3. Feature enrichment analysis and carbohydrates

clustering

To investigate the major topics of selected features, an enrichment

analysis was deployed (See Section 2). The result is shown in

Fig. 3. Interestingly, >95%, >70%, and >70% correspond to

metabolism, chemistry, and pharmacology, respectively, coin-

ciding with our prior knowledge that those prebiotics usually

play major roles in the metabolism of the human body due to

their various chemical structure and pharmacology properties. In

other words, these vital properties are concealed in the feature

summary. We have innovated a method to excavate them out and

effectuate them for prediction.

To examine the quality of the 800 selected features, we further

conducted a hierarchical clustering method to determine if these

features can excel in clustering the relevant carbohydrates

adjacent to each other. Hierarchical clustering is a widely

performed data analysis tool that provides dataset summaries by

grouping similar observations into 1 cluster.

[31]

In the real-world

case presented in Fig. 4, notably, the clustered carbohydrates

shared a similar structure with the MeSH tree in NLM. For

instance, cyclodextrins are cyclic oligosaccharides consisting of

6a-cyclodextrins, 7 b-cyclodextrins, 8 g-cyclodextrins, or more

glucopyranose units linked by a-(1,4) bonds, which is the son

node of dextrins in the MeSH tree (green block at 9 o’clock).

[37]

In addition to this dextrins branch, other branches, such as the

Agar branch (red block at 8 o’clock), oligosaccharides branch

(green block at 4 o’clock), and fructans branch (green block at

1o’clock), etc., also achieve high similarity with the MeSH tree.

These factors indicated that the features we selected may be

effective in further prebiotics prediction.

3.4. Model evaluation and prebiotics prediction

The ROC curve is employed to model evaluation. Because of the

limited number (only 15) of the positive set, we ﬁrst enlarged the

number of positive set to 50 to validate our method. Fifty

positives contain previous 15 positive prebiotics and 35

carbohydrates which under polysaccharides node in NLM

MeSH tree, their names are in S2 Table, http://links.lww.com/

MD/B448. By using 50 positives and remaining 77 carbohy-

drates, we got our optimal parameters a=1, w=0.3, and k=800

with an average rank 11.905. The optimal parameters are utilized

to deploy the model evaluation by 5-fold cross-validation ROC

curve. In addition, we have performed a comparison of our

method to machine learning method. The frequency matrix for

machine learning is extremely sparse and there are more than

20,000 variables. Random forest algorithm can handle large

amount of variables and overﬁtting very well. So, we decide to

compare our method to random forest algorithm (see Section 2).

Figure 5A shows a 5-fold cross-validation ROC curve for the

model with 50 positives. When we enlarged our positive set,

our model can perform well with an AUC of 0.891. Also, the

performance of our model is better than the random forest

algorithm with an AUC of 0.846. After method validation step by

50 positives, we turned to 15 positive prebiotics and perform real-

world ROC evaluation.

Figure 5B shows a 5-fold cross-validation ROC curve for the

model with 15 positives. Surprisingly, the performance of our

model is far better than random forest algorithm. It is, therefore,

suggested that our method can be a good choice for the highly

imbalanced data (112 negatives vs. 15 positives). We hit an AUC

of 0.911 and a cut-off of 0.013 can maintain optimal balance

between average speciﬁcity and average sensitivity. This cut-off

helps select the corresponding rank 11, which may have

prebiotics properties in the above prediction list. Those predicted

novel prebiotics are presented in Table 2, and some of them have

been investigated by prebiotics experts. The average speciﬁcity

and sensitivity for samples were 0.876 and 0.838, respectively.

In addition to evaluating the model and predicting potential

prebiotics, we also investigated related literature evidence for 11

potential prebiotics based on the original deﬁnition of prebiotics:

“a prebiotic is a selectively fermented ingredient that allows

speciﬁc changes, both in the composition and/or activity in the

gastrointestinal microbiota, that confer beneﬁts upon host well-

being and health.”Most of the predicted prebiotics are supported

by the literature analysis for 2 of the 3 criteria of prebiotics (non-

digestibility, fermentation, and selectivity), and there are no

obvious conﬂicts with these criteria. Even for the most rigorous

criterion (selectivity), these are also many considerable items with

promising clues. For example, isomaltose has been shown to

represent a prebiotic with digestion-resistant properties, rafﬁnose

is a complex 285 carbohydrate that can promote the growth of

beneﬁcial microorganisms, and acarbose is usually administered

in diabetes treatment and has promising potential as a

prebiotic.

[40]

Additionally, cyclodextrin is a saccharide that

can reduce the digestion of carbohydrates and lipids. The

Figure 3. Feature enrichment analysis. Top 20 qualiﬁer names were extracted

from 800 features. The categories in the ﬁgure can roughly indicate the high-

level concept of 800 features. Those concepts are highly correlated with real-

world prebiotics chemical property.

Shan et al. Medicine (2016) 95:49 Medicine

derivative a-cyclodextrin is a soluble dietary ﬁber that possesses

the ability to feed one of the Lactococcus sp. strains in the

gastrointestinal tract,

[42]

whereas the other derivative (b-cyclo-

dextrin) has been shown as an important component of low-fat

foods.

[43]

In summary, this promising list not only shows

prospective prebiotics but also demonstrated the efﬁcacy of our

model.

4. Discussion

It should be noted that our method depends on the MeSH terms.

Curators typically summarize 10 to 12 MeSH terms to describe

the most indexed papers from PubMed, but still there are a small

portion of papers that have not been curated yet. For these

overlooked papers, we suggest that keywords should be extracted

manually from their abstracts and titles for information integrity.

In addition, almost all text mining methods including ours are

partly limited by the size and the type of the data set, and the

predictive powers of our method in other data-intensive ﬁelds

haven’t been tested.

Prebiotics can supply vast health beneﬁts to healthy or

unhealthy people. Despite the signiﬁcant demonstrated medical

effect, the discovery and application of various prebiotics could

not meet the growing needs of the prebiotic market simply by

manually matching candidates to criteria. In an effort to improve

prebiotics mining efﬁciency, we herein present a methodology

utilizing text mining techniques to boost the variety of potential

prebiotics from related literature.

Figure 4. Hierarchical clustering of carbohydrates. If we observe a putative branch associated with the MeSH tree in NLM, we could, therefore, infer that features

can be employed to predict potential prebiotics. Carbohydrates were clustered in hierarchical mode. Many branch structures are highly correlated with MeSH tree in

NLM and we could therefore infer that the features have a large portion of prebiotics property, which can be employed to predict potential prebiotics in prediction

step. MeSH =medical subject headings, NLM =National Library of Medicines.

Shan et al. Medicine (2016) 95:49 www.md-journal.com

We explored the optimal parameter set in an exhaustive grid

search: each important parameter (a,w, and k) was evaluated

according to a spectrum of potential values. In the parameter

selection process, the parameter ais effective in corpus volume

trade-off even if the volume of certain corpuses can reach a higher

gulf (10

–10

). The parameters wand kalso substantially impact

the predictive performance. To more accurately determine the

variation tendency for the corpus volume, we performed

additional analyses to plot the average rank score against each

wand kat a speciﬁc lower a(a=0.2) after determining the

optimum a(1.0). Corpus volumes in our experiment vary

substantially; thus, ais intended to narrow the focus on yielding

reasonable parameters. Likewise, our parameter selection process

may provide a solution for other corpuses, especially those with

volume-unbalanced data.

Notwithstanding inevitable practical constraints, we believe

that our work is an important step in identifying more prebiotics,

thereby yielding meaningful results and providing a basis for

future development and experimentation. We identiﬁed critical

factors affecting mining work and developed methods for

characteristics selection of volume-unbalanced data to assess

predictive performance. We also performed clustering measure-

ments to evaluate the selected characteristics for known

prebiotics. The ROC curve, which evaluates the model ﬁt for

an optimal parameter crew, showed that the possibility issues we

identiﬁed are sufﬁciently consistent to create a list of potential

prebiotics for further research. In a list of 11 potential prebiotics,

apart from these promising speciﬁc carbohydrates, some

relatively broad categories also found in it, such as xylans,

fructans, and dextrins, indicate a promising ﬁeld of potential

prebiotics.

Overall, the MeSH-based text mining method provides a

bridge between the availability of tens of thousands of studies

with curated MeSH terms and the emerging functionality of

prebiotics studies, which have found few prebiotics over many

years. For the former, our algorithm dramatically enhances the

power of discovering potential prebiotics underlying countless

studies. For the latter, new candidates for potential prebiotics that

are useful in prebiotics’research come to light. Regarding future

directions: taken together, the thousands of studies at hand in an

entire literature corpus (rather than individual studies) can assist

us in other ﬁelds, such as ﬁnding bacteria that can perform certain

functions or obtain food for soldiers, which may represent a niche

need in future studies.

In this integrated analysis, we present new ideas and

instructions that are helpful to researchers. Our results indicate

that there are currently no universal parameters for the mining

task and that the parameter set reported to work for a speciﬁc

corpus may not be an appropriate choice for research. As we

noted, an exhaustive grid search is recommended to customize

Figure 5. Cross-validation ROC analyses were used to evaluate model performance and determine the ranking threshold. (A) The ROC plot indicated our method

(red) performs better than random forest (green) with 50 positives. That is to say, our method can discriminate well between known prebiotics and carbohydrates.

The 45°diagonal line (dashed) indicates the theoretical plot of a test with no discrimination between known prebiotics and carbohydrates. (B) The ROC plot

indicated our method (red) performs far better than random forest (green) with 15 positives. The cut-off means beyond which threshold can we deem carbohydrates

possess prebiotics property. ROC =receiver operating characteristic curve.

Table 2

Summary and conclusion on the prebiotic effect of 11 potential prebiotics.

Rank Carbohydrates Non-digestibility Fermentation Selectivity References

1 Isomaltose Yes n.c. Yes

[32]

2 Xylans Yes Yes n.c.

[33]

3 Fructans Yes Yes n.c.

[34]

4b-Cyclodextrins Yes Yes n.c.

[35]

5 Rafﬁnose Yes n.c. Yes

[36]

6 Dextrins Yes n.c. n.c.

[37]

7a-Cyclodextrins Yes Yes n.c.

[38]

8 Mitobronitol Probable n.c. n.c.

[39]

9 Oligosaccharides, branched-chain Yes Yes n.c.

[32]

10 Acarbose Yes Yes n.c.

[40]

11 Xylose Yes n.c. Yes

[41]

n.c. =not clear.

Shan et al. Medicine (2016) 95:49 Medicine

the parameter set not only to determine the best parameter settings

for given corpuses but also to assess their potential prediction

performance. Taken together, algorithm development as a part of

our study is meaningful in a widerange of biological scenarios, and

the ultimate potential of the prebiotics set obtained in this study

may provide novel text mining-based insights with clues in the

prebiotics ﬁeld. Follow-up studies are warranted to validate the

ﬁndings herein;moreover, additional deﬁned prebiotics substances

and related documents will improve the model. Our text mining-

based study lays the foundation for an efﬁcient mining study for

obtaining potential prebiotics, which may indicate a promising

method in difﬁcult ﬁeld of prebiotics research.

Acknowledgments

We thank Miss Xin Song for critical discussion and suggestions.

We would also like to acknowledge the generous funding

provided by the National Basic Research Project (973 program)

(2012CB518200), the General Program (31401141, 81573251,

30900830) of the Natural Science Foundation of China, the State

Key Laboratory of Proteomics of China (SKLP-Y201303, SKLP-

O201104 and SKLP-K201004), and the Special Key Programs

for Science and Technology of China (2012ZX09102301–016).

References

[1] Roberfroid M. Prebiotics: the concept revisited. J Nutr 2007;137(Suppl

2):830S–7S.

[2] Gupta V, Lehal GS. A survey of text mining techniques and applications.

J Emerg Technol Web Intell 2009;1:60–76.

[3] Swanson DR. Fish oil, Raynaud’s syndrome, and undiscovered public

knowledge. Perspect Biol Med 1986;30:7–18.

[4] DiGiacomo RA, Kremer JM, Shah DM. Fish-oil dietary supplementation

in patients with Raynaud’s phenomenon: a double-blind, controlled,

prospective study. Am J Med 1989;86:158–64.

[5] Ramadan NM, Halvorson H, Vande-Linde A, et al. Low brain

magnesium in migraine. Headache 1989;29:416–9.

[6] Ferrari MD. Biochemistry of migraine. Pathol Biol 1992;40:287–92.

[7] Tsuruoka Y, Tateishi Y, Kim JD, et al. Developing a robust part-of-

speech tagger for biomedical text. Lect Notes Comput Sci 2005;3746:

382–92.

[8] Tong S, Koller D. Support vector machine active learning with

applications to text classiﬁcation. J Mach Learn Res 2002;2:45–66.

[9] Escalante HJ, Garcia-Limon MA, Morales-Reyes A, et al. Term-

weighting learning via genetic programming for text classiﬁcation.

Knowl-Based Syst 2015;83:176–89.

[10] Hirsch L, Saeedi M, Hirsch R. Evolving text classiﬁcation rules with

genetic programming. Appl Artif Intell 2005;19:659–76.

[11] Jurka TP. Maxent: an R package for low-memory multinomial logistic

regression with support for semi-automated text classiﬁcation. R J

2012;4:56–9.

[12] Ciarelli PM, Oliveira E. An Enhanced Probabilistic Neural Network

Approach Applied to Text Classiﬁcation. Prog Pattern Recog Image Anal

Comput Vis Appl Proc 2009;5856:661–8.

[13] Lu ZY. PubMed and Beyond: A Survey of Web Tools for Searching

Biomedical Literature. Oxford:Database; 2011.

[14] Al-Sheraji SH, Ismail A, Manap MY, et al. Prebiotics as functional foods:

a review. J Funct Foods 2013;5:1542–53.

[15] Agarwala R, Barrett T, Beck J, et al. Database resources of the National

Center for Biotechnology Information. Nucleic Acids Res 2015;43:

D6–17.

[16] Bhattacharya S, Viet HT, Srinivasan P. MeSH: a window into full text for

document summarization. Bioinformatics 2011;27:I120–8.

[17] Dhammi IK, Kumar S. Medical subject headings (MeSH) terms. Indian J

Orthop 2014;48:443–4.

[18] Sangeetha PT, Ramesh MN, Prapulla SG. Recent trends in the microbial

production, analysis and application of Fructooligosaccharides. Trends

Food Sci Tech 2005;16:442–57.

[19] Lina BAR, Jonker D, Kozianowski G. Isomaltulose (Palatinose (R)): a

review of biological and toxicological studies. Food Chem Toxicol

2002;40:1375–81.

[20] Vazquez MJ, Alonso JL, Dominguez H, et al. Xylooligosaccharides:

manufacture and applications. Trends Food Sci Tech 2000;11:

387–93.

[21] Alander M, Matto J, Kneifel W, et al. Effect of galacto-oligosaccharide

supplementation on human faecal microﬂora and on survival and

persistence of Biﬁdobacterium lactis Bb-12 in the gastrointestinal tract.

Int Dairy J 2001;11:817–25.

[22] Singh M, Sharma R, Banerjee UC. Biotechnological applications of

cyclodextrins. Biotechnol Adv 2002;20:341–59.

[23] Johansen HN, Glitso V, Knudsen KEB. Inﬂuence of extraction solvent

and temperature on the quantitative determination of oligosaccharides

from plant materials by high-performance liquid chromatography. J Agr

Food Chem 1996;44:1470–4.

[24] Mussatto SI, Mancilha IM. Non-digestible oligosaccharides: a review.

Carbohyd Polym 2007;68:587–97.

[25] Villamiel M, Corzo N, Foda MI, et al. Lactulose formation catalysed by

alkaline-substituted sepiolites in milk permeate. Food Chem 2002;76:

7–11.

[26] Kawase M, Pilgrim A, Araki T, et al. Lactosucrose production using a

simulated moving bed reactor. Chem Eng Sci 2001;56:453–8.

[27] Kaneko T, Kohmoto T, Kikuchi H, et al. Effects of Isomaltooligosac-

charides with different degrees of polymerization on human fecal

biﬁdobactcria. Biosci Biotechnol Biochem 1994;58:2288–90.

[28] Eeckhaut V, Van Immerseel F, Dewulf J, et al. Arabinoxylooligosac-

charides from wheat bran inhibit Salmonella colonization in broiler

chickens. Poultry Sci 2008;87:2329–34.

[29] Barczynska R, Slizewska K, Jochym K, et al. The tartaric acid-modiﬁed

enzyme-resistant dextrin from potato starch as potential prebiotic.

J Funct Foods 2012;4:954–62.

[30] Kankar P, Adak S, Sarkar A, et al. MedMeSH summarizer: text mining

for gene clusters. Siam Proc S 2002;548–565.

[31] Langfelder P, Zhang B, Horvath S. Deﬁning clusters from a hierarchical

cluster tree: the dynamic tree cut package for R. Bioinformatics 2008;

24:719–20.

[32] Gibson GR, Probert HM, Van Loo J, et al. Dietary modulation of the

human colonic microbiota: updating the concept of prebiotics. Nutr Res

Rev 2004;17:259–75.

[33] INTECH Open Access Publisher, da Silva AE, Oliveira EE, Egito EST,

et al. Xylan, A Promising Hemicellulose for Pharmaceutical Use. 2012.

[34] Springer, Bosscher D. Fructan prebiotics derived from inulin. Prebiotics

and Probiotics Science and Technology 2009;163–205.

[35] Slavin JL. Dietary ﬁber and body weight. Nutrition 2005;21:411–8.

[36] Su P, Henriksson A, Mitchell H. Selected prebiotics support the growth

of probiotic mono-cultures in vitro. Anaerobe 2007;13:134–9.

[37] Binns N. Probiotics, prebiotics and the gut microbiota. Probiotics,

Prebiotics Gut Microbiota 2013. 1–32.

[38] Delzenne NM, Cani PD. Nutrit ional modulation of gut microbiota in the

context of obesity and insulin resistance: Potential interest of prebiotics.

Int Dairy J 2010;20:277–80.

[39] Kelemen E, Jakab K, Váradi G, et al. Non-supralethal mitobronitol/

cytarabine/cyclophosphamide conditioning without irradiation before

bone marrow transplantation for accelerated chronic granulocytic

leukemia: apparent absence of acute graft-versus-host disease. Leukemia

1993;7:939–45.

[40] Evenepoel P, Bammens B, Verbeke K, et al. Acarbose treatment

lowers generation and serum concentrations of the protein-bound solute

p-cresol: a pilot study. Kidney Int 2006;70:192–8.

[41] Springer, Boler BMV, Fahey GCJr. Prebiotics of plant and microbial

origin. Direct-Fed Microbials and Prebiotics for Animals 2012;13–26.

[42] Pranckute R, Kaunietis A, Kuisiene N, et al. Development of synbiotics

with inulin, palatinose, a-cyclodextrin and probiotic bacteria. Pol J

Microbiol 2014;63:33–41.

[43] Marcolino VA, Zanin GM, Durrant LR, et al. Interaction of curcumin

and bixin with b-cyclodextrin: complexation methods, stability, and

applications in food. J Agr Food Chem 2011;59:3348–57.

Shan et al. Medicine (2016) 95:49 www.md-journal.com

Supplemental Digital Content

Data

December 2016

Guangyu Shan · Yiming Lu · Bo Min · Wubin Qu · Chenggang Zhang

Supplemental Digital Content

Data

December 2016

Guangyu Shan · Yiming Lu · Bo Min · Wubin Qu · Chenggang Zhang

Using text mining and forest plots to identify similarities and differences between two spine-related journals based on medical subject headings (MeSH terms) and author-specified keywords in 100 top-cited articles

Article

Full-text available

Nov 2022

Literature research requires an understanding of the similarities and differences between different types of journals. It has not yet been possible to use text-mining to demonstrate the differences between the topics of articles by presenting features of article keywords using forest plots. It is important for authors to make a quick assessment of the similarities and differences between research types when submitting an article for publication in a journal. Our study uses text mining and forest plotting techniques to extract article features and compare the similarities and differences between the two journals' research types. There were a total of 100 top-cited articles selected from Spine (Phila Pa 1976) and The Spine Journal: official journals of the North American Spine Society with impact factors of 3.19 and 3.22 respectively, as reported by Journal Citation Reports (JCR) for 2018. XLSTAT software was used to extract features from author-made keywords and medical subject headings (e.g., MeSH terms in PubMed). These 200 top-cited articles were analyzed and clustered by performing factor analysis and social network analysis (SNA). The study presented three types of results: (1) descriptive statistics, (2) classification analysis, and (3) inferential statistics. The chi-square test was used to examine the frequency of clusters and journals, and forest plots were used to analyze differences between journals in terms of research topics. It was observed that (1) the United States dominated publications, accounting for 54% of 200 articles; the MeSH term of surgery was simultaneously highlighted in both journals using a word cloud generator; (2) five-term clusters were identified, namely, (i) Pain & Prognosis, (ii) Statistics & Data, (iii) Spine & Surgery, (iv) physiopathology, and (v) physiology; (4) there were no differences in distribution counts among categories between journals (Chi Square = 1.64, df = 4, p = 0.82), but differences in category(factor) scores between journals were found(Q-statistic = 484.94, df = 4, p < 0.001). Using text mining and a forest plot, we are able to understand the relationships between the types of research in different journals. Readers can use this research as a reference for future journal submissions based on the study results.

Hotness prediction of scientific topics based on a bibliographic knowledge graph

Article

Jul 2022
INFORM PROCESS MANAG

As a part of innovation in forecasting, scientific topic hotness prediction plays an essential role in dynamic scientific topic assessment and domain knowledge transformation modeling. To improve the topic hotness prediction performance, we propose an innovative model to estimate the co-evolution of scientific topic and bibliographic entities, which leverages a novel dynamic Bibliographic Knowledge Graph (BKG). Then, one can predict the topic hotness by using various kinds of topological entity information, i.e., TopicRank, PaperRank, AuthorRank, and VenueRank, along with pre-trained node embedding, i.e., node2vec embedding, and different pooling techniques. To validate the proposed method, we constructed a new BKG by using 4.5 million PubMed Central publications plus MeSH (Medical Subject Heading) thesaurus and witnessed the essential prediction improvement with extensive experiment outcomes over 10 years observations.

The Intake of Extremely Low Minerals and Bacteria Consumable Saccharides Secured Safety and Persistent of 7-14 Days Prolonged Total Dietary Deprivation Regimen

Chapter

Full-text available

Jan 2021

A Novel 7-Days Prolonged Dietary Deprivation Regimen Improves ALT and UA After 3–6 Months Refeeding, Indicating Therapeutic Potential

Article

Full-text available

May 2020

Objectives: The aim of this study was to evaluate a total fasting regimen assisted by a novel prebiotic, Flexible Abrosia (FA), in more than 7 days of continual dietary deprivation (7D-CDD). Our analysis included basic physical examinations, bioelectrical impedance analysis, and clinical lab and ELISA analysis in normal volunteers. Methods: Seven healthy subjects with normal body weight participated in 7D-CDD with the assistance of a specially designed probiotic. Individuals were assigned to take FA (113.4 KJ/10 g) at each mealtime to avoid possible injuries to intestinal flora and smooth the hunger sensation. During 7D-CDD, the subjects were advised to avoid any food intake, especially carbohydrates, except for drinking plentiful amounts of water. The examination samples were collected before CDD as self-control, at 7 days fasting, and after 7~14 days of refeeding. Three subjects were also tested after 6-m refeeding. Results: The FA-CDD regimen significantly decreased suffering from starvation, with tolerable hunger sensations during the treatment. With the addition of daily mineral electrolytes, the subjects not only passed through the entire 7D-CDD regimen but also succeed in 12~13 days total fasting in two subjects. There was a significant reduction in blood glucose, insulin, and high-density lipoprotein levels during fasting, and the blood concentrations of uric acid (UA), alanine aminotransferase (ALT), and creatine kinase (CK) were increased. However, after more than 2 months of refeeding, the disease markers ALT, GOT, and CK either remained stable or were slightly downregulated compared to their initial D0 control level. Conclusion: Our experiment has supplied the first positive evidence that, with the assistance of a daily nutritional supply of around 100 kcal total calories to their intestinal flora, human subjects were able to tolerate hunger sensations. We have found that, although 7D-CDD induced increases in UA, CK, and transferases during fasting, refeeding led the markers to become either down-regulated or unchanged compared to their initial levels. This phenomenon was further confirmed in longer-term (6 m) recovery. Our results failed to support the hypothesis that fasting induced liver damage, since ALT, GOT, and CK remained low after longer-term refeeding. Our findings indicate that the 7D-CDD regimen might be practical and that it might be valuable to design larger clinical fasting trials for improvement of health strategy-targeting in metabolic disorders.

Research Progress of Gut Flora in Improving Human Wellness

Article

Full-text available

Mar 2019

Human wellness is the ultimate goal of our efforts in improving the human life. Special foods are undoubtedly important in achieving human wellness. However, overeating significantly leads to obesity and diabetes. These chronic diseases will in turn affect the human wellness. Therefore, “dietary restriction and proper exercise” were introduced in the human daily life. Different foods cause various effects on the human health. The diversification of diet is a priority for nutritionists to keep our body healthy. To avoid diabetes mellitus, special foods for ketogenic diet, low-carbon diet, and low-calorie intake are also gradually attracting attention. In addition, the hypothesis that “hunger sensation comes from gut flora” brings new light to the research on the biological motivation for humans to eat food. This hypothesis has been gradually demonstrated using the flexible fasting technology by providing special foods, such as plant polysaccharides and dietary fibers. The response to food-needing signals from the gut flora to these foods demonstrates the importance of the gut flora in improving human wellness. The gut flora is probably an essential factor for translating the food-eating signals and converting the nutrition to our body. Therefore, “gut flora priority principle” is developed to guarantee human wellness. The 16S rRNA sequencing and mass spectrometric techniques can be used to identify the gut flora, which may guide us to a new era of human wellness based on gut flora wellness. Keywords: Hunger sensation comes from gut flora, Gut flora-centric theory, Flexible fasting, Gut flora priority principle, Universal reproducing power of the microbiota, Gut flora wellness, Human wellness

A Tool for Suggesting Ayurvedic Remedies from Curated and Classified Clinical Trial Reports

Article

Full-text available

Sep 2018

It requires great effort to search through huge number of published articles that provide information we need. Therefore it is necessary to find a solution that helps researchers in gaining accurate and deep understanding about diseases. Thus drug discovery and drug repurposing are gaining significance with the current onics tools. Traditional Medical practices like Ayurveda needs to be more visible to practitioners with evidence based approach. The clinical trials conducted have to be shared with the world for attaining the very philosophy of Ayurveda.. This paper presents a survey on various text mining technologies developed to classify theories and literature pertaining to the clinical observations of practitioners and suggests a possible solution to match a patient's symptoms.

Knowledge extraction from literature and enzyme sequences complements FBA analysis in metabolic engineering

Article

Sep 2021

Flux balance analysis (FBA) using genome-scale metabolic model (GSM) is a useful method for improving the bio-production of useful compounds. However, FBA often does not impose important constraints such as nutrients uptakes, by-products excretions and gases (oxygen and carbon dioxide) transfers. Furthermore, important information on metabolic engineering such as enzyme amounts, activities, and characteristics caused by gene expression and enzyme sequences is basically not included in GSM. Therefore, simple FBA is often not sufficient to search for metabolic manipulation strategies that are useful for improving the production of target compounds. In this study, we proposed a method using literature and enzyme search to complement the FBA-based metabolic manipulation strategies. As a case study, this method was applied to shikimic acid production by Corynebacterium glutamicum to verify its usefulness. As unique strategies in literature-mining, overexpression of the transcriptional regulator SugR and gene disruption related to by-products productions were complemented. In the search for alternative enzyme sequences, it was suggested that those candidates are searched for from various species based on features captured by deep learning, which are not simply homologous to amino acid sequences of the base enzymes. This article is protected by copyright. All rights reserved

Dual Triggered Correspondence Topic (DTCT) model for MeSH annotation

Article

Aug 2020

Accurate Medical Subject Headings (MeSH)annotation is an important issue for researchers in terms of effective information retrieval and knowledge discovery in the biomedical literature. We have developed a powerful dual triggered correspondence topic (DTCT)model for MeSH annotated articles. In our model, two types of data are assumed to be generated by the same latent topic factors and words in abstracts and titles serve as descriptions of the other type, MeSH terms. Our model allows the generation of MeSHs in abstracts to be triggered either by general document topics or by document-specific “special” word distributions in a probabilistic manner, allowing for a trade-off between the benefits of topic-based abstraction and specific word matching. In order to relax the topic influences of non-topical words or domain-frequent words in text description, we integrated the discriminative feature of Okapi BM25 into word sampling probability. This allows the model to choose keywords, which stand out from others, in order to generate MeSH terms. We further incorporate prior knowledge about relations between word and MeSH in DTCT with phi -coefficient to improve topic coherence. We demonstrated the model's usefulness in automatic MeSH annotation. Our model obtained 0.62 F-score 150,00 MEDLINE test set and showed a strength in recall rate. Specially, it yielded competitive performances in an integrated probabilistic environment without additional post-processing for filtering MeSHs.

Database resources of the National Center for Biotechnology Information

Article

Full-text available

Jan 2015
NUCLEIC ACIDS RES

The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and Abstracts for published life science journals. Additional NCBI resources focus on literature (Bookshelf, PubMed Central (PMC) and PubReader); medical genetics (ClinVar, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen); genes and genomics (BioProject, BioSample, dbSNP, dbVar, Epigenomics, Gene, Gene Expression Omnibus (GEO), Genome, HomoloGene, the Map Viewer, Nucleotide, PopSet, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser, Trace Archive and UniGene); and proteins and chemicals (Biosystems, COBALT, the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB), Protein Clusters, Protein and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for many of these databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at http://www.ncbi.nlm.nih.gov.

Medical subject headings (MeSH) terms

Article

Full-text available

Sep 2014
Indian J Orthop

Development of Synbiotics with Inulin, Palatinose, α-Cyclodextrin and Probiotic Bacteria

Article

Full-text available

Jul 2014

Success in creating a synbiotic depends on compatibility between the chosen components--prebiotic and probiotic. In this work the interactions between Lactobacillus sp. strains isolated from yogurts and type strains of Lactobacillus sp. and Lactococcus sp., and the dependence of their growth and antibacterial activity on three oligosaccharides (OS)--palatinose, inulin and alpha-cyclodextrin were investigated. All isolated lactobacilli produce antibacterial compounds, which possibly are the bacteriocins of Lactobacillus casei ATCC334 strain. Results of growth analysis with different OS revealed that part of lactobacilli isolated from yogurts can effectively ferment inulin and may be used for the development of synbiotics. Palatinose and Lactobacillus acidophilus could be used as symbiotics with effective antibacterial activity. One of the types of Lactococcus sp. strains can assimilate palatinose and alpha-cyclodextrin, so they both can be used as components of synbiotics with the investigated lactococci. Results of this analysis suggest that the investigated isolated and type strains of Lactobacillus sp. and Lactoccocus sp. can be useful as probiotics in the development of synbiotics. Together with prebiotics--palatinose, inulin and alpha-cyclodextrin, the synbiotics, which could regulate not only the growth of beneficial bacteria in the gastrointestinal tract, but also their antibacterial activity, can be created.

Prebiotics as functional foods: A review

Article

Full-text available

Oct 2013

Prebiotics are short chain carbohydrates that are non-digestible by digestive enzymes in humans and selectively enhance the activity of some groups of beneficial bacteria. In the intestine, prebiotics are fermented by beneficial bacteria to produce short chain fatty acids. Prebiotics also render many other health benefits in the large intestine such as reduction of cancer risk and increase calcium and magnesium absorption. Prebiotics are found in several vegetables and fruits and are considered functional food components which present significant technological advantages. Their addition improves sensory characteristics such as taste and texture, and enhances the stability of foams, emulsions and mouthfeel in a large range of food applications like dairy products and bread. This contribution reviews bioactives from food sources with prebiotic properties. Additionally, food application of bioactive prebiotics, stimulation of the viability of probiotics, health benefits, epidemiological studies, and safety concerns of prebiotics are also reviewed.

Low Brain Magnesium in Migraine

Article

Oct 1989

Dietary modulation of the human colonic microbiology: introducing the concept of prebiotics

Article

Jan 1998

Prebiotics of plant and microbial origin

Article

Jan 2012

The tartaric acid-modified enzyme-resistant dextrin from potato starch as potential prebiotic

Article

Oct 2012

Term-Weighting Learning via Genetic Programming for Text Classification

Article

Oct 2014
KNOWL-BASED SYST

This paper describes a novel approach to learning term-weighting schemes (TWSs) in the context of text classification. In text mining a TWS determines the way in which documents will be represented in a vector space model, before applying a classifier. Whereas acceptable performance has been obtained with standard TWSs (e.g., Boolean and term-frequency schemes), the definition of TWSs has been traditionally an art. Further, it is still a difficult task to determine what is the best TWS for a particular problem and it is not clear yet, whether better schemes, than those currently available, can be generated by combining known TWS. We propose in this article a genetic program that aims at learning effective TWSs that can improve the performance of current schemes in text classification. The genetic program learns how to combine a set of basic units to give rise to discriminative TWSs. We report an extensive experimental study comprising data sets from thematic and non-thematic text classification as well as from image classification. Our study shows the validity of the proposed method; in fact, we show that TWSs learned with the genetic program outperform traditional schemes and other TWSs proposed in recent works. Further, we show that TWSs learned from a specific domain can be effectively used for other tasks.

maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification

Article

Jun 2012

Timothy P. Jurka

maxent is a package with tools for data classification using multinomial logistic regression, also known as maximum entropy. The focus of this maximum entropy classifier is to minimize memory consumption on very large datasets, particularly sparse document-term matrices represented by the tm text mining package.

A MeSH-based text mining method for identifying novel prebiotics

Abstract and Figures

Supplementary resources (2)

Recommended publications

Microbiota intestinal, obesidad y síndrome metabólico un estudio bibliométrico

Microbiota Manipulation With Prebiotics and Probiotics in Patients Undergoing Stem Cell Transplantat...

How do probiotics and prebiotics function at distant sites?

The journey of gut microbiome -An introduction and its influence on metabolic disorders