ArticlePDF Available

A MeSH-based text mining method for identifying novel prebiotics

Authors:
  • National Center for Protein Sciences (Beijing)

Abstract and Figures

Prebiotics contribute to the well-being of their host by altering the composition of the gut microbiota. Discovering new prebiotics is a challenging and arduous task due to strict inclusion criteria; thus, highly limited numbers of prebiotic candidates have been identified. Notably, the large numbers of published studies may contain substantial information attached to various features of known prebiotics that can be used to predict new candidates. In this paper, we propose a medical subject headings (MeSH)-based text mining method for identifying new prebiotics with structured texts obtained from PubMed. We defined an optimal feature set for prebiotics prediction using a systematic feature-ranking algorithm with which a variety of carbohydrates can be accurately classified into different clusters in accordance with their chemical and biological attributes. The optimal feature set was used to separate positive prebiotics from other carbohydrates, and a cross-validation procedure was employed to assess the prediction accuracy of the model. Our method achieved a specificity of 0.876 and a sensitivity of 0.838. Finally, we identified a high-confidence list of candidates of prebiotics that are strongly supported by the literature. Our study demonstrates that text mining from high-volume biomedical literature is a promising approach in searching for new prebiotics.
Content may be subject to copyright.
A MeSH-based text mining method for identifying
novel prebiotics
Guangyu Shan, MS, Yiming Lu, PhD, Bo Min, PhD, Wubin Qu, MS, Chenggang Zhang, PhD
Abstract
Prebiotics contribute to the well-being of their host by altering the composition of the gut microbiota. Discovering new prebiotics is a
challenging and arduous task due to strict inclusion criteria; thus, highly limited numbers of prebiotic candidates have been identied.
Notably, the large numbers of published studies may contain substantial information attached to various features of known prebiotics
that can be used to predict new candidates. In this paper, we propose a medical subject headings (MeSH)-based text mining method
for identifying new prebiotics with structured texts obtained from PubMed. We dened an optimal feature set for prebiotics prediction
using a systematic feature-ranking algorithm with which a variety of carbohydrates can be accurately classied into different clusters
in accordance with their chemical and biological attributes. The optimal feature set was used to separate positive prebiotics from
other carbohydrates, and a cross-validation procedure was employed to assess the prediction accuracy of the model. Our method
achieved a specicity of 0.876 and a sensitivity of 0.838. Finally, we identied a high-condence list of candidates of prebiotics that
are strongly supported by the literature. Our study demonstrates that text mining from high-volume biomedical literature is a
promising approach in searching for new prebiotics.
Abbreviations: AUC =area under the curve, MeSH =medical subject headings, NLM =National Library of Medicines, RF =
random forest, ROC =receiver operating characteristic curve, XML =extensible markup language.
Keywords: Carbohydrates, MeSH-term, Prebiotics, Prebiotics prediction, Text mining
1. Introduction
The health benets of prebiotics, such as cancer risk reduction,
immune system enhancement, and constipation relief have been
widely accepted. A food ingredient can be considered a prebiotic
only when it satises 3 criteria: (1) resistant to gastric acidity and
mammalian enzymes, (2) prone to fermentation by intestinal
microbiota, and (3) selective to stimulation of the growth and/or
activity of benecial intestinal microbiota.
[1]
Identifying new
prebiotics in accordance with these 3 criteria via the screening of
various chemical compounds is a very laborious and challenging
task. Scientists have been performing related work since 1995
when the criteria were rst proposed. However, only two
carbohydrates have been reported until 2007: Inulin and
Fructooligosaccarides.
[1]
Several researchers began to develop other approaches by
reviewing published literature and searching for keywords in
PubMed, and 3 carbohydrates were shown to alter the micro-
biota balance of the large bowel by increasing the number of
bidobacteria and lactobacillus. The success of these studies
suggested the possibility of using a text mining-based method to
identify prebiotics by transforming the inclusion criteria into a
collection of literal features. Text mining efforts developed a
variety of approaches to obtain information in structured
biomedical text using techniques such as machine learning,
natural language processing, biostatistics, information technolo-
gy, and pattern recognition.
[2]
In the rapidly growing elds of knowledge discovery and text
mining, relevant literature can be used to obtain implicit and
unrevealed information. Swanson
[3]
began to mine information
from biomedical literature for Raynaud disease treatment in
1986. He found from a biomedical paper that Raynaud disease is
a peripheral circulatory disorder associated with and exacerbated
by high platelet aggregation, high blood viscosity, and
vasoconstriction; in other biomedical literature, he found that
sh oil could reduce these symptoms. Accordingly, he proposed
the hypothesis that sh oil may be helpful for people suffering
from Raynaud disease, which had not previously been reported.
Three years later, this hypothesis was clinically conrmed by
DiGiacomo et al.
[4]
Corresponding to this method, Ramadan
et al
[5]
traced 11 indirect connections between migraines and
magnesium using summaries of published papers, and the effect
Editor: Giovanni Tarantino.
GS and YL have contributed equally to this work.
Author Contributions: Conceived and designed the experiments: GS, YL, and
BM. Performed the experiments: GS. Analyzed the data: GS, YL, and BM.
Contributed reagents/materials/analysis tools: GS. Wrote the paper: GS, LY, WQ,
and CZ.
Funding provided by the National Basic Research Project (973 program)
(2012CB518200), the General Program (31401141, 81573251, 30900830) of the
Natural Science Foundation of China, the State Key Laboratory of Proteomics of
China (SKLP-Y201303, SKLP-O201104, and SKLP-K201004), and the Special
Key Programs for Science and Technology of China (2012ZX09102301016).
The authors have no conicts of interest to disclose.
Supplemental Digital Content is available for this article.
Beijing Institute of Radiation Medicine, State Key Laboratory of Proteomics,
Cognitive and Mental Health Research Center, Beijing, PR China.
Correspondence: Chenggang Zhang, Academy of Military Medical Sciences,
Beijing, PR China (e-mail: zhangcg@bmi.ac.cn).
Copyright ©2016 the Author(s). Published by Wolters Kluwer Health, Inc. All
rights reserved.
This is an open access article distributed under the terms of the Creative
Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-
ND), where it is permissible to download and share the work provided it is
properly cited. The work cannot be changed in any way or used commercially.
Medicine (2016) 95:49(e5585)
Received: 7 August 2016 / Received in nal form: 2 November 2016 / Accepted:
7 November 2016
http://dx.doi.org/10.1097/MD.0000000000005585
Observational Study Medicine®
OPEN
1
of magnesium was later experimentally validated.
[6]
Thus far,
text mining has become an indispensable tool for extracting
knowledge from biomedical literature.
Feature selection is a critical procedure for text mining to tease
out valuable features from large amounts of data.
[7]
Many
techniques, such as support vector machine (SVM),
[8]
genetic
programming (GP),
[9,10]
logistic regression (LR),
[11]
and proba-
bilistic neural network (PNN),
[12]
can perform this process only
in a general and cursory manner. MedMeSH summarizer can
assess very large amounts of biomedical data in a short period
and is generally used for genome-wide expression proles.
[13]
MedMeSH summarizer can achieve decent performance in
specic as opposed to general assessments.
Inspired by MedMeSH and the philosophy of mining tacit
knowledge from biomedical literature, we herein developed a
novel medical subject headings (MeSH)-based text mining
method for identifying new prebiotics utilizing the PubMed
database. PubMed comprises more than 24 million citations for
biomedical literature from MEDLINE, life science journals, and
online books.
[14]
MeSH is the National Library of Medicines
(NLM)-controlled vocabulary thesaurus specied for indexing
articles from PubMed. We extracted from MeSH because it is
easily available through the PubMed service of the National
Library of MEDLINE, whereas full texts of research studies are
often only accessible by subscription.
[15]
Additionally, utilizing
MeSH rather than the full text not only reduces computation time
but also enables higher dataset throughput.
[16]
Bhattacharya et al
demonstrated that MeSH terms could represent the whole text
accurately if screened appropriately, that is, we can extract
representative features from massive amounts of literature using
these high-quality widgets.
[16]
We hypothesized that carbohydrates with the properties of
prebiotics share similar literal features. To better extract the
features of known prebiotics, we rst used an exhaustive text
mining approach to mine prebiotic-related topical MeSH terms
from structured documents downloaded from PubMed. And then
selected a list of optimal MeSH terms that are closely related to
known prebiotics
[17]
and ranked a large set of carbohydrates
according to the scores calculated from their MeSH frequency
proles. At last, we used a cross-validation technique to assess the
prediction accuracy of our model.
2. Methods
2.1. Data preparation
Firstly, 2 kinds of data were being prepared: positive prebiotics
set and carbohydrates set. We used a list of positive prebiotics
summarized by Al-Sheraji et al.
[14]
The list is in Table 1 which
contains 15 prebiotics that we denoted as positive prebiotics set.
Nearly all positive prebiotics are non-digestible carbohydrates.
Thus, we constructed carbohydrates set using the ofcial names
of all available carbohydrates from the NLM MeSH tree
structures. To ensure the specicity of the prediction, only
carbohydrates that belong to the lowest level of the tree were
selected, with the exception of the lowest-level carbohydrates that
could not cover the carbohydrates represented by their parent
node (in this case, the parent node was also included). Positive
prebiotics were also removed from the carbohydrates set. The
nal carbohydrates set contains 112 carbohydrates (Supporting
Information, S1 Table. The ofcial names of carbohydrates set.
(XLSX), http://links.lww.com/MD/B447; S2 Table. The ofcial
names of 50 positives for method validation. (XLSX), http://links.
lww.com/MD/B448). Each of the names of 15 positive prebiotics
and 112 carbohydrates were used as a query to search relevant
literature in PubMed, and the hit documents were downloaded in
extensible markup language (XML) format, respectively. MeSH
terms in the XML documents are extracted using the ElementTree
Python package. Therefore, each substance contains a MeSH
term list extracted from its relevant literature. Each list contains
thousands of features, which will enable us a robust foundation
for the nal model. This study did not require the ethical approval
and informed consent due to all analyses were carried out based
on the data extracted from previous published literature.
2.2. Stop words ltering
Stop words, which can undermine the efcacy and effectiveness
of the mining task due to high frequency, usually need to be
removed rst. MeSH curators removed traditional stop words
such as a,”“the,and for; however, some MeSH terms with
extremely high-frequency remain, which signicantly reduces
model performance. These MeSH terms were ltered according
to Zipf law. Zipf law states that the rank-proportional frequency
of a word is inversely proportional to its frequency rank among
all words in a given natural language corpus. Thus, the purity of
the corpus can be optimized by removing MeSH terms with
particularly high frequency under the following lter procedure.
1. Initiate a query list containing all carbohydrates in positive
prebiotics set and carbohydrates set;
2. Rank their MeSH terms in descending order according to their
total frequency. We considered the rst region (top 20 terms
with high frequency) of Zipf curve. Four colleagues in our lab
majoring in prebiotics helped to examine the candidates list
and remove those that are biologically important;
3. The remaining MeSH terms from this region constituted the
MeSH stop words list.
2.3. Data normalization
The normalization of MeSH terms frequency is necessary because
of well-studied prebiotics can retrieve much more literature than
other prebiotics and will introduce bias into the ultimate feature
set of the cluster. To avoid this situation, the frequency matrix is
normalized according to Eq. (1), where a(0 a1) is a
Table 1
Types and sources of known prebiotics.
Type of prebiotic Sources of prebiotics References
Inulin Wheat, onion, bananas
[1]
Fructooligosaccharides Asparagus, sugar beet, garlic, etc.
[18]
Isomaltulose Honey, sugarcane juice
[19]
Xylooligosaccharides Bamboo shoots, fruits, vegetables, etc.
[20]
Galactooligosaccharides Humans milk and cows milk
[21]
Cyclodextrins Water-soluble glucans
[22]
Rafnose oligosaccharides Seeds of legumes, lentils, peas, etc.
[23]
Soybean oligosaccharides Soybean
[24]
Lactulose Lactose (milk)
[25]
Lactosucrose Lactose
[26]
Palatinose Sucrose
[19]
Maltooligosaccharides Starch
[27]
Isomaltooligosaccharides Starch
[27]
Arabinoxylooligosaccharides Wheat bran
[28]
Enzyme-resistant dextrin Potato starch
[29]
Shan et al. Medicine (2016) 95:49 Medicine
2
normalization parameter controlling the correlation degree with
the corpus volume. a=0 implies no normalization and a=1
implies complete normalization. We rst build a positive
prebiotics MeSH frequency matrix f
ij
with numerical value,
where each row represents a prebiotic and each column refers to a
MeSH term occurring in the positive prebiotics set. Mdenotes
prebiotics (rows). Thus, F
ij
is the absolute MeSH term frequency
while f
ij
is the relative MeSH term frequency of each positive
prebiotics.
fij ¼Fij
PM
i¼1Fij

að0a1Þð1Þ
2.3.1. Feature selection. To select features from the matrix we
mentioned above, we utilized the MedMeSH summarizers
algorithm, which has been applied to assign pertinent MeSH
terms to describe the functionality of a group of genes.
[30]
MedMeSH summarizer summarizes a group of genes by ltering
biomedical literature and assigning relevant keywords describing
the functionality of the genes. This system constructed a PQ co-
occurrence matrix where P denotes the genes in the cluster and Q
reects the MeSH terms that were extracted from the retrieved
literature. The cell value of the matrix is the frequency of each
MeSH term. With this matrix, an overall score of each MeSH
term can be calculated and the most inuential terms will be
screened to describe the functionality of this cluster. Here, we
utilized this matrix to classify all the MeSH terms into two elds:
Major topics and Particular topics.
2.3.2. Major Topics. Terms occurring in most prebiotics with
high frequency. N denotes MeSH terms (columns). Criterion R
1
:
rank the MeSH terms by decreasing order of the means m
i
.
mi¼PN
j¼1fij
Nði¼1;...;MÞð2Þ
2.3.3. Particular Topics. Terms occurring in a subset of
prebiotics with high frequency. sin Eq. (3) is the ratio of the
mean/standard deviation of their MeSH feature vectors. Criterion
R
2
: rank the MeSH terms by decreasing order of the ratios s2
i=mi.
si¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PN
j¼1ðfijmiÞ2
N
sði¼1;;MÞð3Þ
All MeSH terms in the matrix are ranked in accordance with
the 2 criteria described previously and assigned to an overall rank
Rin Eq. (4). The weight parameter waimed at providing a
summary of the cluster by balancing the major and particular
topics. MeSH terms are arranged by their overall relevance ranks
Rin ascending order. Truncated top kMeSH terms as prebiotics
summary feature set to construct normalization matrix for
subsequent prediction.
R¼wR1þð1wÞR2ð0w1Þð4Þ
2.4. Parameter optimization
Three key parameters, including a,w, and k, were screened for
feature selection. aranges from 0 to 1; 0 implies no normalization
and 1 implies complete normalization. walso ranges from 0 to 1;
1 implies that the major topic terms dominated the feature set and
0 implies that the particular topics dominated the model. The last
parameter kis the number of features we saved for the nal
feature set.
An exhaustive global grid search is implemented for screening
the optimal parameter set. All possible combinations of the
parameter values are evaluated, and the best combination is
retained. Each parameter is designated with a suitable variation
scope: a[0,1], step =0.2; w[0,1], step =0.1; k[200,1000],
step =200 for optimal parameter screening. To evaluate the
performance of the parameter sets, we employed a 5-fold cross-
validation method. After repeating the simulation 100 times, the
average rank of 3 positive prebiotics is used to assess the
performance of each parameter set. A more accurate model is
expected to rank positive prebiotics at the top of the predicted list;
thus, a smaller average rank value means higher rank positions
for them, which indicates a better parameter set.
2.5. Feature enrichment analysis
In the XML document, each MeSH term has two attributes that
were curated by an expert: Descriptor Nameand Qualier
Name.Descriptor Namerefers to the ofcial name of the
MeSH terms, and Qualier Namerefers to the specic related
elds. For example, MeSH term Inositol possesses a Descriptor
NameInositol and 2 Qualier NamesChemistry & Pharma-
cology. Thus, to perform the enrichment analysis is to extract all
Qualier Nameunder each MeSH—“Descriptor Namefor
frequency calculation. Principal groups in frequency distribution
bar plot can denote the property of MeSH group.
2.6. Random forest model training for comparison
Random forest is an outstanding machine learning algorithm,
which can handle sparse matrix and large amount of variables.
Using the MeSH term frequency of positive and negative
carbohydrates as features, the Random forest models were
trained and tested with 100 times repeats of 5-fold cross-
validation, and the averaged areas under the receiver operating
characteristic curve (ROC) (area under the curve [AUC]) were
used for performance comparison in different datasets. The
training and testing procedures of random forest model were
implemented using randomForestpackage in R programming
language.
2.7. Model evaluation and predicting novel prebiotics
We build carbohydrate prediction matrix f
ij
according to Eq. (1)
with numerical value, where each row represents a carbohydrate
and each column refers to a feature. This matrix can be used to
predict novel prebiotics by Eq. (5). Each carbohydrate obtained
R
B
as their own score denotes the ability to be potential
prebiotics.
RB¼X
M
i¼1
fij
Ri
ð5Þ
Then, we carried out 5-fold cross-validation to evaluate the
predictive performance of the model. In each round, 4 randomly
generated folds were used for feature selection, and the fth fold
was reserved for prediction with carbohydrates set. That is to say.
There will yield 2 columns with respect to prediction set in each
round: R
B
score column and binary state column (1 denotes
prebiotics, 0 denotes not prebiotics). Two columns yielded by this
Shan et al. Medicine (2016) 95:49 www.md-journal.com
3
step can produce one AUC score and after the prediction
procedure was repeated 100 times. The average AUC was
deployed as a measure to evaluate the prediction performance.
A model returns a vector of scores between 0 and 1 for a
combined prediction prole. These scores are then mapped
to a binary state indicating prebioticsor non-prebioticsby
choosing a cut-off. For each combination of proles, the existence
of a prebiotic is considered positive (P) or negative (N). True (T)
means that the predicted and observed categories are identical,
and false (F) implies otherwise. The notations TP, FP, TN, and
FN combine these labels to return the number of data points
(combined prediction prole) in each category. These values are
consistent with a cut-off at which carbohydrates prediction ranks
are mapped onto binary predictions. The predicted scores are
transformed into binary predictions using sensitivity and
specicity over the entire score range. The specicity is dened
as TN/(FP + TN) and the sensitivity is TP/(TP + FN). Lastly, we
calculate the average specicity and average sensitivity for each
round (repeat 100 times). The best cut-off point for balancing the
average sensitivity and average specicity of our model is the
point on the curve closest to the (0, 1) point. We deploy the
corresponding cut-off to indicate potential prebiotics, which is
calculated via the R package named ROCR.
[43]
3. Results
3.1. Text mining framework for novel prebiotics prediction
We developed a systemic MeSH-based text mining approach to
robustly predict new prebiotics. The feature selection part of our
method is inspired by the MedMeSH summarizer. It is a text
mining algorithm to describe the functionality of a group of
genes. But our method moves further from here, it not only
summarizes a cluster by using MeSH terms, but also predicts
novel concepts with the same property from the cluster. In
addition, MedMeSH summarizer uses xed parameter set for
gene cluster summarizing. However, we found that a xed
parameter set usually introduce many unrelated terms emerged as
topic terms in our dataset, which will undermine the subsequent
prediction result. To overcome this problem, we developed an
exhaustive global search method to determine the optimal
parameter set for our dataset of prebiotics. High-prole features
were screened out and were validated by feature enrichment
analysis and the ROC plot.
The workow of prebiotics prediction is shown in Fig. 1. We
rst collected known prebiotics from Table 1 and carbohydrates
set from the NLM MeSH tree structure in our queries to retrieve
MeSH-related documents from PubMed. To construct the prole
Figure 1. The framework of prediction. 1. Download PubMed XML documents of 127 carbohydrates, including 15 positive prebiotics and 112 carbohydrates. 2.
Compute the optimal parameter set (a,w, and k) for the model by exhaustive grid search and assign top kfeatures as model feature set. 3. Use ROC curve to
evaluate the performance of the model. 4. Perform prediction procedure to mining novel prebiotics. ROC =receiver operating characteristic curve, XML =
extensible markup language.
Shan et al. Medicine (2016) 95:49 Medicine
4
of each substance (prebiotics or carbohydrates), MeSH terms
were extracted with respect to their retrieval literature and their
frequency was calculated by Eq. (1). After that, we calculated 10
MeSH terms as stopwords, including Animals, Humans, Male,
Female, Rats, Adult, Mice, Aged, Middle Aged, and Child. Those
terms were removed from the corpus prior to the following
analysis.
Our model primarily aims to predict new prebiotics on the
basis of MeSH frequency by extracting highly representative
features, which were originally employed by Kankar et al
[30]
in
investigating the functionality of a gene group. We learn from his
philosophy and adapted it to a more concrete task: novel
prebiotics prediction. Unlike the previous one-ts-all solution for
the gene set, we rened the feature discovery pattern by
considering the unbalanced data across the feature selection
procedure.
We calculated two parameters (R
1
and R
2
) to identify different
types of MeSH terms. R
1
is calculated by Eq. (2) that can take
major topics into account whereas R
2
is produced by Eq. (3)
which aims to consider particular topics. To improve feature
selection step, we specied an exhaustive grid search method to
determine an optimal parameter set with 5-fold cross-validation.
Each parameter in the model is being traversed by certain step in
the value range. Soon after that, we selected 800 features from 15
positive prebiotics that have been determined by the optimal
parameter set (a=1, w=0.6, k=800). Then, we deployed feature
enrichment analysis and carbohydrates clustering to evaluate the
performance of the feature set. The representative ability to
prebiotics property of the feature set was very good, which also
revealed the performance of the optimal parameter set on the
other side. After that, we evaluated the nal model and selected
threshold which denote the boundary between carbohydrates
with prebiotics property and without the property by ROC.
According to the threshold, top 11 carbohydrates were identied
as novel prebiotics. At last, we made a thorough literature
investigation towards those new prebiotics.
3.2. Optimal parameter set for prebiotics prediction
Corpus volume that associated with a carbohydrate often
substantially varies between positive prebiotics and carbohy-
drates. Well-studied prebiotics, such as inulin and fructooligo-
saccharides, are substantially more common in research than
other carbohydrates, which introduce strong bias into the model.
To balance the effect of the corpus volume, we introduced the
parameter of ato control the extent of normalization of MeSH
frequency. To balance the generic topics and particular topics, a
weight parameter wis introduced to ensure that the nal feature
set could take these 2 diverse topics into full consideration. The
last parameter kis the number of features we saved for the nal
feature set. An optimal set of parameters are crucial for precisely
prediction of prebiotics, and we used an exhaustive global grid
search method to determine the optimal parameter set (see
Section 2).
Performance analyses of each parameter are shown in Fig. 2.
a=1 achieves best average rank regardless of the change in w,
indicating that full normalization is necessary for the applied
datasets, as shown in Fig. 2A. w=0.6 (k=800, a=1.0) achieves
the best average rank in Fig. 2B, suggesting that generic topics
have been assigned more contribution for particular topics in
Figure 2. Exhaustive grid search for the optimal parameter set via 5-fold cross-validation. The gure describes the contribution of 3 parameters (a,w, and k)inthe
model. Each column adopts a xed k. (A) described the optimal awas 1 while the optimal w=0.6, k=800 were screened in (B). After that weak normalization also
has been investigated in (C).
Shan et al. Medicine (2016) 95:49 www.md-journal.com
5
known prebiotics summaries under full normalization circum-
stances. Beyond that, wunder weak normalization (a=0.2) also
has been investigated to further understand the impact of
normalization (results shown in Fig. 2C). w=1.0 achieves the
best average rank regardless of the change in wunder weak
normalization, suggesting that generic topics are used to
represent the entire known prebiotics summary, which indicates
that full normalization is necessary when encountering unbal-
anced data (otherwise, the system will automatically abandon a
particular instance to maintain performance). Notably, when
screening the optimal parameter a, the average rank is
represented by an integration of w. Finally, the optimal
parameters of a=1, w=0.6, and k=800 are chosen for further
analyses. After determining the optimal parameter set, two
divergent topics (generic and particular) are balanced by
parameter wto generate a feature summary of positive prebiotics.
3.3. Feature enrichment analysis and carbohydrates
clustering
To investigate the major topics of selected features, an enrichment
analysis was deployed (See Section 2). The result is shown in
Fig. 3. Interestingly, >95%, >70%, and >70% correspond to
metabolism, chemistry, and pharmacology, respectively, coin-
ciding with our prior knowledge that those prebiotics usually
play major roles in the metabolism of the human body due to
their various chemical structure and pharmacology properties. In
other words, these vital properties are concealed in the feature
summary. We have innovated a method to excavate them out and
effectuate them for prediction.
To examine the quality of the 800 selected features, we further
conducted a hierarchical clustering method to determine if these
features can excel in clustering the relevant carbohydrates
adjacent to each other. Hierarchical clustering is a widely
performed data analysis tool that provides dataset summaries by
grouping similar observations into 1 cluster.
[31]
In the real-world
case presented in Fig. 4, notably, the clustered carbohydrates
shared a similar structure with the MeSH tree in NLM. For
instance, cyclodextrins are cyclic oligosaccharides consisting of
6a-cyclodextrins, 7 b-cyclodextrins, 8 g-cyclodextrins, or more
glucopyranose units linked by a-(1,4) bonds, which is the son
node of dextrins in the MeSH tree (green block at 9 oclock).
[37]
In addition to this dextrins branch, other branches, such as the
Agar branch (red block at 8 oclock), oligosaccharides branch
(green block at 4 oclock), and fructans branch (green block at
1oclock), etc., also achieve high similarity with the MeSH tree.
These factors indicated that the features we selected may be
effective in further prebiotics prediction.
3.4. Model evaluation and prebiotics prediction
The ROC curve is employed to model evaluation. Because of the
limited number (only 15) of the positive set, we rst enlarged the
number of positive set to 50 to validate our method. Fifty
positives contain previous 15 positive prebiotics and 35
carbohydrates which under polysaccharides node in NLM
MeSH tree, their names are in S2 Table, http://links.lww.com/
MD/B448. By using 50 positives and remaining 77 carbohy-
drates, we got our optimal parameters a=1, w=0.3, and k=800
with an average rank 11.905. The optimal parameters are utilized
to deploy the model evaluation by 5-fold cross-validation ROC
curve. In addition, we have performed a comparison of our
method to machine learning method. The frequency matrix for
machine learning is extremely sparse and there are more than
20,000 variables. Random forest algorithm can handle large
amount of variables and overtting very well. So, we decide to
compare our method to random forest algorithm (see Section 2).
Figure 5A shows a 5-fold cross-validation ROC curve for the
model with 50 positives. When we enlarged our positive set,
our model can perform well with an AUC of 0.891. Also, the
performance of our model is better than the random forest
algorithm with an AUC of 0.846. After method validation step by
50 positives, we turned to 15 positive prebiotics and perform real-
world ROC evaluation.
Figure 5B shows a 5-fold cross-validation ROC curve for the
model with 15 positives. Surprisingly, the performance of our
model is far better than random forest algorithm. It is, therefore,
suggested that our method can be a good choice for the highly
imbalanced data (112 negatives vs. 15 positives). We hit an AUC
of 0.911 and a cut-off of 0.013 can maintain optimal balance
between average specicity and average sensitivity. This cut-off
helps select the corresponding rank 11, which may have
prebiotics properties in the above prediction list. Those predicted
novel prebiotics are presented in Table 2, and some of them have
been investigated by prebiotics experts. The average specicity
and sensitivity for samples were 0.876 and 0.838, respectively.
In addition to evaluating the model and predicting potential
prebiotics, we also investigated related literature evidence for 11
potential prebiotics based on the original denition of prebiotics:
a prebiotic is a selectively fermented ingredient that allows
specic changes, both in the composition and/or activity in the
gastrointestinal microbiota, that confer benets upon host well-
being and health.Most of the predicted prebiotics are supported
by the literature analysis for 2 of the 3 criteria of prebiotics (non-
digestibility, fermentation, and selectivity), and there are no
obvious conicts with these criteria. Even for the most rigorous
criterion (selectivity), these are also many considerable items with
promising clues. For example, isomaltose has been shown to
represent a prebiotic with digestion-resistant properties, rafnose
is a complex 285 carbohydrate that can promote the growth of
benecial microorganisms, and acarbose is usually administered
in diabetes treatment and has promising potential as a
prebiotic.
[40]
Additionally, cyclodextrin is a saccharide that
can reduce the digestion of carbohydrates and lipids. The
Figure 3. Feature enrichment analysis. Top 20 qualier names were extracted
from 800 features. The categories in the gure can roughly indicate the high-
level concept of 800 features. Those concepts are highly correlated with real-
world prebiotics chemical property.
Shan et al. Medicine (2016) 95:49 Medicine
6
derivative a-cyclodextrin is a soluble dietary ber that possesses
the ability to feed one of the Lactococcus sp. strains in the
gastrointestinal tract,
[42]
whereas the other derivative (b-cyclo-
dextrin) has been shown as an important component of low-fat
foods.
[43]
In summary, this promising list not only shows
prospective prebiotics but also demonstrated the efcacy of our
model.
4. Discussion
It should be noted that our method depends on the MeSH terms.
Curators typically summarize 10 to 12 MeSH terms to describe
the most indexed papers from PubMed, but still there are a small
portion of papers that have not been curated yet. For these
overlooked papers, we suggest that keywords should be extracted
manually from their abstracts and titles for information integrity.
In addition, almost all text mining methods including ours are
partly limited by the size and the type of the data set, and the
predictive powers of our method in other data-intensive elds
havent been tested.
Prebiotics can supply vast health benets to healthy or
unhealthy people. Despite the signicant demonstrated medical
effect, the discovery and application of various prebiotics could
not meet the growing needs of the prebiotic market simply by
manually matching candidates to criteria. In an effort to improve
prebiotics mining efciency, we herein present a methodology
utilizing text mining techniques to boost the variety of potential
prebiotics from related literature.
Figure 4. Hierarchical clustering of carbohydrates. If we observe a putative branch associated with the MeSH tree in NLM, we could, therefore, infer that features
can be employed to predict potential prebiotics. Carbohydrates were clustered in hierarchical mode. Many branch structures are highly correlated with MeSH tree in
NLM and we could therefore infer that the features have a large portion of prebiotics property, which can be employed to predict potential prebiotics in prediction
step. MeSH =medical subject headings, NLM =National Library of Medicines.
Shan et al. Medicine (2016) 95:49 www.md-journal.com
7
We explored the optimal parameter set in an exhaustive grid
search: each important parameter (a,w, and k) was evaluated
according to a spectrum of potential values. In the parameter
selection process, the parameter ais effective in corpus volume
trade-off even if the volume of certain corpuses can reach a higher
gulf (10
3
10
4
). The parameters wand kalso substantially impact
the predictive performance. To more accurately determine the
variation tendency for the corpus volume, we performed
additional analyses to plot the average rank score against each
wand kat a specic lower a(a=0.2) after determining the
optimum a(1.0). Corpus volumes in our experiment vary
substantially; thus, ais intended to narrow the focus on yielding
reasonable parameters. Likewise, our parameter selection process
may provide a solution for other corpuses, especially those with
volume-unbalanced data.
Notwithstanding inevitable practical constraints, we believe
that our work is an important step in identifying more prebiotics,
thereby yielding meaningful results and providing a basis for
future development and experimentation. We identied critical
factors affecting mining work and developed methods for
characteristics selection of volume-unbalanced data to assess
predictive performance. We also performed clustering measure-
ments to evaluate the selected characteristics for known
prebiotics. The ROC curve, which evaluates the model t for
an optimal parameter crew, showed that the possibility issues we
identied are sufciently consistent to create a list of potential
prebiotics for further research. In a list of 11 potential prebiotics,
apart from these promising specic carbohydrates, some
relatively broad categories also found in it, such as xylans,
fructans, and dextrins, indicate a promising eld of potential
prebiotics.
Overall, the MeSH-based text mining method provides a
bridge between the availability of tens of thousands of studies
with curated MeSH terms and the emerging functionality of
prebiotics studies, which have found few prebiotics over many
years. For the former, our algorithm dramatically enhances the
power of discovering potential prebiotics underlying countless
studies. For the latter, new candidates for potential prebiotics that
are useful in prebioticsresearch come to light. Regarding future
directions: taken together, the thousands of studies at hand in an
entire literature corpus (rather than individual studies) can assist
us in other elds, such as nding bacteria that can perform certain
functions or obtain food for soldiers, which may represent a niche
need in future studies.
In this integrated analysis, we present new ideas and
instructions that are helpful to researchers. Our results indicate
that there are currently no universal parameters for the mining
task and that the parameter set reported to work for a specic
corpus may not be an appropriate choice for research. As we
noted, an exhaustive grid search is recommended to customize
Figure 5. Cross-validation ROC analyses were used to evaluate model performance and determine the ranking threshold. (A) The ROC plot indicated our method
(red) performs better than random forest (green) with 50 positives. That is to say, our method can discriminate well between known prebiotics and carbohydrates.
The 45°diagonal line (dashed) indicates the theoretical plot of a test with no discrimination between known prebiotics and carbohydrates. (B) The ROC plot
indicated our method (red) performs far better than random forest (green) with 15 positives. The cut-off means beyond which threshold can we deem carbohydrates
possess prebiotics property. ROC =receiver operating characteristic curve.
Table 2
Summary and conclusion on the prebiotic effect of 11 potential prebiotics.
Rank Carbohydrates Non-digestibility Fermentation Selectivity References
1 Isomaltose Yes n.c. Yes
[32]
2 Xylans Yes Yes n.c.
[33]
3 Fructans Yes Yes n.c.
[34]
4b-Cyclodextrins Yes Yes n.c.
[35]
5 Rafnose Yes n.c. Yes
[36]
6 Dextrins Yes n.c. n.c.
[37]
7a-Cyclodextrins Yes Yes n.c.
[38]
8 Mitobronitol Probable n.c. n.c.
[39]
9 Oligosaccharides, branched-chain Yes Yes n.c.
[32]
10 Acarbose Yes Yes n.c.
[40]
11 Xylose Yes n.c. Yes
[41]
n.c. =not clear.
Shan et al. Medicine (2016) 95:49 Medicine
8
the parameter set not only to determine the best parameter settings
for given corpuses but also to assess their potential prediction
performance. Taken together, algorithm development as a part of
our study is meaningful in a widerange of biological scenarios, and
the ultimate potential of the prebiotics set obtained in this study
may provide novel text mining-based insights with clues in the
prebiotics eld. Follow-up studies are warranted to validate the
ndings herein;moreover, additional dened prebiotics substances
and related documents will improve the model. Our text mining-
based study lays the foundation for an efcient mining study for
obtaining potential prebiotics, which may indicate a promising
method in difcult eld of prebiotics research.
Acknowledgments
We thank Miss Xin Song for critical discussion and suggestions.
We would also like to acknowledge the generous funding
provided by the National Basic Research Project (973 program)
(2012CB518200), the General Program (31401141, 81573251,
30900830) of the Natural Science Foundation of China, the State
Key Laboratory of Proteomics of China (SKLP-Y201303, SKLP-
O201104 and SKLP-K201004), and the Special Key Programs
for Science and Technology of China (2012ZX09102301016).
References
[1] Roberfroid M. Prebiotics: the concept revisited. J Nutr 2007;137(Suppl
2):830S7S.
[2] Gupta V, Lehal GS. A survey of text mining techniques and applications.
J Emerg Technol Web Intell 2009;1:6076.
[3] Swanson DR. Fish oil, Raynauds syndrome, and undiscovered public
knowledge. Perspect Biol Med 1986;30:718.
[4] DiGiacomo RA, Kremer JM, Shah DM. Fish-oil dietary supplementation
in patients with Raynauds phenomenon: a double-blind, controlled,
prospective study. Am J Med 1989;86:15864.
[5] Ramadan NM, Halvorson H, Vande-Linde A, et al. Low brain
magnesium in migraine. Headache 1989;29:4169.
[6] Ferrari MD. Biochemistry of migraine. Pathol Biol 1992;40:28792.
[7] Tsuruoka Y, Tateishi Y, Kim JD, et al. Developing a robust part-of-
speech tagger for biomedical text. Lect Notes Comput Sci 2005;3746:
38292.
[8] Tong S, Koller D. Support vector machine active learning with
applications to text classication. J Mach Learn Res 2002;2:4566.
[9] Escalante HJ, Garcia-Limon MA, Morales-Reyes A, et al. Term-
weighting learning via genetic programming for text classication.
Knowl-Based Syst 2015;83:17689.
[10] Hirsch L, Saeedi M, Hirsch R. Evolving text classication rules with
genetic programming. Appl Artif Intell 2005;19:65976.
[11] Jurka TP. Maxent: an R package for low-memory multinomial logistic
regression with support for semi-automated text classication. R J
2012;4:569.
[12] Ciarelli PM, Oliveira E. An Enhanced Probabilistic Neural Network
Approach Applied to Text Classication. Prog Pattern Recog Image Anal
Comput Vis Appl Proc 2009;5856:6618.
[13] Lu ZY. PubMed and Beyond: A Survey of Web Tools for Searching
Biomedical Literature. Oxford:Database; 2011.
[14] Al-Sheraji SH, Ismail A, Manap MY, et al. Prebiotics as functional foods:
a review. J Funct Foods 2013;5:154253.
[15] Agarwala R, Barrett T, Beck J, et al. Database resources of the National
Center for Biotechnology Information. Nucleic Acids Res 2015;43:
D617.
[16] Bhattacharya S, Viet HT, Srinivasan P. MeSH: a window into full text for
document summarization. Bioinformatics 2011;27:I1208.
[17] Dhammi IK, Kumar S. Medical subject headings (MeSH) terms. Indian J
Orthop 2014;48:4434.
[18] Sangeetha PT, Ramesh MN, Prapulla SG. Recent trends in the microbial
production, analysis and application of Fructooligosaccharides. Trends
Food Sci Tech 2005;16:44257.
[19] Lina BAR, Jonker D, Kozianowski G. Isomaltulose (Palatinose (R)): a
review of biological and toxicological studies. Food Chem Toxicol
2002;40:137581.
[20] Vazquez MJ, Alonso JL, Dominguez H, et al. Xylooligosaccharides:
manufacture and applications. Trends Food Sci Tech 2000;11:
38793.
[21] Alander M, Matto J, Kneifel W, et al. Effect of galacto-oligosaccharide
supplementation on human faecal microora and on survival and
persistence of Bidobacterium lactis Bb-12 in the gastrointestinal tract.
Int Dairy J 2001;11:81725.
[22] Singh M, Sharma R, Banerjee UC. Biotechnological applications of
cyclodextrins. Biotechnol Adv 2002;20:34159.
[23] Johansen HN, Glitso V, Knudsen KEB. Inuence of extraction solvent
and temperature on the quantitative determination of oligosaccharides
from plant materials by high-performance liquid chromatography. J Agr
Food Chem 1996;44:14704.
[24] Mussatto SI, Mancilha IM. Non-digestible oligosaccharides: a review.
Carbohyd Polym 2007;68:58797.
[25] Villamiel M, Corzo N, Foda MI, et al. Lactulose formation catalysed by
alkaline-substituted sepiolites in milk permeate. Food Chem 2002;76:
711.
[26] Kawase M, Pilgrim A, Araki T, et al. Lactosucrose production using a
simulated moving bed reactor. Chem Eng Sci 2001;56:4538.
[27] Kaneko T, Kohmoto T, Kikuchi H, et al. Effects of Isomaltooligosac-
charides with different degrees of polymerization on human fecal
bidobactcria. Biosci Biotechnol Biochem 1994;58:228890.
[28] Eeckhaut V, Van Immerseel F, Dewulf J, et al. Arabinoxylooligosac-
charides from wheat bran inhibit Salmonella colonization in broiler
chickens. Poultry Sci 2008;87:232934.
[29] Barczynska R, Slizewska K, Jochym K, et al. The tartaric acid-modied
enzyme-resistant dextrin from potato starch as potential prebiotic.
J Funct Foods 2012;4:95462.
[30] Kankar P, Adak S, Sarkar A, et al. MedMeSH summarizer: text mining
for gene clusters. Siam Proc S 2002;548565.
[31] Langfelder P, Zhang B, Horvath S. Dening clusters from a hierarchical
cluster tree: the dynamic tree cut package for R. Bioinformatics 2008;
24:71920.
[32] Gibson GR, Probert HM, Van Loo J, et al. Dietary modulation of the
human colonic microbiota: updating the concept of prebiotics. Nutr Res
Rev 2004;17:25975.
[33] INTECH Open Access Publisher, da Silva AE, Oliveira EE, Egito EST,
et al. Xylan, A Promising Hemicellulose for Pharmaceutical Use. 2012.
[34] Springer, Bosscher D. Fructan prebiotics derived from inulin. Prebiotics
and Probiotics Science and Technology 2009;163205.
[35] Slavin JL. Dietary ber and body weight. Nutrition 2005;21:4118.
[36] Su P, Henriksson A, Mitchell H. Selected prebiotics support the growth
of probiotic mono-cultures in vitro. Anaerobe 2007;13:1349.
[37] Binns N. Probiotics, prebiotics and the gut microbiota. Probiotics,
Prebiotics Gut Microbiota 2013. 132.
[38] Delzenne NM, Cani PD. Nutrit ional modulation of gut microbiota in the
context of obesity and insulin resistance: Potential interest of prebiotics.
Int Dairy J 2010;20:27780.
[39] Kelemen E, Jakab K, Váradi G, et al. Non-supralethal mitobronitol/
cytarabine/cyclophosphamide conditioning without irradiation before
bone marrow transplantation for accelerated chronic granulocytic
leukemia: apparent absence of acute graft-versus-host disease. Leukemia
1993;7:93945.
[40] Evenepoel P, Bammens B, Verbeke K, et al. Acarbose treatment
lowers generation and serum concentrations of the protein-bound solute
p-cresol: a pilot study. Kidney Int 2006;70:1928.
[41] Springer, Boler BMV, Fahey GCJr. Prebiotics of plant and microbial
origin. Direct-Fed Microbials and Prebiotics for Animals 2012;1326.
[42] Pranckute R, Kaunietis A, Kuisiene N, et al. Development of synbiotics
with inulin, palatinose, a-cyclodextrin and probiotic bacteria. Pol J
Microbiol 2014;63:3341.
[43] Marcolino VA, Zanin GM, Durrant LR, et al. Interaction of curcumin
and bixin with b-cyclodextrin: complexation methods, stability, and
applications in food. J Agr Food Chem 2011;59:334857.
Shan et al. Medicine (2016) 95:49 www.md-journal.com
9

Supplementary resources (2)

... Several studies have used text mining (i.e., extracting useful information from texts) Wiki, 2020;XLSTAT, 2022) to explore domain knowledge in articles (Huang et al., 2020;Kostoff et al., 2005;Wu et al., 2019;Zhou et al., 2019). The bibliometrics and visualization perspectives were applied to 1575 documents on medical data mining (MDM) methods (Shan et al., 2016). MetaMap can be used to extract concepts from the titles of articles (Aronson & Lang, 2010;Hu et al., 2020). ...
Article
Full-text available
Literature research requires an understanding of the similarities and differences between different types of journals. It has not yet been possible to use text-mining to demonstrate the differences between the topics of articles by presenting features of article keywords using forest plots. It is important for authors to make a quick assessment of the similarities and differences between research types when submitting an article for publication in a journal. Our study uses text mining and forest plotting techniques to extract article features and compare the similarities and differences between the two journals' research types. There were a total of 100 top-cited articles selected from Spine (Phila Pa 1976) and The Spine Journal: official journals of the North American Spine Society with impact factors of 3.19 and 3.22 respectively, as reported by Journal Citation Reports (JCR) for 2018. XLSTAT software was used to extract features from author-made keywords and medical subject headings (e.g., MeSH terms in PubMed). These 200 top-cited articles were analyzed and clustered by performing factor analysis and social network analysis (SNA). The study presented three types of results: (1) descriptive statistics, (2) classification analysis, and (3) inferential statistics. The chi-square test was used to examine the frequency of clusters and journals, and forest plots were used to analyze differences between journals in terms of research topics. It was observed that (1) the United States dominated publications, accounting for 54% of 200 articles; the MeSH term of surgery was simultaneously highlighted in both journals using a word cloud generator; (2) five-term clusters were identified, namely, (i) Pain & Prognosis, (ii) Statistics & Data, (iii) Spine & Surgery, (iv) physiopathology, and (v) physiology; (4) there were no differences in distribution counts among categories between journals (Chi Square = 1.64, df = 4, p = 0.82), but differences in category(factor) scores between journals were found(Q-statistic = 484.94, df = 4, p < 0.001). Using text mining and a forest plot, we are able to understand the relationships between the types of research in different journals. Readers can use this research as a reference for future journal submissions based on the study results.
... Zhang et al. (2018) proposed a hierarchical vector space model for computing the semantic similarity between different genes based on a gene ontology. The text mining based on MeSH is a promising approach in searching for new prebiotics (Shan, 2016). Relevant tests have examined the improvement of the MeSH concept toward information retrieval, compared with PubMed's ATM (Automatic Term Mapping) and CISMeF (Catalog and Index of French-language Health Internet) ATM (Darmoni et al., 2012). ...
Article
As a part of innovation in forecasting, scientific topic hotness prediction plays an essential role in dynamic scientific topic assessment and domain knowledge transformation modeling. To improve the topic hotness prediction performance, we propose an innovative model to estimate the co-evolution of scientific topic and bibliographic entities, which leverages a novel dynamic Bibliographic Knowledge Graph (BKG). Then, one can predict the topic hotness by using various kinds of topological entity information, i.e., TopicRank, PaperRank, AuthorRank, and VenueRank, along with pre-trained node embedding, i.e., node2vec embedding, and different pooling techniques. To validate the proposed method, we constructed a new BKG by using 4.5 million PubMed Central publications plus MeSH (Medical Subject Heading) thesaurus and witnessed the essential prediction improvement with extensive experiment outcomes over 10 years observations.
... FA-CDD involved daily oral application of a solid beverage of Flexible Abrosia (FA, Beijing Cloud Medical International Technology, Inc. China) 10 g/bag/person per treatment at three mealtimes every day on an outpatient basis during the fasting period. The ingredients of FA were designed to include dietary fiber and cordyceps polysaccharide, ganoderma lucidum polysaccharide, and hericium erinaceus polysaccharide [19], which were regarded as bacteria-but not human-consumed saccharides [23]. The National Food Inspection Center of China has reported the analyzed energy of 10 g FA as 113.4 KJ (27 kcal), which indicated that even if the calories from each treatment were completely absorbed by the human being, it would be less than 100 kcal daily in total, significantly less than recently reported low-calorie (500 kcal per day) intake in the treatment of cancer [1]. ...
... FA-CDD involved daily oral application of a solid beverage of Flexible Abrosia (FA, Beijing Cloud Medical International Technology, Inc. China) 10 g/bag/person per treatment at three mealtimes every day on an outpatient basis during the fasting period. The ingredients of FA were designed to include dietary fiber and cordyceps polysaccharide, ganoderma lucidum polysaccharide, and hericium erinaceus polysaccharide (18), which were regarded as bacteriabut not human-consumed saccharides (22). The National Food Inspection Center of China has reported the analyzed energy of 10 g FA as 113.4 KJ (27 kcal), which indicated that even if the calories from each treatment were completely absorbed by the human being, it would be <100 kcal daily in total, significantly less than recently reported low-calorie (500 kcal per day) intake in the treatment of cancer (1). ...
Article
Full-text available
Objectives: The aim of this study was to evaluate a total fasting regimen assisted by a novel prebiotic, Flexible Abrosia (FA), in more than 7 days of continual dietary deprivation (7D-CDD). Our analysis included basic physical examinations, bioelectrical impedance analysis, and clinical lab and ELISA analysis in normal volunteers. Methods: Seven healthy subjects with normal body weight participated in 7D-CDD with the assistance of a specially designed probiotic. Individuals were assigned to take FA (113.4 KJ/10 g) at each mealtime to avoid possible injuries to intestinal flora and smooth the hunger sensation. During 7D-CDD, the subjects were advised to avoid any food intake, especially carbohydrates, except for drinking plentiful amounts of water. The examination samples were collected before CDD as self-control, at 7 days fasting, and after 7~14 days of refeeding. Three subjects were also tested after 6-m refeeding. Results: The FA-CDD regimen significantly decreased suffering from starvation, with tolerable hunger sensations during the treatment. With the addition of daily mineral electrolytes, the subjects not only passed through the entire 7D-CDD regimen but also succeed in 12~13 days total fasting in two subjects. There was a significant reduction in blood glucose, insulin, and high-density lipoprotein levels during fasting, and the blood concentrations of uric acid (UA), alanine aminotransferase (ALT), and creatine kinase (CK) were increased. However, after more than 2 months of refeeding, the disease markers ALT, GOT, and CK either remained stable or were slightly downregulated compared to their initial D0 control level. Conclusion: Our experiment has supplied the first positive evidence that, with the assistance of a daily nutritional supply of around 100 kcal total calories to their intestinal flora, human subjects were able to tolerate hunger sensations. We have found that, although 7D-CDD induced increases in UA, CK, and transferases during fasting, refeeding led the markers to become either down-regulated or unchanged compared to their initial levels. This phenomenon was further confirmed in longer-term (6 m) recovery. Our results failed to support the hypothesis that fasting induced liver damage, since ALT, GOT, and CK remained low after longer-term refeeding. Our findings indicate that the 7D-CDD regimen might be practical and that it might be valuable to design larger clinical fasting trials for improvement of health strategy-targeting in metabolic disorders.
... The gut flora can decompose dietary fibers into short-chain fatty acids to provide nutrition for the intestinal epithelial cells, which will in turn support the wellness of the gut flora [69][70][71]. We have used the text mining approach to identify the novel prebiotics for further studies [72]. Normally, both the nutrition for the human body and gut flora, such as starch, rice, noodles, fruits, vegetables, fish, shrimps, meat, eggs, and milk, are combined together. ...
Article
Full-text available
Human wellness is the ultimate goal of our efforts in improving the human life. Special foods are undoubtedly important in achieving human wellness. However, overeating significantly leads to obesity and diabetes. These chronic diseases will in turn affect the human wellness. Therefore, “dietary restriction and proper exercise” were introduced in the human daily life. Different foods cause various effects on the human health. The diversification of diet is a priority for nutritionists to keep our body healthy. To avoid diabetes mellitus, special foods for ketogenic diet, low-carbon diet, and low-calorie intake are also gradually attracting attention. In addition, the hypothesis that “hunger sensation comes from gut flora” brings new light to the research on the biological motivation for humans to eat food. This hypothesis has been gradually demonstrated using the flexible fasting technology by providing special foods, such as plant polysaccharides and dietary fibers. The response to food-needing signals from the gut flora to these foods demonstrates the importance of the gut flora in improving human wellness. The gut flora is probably an essential factor for translating the food-eating signals and converting the nutrition to our body. Therefore, “gut flora priority principle” is developed to guarantee human wellness. The 16S rRNA sequencing and mass spectrometric techniques can be used to identify the gut flora, which may guide us to a new era of human wellness based on gut flora wellness. Keywords: Hunger sensation comes from gut flora, Gut flora-centric theory, Flexible fasting, Gut flora priority principle, Universal reproducing power of the microbiota, Gut flora wellness, Human wellness
... Guangyu Shan et al. [25] presents a MeSH based text mining method to predict new prebiotics. The system use a systematic feature-ranking algorithm which classifies a variety of carbohydrates into different clusters according to their chemical and biological attributes. ...
Article
Full-text available
It requires great effort to search through huge number of published articles that provide information we need. Therefore it is necessary to find a solution that helps researchers in gaining accurate and deep understanding about diseases. Thus drug discovery and drug repurposing are gaining significance with the current onics tools. Traditional Medical practices like Ayurveda needs to be more visible to practitioners with evidence based approach. The clinical trials conducted have to be shared with the world for attaining the very philosophy of Ayurveda.. This paper presents a survey on various text mining technologies developed to classify theories and literature pertaining to the clinical observations of practitioners and suggests a possible solution to match a patient's symptoms.
Article
Flux balance analysis (FBA) using genome-scale metabolic model (GSM) is a useful method for improving the bio-production of useful compounds. However, FBA often does not impose important constraints such as nutrients uptakes, by-products excretions and gases (oxygen and carbon dioxide) transfers. Furthermore, important information on metabolic engineering such as enzyme amounts, activities, and characteristics caused by gene expression and enzyme sequences is basically not included in GSM. Therefore, simple FBA is often not sufficient to search for metabolic manipulation strategies that are useful for improving the production of target compounds. In this study, we proposed a method using literature and enzyme search to complement the FBA-based metabolic manipulation strategies. As a case study, this method was applied to shikimic acid production by Corynebacterium glutamicum to verify its usefulness. As unique strategies in literature-mining, overexpression of the transcriptional regulator SugR and gene disruption related to by-products productions were complemented. In the search for alternative enzyme sequences, it was suggested that those candidates are searched for from various species based on features captured by deep learning, which are not simply homologous to amino acid sequences of the base enzymes. This article is protected by copyright. All rights reserved
Article
Accurate Medical Subject Headings (MeSH)annotation is an important issue for researchers in terms of effective information retrieval and knowledge discovery in the biomedical literature. We have developed a powerful dual triggered correspondence topic (DTCT)model for MeSH annotated articles. In our model, two types of data are assumed to be generated by the same latent topic factors and words in abstracts and titles serve as descriptions of the other type, MeSH terms. Our model allows the generation of MeSHs in abstracts to be triggered either by general document topics or by document-specific “special” word distributions in a probabilistic manner, allowing for a trade-off between the benefits of topic-based abstraction and specific word matching. In order to relax the topic influences of non-topical words or domain-frequent words in text description, we integrated the discriminative feature of Okapi BM25 into word sampling probability. This allows the model to choose keywords, which stand out from others, in order to generate MeSH terms. We further incorporate prior knowledge about relations between word and MeSH in DTCT with phi -coefficient to improve topic coherence. We demonstrated the model's usefulness in automatic MeSH annotation. Our model obtained 0.62 F-score 150,00 MEDLINE test set and showed a strength in recall rate. Specially, it yielded competitive performances in an integrated probabilistic environment without additional post-processing for filtering MeSHs.
Article
Full-text available
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and Abstracts for published life science journals. Additional NCBI resources focus on literature (Bookshelf, PubMed Central (PMC) and PubReader); medical genetics (ClinVar, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen); genes and genomics (BioProject, BioSample, dbSNP, dbVar, Epigenomics, Gene, Gene Expression Omnibus (GEO), Genome, HomoloGene, the Map Viewer, Nucleotide, PopSet, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser, Trace Archive and UniGene); and proteins and chemicals (Biosystems, COBALT, the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB), Protein Clusters, Protein and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for many of these databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at http://www.ncbi.nlm.nih.gov.
Article
Full-text available
Article
Full-text available
Success in creating a synbiotic depends on compatibility between the chosen components--prebiotic and probiotic. In this work the interactions between Lactobacillus sp. strains isolated from yogurts and type strains of Lactobacillus sp. and Lactococcus sp., and the dependence of their growth and antibacterial activity on three oligosaccharides (OS)--palatinose, inulin and alpha-cyclodextrin were investigated. All isolated lactobacilli produce antibacterial compounds, which possibly are the bacteriocins of Lactobacillus casei ATCC334 strain. Results of growth analysis with different OS revealed that part of lactobacilli isolated from yogurts can effectively ferment inulin and may be used for the development of synbiotics. Palatinose and Lactobacillus acidophilus could be used as symbiotics with effective antibacterial activity. One of the types of Lactococcus sp. strains can assimilate palatinose and alpha-cyclodextrin, so they both can be used as components of synbiotics with the investigated lactococci. Results of this analysis suggest that the investigated isolated and type strains of Lactobacillus sp. and Lactoccocus sp. can be useful as probiotics in the development of synbiotics. Together with prebiotics--palatinose, inulin and alpha-cyclodextrin, the synbiotics, which could regulate not only the growth of beneficial bacteria in the gastrointestinal tract, but also their antibacterial activity, can be created.
Article
Full-text available
Prebiotics are short chain carbohydrates that are non-digestible by digestive enzymes in humans and selectively enhance the activity of some groups of beneficial bacteria. In the intestine, prebiotics are fermented by beneficial bacteria to produce short chain fatty acids. Prebiotics also render many other health benefits in the large intestine such as reduction of cancer risk and increase calcium and magnesium absorption. Prebiotics are found in several vegetables and fruits and are considered functional food components which present significant technological advantages. Their addition improves sensory characteristics such as taste and texture, and enhances the stability of foams, emulsions and mouthfeel in a large range of food applications like dairy products and bread. This contribution reviews bioactives from food sources with prebiotic properties. Additionally, food application of bioactive prebiotics, stimulation of the viability of probiotics, health benefits, epidemiological studies, and safety concerns of prebiotics are also reviewed.
Article
This paper describes a novel approach to learning term-weighting schemes (TWSs) in the context of text classification. In text mining a TWS determines the way in which documents will be represented in a vector space model, before applying a classifier. Whereas acceptable performance has been obtained with standard TWSs (e.g., Boolean and term-frequency schemes), the definition of TWSs has been traditionally an art. Further, it is still a difficult task to determine what is the best TWS for a particular problem and it is not clear yet, whether better schemes, than those currently available, can be generated by combining known TWS. We propose in this article a genetic program that aims at learning effective TWSs that can improve the performance of current schemes in text classification. The genetic program learns how to combine a set of basic units to give rise to discriminative TWSs. We report an extensive experimental study comprising data sets from thematic and non-thematic text classification as well as from image classification. Our study shows the validity of the proposed method; in fact, we show that TWSs learned with the genetic program outperform traditional schemes and other TWSs proposed in recent works. Further, we show that TWSs learned from a specific domain can be effectively used for other tasks.
Article
maxent is a package with tools for data classification using multinomial logistic regression, also known as maximum entropy. The focus of this maximum entropy classifier is to minimize memory consumption on very large datasets, particularly sparse document-term matrices represented by the tm text mining package.