A MeSH-based text mining method for identifying novel prebiotics

  • National Center for Protein Sciences (Beijing)

Abstract and Figures

Prebiotics contribute to the well-being of their host by altering the composition of the gut microbiota. Discovering new prebiotics is a challenging and arduous task due to strict inclusion criteria; thus, highly limited numbers of prebiotic candidates have been identified. Notably, the large numbers of published studies may contain substantial information attached to various features of known prebiotics that can be used to predict new candidates. In this paper, we propose a medical subject headings (MeSH)-based text mining method for identifying new prebiotics with structured texts obtained from PubMed. We defined an optimal feature set for prebiotics prediction using a systematic feature-ranking algorithm with which a variety of carbohydrates can be accurately classified into different clusters in accordance with their chemical and biological attributes. The optimal feature set was used to separate positive prebiotics from other carbohydrates, and a cross-validation procedure was employed to assess the prediction accuracy of the model. Our method achieved a specificity of 0.876 and a sensitivity of 0.838. Finally, we identified a high-confidence list of candidates of prebiotics that are strongly supported by the literature. Our study demonstrates that text mining from high-volume biomedical literature is a promising approach in searching for new prebiotics.
A MeSH-based text mining method for identifying
novel prebiotics
Guangyu Shan, MS, Yiming Lu, PhD, Bo Min, PhD, Wubin Qu, MS, Chenggang Zhang, PhD
Prebiotics contribute to the well-being of their host by altering the composition of the gut microbiota. Discovering new prebiotics is a
challenging and arduous task due to strict inclusion criteria; thus, highly limited numbers of prebiotic candidates have been identied.
Notably, the large numbers of published studies may contain substantial information attached to various features of known prebiotics
that can be used to predict new candidates. In this paper, we propose a medical subject headings (MeSH)-based text mining method
for identifying new prebiotics with structured texts obtained from PubMed. We dened an optimal feature set for prebiotics prediction
using a systematic feature-ranking algorithm with which a variety of carbohydrates can be accurately classied into different clusters
in accordance with their chemical and biological attributes. The optimal feature set was used to separate positive prebiotics from
other carbohydrates, and a cross-validation procedure was employed to assess the prediction accuracy of the model. Our method
achieved a specicity of 0.876 and a sensitivity of 0.838. Finally, we identied a high-condence list of candidates of prebiotics that
are strongly supported by the literature. Our study demonstrates that text mining from high-volume biomedical literature is a
promising approach in searching for new prebiotics.
Abbreviations: AUC =area under the curve, MeSH =medical subject headings, NLM =National Library of Medicines, RF =
random forest, ROC =receiver operating characteristic curve, XML =extensible markup language.
Keywords: Carbohydrates, MeSH-term, Prebiotics, Prebiotics prediction, Text mining
1. Introduction
The health benets of prebiotics, such as cancer risk reduction,
immune system enhancement, and constipation relief have been
widely accepted. A food ingredient can be considered a prebiotic
only when it satises 3 criteria: (1) resistant to gastric acidity and
mammalian enzymes, (2) prone to fermentation by intestinal
microbiota, and (3) selective to stimulation of the growth and/or
activity of benecial intestinal microbiota.
Identifying new
prebiotics in accordance with these 3 criteria via the screening of
various chemical compounds is a very laborious and challenging
task. Scientists have been performing related work since 1995
when the criteria were rst proposed. However, only two
carbohydrates have been reported until 2007: Inulin and
Several researchers began to develop other approaches by
reviewing published literature and searching for keywords in
PubMed, and 3 carbohydrates were shown to alter the micro-
biota balance of the large bowel by increasing the number of
bidobacteria and lactobacillus. The success of these studies
suggested the possibility of using a text mining-based method to
identify prebiotics by transforming the inclusion criteria into a
collection of literal features. Text mining efforts developed a
variety of approaches to obtain information in structured
biomedical text using techniques such as machine learning,
natural language processing, biostatistics, information technolo-
gy, and pattern recognition.
In the rapidly growing elds of knowledge discovery and text
mining, relevant literature can be used to obtain implicit and
unrevealed information. Swanson
began to mine information
from biomedical literature for Raynaud disease treatment in
1986. He found from a biomedical paper that Raynaud disease is
a peripheral circulatory disorder associated with and exacerbated
by high platelet aggregation, high blood viscosity, and
vasoconstriction; in other biomedical literature, he found that
sh oil could reduce these symptoms. Accordingly, he proposed
the hypothesis that sh oil may be helpful for people suffering
from Raynaud disease, which had not previously been reported.
Three years later, this hypothesis was clinically conrmed by
DiGiacomo et al.
Corresponding to this method, Ramadan
et al
traced 11 indirect connections between migraines and
magnesium using summaries of published papers, and the effect
of magnesium was later experimentally validated.
Thus far,
text mining has become an indispensable tool for extracting
knowledge from biomedical literature.
Feature selection is a critical procedure for text mining to tease
out valuable features from large amounts of data.
techniques, such as support vector machine (SVM),
programming (GP),
logistic regression (LR),
and proba-
bilistic neural network (PNN),
can perform this process only
in a general and cursory manner. MedMeSH summarizer can
assess very large amounts of biomedical data in a short period
and is generally used for genome-wide expression proles.
MedMeSH summarizer can achieve decent performance in
specic as opposed to general assessments.
Inspired by MedMeSH and the philosophy of mining tacit
knowledge from biomedical literature, we herein developed a
novel medical subject headings (MeSH)-based text mining
method for identifying new prebiotics utilizing the PubMed
database. PubMed comprises more than 24 million citations for
biomedical literature from MEDLINE, life science journals, and
online books.
MeSH is the National Library of Medicines
(NLM)-controlled vocabulary thesaurus specied for indexing
articles from PubMed. We extracted from MeSH because it is
easily available through the PubMed service of the National
Library of MEDLINE, whereas full texts of research studies are
often only accessible by subscription.
Additionally, utilizing
MeSH rather than the full text not only reduces computation time
but also enables higher dataset throughput.
Bhattacharya et al
demonstrated that MeSH terms could represent the whole text
accurately if screened appropriately, that is, we can extract
representative features from massive amounts of literature using
these high-quality widgets.
We hypothesized that carbohydrates with the properties of
prebiotics share similar literal features. To better extract the
features of known prebiotics, we rst used an exhaustive text
mining approach to mine prebiotic-related topical MeSH terms
from structured documents downloaded from PubMed. And then
selected a list of optimal MeSH terms that are closely related to
known prebiotics
and ranked a large set of carbohydrates
according to the scores calculated from their MeSH frequency
proles. At last, we used a cross-validation technique to assess the
prediction accuracy of our model.
2. Methods
2.1. Data preparation
Firstly, 2 kinds of data were being prepared: positive prebiotics
set and carbohydrates set. We used a list of positive prebiotics
summarized by Al-Sheraji et al.
The list is in Table 1 which
contains 15 prebiotics that we denoted as positive prebiotics set.
Nearly all positive prebiotics are non-digestible carbohydrates.
Thus, we constructed carbohydrates set using the ofcial names
of all available carbohydrates from the NLM MeSH tree
structures. To ensure the specicity of the prediction, only
carbohydrates that belong to the lowest level of the tree were
selected, with the exception of the lowest-level carbohydrates that
could not cover the carbohydrates represented by their parent
node (in this case, the parent node was also included). Positive
prebiotics were also removed from the carbohydrates set. The
nal carbohydrates set contains 112 carbohydrates (Supporting
Information, S1 Table. The ofcial names of carbohydrates set.
(XLSX),; S2 Table. The ofcial
names of 50 positives for method validation. (XLSX), http://links. Each of the names of 15 positive prebiotics
and 112 carbohydrates were used as a query to search relevant
literature in PubMed, and the hit documents were downloaded in
extensible markup language (XML) format, respectively. MeSH
terms in the XML documents are extracted using the ElementTree
Python package. Therefore, each substance contains a MeSH
term list extracted from its relevant literature. Each list contains
thousands of features, which will enable us a robust foundation
for the nal model. This study did not require the ethical approval
and informed consent due to all analyses were carried out based
on the data extracted from previous published literature.
2.2. Stop words ltering
Stop words, which can undermine the efcacy and effectiveness
of the mining task due to high frequency, usually need to be
removed rst. MeSH curators removed traditional stop words
such as a,”“the,and for; however, some MeSH terms with
extremely high-frequency remain, which signicantly reduces
model performance. These MeSH terms were ltered according
to Zipf law. Zipf law states that the rank-proportional frequency
of a word is inversely proportional to its frequency rank among
all words in a given natural language corpus. Thus, the purity of
the corpus can be optimized by removing MeSH terms with
particularly high frequency under the following lter procedure.
1. Initiate a query list containing all carbohydrates in positive
prebiotics set and carbohydrates set;
2. Rank their MeSH terms in descending order according to their
total frequency. We considered the rst region (top 20 terms
with high frequency) of Zipf curve. Four colleagues in our lab
majoring in prebiotics helped to examine the candidates list
and remove those that are biologically important;
3. The remaining MeSH terms from this region constituted the
MeSH stop words list.
2.3. Data normalization
The normalization of MeSH terms frequency is necessary because
of well-studied prebiotics can retrieve much more literature than
other prebiotics and will introduce bias into the ultimate feature
set of the cluster. To avoid this situation, the frequency matrix is
normalized according to Eq. (1), where a(0 a1) is a
Table 1
Types and sources of known prebiotics.
Type of prebiotic Sources of prebiotics References
Inulin Wheat, onion, bananas
Fructooligosaccharides Asparagus, sugar beet, garlic, etc.
Isomaltulose Honey, sugarcane juice
Xylooligosaccharides Bamboo shoots, fruits, vegetables, etc.
Galactooligosaccharides Humans milk and cows milk
Cyclodextrins Water-soluble glucans
Rafnose oligosaccharides Seeds of legumes, lentils, peas, etc.
Soybean oligosaccharides Soybean
Lactulose Lactose (milk)
Lactosucrose Lactose
Palatinose Sucrose
Maltooligosaccharides Starch
Isomaltooligosaccharides Starch
Arabinoxylooligosaccharides Wheat bran
Enzyme-resistant dextrin Potato starch
normalization parameter controlling the correlation degree with
the corpus volume. a=0 implies no normalization and a=1
implies complete normalization. We rst build a positive
prebiotics MeSH frequency matrix f
with numerical value,
where each row represents a prebiotic and each column refers to a
MeSH term occurring in the positive prebiotics set. Mdenotes
prebiotics (rows). Thus, F
is the absolute MeSH term frequency
while f
is the relative MeSH term frequency of each positive
fij ¼Fij
2.3.1. Feature selection. To select features from the matrix we
mentioned above, we utilized the MedMeSH summarizers
algorithm, which has been applied to assign pertinent MeSH
terms to describe the functionality of a group of genes.
MedMeSH summarizer summarizes a group of genes by ltering
biomedical literature and assigning relevant keywords describing
the functionality of the genes. This system constructed a PQ co-
occurrence matrix where P denotes the genes in the cluster and Q
reects the MeSH terms that were extracted from the retrieved
literature. The cell value of the matrix is the frequency of each
MeSH term. With this matrix, an overall score of each MeSH
term can be calculated and the most inuential terms will be
screened to describe the functionality of this cluster. Here, we
utilized this matrix to classify all the MeSH terms into two elds:
Major topics and Particular topics.
2.3.2. Major Topics. Terms occurring in most prebiotics with
high frequency. N denotes MeSH terms (columns). Criterion R
rank the MeSH terms by decreasing order of the means m
2.3.3. Particular Topics. Terms occurring in a subset of
prebiotics with high frequency. sin Eq. (3) is the ratio of the
mean/standard deviation of their MeSH feature vectors. Criterion
: rank the MeSH terms by decreasing order of the ratios s2
All MeSH terms in the matrix are ranked in accordance with
the 2 criteria described previously and assigned to an overall rank
Rin Eq. (4). The weight parameter waimed at providing a
summary of the cluster by balancing the major and particular
topics. MeSH terms are arranged by their overall relevance ranks
Rin ascending order. Truncated top kMeSH terms as prebiotics
summary feature set to construct normalization matrix for
subsequent prediction.
2.4. Parameter optimization
Three key parameters, including a,w, and k, were screened for
feature selection. aranges from 0 to 1; 0 implies no normalization
and 1 implies complete normalization. walso ranges from 0 to 1;
1 implies that the major topic terms dominated the feature set and
0 implies that the particular topics dominated the model. The last
parameter kis the number of features we saved for the nal
feature set.
An exhaustive global grid search is implemented for screening
the optimal parameter set. All possible combinations of the
parameter values are evaluated, and the best combination is
retained. Each parameter is designated with a suitable variation
scope: a[0,1], step =0.2; w[0,1], step =0.1; k[200,1000],
step =200 for optimal parameter screening. To evaluate the
performance of the parameter sets, we employed a 5-fold cross-
validation method. After repeating the simulation 100 times, the
average rank of 3 positive prebiotics is used to assess the
performance of each parameter set. A more accurate model is
expected to rank positive prebiotics at the top of the predicted list;
thus, a smaller average rank value means higher rank positions
for them, which indicates a better parameter set.
2.5. Feature enrichment analysis
In the XML document, each MeSH term has two attributes that
were curated by an expert: Descriptor Nameand Qualier
Name.Descriptor Namerefers to the ofcial name of the
MeSH terms, and Qualier Namerefers to the specic related
elds. For example, MeSH term Inositol possesses a Descriptor
NameInositol and 2 Qualier NamesChemistry & Pharma-
cology. Thus, to perform the enrichment analysis is to extract all
Qualier Nameunder each MeSH—“Descriptor Namefor
frequency calculation. Principal groups in frequency distribution
bar plot can denote the property of MeSH group.
2.6. Random forest model training for comparison
Random forest is an outstanding machine learning algorithm,
which can handle sparse matrix and large amount of variables.
Using the MeSH term frequency of positive and negative
carbohydrates as features, the Random forest models were
trained and tested with 100 times repeats of 5-fold cross-
validation, and the averaged areas under the receiver operating
characteristic curve (ROC) (area under the curve [AUC]) were
used for performance comparison in different datasets. The
training and testing procedures of random forest model were
implemented using randomForestpackage in R programming
2.7. Model evaluation and predicting novel prebiotics
We build carbohydrate prediction matrix f
according to Eq. (1)
with numerical value, where each row represents a carbohydrate
and each column refers to a feature. This matrix can be used to
predict novel prebiotics by Eq. (5). Each carbohydrate obtained
as their own score denotes the ability to be potential
Then, we carried out 5-fold cross-validation to evaluate the
predictive performance of the model. In each round, 4 randomly
generated folds were used for feature selection, and the fth fold
was reserved for prediction with carbohydrates set. That is to say.
There will yield 2 columns with respect to prediction set in each
round: R
score column and binary state column (1 denotes
prebiotics, 0 denotes not prebiotics). Two columns yielded by this
step can produce one AUC score and after the prediction
procedure was repeated 100 times. The average AUC was
deployed as a measure to evaluate the prediction performance.
A model returns a vector of scores between 0 and 1 for a
combined prediction prole. These scores are then mapped
to a binary state indicating prebioticsor non-prebioticsby
choosing a cut-off. For each combination of proles, the existence
of a prebiotic is considered positive (P) or negative (N). True (T)
means that the predicted and observed categories are identical,
and false (F) implies otherwise. The notations TP, FP, TN, and
FN combine these labels to return the number of data points
(combined prediction prole) in each category. These values are
consistent with a cut-off at which carbohydrates prediction ranks
are mapped onto binary predictions. The predicted scores are
transformed into binary predictions using sensitivity and
specicity over the entire score range. The specicity is dened
as TN/(FP + TN) and the sensitivity is TP/(TP + FN). Lastly, we
calculate the average specicity and average sensitivity for each
round (repeat 100 times). The best cut-off point for balancing the
average sensitivity and average specicity of our model is the
point on the curve closest to the (0, 1) point. We deploy the
corresponding cut-off to indicate potential prebiotics, which is
calculated via the R package named ROCR.
3. Results
3.1. Text mining framework for novel prebiotics prediction
We developed a systemic MeSH-based text mining approach to
robustly predict new prebiotics. The feature selection part of our
method is inspired by the MedMeSH summarizer. It is a text
mining algorithm to describe the functionality of a group of
genes. But our method moves further from here, it not only
summarizes a cluster by using MeSH terms, but also predicts
novel concepts with the same property from the cluster. In
addition, MedMeSH summarizer uses xed parameter set for
gene cluster summarizing. However, we found that a xed
parameter set usually introduce many unrelated terms emerged as
topic terms in our dataset, which will undermine the subsequent
prediction result. To overcome this problem, we developed an
exhaustive global search method to determine the optimal
parameter set for our dataset of prebiotics. High-prole features
were screened out and were validated by feature enrichment
analysis and the ROC plot.
The workow of prebiotics prediction is shown in Fig. 1. We
rst collected known prebiotics from Table 1 and carbohydrates
set from the NLM MeSH tree structure in our queries to retrieve
MeSH-related documents from PubMed. To construct the prole
Figure 1. The framework of prediction. 1. Download PubMed XML documents of 127 carbohydrates, including 15 positive prebiotics and 112 carbohydrates. 2.
Compute the optimal parameter set (a,w, and k) for the model by exhaustive grid search and assign top kfeatures as model feature set. 3. Use ROC curve to
evaluate the performance of the model. 4. Perform prediction procedure to mining novel prebiotics. ROC =receiver operating characteristic curve, XML =
extensible markup language.
of each substance (prebiotics or carbohydrates), MeSH terms
were extracted with respect to their retrieval literature and their
frequency was calculated by Eq. (1). After that, we calculated 10
MeSH terms as stopwords, including Animals, Humans, Male,
Female, Rats, Adult, Mice, Aged, Middle Aged, and Child. Those
terms were removed from the corpus prior to the following
Our model primarily aims to predict new prebiotics on the
basis of MeSH frequency by extracting highly representative
features, which were originally employed by Kankar et al
investigating the functionality of a gene group. We learn from his
philosophy and adapted it to a more concrete task: novel
prebiotics prediction. Unlike the previous one-ts-all solution for
the gene set, we rened the feature discovery pattern by
considering the unbalanced data across the feature selection
We calculated two parameters (R
and R
) to identify different
types of MeSH terms. R
is calculated by Eq. (2) that can take
major topics into account whereas R
is produced by Eq. (3)
which aims to consider particular topics. To improve feature
selection step, we specied an exhaustive grid search method to
determine an optimal parameter set with 5-fold cross-validation.
Each parameter in the model is being traversed by certain step in
the value range. Soon after that, we selected 800 features from 15
positive prebiotics that have been determined by the optimal
parameter set (a=1, w=0.6, k=800). Then, we deployed feature
enrichment analysis and carbohydrates clustering to evaluate the
performance of the feature set. The representative ability to
prebiotics property of the feature set was very good, which also
revealed the performance of the optimal parameter set on the
other side. After that, we evaluated the nal model and selected
threshold which denote the boundary between carbohydrates
with prebiotics property and without the property by ROC.
According to the threshold, top 11 carbohydrates were identied
as novel prebiotics. At last, we made a thorough literature
investigation towards those new prebiotics.
3.2. Optimal parameter set for prebiotics prediction
Corpus volume that associated with a carbohydrate often
substantially varies between positive prebiotics and carbohy-
drates. Well-studied prebiotics, such as inulin and fructooligo-
saccharides, are substantially more common in research than
other carbohydrates, which introduce strong bias into the model.
To balance the effect of the corpus volume, we introduced the
parameter of ato control the extent of normalization of MeSH
frequency. To balance the generic topics and particular topics, a
weight parameter wis introduced to ensure that the nal feature
set could take these 2 diverse topics into full consideration. The
last parameter kis the number of features we saved for the nal
feature set. An optimal set of parameters are crucial for precisely
prediction of prebiotics, and we used an exhaustive global grid
search method to determine the optimal parameter set (see
Section 2).
Performance analyses of each parameter are shown in Fig. 2.
a=1 achieves best average rank regardless of the change in w,
indicating that full normalization is necessary for the applied
datasets, as shown in Fig. 2A. w=0.6 (k=800, a=1.0) achieves
the best average rank in Fig. 2B, suggesting that generic topics
have been assigned more contribution for particular topics in
Figure 2. Exhaustive grid search for the optimal parameter set via 5-fold cross-validation. The gure describes the contribution of 3 parameters (a,w, and k)inthe
model. Each column adopts a xed k. (A) described the optimal awas 1 while the optimal w=0.6, k=800 were screened in (B). After that weak normalization also
has been investigated in (C).
known prebiotics summaries under full normalization circum-
stances. Beyond that, wunder weak normalization (a=0.2) also
has been investigated to further understand the impact of
normalization (results shown in Fig. 2C). w=1.0 achieves the
best average rank regardless of the change in wunder weak
normalization, suggesting that generic topics are used to
represent the entire known prebiotics summary, which indicates
that full normalization is necessary when encountering unbal-
anced data (otherwise, the system will automatically abandon a
particular instance to maintain performance). Notably, when
screening the optimal parameter a, the average rank is
represented by an integration of w. Finally, the optimal
parameters of a=1, w=0.6, and k=800 are chosen for further
analyses. After determining the optimal parameter set, two
divergent topics (generic and particular) are balanced by
parameter wto generate a feature summary of positive prebiotics.
3.3. Feature enrichment analysis and carbohydrates
To investigate the major topics of selected features, an enrichment
analysis was deployed (See Section 2). The result is shown in
Fig. 3. Interestingly, >95%, >70%, and >70% correspond to
metabolism, chemistry, and pharmacology, respectively, coin-
ciding with our prior knowledge that those prebiotics usually
play major roles in the metabolism of the human body due to
their various chemical structure and pharmacology properties. In
other words, these vital properties are concealed in the feature
summary. We have innovated a method to excavate them out and
effectuate them for prediction.
To examine the quality of the 800 selected features, we further
conducted a hierarchical clustering method to determine if these
features can excel in clustering the relevant carbohydrates
adjacent to each other. Hierarchical clustering is a widely
performed data analysis tool that provides dataset summaries by
grouping similar observations into 1 cluster.
In the real-world
case presented in Fig. 4, notably, the clustered carbohydrates
shared a similar structure with the MeSH tree in NLM. For
instance, cyclodextrins are cyclic oligosaccharides consisting of
6a-cyclodextrins, 7 b-cyclodextrins, 8 g-cyclodextrins, or more
glucopyranose units linked by a-(1,4) bonds, which is the son
node of dextrins in the MeSH tree (green block at 9 oclock).
In addition to this dextrins branch, other branches, such as the
Agar branch (red block at 8 oclock), oligosaccharides branch
(green block at 4 oclock), and fructans branch (green block at
1oclock), etc., also achieve high similarity with the MeSH tree.
These factors indicated that the features we selected may be
effective in further prebiotics prediction.
3.4. Model evaluation and prebiotics prediction
The ROC curve is employed to model evaluation. Because of the
limited number (only 15) of the positive set, we rst enlarged the
number of positive set to 50 to validate our method. Fifty
positives contain previous 15 positive prebiotics and 35
carbohydrates which under polysaccharides node in NLM
MeSH tree, their names are in S2 Table,
MD/B448. By using 50 positives and remaining 77 carbohy-
drates, we got our optimal parameters a=1, w=0.3, and k=800
with an average rank 11.905. The optimal parameters are utilized
to deploy the model evaluation by 5-fold cross-validation ROC
curve. In addition, we have performed a comparison of our
method to machine learning method. The frequency matrix for
machine learning is extremely sparse and there are more than
20,000 variables. Random forest algorithm can handle large
amount of variables and overtting very well. So, we decide to
compare our method to random forest algorithm (see Section 2).
Figure 5A shows a 5-fold cross-validation ROC curve for the
model with 50 positives. When we enlarged our positive set,
our model can perform well with an AUC of 0.891. Also, the
performance of our model is better than the random forest
algorithm with an AUC of 0.846. After method validation step by
50 positives, we turned to 15 positive prebiotics and perform real-
world ROC evaluation.
Figure 5B shows a 5-fold cross-validation ROC curve for the
model with 15 positives. Surprisingly, the performance of our
model is far better than random forest algorithm. It is, therefore,
suggested that our method can be a good choice for the highly
imbalanced data (112 negatives vs. 15 positives). We hit an AUC
of 0.911 and a cut-off of 0.013 can maintain optimal balance
between average specicity and average sensitivity. This cut-off
helps select the corresponding rank 11, which may have
prebiotics properties in the above prediction list. Those predicted
novel prebiotics are presented in Table 2, and some of them have
been investigated by prebiotics experts. The average specicity
and sensitivity for samples were 0.876 and 0.838, respectively.
In addition to evaluating the model and predicting potential
prebiotics, we also investigated related literature evidence for 11
potential prebiotics based on the original denition of prebiotics:
a prebiotic is a selectively fermented ingredient that allows
specic changes, both in the composition and/or activity in the
gastrointestinal microbiota, that confer benets upon host well-
being and health.Most of the predicted prebiotics are supported
by the literature analysis for 2 of the 3 criteria of prebiotics (non-
digestibility, fermentation, and selectivity), and there are no
obvious conicts with these criteria. Even for the most rigorous
criterion (selectivity), these are also many considerable items with
promising clues. For example, isomaltose has been shown to
represent a prebiotic with digestion-resistant properties, rafnose
is a complex 285 carbohydrate that can promote the growth of
benecial microorganisms, and acarbose is usually administered
in diabetes treatment and has promising potential as a
Additionally, cyclodextrin is a saccharide that
can reduce the digestion of carbohydrates and lipids. The
Figure 3. Feature enrichment analysis. Top 20 qualier names were extracted
from 800 features. The categories in the gure can roughly indicate the high-
level concept of 800 features. Those concepts are highly correlated with real-
world prebiotics chemical property.
derivative a-cyclodextrin is a soluble dietary ber that possesses
the ability to feed one of the Lactococcus sp. strains in the
gastrointestinal tract,
whereas the other derivative (b-cyclo-
dextrin) has been shown as an important component of low-fat
In summary, this promising list not only shows
prospective prebiotics but also demonstrated the efcacy of our
4. Discussion
It should be noted that our method depends on the MeSH terms.
Curators typically summarize 10 to 12 MeSH terms to describe
the most indexed papers from PubMed, but still there are a small
portion of papers that have not been curated yet. For these
overlooked papers, we suggest that keywords should be extracted
manually from their abstracts and titles for information integrity.
In addition, almost all text mining methods including ours are
partly limited by the size and the type of the data set, and the
predictive powers of our method in other data-intensive elds
havent been tested.
Prebiotics can supply vast health benets to healthy or
unhealthy people. Despite the signicant demonstrated medical
effect, the discovery and application of various prebiotics could
not meet the growing needs of the prebiotic market simply by
manually matching candidates to criteria. In an effort to improve
prebiotics mining efciency, we herein present a methodology
utilizing text mining techniques to boost the variety of potential
prebiotics from related literature.
Figure 4. Hierarchical clustering of carbohydrates. If we observe a putative branch associated with the MeSH tree in NLM, we could, therefore, infer that features
can be employed to predict potential prebiotics. Carbohydrates were clustered in hierarchical mode. Many branch structures are highly correlated with MeSH tree in
NLM and we could therefore infer that the features have a large portion of prebiotics property, which can be employed to predict potential prebiotics in prediction
step. MeSH =medical subject headings, NLM =National Library of Medicines.
We explored the optimal parameter set in an exhaustive grid
search: each important parameter (a,w, and k) was evaluated
according to a spectrum of potential values. In the parameter
selection process, the parameter ais effective in corpus volume
trade-off even if the volume of certain corpuses can reach a higher
gulf (10
). The parameters wand kalso substantially impact
the predictive performance. To more accurately determine the
variation tendency for the corpus volume, we performed
additional analyses to plot the average rank score against each
wand kat a specic lower a(a=0.2) after determining the
optimum a(1.0). Corpus volumes in our experiment vary
substantially; thus, ais intended to narrow the focus on yielding
reasonable parameters. Likewise, our parameter selection process
may provide a solution for other corpuses, especially those with
volume-unbalanced data.
Notwithstanding inevitable practical constraints, we believe
that our work is an important step in identifying more prebiotics,
thereby yielding meaningful results and providing a basis for
future development and experimentation. We identied critical
factors affecting mining work and developed methods for
characteristics selection of volume-unbalanced data to assess
predictive performance. We also performed clustering measure-
ments to evaluate the selected characteristics for known
prebiotics. The ROC curve, which evaluates the model t for
an optimal parameter crew, showed that the possibility issues we
identied are sufciently consistent to create a list of potential
prebiotics for further research. In a list of 11 potential prebiotics,
apart from these promising specic carbohydrates, some
relatively broad categories also found in it, such as xylans,
fructans, and dextrins, indicate a promising eld of potential
Overall, the MeSH-based text mining method provides a
bridge between the availability of tens of thousands of studies
with curated MeSH terms and the emerging functionality of
prebiotics studies, which have found few prebiotics over many
years. For the former, our algorithm dramatically enhances the
power of discovering potential prebiotics underlying countless
studies. For the latter, new candidates for potential prebiotics that
are useful in prebioticsresearch come to light. Regarding future
directions: taken together, the thousands of studies at hand in an
entire literature corpus (rather than individual studies) can assist
us in other elds, such as nding bacteria that can perform certain
functions or obtain food for soldiers, which may represent a niche
need in future studies.
In this integrated analysis, we present new ideas and
instructions that are helpful to researchers. Our results indicate
that there are currently no universal parameters for the mining
task and that the parameter set reported to work for a specic
corpus may not be an appropriate choice for research. As we
noted, an exhaustive grid search is recommended to customize
Figure 5. Cross-validation ROC analyses were used to evaluate model performance and determine the ranking threshold. (A) The ROC plot indicated our method
(red) performs better than random forest (green) with 50 positives. That is to say, our method can discriminate well between known prebiotics and carbohydrates.
The 45°diagonal line (dashed) indicates the theoretical plot of a test with no discrimination between known prebiotics and carbohydrates. (B) The ROC plot
indicated our method (red) performs far better than random forest (green) with 15 positives. The cut-off means beyond which threshold can we deem carbohydrates
possess prebiotics property. ROC =receiver operating characteristic curve.
Table 2
Summary and conclusion on the prebiotic effect of 11 potential prebiotics.
Rank Carbohydrates Non-digestibility Fermentation Selectivity References
1 Isomaltose Yes n.c. Yes
2 Xylans Yes Yes n.c.
3 Fructans Yes Yes n.c.
4b-Cyclodextrins Yes Yes n.c.
5 Rafnose Yes n.c. Yes
6 Dextrins Yes n.c. n.c.
7a-Cyclodextrins Yes Yes n.c.
8 Mitobronitol Probable n.c. n.c.
9 Oligosaccharides, branched-chain Yes Yes n.c.
10 Acarbose Yes Yes n.c.
11 Xylose Yes n.c. Yes
n.c. =not clear.
the parameter set not only to determine the best parameter settings
for given corpuses but also to assess their potential prediction
performance. Taken together, algorithm development as a part of
our study is meaningful in a widerange of biological scenarios, and
the ultimate potential of the prebiotics set obtained in this study
may provide novel text mining-based insights with clues in the
prebiotics eld. Follow-up studies are warranted to validate the
ndings herein;moreover, additional dened prebiotics substances
and related documents will improve the model. Our text mining-
based study lays the foundation for an efcient mining study for
obtaining potential prebiotics, which may indicate a promising
method in difcult eld of prebiotics research.
We thank Miss Xin Song for critical discussion and suggestions.
We would also like to acknowledge the generous funding
provided by the National Basic Research Project (973 program)
(2012CB518200), the General Program (31401141, 81573251,
30900830) of the Natural Science Foundation of China, the State
Key Laboratory of Proteomics of China (SKLP-Y201303, SKLP-
O201104 and SKLP-K201004), and the Special Key Programs
for Science and Technology of China (2012ZX09102301016).
