Content uploaded by Lawrence Wing-Chi Chan
Author content
All content in this area was uploaded by Lawrence Wing-Chi Chan on Jan 06, 2016
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
RES E A R C H A R T I C L E Open Access
PubMed-supported clinical term weighting
approach for improving inter-patient similarity
measure in diagnosis prediction
Lawrence WC Chan
1*
,YingLiu
2
, Tao Chan
3
, Helen KW Law
1
, SC Cesar Wong
1
,AndyPHYeung
1
,KFLo
1
,SWYeung
1
,
KY Kwok
1
, William YL Chan
1
,ThomasYHLau
1
and Chi-Ren Shyu
4
Abstract
Background: Similarity-based retrieval of Electronic Health Records (EHRs) from large clinical information systems
provides physicians the evidence support in making diagnoses or referring examinations for the suspected cases.
Clinical Terms in EHRs represent high-level conceptual information and the similarity measure established based
on these terms reflects the chance of inter-patient disease co-occurrence. The assumption that clinical terms are
equally relevant to a disease is unrealistic, reducing the prediction accuracy. Here we propose a term weighting
approach supported by PubMed search engine to address this issue.
Methods: We collected and studied 112 abdominal computed tomography imaging examination reports from four
hospitals in Hong Kong. Clinical terms, which are the image findings related to hepatocellular carcinoma (HCC),
were extracted from the reports. Through two systematic PubMed search methods, the generic and specific term
weightings were established by estimating the conditional probabilities of clinical terms given HCC. Each report
was characterize d by an ontological feature vector and there were totally 6216 vector pairs. We optimized the
modified direction cosine (mDC) with respect to a regularization constant embedded into the feature vector.
Equal, generic and specific term weighting approaches were applied to measure the similarity of each pair and
their performances for predicting inter-patient co-occurrence of HCC diagnoses were compared by using Receiver
Operating Characteristics (ROC) analysis.
Results: The Areas under the curves (AUROCs) of similarity scores based on equal, generic and specific term weighting
approaches were 0.735, 0.728 and 0.743 respectively (p < 0.01). In comparison with equal term weighting, the
performance was significantly improved by specific term weighting (p < 0.01) but not by generic term weighting. The
clinical terms “Dysplastic nodule”, “nodule of liver” and “equal density (isodense) lesion” were found the top three
image findings associated with HCC in PubMed.
Conclusions: Our findings suggest that the optimized similarity measure with specific term weighting to EHRs can
improve significantly the accuracy for predicting the inter-patient co-occurrence of diagnosis when compared with
equal and generic term weighting approaches.
* Correspondence: wing.chi.chan@polyu.edu.hk
1
Department of Health Technology and Informatics, Hong Kong Polytechnic
University, Hung Hom, Kowloon, Hong Kong
Full list of author information is available at the end of the article
© 2015 Chan et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain
Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article,
unless otherwise stated.
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43
DOI 10.1186/s12911-015-0166-2
Background
The huge amount of clinical data managed by the elec-
tronic health record (EHR) system potentiate case-based
decision support where the reference cases are retrieved
based on their similarity with the current case of inte rest
[1, 2]. To measure the inter-patient similarity consist-
ently, the feature vector model has been established by
transforming the clinical information of EHRs, including
laboratory test findings, medical image s and diagnostic
reports, to vector elements systematically [3–6].
The transformation of textual information, suc h a s
image findings , to feature vector requires the support
of a medical ontology [5, 6]. Systematized Nomencla-
ture of Medicine (SNOMED) Clinical Terms (CT) is a
collection o f clinical terms that are organized as con-
cepts and linked in a hierarchy with “is-a” or inverse
“is-a” relationships [7–10]. Concepts at a particular
level of the hierarchical structure are selected as the fea-
ture concepts. The edge count along the path connecting
a term in EHR and a feature concept in the “is-a” hier-
archy represents their semantic distance [3–5, 11, 12].
The ontological feature vector contains numerical ele-
ments, each of which is inferred by integrating the
semantic distances from all the EHR terms to a feature
concept. It has been proved that the ontological ve c tor
model significantly outperforms the simple string
matching in predicting inter-patient co-occurrence of
subclinical disorder [12].
Euclidean distance and direction cosine are two com-
monly used similarity measures but preserve dif ferent
properties. Direction cosine mea sures the similarity ac-
cording to the angle between two feature vec tors only
but Euclidean distance considers the magnitudes of two
vectors in addition to the angle. With such property,
Euclidean distance is more sensitive to the absolute dif-
ference between two EHRs than dire ction cosine. For
high dimensional vector model, they achieved similar
accuracy in neare st neighbour que ries. Howe ver, the
direction cosine is more computationally efficient than
Euclidean distance because the ontological vectors
usually have a large number of zero element s in the
information retrieval applications , expediting the
computation of direction cosine. Identifying similar
examination reports for diagnosis prediction requires
exhaustive search in imaging exa mination database.
As the database is assumed to host a huge number
of eligible reports, the efficiency for computing the
similarity score of an eligible report with the query
report becomes very crucial.
The modified direction cosine (mDC) wa s de veloped
by Chan et al. (2011) t o preserve the advantageous
properties of both Euclidean distance and direction co-
sine and extend the applications to low dimensional
vectormodel[12].InmDC,thefeaturevectoris
augmented by a regularization constant of unity to
acquire the property of Euclidean distance and main-
tain the computational efficiency of direction cosine
[12]. Numerical overflow that happens for direction
cosine can be avoided because the length of the feature
vector will never be c lose to zero due to the inclusion
of regularization constant in mDC. However, it is still
questionable if the performance of mDC can be opti-
mized against different values of this regula riza tion
constant.
The feature conce pts of the above-mentioned ve ctor
model were equally weighted . I n fac t, clinical terms are
unequally associated to a particular disea se. For ex-
ample, hepatic ne crosis and cirrhosis are common
image findings in the computed tomographic scan of
HCC patients. However, “hepatic necrosis” is more
spatially associated with cell death p henomenon in the
simultaneous growth of HCC than “cirrhosis” that re-
veals a fibrotic condition following cell death in HCC.
Thus, term weighting, which has been w ell established
in bioinformatics, should be applied to improve accuracy
of semantic measure or remove unrelated terms [13, 14].
The disease of interest in th is work is HCC, on e of
the ten most common cancers in the world [15]. Ab-
dominal tomographic scan plays an important role in
the diagnosis of HCC because the images can show pat-
terns characterizing the pathophysiology of HCC [16].
Such patterns, after observ ed by radiologists, will be
recorded as findings in the image examination report.
In this work , two novel weighting approaches, namely
generic term weighting and specific term weighting, are
proposed to improve the performance of the onto-
logical vector model in predicting inter-patient HCC
co-occurrence. The performances of these two ap-
proaches were compared with reference to the baseline
approach of equal term we ighting , in which all feature
concepts were equally weighted and t he independent
constant has already been optimized.
The generic and specific term weighting approaches
were implemented ba sed on the systematic search of
PubMed, a huge database indexing biomedical journal
articles. We assumed that a term is highly related to a
particular disease if the chance for co-mentioning the
term, the disease of interest and their synonyms in the
abstracts of the articles is high. The highly weighted
terms identified by this work can also be used to index
the report s for reminding the clinicians of follow-up
using other clinical tests.
Methods
Clinical data collection
Under the criterion that liver is the region of interest for
HCC, 112 image reports of abdominal computed tomog-
raphy examinations were collected retrospectively from
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 2 of 8
the Ra diology Departments of four local hospitals in
Hong Kong. HCC or liver metasta ses were r eported in
59 cases a nd no abnormality detec ted (NAD) in the
other 53 reports. The age range of the patients was
from 4 to 88 at the ti me of data retrieval. The patients
were de-identified b y using a randomly generated
unique ID. The personal information, includin g name,
identity card number, telephone number and address,
were removed from the report s by third party clinic al
personnel before d ata were collected by the research
team. Human Subject Ethics Approval has been ob-
tained from th e Hong Kong Polytechnic Unive rsity
(HSEAR S20140710 002).
Image finding term extraction
The clinical terms of image findings related to HCC
were identified and extracted manually from the reports
by five practicing radiographers (Authors : APHY, KFL ,
SWY, KYK , WYLC). They learnt the structure, content
and use of SNOMED C T from the Unified Medical
Language System (UMLS). The extraction of clinical
terms was supervised and validated by a radiologist
(Author: TC) and two profes sorial staff with anatomy
and radiography background (Authors: HKWL, T YHL).
The definitions of clinical terms and their synonyms
are standardized by SNOMED CT and unified to con-
cepts by UML S Terminology Se rvices (license code:
NLM-0315126310) where a unique concept ID is
assigned to each concept. The concept s for all the ex-
tracted image finding terms w ere identified. For the
equal and generic term weighting approaches, the iden-
tified concepts were mapped to the corresponding fea-
ture concepts according to SNOMED CT and the
mapped feature concepts and their synonyms were used
for weighting. For the specific term weighting approach,
the extracted terms and their synonyms were u sed for
weighting.
Ontological vector model
The relationship between concepts is defined by the “is-a”
hierarchical tree of SNOMED CT, which consists of levels
of concepts. As the concepts of the extracted terms exist
at different levels, the reports can be consistently com-
pared if the extracted terms are projected to the concepts
at a particular level, which are referred to feature concepts.
The level-4 concepts were chosen as feature concepts in
this work because level-4 provides an optimal classifica-
tion granularity for accurate patient matching [12].
Let f
i
, m, d
j
and n be the i
th
feature concept, the
number of feature concepts , the j
th
concept extracted
from a report and the number of concept s extracted
from the report respe ctively. The semantic distance be-
tween f
i
and d
j
is defined as s
ij
∈ [0,∞]. The value of s
ij
is determined subject to three rules. (see Fig. 1)
1. If d
j
is the descendant of f
i
, then s
ij
is the number of
“is-a” link from d
j
to f
i
.
2. If d
j
is not the descendant of f
i
, then s
ij
= ∞.
3. If d
j
is the same as f
i
, then s
ij
=0.
For each report, a feature vector, given by [a
1
,a
2
,a
3
,
…,a
m
, δ], was generated. δ represents a regularization
constant whose value is equal to 10
-k
where k is a non-
negative integer; a
i
∈. [0,1] represents a vector element
associated with the i
th
feature concept and is obtained
by the following formula.
a
i
¼
ffiffiffiffi
p
i
p
1 þ min
j¼1…n
s
ij
ð1Þ
where p
i
is the conditional probability of the i
th
feature
concept given the occurrence of HCC. The value of a
i
indicates the relatedness of the i
th
feature con cept with
the image finding terms of a report. The ability of the
feature vector in characterizing a repor t can be modu-
lated by p
i
. When the value of p
i
is zero, the effect of the
i
th
feature concept on the similarity score is fuy re-
pressed. When the value of p
i
is one, the effect of the i
th
feature concept on the similarity score is fully promoted.
Similarity measure
The similarity score betwe en two reports was calculated
by using direction cosine of their f eature vectors , Q
and D.
sim Q DðÞ¼
Q ⋅D
Q
jj
D
jj
ð2Þ
where “⋅” is the inner product of two ve ctors and |x|. is
the length of a vector x. The similarity score ranges
from 0 to 1. When the similarity score tends to 0, the
vectors Q and D are more dissimilar to eac h other.
When the similarity score tends to 1, they are more
similar to e ach other. To improve inter-patient similar-
ity measure for HCC co-occurrence predictio n, this
work aims to establish a PubMed-supported approa ch
for estimating more precisely the conditional probabil-
ity p
i
. The implem entation of the inter-patient HCC
co-occurrence prediction is illustrated in Fig. 2.
Optimization of similarity measure
The similarity measure was optimized by determining
its ma ximum performance in predicting HCC co-
occurrence among different values of k. Eqterm weight-
ing (i.e. p
i
= 1) is considered a s the baseline for the
optimization. The choice of k is of crucial importance
when the extracted terms a re particularly few or even
none. If k tends to infinity (δ ≈ 0), the similarity score
will be unstable and probably undefined due to the tiny
magnitude of feature vector. If k is equal to 0, the
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 3 of 8
similarity score w ill be dominated by the value of δ ir-
respective of the reports’ content. The accuracy of the
similarity measure in predicting HCC co-occurrence
was plotted against the k value. We determined a value
of k, at which the accuracy attains maximum according
to the trend of the plot. Besides the equal term weight-
ing, the optimal value of k wa s applied to feature vector
for establishing generic and s pe cific term w eighting
approaches.
Generic term weighting
According to equation (1), the feature concepts are
weighted by a panel of p
i
, which is defined as the condi-
tional probability of the i
th
feature concept given HCC.
Literature search was performed by using PubMed and
the numbers of abstracts listed in the search results were
used for the estimation of p
i
. Generic term weighting
was implemented by applying directly the following
formula.
p
i; generic
¼
#
of abstracts containing A OR A
0
1
OR A
0
2
OR…
AND B OR B
0
1
OR B
0
2
OR…
of abstracts containing B OR B
0
1
OR B
0
2
OR…
ð3Þ
Fig. 1 - Projection of image finding terms to feature concepts in SNOMED CT “is-a” hierarchy. Part of the “is-a” hierarchical relationships is illustrated with
three examples demonstrating the rules to determine the semantic distances. Four image finding terms: “cirrhosis”, “hepatic fibrosis”, “splenomegaly” and
“fatt y liver” are considered. The level-4 concepts are regarded as feature concepts. In this case, feature concepts: “liver finding”, “abdominal organ finding”
and “fatty liver” are involved. a The term “cirrhosis” at level-7 is the descendant of “liver finding”. Their semantic distance is 3 because there are three “is-a”
links between them. b The semantic distance between “hepatic fibrosis” and “liver finding” is 2. c The term “splenomegaly” is not a descendant of “liver
finding” but the descendant of “abdominal organ finding”. Thus, the semantic distance between “s plenom egaly ” and “liver finding” is inf inity and that
with “abdominal organ finding” is 2. Finally, the term “fatty liver” at level 4 is also a feature concept and the semantic distance is 0
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43
Page 4 of 8
where A and A
n
'
are the i
th
feature concept and its n
th
synonym respectively; B and B
n
'
represent HCC and its
n
th
synonym respe ctively. Using this approach, the calcu-
lated weights of feature concepts were the same among
different reports although their descendent terms ex-
tracted from the reports are different.
Specific term weighting
In specific term weighting approach, we seched PubMed
for the abstracts containing the extracted terms and
HCC. The conditional probability of the m
th
extracted
term given HCC, q
m
, were estimated by the followin g
formula.
q
m
¼ of abstracts containing ½ðC OR C
0
1
OR C
0
2
OR…Þ
AND B OR B
0
1
OR B
0
2
OR…
of abstracts containing B OR B
0
1
OR B
0
2
OR…
ð4Þ
where C and C
n
'
. are the m
th
extracted term and its n
th
synonym respectively; B and B
n
'
represent HCC and its
n
th
synonym respectively. We assumed that the condi-
tional probability of the i
th
feature concept given HCC is
equal to the average of the conditional probabilities of
its N descendent terms extracted from a report given
HCC. The value of p
i
was calculated by the following
formula.
p
i; specific
¼
q
1
þ q
2
þ q
3
þ …
N
ð5Þ
Note that the weighting of feature vector elements is
dependent of the report content. In contrast to the gen-
eric term weighting where the weights don’t change
across reports, the weights of the same feature concept
estimated by specific term weighting approach may dif-
fer from patient to patient.
Statistical analysis
The receiver operating characteristic (ROC) analysis was
performed to the results of inter-patient HCC co-
occurrence predicted by equal, generic and specific term
weighting approaches. For each approach, the ROC
curve was plotted and the Areas under the ROC curve
(AUROC) indicated the accuracy of the prediction, i.e.
the probability of correctly classifying a pair of reports
into same diagnosis (both are HCC; both are NAD) or
different diagnosis (one is HCC and the other is NAD).
In addition to the comparison with the area under the
chance diagonal, AUROCs were compared with each
other to determine an approach with the best performance
Fig. 2 - A schematic view of the method. Step 1: Manual extraction of the image finding terms and their corresponding synonyms from the
reports. Step 2: The concepts of the image finding terms defined in SNOMED CT were identified by using UMLS Terminology Services. Step 3:
Edge counting of the semantic distances between the extracted terms and the level-4 feature concepts. Step 4: The feature concepts are
weighted by (Step 4a) generic term weighting approach and (Step 4b) specific term weighting approach. Step 5: The feature vectors are generated.
Step 6: Similarity scores between feature vectors are calculated by modified direction cosine
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 5 of 8
and the statistical significance of the observed differences
were also indicated [17, 18].
Results
Feature extraction and report pair formation
We extracted 38 image finding terms from 112 examin-
ation reports (59 HCC and 53 NAD cases). These terms
are uniquely defined by 38 concepts in UMLS and were
projected to 36 feature concepts at level-4 of SNOMED
CT “is-a” hierarchy. The reports were paired up to form
6216 non-redundant pairs, in which 3089 pairs are
matches, i.e. (HCC,HCC) or (NAD,NAD), and 3127 pairs
are mismatches, i.e. (HCC,NAD) or (NAD,HCC).
Optimization of similarity measure
Equal term weighting was considered as baseline for op-
timizing the similarity measure. ROC analysis of inter-
patient HCC co-occurrence prediction was performed
for different values of k. Fig. 3 shows the plot of AUROC
against k. It was found that the accuracy increases for k
between 0 and 2. For k > 2, the AUROC reaches a con-
stant level. Thus, we chose k = 10 for all the term weight-
ing approaches.
Estimation of conditional probabilities
In generic term weighting, abstracts were retrieved by
PubMed search for each feature concept and its syno-
nyms. The count of abstracts containing a feature concept
or its synonyms ranges from 1 to 427154. By incorporat-
ing HCC and its synonyms to the search criteria, the
abstract count was further reduced. The conditional prob-
ability of a feature concept for generic term weighting is
defined as the ratio of these two counts.
In specific term weighting, abst racts were retrieved by
PubMed search for each extracted term and its syno-
nyms. The count of abstracts containing a fe ature con-
cept or its synonyms ranges from 1 to 195708. The
abstract count is further reduced by adding HCC and its
synonyms to the search criteria. The ratio of these two
counts was projected to the corresponding feature con-
cepts. The conditional probability of a feature concept
for specific term weighting is defined as the average of
the ratios across all of its descendent terms extracted
from a report. The values of conditional probabilities
were computed and saved in Excel files (See Additional
files 1, 2, 3, 4).
Comparison of term weighting approaches
The AUROCs and the 95 % confidence intervals (95 %
CIs) of equal, generic and specific term weighting ap-
proaches are shown in Table 1. It was found all three
approaches outperformed the random rater significantly
(p < 0.01). When compared to equal term weighing ap-
proach (AUROC = 0.735), the performance was signifi-
cantly improved by specific term weighting approach
(AUROC = 0.743, p < 0.01) but was significantly worsen
by generic term weighting approach (AUROC = 0.728,
p < 0.01). The conditional probabilities of the extracted
image finding terms given HCC, derived by the specific
term weighting approach, were sorted in descending
order. The top ten image finding terms are listed to-
gether with their conditional probabilitie s in Table 2.
Discussion
Health records ontologically similar to new suspected
case support clinical decision with evidence of the dis -
ease. The reliability of such ontology-similarity-based
case retrieval algorithm depends on the choices of inter-
patient similarity measure and ontological vector model.
It has been proved that modified Direction Cosine (mDC)
avoids the problem of numerical overflow and preserves
the same properties as Euclidean distance does [12]. How-
ever, weighting of the ontological vector was not consid-
ered in the previous studies and it remained unknown if
the performance of mDC can be improved by adjusting
the weights associated with the feature concepts. It was
shown that the performance of the similarity measure
was substan tially improved by setting an extremely
small regularization constant, 10
−10
.Suchsettinghelps
maintain the similarity scores discriminative for com-
paring health records that have very few or e ven no ex-
tracted clinical terms.
Fig. 3 - Plot of AUROC against the value of k. The accuracy of
inter-patient HCC co-occurrence prediction increas es when k is between
0 and 2 and saturates at the level of 0.735 when k further increases
Table 1 - Comparison of term weighting approaches. AUROCs
and the 95 % CIs of the equal, generic and specific term
weighting approaches are summarized here
Term weighting approach AUROC 95 % CI
Equal term weighting 0.735 (0.724, 0.746)
Generic term weighting 0.728 (0.717, 0.739)
Specific term weighting 0.743 (0.732, 0.754)
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 6 of 8
In generic term weighting, the median and the 10th
percentile of the counts of retrieved abstracts containing
the feature concepts or their synonyms are 4204 and
82.9 respe ctively. In specific term weighting, the median
and the 10th percentile of the counts of retrieved ab-
stracts containing the extracted terms or their synonyms
are 2689.5 and 46.5 respe ctively. The sample sizes of the
retrieved abstracts are large enough to support the esti-
mation of conditional probabilities.
In comparison to equal term weighting, the perform-
ance was imp roved by specific term weighting approach
but worsened by generic term weighting approach. It
implied that the weighted feature vector elements do
not necessarily give better performance but the way,
through which we derived the weights, is crucial for im-
proving performance. In generic term weighting, the fea-
ture concepts at level-4, instead of the clinical terms
extracted from the reports, were used for PubMed
search. The weights are associated with the level-4 con-
cepts only and remain unchanged across different re-
ports. Moreover, the level-4 concepts are not specific
enough to provide reliable results of PubMed search for
estimating the conditional probabilitie s. Specific term
weighting used the extracted terms directly for PubMed
Search. The search results are more reliable for estimating
the conditional probabilities due to the higher granularity
of concepts provided by the extracted terms. Although the
conditional probabilities of the descendent extracted terms
are averaged to generate the weights of feature vector ele-
ments, the keywords for PubMed Search are dynamically
dependent of the report contents and the weights become
more specific.
The high weights of feature concepts dominating the
similarity score are attributed by their descendent terms
extracted from the reports. In Table 2, the top three
image finding terms (conditional probabilities) are “dys-
plastic nodule” (0.934), “nodule of liver” (0.513) and
“equal density (isodense) lesion” (0.438). The a ssoci-
ation of these image findings with HCC is supported by
Sakamoto [19] stating that small equivocal lesions, i.e.
dysplastic nodules, detected by imagin g examination of
liver are regarded a s a precursor of HCC. For the cases
with such image finding but no abnormality detected,
we suggest to index t hem a s “high risk ” so that close
follow-up can be recommended to those patients.
As the conditional probability of the most relevant
image finding “Dysplastic nodule” (0.934) is greater than
ten times of that of the eighth image finding “Hepatic fi-
brosis”, 0.082, only the top seven features are significantly
contributed to the diagnosis prediction performance. The
features other than these top seven features are associated
with negligible weights and have negligible effect on the
prediction. Therefore, the generic and specific term weight-
ing approaches are analogous with feature selection that
makes the vector model more parsimonious with respect
to the number of available cases.
As the numbers of PubMed abstracts are dynamic, the
term weighting results may change from time to time.
In our future studies, it is suggested to enhance the
ontological ve ctor model by incorporating more algo-
rithmic elements from information content model,
which spans an essential dimension of assessing the se-
mantic similarity [20].
Besides the image examination report, laboratory test
findings, such as Alpha fetoprotein (AFP) level and
Child-Pugh score, are important features for the diagno-
sis of HCC. As the electronic health record (EHR) inte-
grated the image examination report and laboratory test
report, the feature vector can be augmented to cover la-
boratory test finding s [21]. The same weighting ap-
proach and similarity measure can also be applied to
such augmented feature vector.
Our development of clinical term weighting approach
not only improved the inter-patient similarity measures
for diagnosis prediction. In fact, this method may be used
to identify large cohorts of patients with similar disease
presentation for retrospective treatment efficacy analysis.
It may also facilitate the identification of targeted patient
cohorts for prospective interventional studies.
Conclusions
The performance of inter-patient similarity measure was
significantly improved by specifically weighting the ele-
ments of the ontological feature vector. PubMed search
was applied to estimate the weights. Early HCC markers,
including dyspla stic nodule, nodule of liver, and equal
density lesion, were identified by PubMed search as image
findings that are strongly associated with HCC.
Table 2 - Top ten image finding terms. The PubMed search
results indicated that some image finding terms were
co-mentioned with HCC very frequently in the abstracts of
biomedical journal articles. The conditional probability of
“Dysplastic nodule” (0.934) is t he highest among all the
extracted terms
Rank Image finding Conditional probability
1 Dysplastic nodule 0.934
2 Nodule of liver 0.513
3 Equal density (isodense) lesion 0.438
4 Nodular hyperplasia of liver 0.329
5 Solitary necrotic liver nodule 0.259
6 Portal vein thrombosis 0.209
7 Space occupying lesion of liver 0.175
8 Cirrhosis of liver 0.170
9 Hepatic fibrosis 0.082
10 Nontraumatic hemoperitoneum 0.064
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 7 of 8
Additional files
Additional file 1: Generic Term Weighting Computation.
Additional file 2: Specific Term Weighting Computation.
Additional file 3: Similarity Scores with Generic Term Weighting.
Additional file 4: Similarity Scores with Specific Term Weighting.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
LWCC and YL have made substantial contributions to conception and design
of the study and have prepared and revised the manuscript. APHY, KFL,
SWY, KYK and WYLC have made substantial contributions to the acquisition,
image finding extraction, analysis and interpretation of data. TC, HKWL,
SCCW and TYHL have contributed the domain knowledge to approving the
image findings and their relationship to liver diseases. CRS has contributed
to verifying the ontological method and revising critically the manuscript.
All authors read and approved the final manuscript.
Acknowledgements
This research was supported by the RGC General Research Fund “PolyU
5118/11E: Clinical Decision Support using Biomedical Ontology and
Literature Supported Patient Similarity for Diagnostic and Prognostic
Pattern Discovery from Electronic Health Records”.
Author details
1
Department of Health Technology and Informatics, Hong Kong Polytechnic
University, Hung Hom, Kowloon, Hong Kong.
2
Institute of Mechanical and
Manufacturing Engineering, School of Engineering, Cardiff University, Cardiff
CF24 3AA, UK.
3
Department of Diagnostic Radiology, University of Hong
Kong, Pokfulam, Hong Kong.
4
Informatics Institute and Department of
Computer Science, University of Missouri, Columbia, MO, USA.
Received: 24 December 2014 Accepted: 22 May 2015
References
1. Peter BJ, Lars JJ, Søren B. Mining electronic health records: towards better
research applications and clinical care. Nat Rev Genet. 2012;13(6):395–405.
2. Ceuster W, Smith B. Strategies for referent tracking in electronic health
records. J Biomed Inform. 2006;39:362–78.
3. Chan LWC, Benzie IFF, Liu Y, et al.: Is the inter-patient coincidence of a
subclinical disorder related to EHR similarity? 2011 IEEE 13th International
Conference on e-Health Networking, Applications and Services 2011:177–180
doi:10.1109/HEALTH.2011.6026738.
4. Sánchez D, Batet M, Isern D, Valls A. Ontology-based semantic similarity:
a new feature-based approach. Expert Systems With Appli cations.
2012;39(9):7 718–28.
5. Batet M, Sánchez D, Aida V. An ontology-based measure to compute
semantic similarity in biomedicine. J Biomed Inform. 2011;44:118–25.
6. Richesson RL, Andrew JE, Krischer JP. Use of SNOMD CT to represent clinical
research data: a semantic characterization of data items on case report
forms in vasculitis research. J Am Med Inform Assoc. 2006;13(5):536–46.
7. Melton GB, Parsons S, Morrison FP, Rothschild AS, Markatou M, Hripcsak G.
Inter-patient distance metrics using SNOMED CT defining relationships.
J Biomed Inform. 2006;39(6):697–705.
8. Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG. Measures of semantic
similarity and relatedness in the biomedical domain. J Biomed Inform.
2007;40(3):288–99.
9. Wasserman H, Wang J. An applied evaluation of SNOMED CT as a clinical
vocabulary for the computerized diagnosis and problem list. AMIA
Symposium. 2003;699–703.
10. Lieberman MI, Ricciardi TN, Masarie FE, Spackman KA. The use of SNOMED
CT simplifies querying of a clinical data warehouse. AMIA Symposium.
2003;910.
11. Lord PW, Stevens RD, Brass A, Goble CA. Investigating semantic similarity
measures across the gene ontology: the relationship between sequence
and annotation. Bioinformatics. 2003;19(10):1275–83.
12. Chan LWC, Liu Y, Shyu CR, Benzie IFF. A SNOMED supported ontological
vector model for subclinical disorder detection using EHR similarity. Eng
Appl Artif Intell. 2011;24:1398–409.
13. Falda M, Toppo S, Pescarolo A, Lavezzo E, Camillo BD, Facchinetti A, et al.
Argot2: a large scale function prediction tool relying on semantic similarity
of weighted Gene Ontology terms. BMC Bioinformatics. 2012;13:1–9.
14. Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in
biomedical ontologies. PLoS Comput Biol. 2009;5(7), e1000443.
15. Page AJ, Cosgrove DC, Philosophe B, Pawlik TM. Hepatocellular carcinoma:
diagnosis, management, and prognosis. Surg Oncol Clin N Am.
2014;23(2):289–311.
16. Kamel IR, Liapi E, Fishman EK. Multidetector CT of hepatocellular carcinoma.
Best Pract Res Clin Gastroenterol. 2005;19(1):63–89.
17. Hanley JA, Mcneil BJ. The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology. 1982;143:29–36.
18. Hanley JA, Mcneil BJ. A method of comparing the areas under receiver
operating characteristic curves derived from the same cases. Radiology.
1983;148:839–43.
19. Sakamoto M. Early HCC: diagnosis and molecular markers. J Gastroenterol.
2009;44:108–11.
20. Zhou Z, Wang Y, Gu J. A new model of information content for semantic
similarity in WordNet. Second International Conference on Future
Generation Communication and Networking Symposia. 2008;2008:85–9.
21. Gottlieb A, Stein GY, Ruppin E, Altman RB, Sharan R. A method for inferring
medical diagnoses from patient similarities. BMC Medicine. 2013;11:194.
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 8 of 8