ArticlePDF Available

PubMed-supported clinical term weighting approach for improving inter-patient similarity measure in diagnosis prediction

Authors:

Abstract and Figures

Similarity-based retrieval of Electronic Health Records (EHRs) from large clinical information systems provides physicians the evidence support in making diagnoses or referring examinations for the suspected cases. Clinical Terms in EHRs represent high-level conceptual information and the similarity measure established based on these terms reflects the chance of inter-patient disease co-occurrence. The assumption that clinical terms are equally relevant to a disease is unrealistic, reducing the prediction accuracy. Here we propose a term weighting approach supported by PubMed search engine to address this issue. We collected and studied 112 abdominal computed tomography imaging examination reports from four hospitals in Hong Kong. Clinical terms, which are the image findings related to hepatocellular carcinoma (HCC), were extracted from the reports. Through two systematic PubMed search methods, the generic and specific term weightings were established by estimating the conditional probabilities of clinical terms given HCC. Each report was characterized by an ontological feature vector and there were totally 6216 vector pairs. We optimized the modified direction cosine (mDC) with respect to a regularization constant embedded into the feature vector. Equal, generic and specific term weighting approaches were applied to measure the similarity of each pair and their performances for predicting inter-patient co-occurrence of HCC diagnoses were compared by using Receiver Operating Characteristics (ROC) analysis. The Areas under the curves (AUROCs) of similarity scores based on equal, generic and specific term weighting approaches were 0.735, 0.728 and 0.743 respectively (p < 0.01). In comparison with equal term weighting, the performance was significantly improved by specific term weighting (p < 0.01) but not by generic term weighting. The clinical terms "Dysplastic nodule", "nodule of liver" and "equal density (isodense) lesion" were found the top three image findings associated with HCC in PubMed. Our findings suggest that the optimized similarity measure with specific term weighting to EHRs can improve significantly the accuracy for predicting the inter-patient co-occurrence of diagnosis when compared with equal and generic term weighting approaches.
Content may be subject to copyright.
RES E A R C H A R T I C L E Open Access
PubMed-supported clinical term weighting
approach for improving inter-patient similarity
measure in diagnosis prediction
Lawrence WC Chan
1*
,YingLiu
2
, Tao Chan
3
, Helen KW Law
1
, SC Cesar Wong
1
,AndyPHYeung
1
,KFLo
1
,SWYeung
1
,
KY Kwok
1
, William YL Chan
1
,ThomasYHLau
1
and Chi-Ren Shyu
4
Abstract
Background: Similarity-based retrieval of Electronic Health Records (EHRs) from large clinical information systems
provides physicians the evidence support in making diagnoses or referring examinations for the suspected cases.
Clinical Terms in EHRs represent high-level conceptual information and the similarity measure established based
on these terms reflects the chance of inter-patient disease co-occurrence. The assumption that clinical terms are
equally relevant to a disease is unrealistic, reducing the prediction accuracy. Here we propose a term weighting
approach supported by PubMed search engine to address this issue.
Methods: We collected and studied 112 abdominal computed tomography imaging examination reports from four
hospitals in Hong Kong. Clinical terms, which are the image findings related to hepatocellular carcinoma (HCC),
were extracted from the reports. Through two systematic PubMed search methods, the generic and specific term
weightings were established by estimating the conditional probabilities of clinical terms given HCC. Each report
was characterize d by an ontological feature vector and there were totally 6216 vector pairs. We optimized the
modified direction cosine (mDC) with respect to a regularization constant embedded into the feature vector.
Equal, generic and specific term weighting approaches were applied to measure the similarity of each pair and
their performances for predicting inter-patient co-occurrence of HCC diagnoses were compared by using Receiver
Operating Characteristics (ROC) analysis.
Results: The Areas under the curves (AUROCs) of similarity scores based on equal, generic and specific term weighting
approaches were 0.735, 0.728 and 0.743 respectively (p < 0.01). In comparison with equal term weighting, the
performance was significantly improved by specific term weighting (p < 0.01) but not by generic term weighting. The
clinical terms Dysplastic nodule, nodule of liver and equal density (isodense) lesion were found the top three
image findings associated with HCC in PubMed.
Conclusions: Our findings suggest that the optimized similarity measure with specific term weighting to EHRs can
improve significantly the accuracy for predicting the inter-patient co-occurrence of diagnosis when compared with
equal and generic term weighting approaches.
* Correspondence: wing.chi.chan@polyu.edu.hk
1
Department of Health Technology and Informatics, Hong Kong Polytechnic
University, Hung Hom, Kowloon, Hong Kong
Full list of author information is available at the end of the article
© 2015 Chan et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain
Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article,
unless otherwise stated.
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43
DOI 10.1186/s12911-015-0166-2
Background
The huge amount of clinical data managed by the elec-
tronic health record (EHR) system potentiate case-based
decision support where the reference cases are retrieved
based on their similarity with the current case of inte rest
[1, 2]. To measure the inter-patient similarity consist-
ently, the feature vector model has been established by
transforming the clinical information of EHRs, including
laboratory test findings, medical image s and diagnostic
reports, to vector elements systematically [36].
The transformation of textual information, suc h a s
image findings , to feature vector requires the support
of a medical ontology [5, 6]. Systematized Nomencla-
ture of Medicine (SNOMED) Clinical Terms (CT) is a
collection o f clinical terms that are organized as con-
cepts and linked in a hierarchy with is-a or inverse
is-a relationships [710]. Concepts at a particular
level of the hierarchical structure are selected as the fea-
ture concepts. The edge count along the path connecting
a term in EHR and a feature concept in the is-a hier-
archy represents their semantic distance [35, 11, 12].
The ontological feature vector contains numerical ele-
ments, each of which is inferred by integrating the
semantic distances from all the EHR terms to a feature
concept. It has been proved that the ontological ve c tor
model significantly outperforms the simple string
matching in predicting inter-patient co-occurrence of
subclinical disorder [12].
Euclidean distance and direction cosine are two com-
monly used similarity measures but preserve dif ferent
properties. Direction cosine mea sures the similarity ac-
cording to the angle between two feature vec tors only
but Euclidean distance considers the magnitudes of two
vectors in addition to the angle. With such property,
Euclidean distance is more sensitive to the absolute dif-
ference between two EHRs than dire ction cosine. For
high dimensional vector model, they achieved similar
accuracy in neare st neighbour que ries. Howe ver, the
direction cosine is more computationally efficient than
Euclidean distance because the ontological vectors
usually have a large number of zero element s in the
information retrieval applications , expediting the
computation of direction cosine. Identifying similar
examination reports for diagnosis prediction requires
exhaustive search in imaging exa mination database.
As the database is assumed to host a huge number
of eligible reports, the efficiency for computing the
similarity score of an eligible report with the query
report becomes very crucial.
The modified direction cosine (mDC) wa s de veloped
by Chan et al. (2011) t o preserve the advantageous
properties of both Euclidean distance and direction co-
sine and extend the applications to low dimensional
vectormodel[12].InmDC,thefeaturevectoris
augmented by a regularization constant of unity to
acquire the property of Euclidean distance and main-
tain the computational efficiency of direction cosine
[12]. Numerical overflow that happens for direction
cosine can be avoided because the length of the feature
vector will never be c lose to zero due to the inclusion
of regularization constant in mDC. However, it is still
questionable if the performance of mDC can be opti-
mized against different values of this regula riza tion
constant.
The feature conce pts of the above-mentioned ve ctor
model were equally weighted . I n fac t, clinical terms are
unequally associated to a particular disea se. For ex-
ample, hepatic ne crosis and cirrhosis are common
image findings in the computed tomographic scan of
HCC patients. However, hepatic necrosis is more
spatially associated with cell death p henomenon in the
simultaneous growth of HCC than cirrhosis that re-
veals a fibrotic condition following cell death in HCC.
Thus, term weighting, which has been w ell established
in bioinformatics, should be applied to improve accuracy
of semantic measure or remove unrelated terms [13, 14].
The disease of interest in th is work is HCC, on e of
the ten most common cancers in the world [15]. Ab-
dominal tomographic scan plays an important role in
the diagnosis of HCC because the images can show pat-
terns characterizing the pathophysiology of HCC [16].
Such patterns, after observ ed by radiologists, will be
recorded as findings in the image examination report.
In this work , two novel weighting approaches, namely
generic term weighting and specific term weighting, are
proposed to improve the performance of the onto-
logical vector model in predicting inter-patient HCC
co-occurrence. The performances of these two ap-
proaches were compared with reference to the baseline
approach of equal term we ighting , in which all feature
concepts were equally weighted and t he independent
constant has already been optimized.
The generic and specific term weighting approaches
were implemented ba sed on the systematic search of
PubMed, a huge database indexing biomedical journal
articles. We assumed that a term is highly related to a
particular disease if the chance for co-mentioning the
term, the disease of interest and their synonyms in the
abstracts of the articles is high. The highly weighted
terms identified by this work can also be used to index
the report s for reminding the clinicians of follow-up
using other clinical tests.
Methods
Clinical data collection
Under the criterion that liver is the region of interest for
HCC, 112 image reports of abdominal computed tomog-
raphy examinations were collected retrospectively from
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 2 of 8
the Ra diology Departments of four local hospitals in
Hong Kong. HCC or liver metasta ses were r eported in
59 cases a nd no abnormality detec ted (NAD) in the
other 53 reports. The age range of the patients was
from 4 to 88 at the ti me of data retrieval. The patients
were de-identified b y using a randomly generated
unique ID. The personal information, includin g name,
identity card number, telephone number and address,
were removed from the report s by third party clinic al
personnel before d ata were collected by the research
team. Human Subject Ethics Approval has been ob-
tained from th e Hong Kong Polytechnic Unive rsity
(HSEAR S20140710 002).
Image finding term extraction
The clinical terms of image findings related to HCC
were identified and extracted manually from the reports
by five practicing radiographers (Authors : APHY, KFL ,
SWY, KYK , WYLC). They learnt the structure, content
and use of SNOMED C T from the Unified Medical
Language System (UMLS). The extraction of clinical
terms was supervised and validated by a radiologist
(Author: TC) and two profes sorial staff with anatomy
and radiography background (Authors: HKWL, T YHL).
The definitions of clinical terms and their synonyms
are standardized by SNOMED CT and unified to con-
cepts by UML S Terminology Se rvices (license code:
NLM-0315126310) where a unique concept ID is
assigned to each concept. The concept s for all the ex-
tracted image finding terms w ere identified. For the
equal and generic term weighting approaches, the iden-
tified concepts were mapped to the corresponding fea-
ture concepts according to SNOMED CT and the
mapped feature concepts and their synonyms were used
for weighting. For the specific term weighting approach,
the extracted terms and their synonyms were u sed for
weighting.
Ontological vector model
The relationship between concepts is defined by the is-a
hierarchical tree of SNOMED CT, which consists of levels
of concepts. As the concepts of the extracted terms exist
at different levels, the reports can be consistently com-
pared if the extracted terms are projected to the concepts
at a particular level, which are referred to feature concepts.
The level-4 concepts were chosen as feature concepts in
this work because level-4 provides an optimal classifica-
tion granularity for accurate patient matching [12].
Let f
i
, m, d
j
and n be the i
th
feature concept, the
number of feature concepts , the j
th
concept extracted
from a report and the number of concept s extracted
from the report respe ctively. The semantic distance be-
tween f
i
and d
j
is defined as s
ij
[0,]. The value of s
ij
is determined subject to three rules. (see Fig. 1)
1. If d
j
is the descendant of f
i
, then s
ij
is the number of
is-a link from d
j
to f
i
.
2. If d
j
is not the descendant of f
i
, then s
ij
= .
3. If d
j
is the same as f
i
, then s
ij
=0.
For each report, a feature vector, given by [a
1
,a
2
,a
3
,
,a
m
, δ], was generated. δ represents a regularization
constant whose value is equal to 10
-k
where k is a non-
negative integer; a
i
. [0,1] represents a vector element
associated with the i
th
feature concept and is obtained
by the following formula.
a
i
¼
ffiffiffi
p
i
p
1 þ min
j¼1n
s
ij
ð1Þ
where p
i
is the conditional probability of the i
th
feature
concept given the occurrence of HCC. The value of a
i
indicates the relatedness of the i
th
feature con cept with
the image finding terms of a report. The ability of the
feature vector in characterizing a repor t can be modu-
lated by p
i
. When the value of p
i
is zero, the effect of the
i
th
feature concept on the similarity score is fuy re-
pressed. When the value of p
i
is one, the effect of the i
th
feature concept on the similarity score is fully promoted.
Similarity measure
The similarity score betwe en two reports was calculated
by using direction cosine of their f eature vectors , Q
and D.
sim Q DðÞ¼
Q D
Q
jj
D
jj
ð2Þ
where is the inner product of two ve ctors and |x|. is
the length of a vector x. The similarity score ranges
from 0 to 1. When the similarity score tends to 0, the
vectors Q and D are more dissimilar to eac h other.
When the similarity score tends to 1, they are more
similar to e ach other. To improve inter-patient similar-
ity measure for HCC co-occurrence predictio n, this
work aims to establish a PubMed-supported approa ch
for estimating more precisely the conditional probabil-
ity p
i
. The implem entation of the inter-patient HCC
co-occurrence prediction is illustrated in Fig. 2.
Optimization of similarity measure
The similarity measure was optimized by determining
its ma ximum performance in predicting HCC co-
occurrence among different values of k. Eqterm weight-
ing (i.e. p
i
= 1) is considered a s the baseline for the
optimization. The choice of k is of crucial importance
when the extracted terms a re particularly few or even
none. If k tends to infinity (δ 0), the similarity score
will be unstable and probably undefined due to the tiny
magnitude of feature vector. If k is equal to 0, the
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 3 of 8
similarity score w ill be dominated by the value of δ ir-
respective of the reports content. The accuracy of the
similarity measure in predicting HCC co-occurrence
was plotted against the k value. We determined a value
of k, at which the accuracy attains maximum according
to the trend of the plot. Besides the equal term weight-
ing, the optimal value of k wa s applied to feature vector
for establishing generic and s pe cific term w eighting
approaches.
Generic term weighting
According to equation (1), the feature concepts are
weighted by a panel of p
i
, which is defined as the condi-
tional probability of the i
th
feature concept given HCC.
Literature search was performed by using PubMed and
the numbers of abstracts listed in the search results were
used for the estimation of p
i
. Generic term weighting
was implemented by applying directly the following
formula.
p
i; generic
¼
#
of abstracts containing A OR A
0
1
OR A
0
2
OR

AND B OR B
0
1
OR B
0
2
OR

of abstracts containing B OR B
0
1
OR B
0
2
OR

ð3Þ
Fig. 1 - Projection of image finding terms to feature concepts in SNOMED CT is-a hierarchy. Part of the is-a hierarchical relationships is illustrated with
three examples demonstrating the rules to determine the semantic distances. Four image finding terms: cirrhosis, hepatic fibrosis, splenomegaly and
fatt y liver are considered. The level-4 concepts are regarded as feature concepts. In this case, feature concepts: liver finding, abdominal organ finding
and fatty liver are involved. a The term cirrhosis at level-7 is the descendant of liver finding. Their semantic distance is 3 because there are three is-a
links between them. b The semantic distance between hepatic fibrosis and liver finding is 2. c The term splenomegaly is not a descendant of liver
finding but the descendant of abdominal organ finding. Thus, the semantic distance between s plenom egaly and liver finding is inf inity and that
with abdominal organ finding is 2. Finally, the term fatty liver at level 4 is also a feature concept and the semantic distance is 0
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43
Page 4 of 8
where A and A
n
'
are the i
th
feature concept and its n
th
synonym respectively; B and B
n
'
represent HCC and its
n
th
synonym respe ctively. Using this approach, the calcu-
lated weights of feature concepts were the same among
different reports although their descendent terms ex-
tracted from the reports are different.
Specific term weighting
In specific term weighting approach, we seched PubMed
for the abstracts containing the extracted terms and
HCC. The conditional probability of the m
th
extracted
term given HCC, q
m
, were estimated by the followin g
formula.
q
m
¼ of abstracts containing ½ðC OR C
0
1
OR C
0
2
ORÞ
AND B OR B
0
1
OR B
0
2
OR

of abstracts containing B OR B
0
1
OR B
0
2
OR

ð4Þ
where C and C
n
'
. are the m
th
extracted term and its n
th
synonym respectively; B and B
n
'
represent HCC and its
n
th
synonym respectively. We assumed that the condi-
tional probability of the i
th
feature concept given HCC is
equal to the average of the conditional probabilities of
its N descendent terms extracted from a report given
HCC. The value of p
i
was calculated by the following
formula.
p
i; specific
¼
q
1
þ q
2
þ q
3
þ
N
ð5Þ
Note that the weighting of feature vector elements is
dependent of the report content. In contrast to the gen-
eric term weighting where the weights dont change
across reports, the weights of the same feature concept
estimated by specific term weighting approach may dif-
fer from patient to patient.
Statistical analysis
The receiver operating characteristic (ROC) analysis was
performed to the results of inter-patient HCC co-
occurrence predicted by equal, generic and specific term
weighting approaches. For each approach, the ROC
curve was plotted and the Areas under the ROC curve
(AUROC) indicated the accuracy of the prediction, i.e.
the probability of correctly classifying a pair of reports
into same diagnosis (both are HCC; both are NAD) or
different diagnosis (one is HCC and the other is NAD).
In addition to the comparison with the area under the
chance diagonal, AUROCs were compared with each
other to determine an approach with the best performance
Fig. 2 - A schematic view of the method. Step 1: Manual extraction of the image finding terms and their corresponding synonyms from the
reports. Step 2: The concepts of the image finding terms defined in SNOMED CT were identified by using UMLS Terminology Services. Step 3:
Edge counting of the semantic distances between the extracted terms and the level-4 feature concepts. Step 4: The feature concepts are
weighted by (Step 4a) generic term weighting approach and (Step 4b) specific term weighting approach. Step 5: The feature vectors are generated.
Step 6: Similarity scores between feature vectors are calculated by modified direction cosine
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 5 of 8
and the statistical significance of the observed differences
were also indicated [17, 18].
Results
Feature extraction and report pair formation
We extracted 38 image finding terms from 112 examin-
ation reports (59 HCC and 53 NAD cases). These terms
are uniquely defined by 38 concepts in UMLS and were
projected to 36 feature concepts at level-4 of SNOMED
CT is-a hierarchy. The reports were paired up to form
6216 non-redundant pairs, in which 3089 pairs are
matches, i.e. (HCC,HCC) or (NAD,NAD), and 3127 pairs
are mismatches, i.e. (HCC,NAD) or (NAD,HCC).
Optimization of similarity measure
Equal term weighting was considered as baseline for op-
timizing the similarity measure. ROC analysis of inter-
patient HCC co-occurrence prediction was performed
for different values of k. Fig. 3 shows the plot of AUROC
against k. It was found that the accuracy increases for k
between 0 and 2. For k > 2, the AUROC reaches a con-
stant level. Thus, we chose k = 10 for all the term weight-
ing approaches.
Estimation of conditional probabilities
In generic term weighting, abstracts were retrieved by
PubMed search for each feature concept and its syno-
nyms. The count of abstracts containing a feature concept
or its synonyms ranges from 1 to 427154. By incorporat-
ing HCC and its synonyms to the search criteria, the
abstract count was further reduced. The conditional prob-
ability of a feature concept for generic term weighting is
defined as the ratio of these two counts.
In specific term weighting, abst racts were retrieved by
PubMed search for each extracted term and its syno-
nyms. The count of abstracts containing a fe ature con-
cept or its synonyms ranges from 1 to 195708. The
abstract count is further reduced by adding HCC and its
synonyms to the search criteria. The ratio of these two
counts was projected to the corresponding feature con-
cepts. The conditional probability of a feature concept
for specific term weighting is defined as the average of
the ratios across all of its descendent terms extracted
from a report. The values of conditional probabilities
were computed and saved in Excel files (See Additional
files 1, 2, 3, 4).
Comparison of term weighting approaches
The AUROCs and the 95 % confidence intervals (95 %
CIs) of equal, generic and specific term weighting ap-
proaches are shown in Table 1. It was found all three
approaches outperformed the random rater significantly
(p < 0.01). When compared to equal term weighing ap-
proach (AUROC = 0.735), the performance was signifi-
cantly improved by specific term weighting approach
(AUROC = 0.743, p < 0.01) but was significantly worsen
by generic term weighting approach (AUROC = 0.728,
p < 0.01). The conditional probabilities of the extracted
image finding terms given HCC, derived by the specific
term weighting approach, were sorted in descending
order. The top ten image finding terms are listed to-
gether with their conditional probabilitie s in Table 2.
Discussion
Health records ontologically similar to new suspected
case support clinical decision with evidence of the dis -
ease. The reliability of such ontology-similarity-based
case retrieval algorithm depends on the choices of inter-
patient similarity measure and ontological vector model.
It has been proved that modified Direction Cosine (mDC)
avoids the problem of numerical overflow and preserves
the same properties as Euclidean distance does [12]. How-
ever, weighting of the ontological vector was not consid-
ered in the previous studies and it remained unknown if
the performance of mDC can be improved by adjusting
the weights associated with the feature concepts. It was
shown that the performance of the similarity measure
was substan tially improved by setting an extremely
small regularization constant, 10
10
.Suchsettinghelps
maintain the similarity scores discriminative for com-
paring health records that have very few or e ven no ex-
tracted clinical terms.
Fig. 3 - Plot of AUROC against the value of k. The accuracy of
inter-patient HCC co-occurrence prediction increas es when k is between
0 and 2 and saturates at the level of 0.735 when k further increases
Table 1 - Comparison of term weighting approaches. AUROCs
and the 95 % CIs of the equal, generic and specific term
weighting approaches are summarized here
Term weighting approach AUROC 95 % CI
Equal term weighting 0.735 (0.724, 0.746)
Generic term weighting 0.728 (0.717, 0.739)
Specific term weighting 0.743 (0.732, 0.754)
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 6 of 8
In generic term weighting, the median and the 10th
percentile of the counts of retrieved abstracts containing
the feature concepts or their synonyms are 4204 and
82.9 respe ctively. In specific term weighting, the median
and the 10th percentile of the counts of retrieved ab-
stracts containing the extracted terms or their synonyms
are 2689.5 and 46.5 respe ctively. The sample sizes of the
retrieved abstracts are large enough to support the esti-
mation of conditional probabilities.
In comparison to equal term weighting, the perform-
ance was imp roved by specific term weighting approach
but worsened by generic term weighting approach. It
implied that the weighted feature vector elements do
not necessarily give better performance but the way,
through which we derived the weights, is crucial for im-
proving performance. In generic term weighting, the fea-
ture concepts at level-4, instead of the clinical terms
extracted from the reports, were used for PubMed
search. The weights are associated with the level-4 con-
cepts only and remain unchanged across different re-
ports. Moreover, the level-4 concepts are not specific
enough to provide reliable results of PubMed search for
estimating the conditional probabilitie s. Specific term
weighting used the extracted terms directly for PubMed
Search. The search results are more reliable for estimating
the conditional probabilities due to the higher granularity
of concepts provided by the extracted terms. Although the
conditional probabilities of the descendent extracted terms
are averaged to generate the weights of feature vector ele-
ments, the keywords for PubMed Search are dynamically
dependent of the report contents and the weights become
more specific.
The high weights of feature concepts dominating the
similarity score are attributed by their descendent terms
extracted from the reports. In Table 2, the top three
image finding terms (conditional probabilities) are dys-
plastic nodule (0.934), nodule of liver (0.513) and
equal density (isodense) lesion (0.438). The a ssoci-
ation of these image findings with HCC is supported by
Sakamoto [19] stating that small equivocal lesions, i.e.
dysplastic nodules, detected by imagin g examination of
liver are regarded a s a precursor of HCC. For the cases
with such image finding but no abnormality detected,
we suggest to index t hem a s high risk so that close
follow-up can be recommended to those patients.
As the conditional probability of the most relevant
image finding Dysplastic nodule (0.934) is greater than
ten times of that of the eighth image finding Hepatic fi-
brosis, 0.082, only the top seven features are significantly
contributed to the diagnosis prediction performance. The
features other than these top seven features are associated
with negligible weights and have negligible effect on the
prediction. Therefore, the generic and specific term weight-
ing approaches are analogous with feature selection that
makes the vector model more parsimonious with respect
to the number of available cases.
As the numbers of PubMed abstracts are dynamic, the
term weighting results may change from time to time.
In our future studies, it is suggested to enhance the
ontological ve ctor model by incorporating more algo-
rithmic elements from information content model,
which spans an essential dimension of assessing the se-
mantic similarity [20].
Besides the image examination report, laboratory test
findings, such as Alpha fetoprotein (AFP) level and
Child-Pugh score, are important features for the diagno-
sis of HCC. As the electronic health record (EHR) inte-
grated the image examination report and laboratory test
report, the feature vector can be augmented to cover la-
boratory test finding s [21]. The same weighting ap-
proach and similarity measure can also be applied to
such augmented feature vector.
Our development of clinical term weighting approach
not only improved the inter-patient similarity measures
for diagnosis prediction. In fact, this method may be used
to identify large cohorts of patients with similar disease
presentation for retrospective treatment efficacy analysis.
It may also facilitate the identification of targeted patient
cohorts for prospective interventional studies.
Conclusions
The performance of inter-patient similarity measure was
significantly improved by specifically weighting the ele-
ments of the ontological feature vector. PubMed search
was applied to estimate the weights. Early HCC markers,
including dyspla stic nodule, nodule of liver, and equal
density lesion, were identified by PubMed search as image
findings that are strongly associated with HCC.
Table 2 - Top ten image finding terms. The PubMed search
results indicated that some image finding terms were
co-mentioned with HCC very frequently in the abstracts of
biomedical journal articles. The conditional probability of
Dysplastic nodule (0.934) is t he highest among all the
extracted terms
Rank Image finding Conditional probability
1 Dysplastic nodule 0.934
2 Nodule of liver 0.513
3 Equal density (isodense) lesion 0.438
4 Nodular hyperplasia of liver 0.329
5 Solitary necrotic liver nodule 0.259
6 Portal vein thrombosis 0.209
7 Space occupying lesion of liver 0.175
8 Cirrhosis of liver 0.170
9 Hepatic fibrosis 0.082
10 Nontraumatic hemoperitoneum 0.064
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 7 of 8
Additional files
Additional file 1: Generic Term Weighting Computation.
Additional file 2: Specific Term Weighting Computation.
Additional file 3: Similarity Scores with Generic Term Weighting.
Additional file 4: Similarity Scores with Specific Term Weighting.
Competing interests
The authors declare that they have no competing interests.
Authors contributions
LWCC and YL have made substantial contributions to conception and design
of the study and have prepared and revised the manuscript. APHY, KFL,
SWY, KYK and WYLC have made substantial contributions to the acquisition,
image finding extraction, analysis and interpretation of data. TC, HKWL,
SCCW and TYHL have contributed the domain knowledge to approving the
image findings and their relationship to liver diseases. CRS has contributed
to verifying the ontological method and revising critically the manuscript.
All authors read and approved the final manuscript.
Acknowledgements
This research was supported by the RGC General Research Fund PolyU
5118/11E: Clinical Decision Support using Biomedical Ontology and
Literature Supported Patient Similarity for Diagnostic and Prognostic
Pattern Discovery from Electronic Health Records.
Author details
1
Department of Health Technology and Informatics, Hong Kong Polytechnic
University, Hung Hom, Kowloon, Hong Kong.
2
Institute of Mechanical and
Manufacturing Engineering, School of Engineering, Cardiff University, Cardiff
CF24 3AA, UK.
3
Department of Diagnostic Radiology, University of Hong
Kong, Pokfulam, Hong Kong.
4
Informatics Institute and Department of
Computer Science, University of Missouri, Columbia, MO, USA.
Received: 24 December 2014 Accepted: 22 May 2015
References
1. Peter BJ, Lars JJ, Søren B. Mining electronic health records: towards better
research applications and clinical care. Nat Rev Genet. 2012;13(6):395405.
2. Ceuster W, Smith B. Strategies for referent tracking in electronic health
records. J Biomed Inform. 2006;39:36278.
3. Chan LWC, Benzie IFF, Liu Y, et al.: Is the inter-patient coincidence of a
subclinical disorder related to EHR similarity? 2011 IEEE 13th International
Conference on e-Health Networking, Applications and Services 2011:177180
doi:10.1109/HEALTH.2011.6026738.
4. Sánchez D, Batet M, Isern D, Valls A. Ontology-based semantic similarity:
a new feature-based approach. Expert Systems With Appli cations.
2012;39(9):7 71828.
5. Batet M, Sánchez D, Aida V. An ontology-based measure to compute
semantic similarity in biomedicine. J Biomed Inform. 2011;44:11825.
6. Richesson RL, Andrew JE, Krischer JP. Use of SNOMD CT to represent clinical
research data: a semantic characterization of data items on case report
forms in vasculitis research. J Am Med Inform Assoc. 2006;13(5):53646.
7. Melton GB, Parsons S, Morrison FP, Rothschild AS, Markatou M, Hripcsak G.
Inter-patient distance metrics using SNOMED CT defining relationships.
J Biomed Inform. 2006;39(6):697705.
8. Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG. Measures of semantic
similarity and relatedness in the biomedical domain. J Biomed Inform.
2007;40(3):28899.
9. Wasserman H, Wang J. An applied evaluation of SNOMED CT as a clinical
vocabulary for the computerized diagnosis and problem list. AMIA
Symposium. 2003;699703.
10. Lieberman MI, Ricciardi TN, Masarie FE, Spackman KA. The use of SNOMED
CT simplifies querying of a clinical data warehouse. AMIA Symposium.
2003;910.
11. Lord PW, Stevens RD, Brass A, Goble CA. Investigating semantic similarity
measures across the gene ontology: the relationship between sequence
and annotation. Bioinformatics. 2003;19(10):127583.
12. Chan LWC, Liu Y, Shyu CR, Benzie IFF. A SNOMED supported ontological
vector model for subclinical disorder detection using EHR similarity. Eng
Appl Artif Intell. 2011;24:1398409.
13. Falda M, Toppo S, Pescarolo A, Lavezzo E, Camillo BD, Facchinetti A, et al.
Argot2: a large scale function prediction tool relying on semantic similarity
of weighted Gene Ontology terms. BMC Bioinformatics. 2012;13:19.
14. Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in
biomedical ontologies. PLoS Comput Biol. 2009;5(7), e1000443.
15. Page AJ, Cosgrove DC, Philosophe B, Pawlik TM. Hepatocellular carcinoma:
diagnosis, management, and prognosis. Surg Oncol Clin N Am.
2014;23(2):289311.
16. Kamel IR, Liapi E, Fishman EK. Multidetector CT of hepatocellular carcinoma.
Best Pract Res Clin Gastroenterol. 2005;19(1):6389.
17. Hanley JA, Mcneil BJ. The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology. 1982;143:2936.
18. Hanley JA, Mcneil BJ. A method of comparing the areas under receiver
operating characteristic curves derived from the same cases. Radiology.
1983;148:83943.
19. Sakamoto M. Early HCC: diagnosis and molecular markers. J Gastroenterol.
2009;44:10811.
20. Zhou Z, Wang Y, Gu J. A new model of information content for semantic
similarity in WordNet. Second International Conference on Future
Generation Communication and Networking Symposia. 2008;2008:859.
21. Gottlieb A, Stein GY, Ruppin E, Altman RB, Sharan R. A method for inferring
medical diagnoses from patient similarities. BMC Medicine. 2013;11:194.
Submit your next manuscript to BioMed Central
and take full advantage of:
Convenient online submission
Thorough peer review
No space constraints or color figure charges
Immediate publication on acceptance
Inclusion in PubMed, CAS, Scopus and Google Scholar
Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Chan et al. BMC Medical Informatics and Decision Making (2015) 15:43 Page 8 of 8
... Le diagnostic du CHC se fait fréquemment par application des modalités radiologiques. Cette lésion hépatique est découverte lors de la surveillance d'une maladie chronique du foie (voir Figure III.2.) 34 . 34. https ://www.snfge.org/ ...
... Cette lésion hépatique est découverte lors de la surveillance d'une maladie chronique du foie (voir Figure III.2.) 34 . 34. https ://www.snfge.org/ ...
... Puis, une comparaison est conduite entre tous les rapports pour permettre d'identifier ceux qui présentent le CHC. Cette représentation est importée du travail de [34]. Le système proposé génère à la fin 30 concepts présentant le cancer du foie. ...
Thesis
Le diagnostic des lésions hépatiques est une tâche complexe surtout lorsque les nodules détectés sont de petites tailles. Dans ce cas, il devient très difficile de connaitre leurs natures (tumeur bénigne ou maligne, type de lésion, etc). Dans des cas similaires, il faut répéter des examens cliniques pendant plusieurs mois pour voir l’évolution des masses hépatiques. Afin de mieux répondre à ces problèmes, il faut trouver des solutions informatiques qui servent à l’optimisation du diagnostic des tumeurs du foie. Dans le contexte de la classification des lésions hépatiques, nous avons développé une première approche ontologique (OntHCC) pour l’aide au diagnostic, à la stadification et au choix de traitement des tumeurs CHC (Carcinome Hépatocellulaire). Cette approche est fondée sur l’analyse d’images IRM de foies infectés et sur des rapports radiologiques. Par la suite, nous avons proposé une deuxième approche ontologique (MROnt) pour la modélisation de l’information médicale contenue dans les rapports radiologiques, dans le cadre du diagnostic et de suivi de tumeurs du foie. La détection automatique des tumeurs du foie nécessite un processus de diagnostic primaire en utilisant obligatoirement les images médicales (par exemple IRM ou scanner). Pour ce faire, nous avons intégré l’apprentissage profond dans la classification d’images IRM avec prise de contraste. Dans la suite de la thèse et afin d’accroitre la performance du processus de classification des images, nous avons intégré les connaissances sémantiques. L’objectif est de profiter de la base de connaissances offerte par les ontologies pour décrire les images médicales et fournir des informations sur les tumeurs détectées (par exemple, le type, la taille et le stade). En outre, notre approche consiste à développer un CNN multi-label afin de supporter les ontologies développées (OntHCC et MROnt). Nous montrons l’efficacité des approches et prototypes proposés dans ces travaux de thèse à travers des évaluations numériques comparatives et des études de cas.
... For the new cases, similar cases retrieved from EHR database using the patterns provide clinicians with evidence of the feasible diagnostic and therapeutic options. The similarity search algorithm based on the ontological vector model has been successfully applied to similar radiological image report retrieval and similar radiotherapy treatment plan retrieval [22][23][24]. ...
... With the value between 0 and 1, a i indicates the relevance between the ith feature concept and a clinical term in a report. Such relevance can be modulated by the conditional probability, p i , which is estimated by the specific termweighting approach [22]. Indeed, a similarity measure derived from direction cosine represents the sum of the product of ontological features. ...
... Thus, edge count of "splenomegaly" with "abdominal organ finding" is 2 and that with "liver finding" is infinity. "Fatty liver" is a feature concept, and thus, the edge count with itself is 0. Diagram was extracted from [22]. ...
Article
Full-text available
Electronic Health Record (EHR) system enables clinical decision support. In this study, a set of 112 abdominal computed tomography imaging examination reports, consisting of 59 cases of hepatocellular carcinoma (HCC) or liver metastases (so-called HCC group for simplicity) and 53 cases with no abnormality detected (NAD group), were collected from four hospitals in Hong Kong. We extracted terms related to liver cancer from the reports and mapped them to ontological features using Systematized Nomenclature of Medicine (SNOMED) Clinical Terms (CT). The primary predictor panel was formed by these ontological features. Association levels between every two features in the HCC and NAD groups were quantified using Pearson’s correlation coefficient. The HCC group reveals a distinct association pattern that signifies liver cancer and provides clinical decision support for suspected cases, motivating the inclusion of new features to form the augmented predictor panel. Logistic regression analysis with stepwise forward procedure was applied to the primary and augmented predictor sets, respectively. The obtained model with the new features attained 84.7% sensitivity and 88.4% overall accuracy in distinguishing HCC from NAD cases, which were significantly improved when compared with that without the new features.
... In this way, patient similarity represents a paradigm shift that introduces disruptive innovation to optimize personalization of patient care. Some promising examples are regarding mental and behavioral disorders (Roque et al., 2011), infectious diseases , cancers (Wu et al., 2005;Teng et al., 2007;Chan et al., 2010Chan et al., , 2015Klenk et al., 2010;Cho and Przytycka, 2013;Li et al., 2015;Wang, 2015;Bolouri et al., 2016;Wang et al., 2016), endocrine Wang, 2015), and metabolic diseases (Zhang et al., 2014;Ng et al., 2015). Others involve diseases of the nervous system (Lieberman et al., 2005;Carreiro et al., 2013;Cho and Przytycka, 2013;Qian et al., 2014;Buske et al., 2015a;Li et al., 2015;Bolouri et al., 2016;Wang et al., 2016), eyes (Buske et al., 2015a;Li et al., 2015), skin (Buske et al., 2015a;Li et al., 2015), heart (Wu et al., 2005;Tsymbal et al., 2007;Syed and Guttag, 2011;Buske et al., 2015a;Li et al., 2015;Panahiazar et al., 2015a,b;Wang, 2015;Björnson et al., 2016), liver (Chan et al., 2015), intestines (Buske et al., 2015a), musculoskeletal system (Buske et al., 2015a), congenital malformations (Buske et al., 2015a), and various other conditions or factors influencing health status (Gotz et al., 2012;Subirats et al., 2012;Ng et al., 2015). ...
... Some promising examples are regarding mental and behavioral disorders (Roque et al., 2011), infectious diseases , cancers (Wu et al., 2005;Teng et al., 2007;Chan et al., 2010Chan et al., , 2015Klenk et al., 2010;Cho and Przytycka, 2013;Li et al., 2015;Wang, 2015;Bolouri et al., 2016;Wang et al., 2016), endocrine Wang, 2015), and metabolic diseases (Zhang et al., 2014;Ng et al., 2015). Others involve diseases of the nervous system (Lieberman et al., 2005;Carreiro et al., 2013;Cho and Przytycka, 2013;Qian et al., 2014;Buske et al., 2015a;Li et al., 2015;Bolouri et al., 2016;Wang et al., 2016), eyes (Buske et al., 2015a;Li et al., 2015), skin (Buske et al., 2015a;Li et al., 2015), heart (Wu et al., 2005;Tsymbal et al., 2007;Syed and Guttag, 2011;Buske et al., 2015a;Li et al., 2015;Panahiazar et al., 2015a,b;Wang, 2015;Björnson et al., 2016), liver (Chan et al., 2015), intestines (Buske et al., 2015a), musculoskeletal system (Buske et al., 2015a), congenital malformations (Buske et al., 2015a), and various other conditions or factors influencing health status (Gotz et al., 2012;Subirats et al., 2012;Ng et al., 2015). ...
... 32,36 There are systems such as PSF that are evaluating novel metric learning approaches and those that are using external knowledge resources such as PubMed to improved patient similarity measures. 33,34 We also used Apache cTAKES 37 to extract more features from unstructured Progress notes. However, since we are already accomplishing a very high average cross-validation F1 score and the unigram and bigram features that are also obtained from progress notes are not very predictive, we decided not to currently use them in further experiments. ...
... Accuracy Measures after feature selectionDiscussionThis work is very relevant to an emerging direction in clinical informatics research focusing on developing patient similarity measures derived from EHR data for application in a variety of areas.[27][28][29][30][31][32][33][34][35][36] For example, Zhang et al used patient similarity and drug similarity for personalized medicine.35 ...
Article
Full-text available
Rare diseases are very difficult to identify among large number of other possible diagnoses. Better availability of patient data and improvement in machine learning algorithms empower us to tackle this problem computationally. In this paper, we target one such rare disease - cardiac amyloidosis. We aim to automate the process of identifying potential cardiac amyloidosis patients with the help of machine learning algorithms and also learn most predictive factors. With the help of experienced cardiologists, we prepared a gold standard with 73 positive (cardiac amyloidosis) and 197 negative instances. We achieved high average cross-validation F1 score of 0.98 using an ensemble machine learning classifier. Some of the predictive variables were: Age and Diagnosis of cardiac arrest, chest pain, congestive heart failure, hypertension, prim open angle glaucoma, and shoulder arthritis. Further studies are needed to validate the accuracy of the system across an entire health system and its generalizability for other diseases.
... The multidimensional patient similarity [24] supervised approach reported an accuracy of 77%, followed by hierarchical and K-means with 73% and 71%, respectively. The optimized similarity measure [43] with specific term-weighting improved the accuracy (74.3%) associated with diagnosis prediction when compared with equal (73.5%) and generic term-weighting (72.8%) approaches. ...
Article
Full-text available
Precision medicine can be defined as the comparison of a new patient with existing patients that have similar characteristics and can be referred to as patient similarity. Several deep learning models have been used to build and apply patient similarity networks (PSNs). However, the challenges related to data heterogeneity and dimensionality make it difficult to use a single model to reduce data dimensionality and capture the features of diverse data types. In this paper, we propose a multi-model PSN that considers heterogeneous static and dynamic data. The combination of deep learning models and PSN allows ample clinical evidence and information extraction against which similar patients can be compared. We use the bidirectional encoder representations from transformers (BERT) to analyze the contextual data and generate word embedding, where semantic features are captured using a convolutional neural network (CNN). Dynamic data are analyzed using a long-short-term-memory (LSTM)-based autoencoder, which reduces data dimensionality and preserves the temporal features of the data. We propose a data fusion approach combining temporal and clinical narrative data to estimate patient similarity. The experiments we conducted proved that our model provides a higher classification accuracy in determining various patient health outcomes when compared with other traditional classification algorithms.
Article
A patient centric social network enables connecting patients suffering from the same disease or health conditions. The growth of such a network depends highly on the recommendations like ‘patient to patient’ and ‘caregivers to a patient’. From a patient’s point of view, discovering a person with similar conditions like him gives him some sort of solace thereby encouraging him to extend support to the other or lookout for support. In this paper, we have proposed a recommendation strategy for a group of patients in a social network, by deriving similarities in the unstructured clinical text found in their profiles. To carry out our task, we used physician notes of the MIMIC-III database, a publicly available large database comprising of de-identified health-related data as patient profiles. We computed the similarities between them and visualized possible social network graphs that resulted out of recommendations based on those similarities.
Article
Long planning time in volumetric-modulated arc stereotactic radiotherapy (VMA-SRT) cases can limit its clinical efficiency and use. A vector model could retrieve previously successful radiotherapy cases that share various common anatomic features with the current case. The prsent study aimed to develop a vector model that could reduce planning time by applying the optimization parameters from those retrieved reference cases. Thirty-six VMA-SRT cases of brain metastasis (gender, male [n = 23], female [n = 13]; age range, 32 to 81 years old) were collected and used as a reference database. Another 10 VMA-SRT cases were planned with both conventional optimization and vector-model-supported optimization, following the oncologists' clinical dose prescriptions. Planning time and plan quality measures were compared using the 2-sided paired Wilcoxon signed rank test with a significance level of 0.05, with positive false discovery rate (pFDR) of less than 0.05. With vector-model-supported optimization, there was a significant reduction in the median planning time, a 40% reduction from 3.7 to 2.2 hours (p = 0.002, pFDR = 0.032), and for the number of iterations, a 30% reduction from 8.5 to 6.0 (p = 0.006, pFDR = 0.047). The quality of plans from both approaches was comparable. From these preliminary results, vector-model-supported optimization can expedite the optimization of VMA-SRT for brain metastasis while maintaining plan quality.
Article
Full-text available
Clinical decision support systems assist physicians in interpreting complex patient data. However, they typically operate on a per-patient basis and do not exploit the extensive latent medical knowledge in electronic health records (EHRs). The emergence of large EHR systems offers the opportunity to integrate population information actively into these tools. Here, we assess the ability of a large corpus of electronic records to predict individual discharge diagnoses. We present a method that exploits similarities between patients along multiple dimensions to predict the eventual discharge diagnoses. Using demographic, initial blood and electrocardiography measurements, as well as medical history of hospitalized patients from two independent hospitals, we obtained high performance in cross-validation (area under the curve >0.88) and correctly predicted at least one diagnosis among the top ten predictions for more than 84% of the patients tested. Importantly, our method provides accurate predictions (>0.86 precision in cross validation) for major disease categories, including infectious and parasitic diseases, endocrine and metabolic diseases and diseases of the circulatory systems. Our performance applies to both chronic and acute diagnoses. Our results suggest that one can harness the wealth of population-based information embedded in electronic health records for patient-specific predictive tasks.
Article
Full-text available
Predicting protein function has become increasingly demanding in the era of next generation sequencing technology. The task to assign a curator-reviewed function to every single sequence is impracticable. Bioinformatics tools, easy to use and able to provide automatic and reliable annotations at a genomic scale, are necessary and urgent. In this scenario, the Gene Ontology has provided the means to standardize the annotation classification with a structured vocabulary which can be easily exploited by computational methods. Argot2 is a web-based function prediction tool able to annotate nucleic or protein sequences from small datasets up to entire genomes. It accepts as input a list of sequences in FASTA format, which are processed using BLAST and HMMER searches vs UniProKB and Pfam databases respectively; these sequences are then annotated with GO terms retrieved from the UniProtKB-GOA database and the terms are weighted using the e-values from BLAST and HMMER. The weighted GO terms are processed according to both their semantic similarity relations described by the Gene Ontology and their associated score. The algorithm is based on the original idea developed in a previous tool called Argot. The entire engine has been completely rewritten to improve both accuracy and computational efficiency, thus allowing for the annotation of complete genomes. The revised algorithm has been already employed and successfully tested during in-house genome projects of grape and apple, and has proven to have a high precision and recall in all our benchmark conditions. It has also been successfully compared with Blast2GO, one of the methods most commonly employed for sequence annotation. The server is freely accessible at http://www.medcomp.medicina.unipd.it/Argot2.
Conference Paper
Electronic Health Record (EHR) provide clinical evidence for identifying subclinical diseases and supporting decisions on early intervention. Simple string matching cannot link up the conceptually similar but verbally different clinical terms in patient records, limiting the usefulness of EHR. A novel ontological similarity matching approach supported by the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) is proposed in this paper. The disease terms of a patient record are transformed into a vector space so that each patient record can be characterized by a feature vector. The similarity between the new record and an existing database record was quantified by a kernel function of their feature vectors. The matches are ranked by their similarity scores. To evaluate the proposed matching approach, medical history and carotid ultrasonic imaging finding were collected from 47 subjects in Hong Kong. The dataset formed 1081 pairs of patient records and the ROC analysis was used to evaluate and compare the accuracy of the ontological similarity matching and the simple string matching against the presence or absence of carotid plaques identified in ultrasound examination. It was found that the simple string matching randomly rated the record pairs but the ontological similarity matching provided non-random rating.
Article
The successful management of hepatocellular carcinoma (HCC) requires a multidisciplinary approach, incorporating hepatologists, oncologists, surgical oncologists, transplant surgeons, and radiologists. With improvements in technology and better long-term outcomes data, management strategies for HCC have become more methodical and more successful. This article focuses on some of the most critical advances relating to carcinogenesis, surveillance, and management.
Article
Electronic Health Records (EHR) form a valuable resource in the healthcare enterprise because clinical evidence can be provided to identify potential complications and support decisions on early intervention. Simple string matching, the common search algorithm, is not able to map a query to the similar health records in the database with respect to the medical concepts. A novel ontological vector model supported by the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) is proposed in this paper to project the disease terms of a health record to a feature space so that each health record can be characterized using a feature vector, giving a fingerprint of the record. The similarity between the query and database health records was measured by similarity measures of their feature vectors and string matching score respectively. Three types of similarity measures were considered in this study, namely, Euclidean distance (ED), direction cosine (DC) and modified direction cosine (mDC). Medical history and carotid ultrasonic imaging findings were collected from 47 subjects in Hong Kong. The dataset formed 1081 pairs of health records and ROC analysis was used to evaluate and compare the accuracy of the ontological vector model and simple string matching against the agreement of the presence or absence of carotid plaques identified by carotid ultrasound between two subjects. It was found that the score generated by simple string matching was a random rater but the ontological vector model was not. In other words, the degree of health record similarity based on the ontological vector model is associated with the agreement of atherosclerosis between two patients. The vector model using feature terms at the SNOMED-CT level 4 gave the best performance. The performance of mDC was very close to that of ED and DC but the properties of mDC make it more suitable for the retrieval of similar health records. It was also shown that the ontological vector model was enhanced by the support vector classifier approach.
Article
Clinical data describing the phenotypes and treatment of patients represents an underused data source that has much greater research potential than is currently realized. Mining of electronic health records (EHRs) has the potential for establishing new patient-stratification principles and for revealing unknown disease correlations. Integrating EHR data with genetic data will also give a finer understanding of genotype-phenotype relationships. However, a broad range of ethical, legal and technical reasons currently hinder the systematic deposition of these data in EHRs and their mining. Here, we consider the potential for furthering medical research and clinical care using EHR data and the challenges that must be overcome before this is a reality.
Conference Paper
Information Content (IC) is an important dimension of assessing the semantic similarity between two terms or word senses in word knowledge. The conventional method of obtaining IC of word senses is to combine knowledge of their hierarchical structure from an ontology like WordNet with actual usage in text as derived from a large corpus. In this paper, a new model of IC is presented, which relies on hierarchical structure alone. The model considers not only the hyponyms of each word sense but also its depth in the structure. The IC value is easier to calculate based on our model, and when used as the basis of a similarity approach it yields judgments that correlate more closely with human assessments than others, which using IC value obtained only considering the hyponyms and IC value got by employing corpus analysis.
Article
Hepatocellular carcinoma (HCC) is one of the most common malignant tumors. HCC occurs mainly in patients with chronic liver disease such as in hepatitis B and C infection. These high-risk patients are closely followed up, and increasing numbers of small equivocal lesions are detected by imaging diagnosis. They are now widely recognized as a precursor or early stage of HCC and are classified as dysplastic nodules or early HCC. It is considered that early HCC is a key step in the process of HCC development and progression. However, the molecular mechanisms of early hepatocarcinogenesis are far from clear. Specific mutations of classical oncogenes or tumor suppressor genes have not been identified in early HCC so far. Recent progress in comprehensive analysis of gene expression is shedding some light on this issue. It has been reported that HSP70, CAP2, glypican 3, and glutamine synthetase could serve as molecular markers for early HCC. Further analysis is expected to evaluate their usefulness in routine pathological diagnosis including biopsy diagnosis and also as serum markers for early detection of HCC.