ArticlePDF Available

Performance analysis of machine learning classifiers on improved concept vector space models

Authors:

Abstract and Figures

This paper provides a comprehensive performance analysis of parametric and non-parametric machine learning classifiers including a deep feed-forward multi-layer perceptron (MLP) network on two variants of improved Concept Vector Space (iCVS) model. In the first variant, a weighting scheme enhanced with the notion of concept importance is used to assess weight of ontology concepts. Concept importance shows how important a concept is in an ontology and it is automatically computed by converting the ontology into a graph and then applying one of the Markov based algorithms. In the second variant of iCVS, concepts provided by the ontology and their semantically related terms are used to construct concept vectors in order to represent the document into a semantic vector space. We conducted various experiments using a variety of machine learning classifiers for three different models of document representation. The first model is a baseline concept vector space (CVS) model that relies on an exact/partial match technique to represent a document into a vector space. The second and third model is an iCVS model that employs an enhanced concept weighting scheme for assessing weights of concepts (variant 1), and the acquisition of terms that are semantically related to concepts of the ontology for semantic document representation (variant 2), respectively. Additionally, a comparison between seven different classifiers is performed for all three models using precision, recall, and F1 score. Results for multiple configurations of deep learning architecture are obtained by varying the number of hidden layers and nodes in each layer, and are compared to those obtained with conventional classifiers. The obtained results show that the classification performance is highly dependent upon the choice of a classifier, and that the Random Forest, Gradient Boosting, and Multilayer Perceptron are among the classifiers that performed rather well for all three models.
Content may be subject to copyright.
Performance Analysis of Machine Learning Classifiers on
Improved Concept Vector Space Models
Zenun Kastrati, Ali Shariq Imran
Norwegian University of Science and Technology
Norway
Pre-Print Version - Published version avaiable at https://doi.org/10.1016/j.future.2019.02.006
Abstract
This paper provides a comprehensive performance analysis of parametric and non- para-
metric machine learning classifiers including a deep feed-forward multi-layer perceptron
(MLP) network on two variants of improved Concept Vector Space (iCVS) model. In the
first variant, a weighting scheme enhanced with the notion of concept importance is used
to asses weight of ontology concepts. Concept importance shows how important a con-
cept is in an ontology and it is automatically computed by converting the ontology into
a graph and then applying one of the Markov based algorithms. In the second variant of
iCVS, concepts provided by the ontology and their semantically related terms are used
to construct concept vectors in order to represent the document into a semantic vector
space.
We conducted various experiments using a variety of machine learning classifiers for
three different models of document representation. The first model is a baseline concept
vector space (CVS) model that relies on an exact/partial match technique to represent a
document into a vector space. The second and third model is an iCVS model that em-
ploys an enhanced concept weighting scheme for assessing weights of concepts (variant
1), and the acquisition of terms that are semantically related to concepts of the ontology
for semantic document representation (variant 2), respectively. Additionally, a compari-
son between seven different classifiers is performed for all three models using precision,
recall, and F1 score. Results for multiple configurations of deep learning architecture are
obtained by varying the number of hidden layers and nodes in each layer, and are com-
pared to those obtained with conventional classifiers. The obtained results show that the
classification performance is highly dependent upon the choice of a classifier, and that the
Random Forest, Gradient Boosting, and Multilayer Perceptron are among the classifiers
that performed rather well for all three models.
Keywords: document representation, CVS, iCVS, document classification, deep learning,
ontology
1
1. Introduction
The global Internet population has reached 3.8 billion in 2017 from 3.4 billion the year
before, which is 47% of the world’s population [1]. According to IBM [2], in 2013 the
amount of data produced was 2.5 quintillion when the Internet users were around 2.7
billion only. The number is expected to grow in coming years which means that the
amount of data produced will be tremendous. By 2020, it is estimated that around 1.7
MB of data will be created every second for every person on earth.
The penetration of Internet of Things (IoT) and smart gadgets to households and a
huge amount of data produced every minute as a result has created a need for better or-
ganization and structuring of the data, which according to [3] is mostly unstructured.
Despite the computational resources available nowadays, organizing and structuring
tremendous amount of data is not a trivial task and without it, finding and extracting
useful information from massive Internet resources is a challenge [4]. Nearly 3.87 million
Google searches are conducted every minute of the day by the users [1]. Finding rele-
vant information for every query from plethora of resources is a challenging task. For
text-based documents, ontology can play a vital role in this regard [5].
An ontology is a data representation techniques that not only help better organize
data but also help categorize and classify data objects for easy search and retrieval. Many
text document classification approaches widely employ ontologies to classify and or-
ganize text-based documents. A text document is generally represented by a vector
space model [6]. A vector space model is a feature vector representation constructed
by terms/words occurring in a document and their corresponding weights. Each term
denotes a dimension in the vector space and it is independent to other terms in the same
document. This representation technique is based on string literals and fail to consider or-
der of words and semantic relationships between them i.e. taxonomic and non-taxonomic
relations. In order to overcome these issues, a conceptual space document representation
emerged as a means that takes advantages of using wide coverage of concepts and rela-
tions provided by ontologies. In a conceptual space representation, a document is repre-
sented as a vector comprised of concepts (rather than words) and their weights. Concepts
are identified and located in a document through a matching technique which links the
terms appearing in that document with the concepts in the ontology. In fact, the link be-
tween a term tand a concept cis a mapping denoted by ht, ciin which textual description
defined in label of tis replaced with textual description defined in label of c. The weights
of concepts are defined by counting the occurrences of the concepts within a document
i.e. concept relevance. Researchers in [7, 8, 9, 10, 11, 12] have widely used concept vector
space model for document classification. Even though this approach has proven useful
for document classification of many domains, it however has some limitations. Two ma-
jor limitations of this approach are: 1) it relies on the exact technique in which a document
is represented into vector space using concept vectors built by mapping terms occurring
I am corresponding author
Email address: zenun.kastrati@ntnu.no (Zenun Kastrati)
2
in a document with concepts appearing in a ontology, and 2) weighting technique that
treats all concepts equally important regardless of where the concepts are depicted in the
hierarchy of an ontology [13]. The importance is not equal for all concepts and it depends
on relations of concepts with other concepts in the ontology hierarchy. Concepts which
have more relations with other concepts are more important than the concepts which
have less relations [14].
These limitations are addressed in this paper by proposing an improved concept vec-
tor space model in which
1. a weighting technique enhanced with the new concept importance parameter is
used to asses weight of ontology concepts. The concept importance in our case is
computed automatically by first converting the ontology into an ontology graph
and then implementing one of the Markov based algorithms called PageRank. The
obtained importance is then aggregated with the concept relevance in order to
achieve the final weight of that particular concept.
2. concept vectors used to represent the document into a semantic vector space are
constructed by using concepts provided by the ontology through exact technique
and by acquiring terms that are related and can be attached to concepts of that
ontology.
The rest of the paper is structured as follows. Section 2 describes related work. Section
3 gives an overview of the proposed architecture and presents a detailed description of
our proposed concept vector space model. Section 4 describes the concept importance
calculation procedure and presents the performance of conventional and deep machine
learning classifiers on the INFUSE dataset for classifying funding documents in to five
distinct categories. Lastly, section 5 concludes the paper and gives an insight into the
future work.
2. Related Work
The field of document classification has attracted a lot of attention in recent years,
thereby resulting in a wide variety of approaches. Depending on the vector space docu-
ment representation model employed there are two main categories of these approaches
relevant to the classification task: 1) Keyword based vector space approach, and 2) Con-
cept (ontology) based vector space approach.
The first approach relies on a set of terms (words) extracted from the documents in
the dataset. This approach has some limitations as it does not consider the dependency
between the terms and it also ignores the order and the syntactic structure of the terms
in the documents. To overcome these limitations, concept based vector space approach
comes into effect. This approach relies on a set of concepts taken from an ontology to
derive the semantic representation of documents. There is some research work in which
concepts exploited by ontologies are used for semantic document representation. One ex-
ample is presented in [15], in which the authors introduced a classification approach that
3
relies on a document representation model constructed using concepts gathered by a do-
main ontology. In particular, a domain ontology for Health, Safety, and Environment for
oil and gas application contexts is used for classifying documents dealing with accidents
from the oil and gas industry. An extended version of classification approach given in
[15] is presented later in [9]. This extended work proposed a classification approach that
employs a semantic document representation model that, besides concepts derived by
the ontology, uses a list of semantically related terms. Although the approach presented
in this paper is similar to our work, we differ in the way of how we acquire semantically
related terms. An extraction technique that relies on semantic and contextual information
of terms is used in our approach to find and extract the most semantically related terms
instead of n-gram extraction technique used in [9].
Concept vector space approach employs a weighting technique for assessing weight
of concepts that relies on the concept relevance as a discriminatory feature for document
classification. A drawback of this weighting technique is that it considers all concepts
equally regardless of where in the hierarchy the concepts occur. There have been some
efforts to find concepts importance depending on the position of concepts where they are
depicted in the hierarchy. For instance, researchers in [16] used three different weights for
concepts depending on the position where they occur in the ontology hierarchy. The first
weight was assigned to concepts which are occurring as classes, second weight for con-
cepts occurring as subclasses and the third weight for concepts occurring as instances.
The value of these weights is set empirically through trial and error by conducting ex-
periments. The value of 0.2 is set for concepts which occur as classes, 0.5 for concepts
occurring as subclasses and 0.8 when concepts occur as instances.
A slightly different approach of computing weights is implemented in [17, 18] where
layers of ontology tree are used to represent the position of concepts in the ontology. The
weight of each concept is then computed by counting the length of path from the root
node to the given concept. The same approach of using layers for calculating weight
values of concepts is used in [19]. Path length is also used to compute the weight of
concepts but rather than considering all ontology concepts, only the leaf concepts are
used. The idea behind this approach is that more general concepts, such as superclasses,
are implicitly taken into account through the use of leaf concepts by distributing their
weights to all of their subclasses down to the leaf concepts in equal proportion.
The drawback of above presented approaches is that they compute concepts’ weight
either empirically through trial and error by conducting experiments thus keeping these
weights fixed or using the path length. Furthermore, the approach presented in [19] uses
only the top-level ontology for computing weights. Our approach uses a Markov based
PageRank algorithm to compute the concept importance. The algorithm uses all concepts
of ontology and the importance of a concept is computed relative to all other concepts in
the ontology.
From classification perspective, studies presented above have not established well the
representation of documents which is one of the main aspects that influences the perfor-
mance of ontology based classification models. Documents are represented as vectors
containing relevance of the concepts that are gathered by an ontology by searching only
4
the presence of their lexicalizations (concept label) in the documents. As a result of this,
classification models are limited to capture the whole conceptualization involved in doc-
uments.
Another strand of research covers the work related to the use of machine learning ap-
proaches for document classification. For instance, the authors in [8] proposed a machine
learning based classification approach for understanding sentiment through differentiat-
ing good news from bad news. This is achieved using a vector space document represen-
tations learned by deep learning and convolutional neural networks with a test accuracy
of 85%. Another example of using convolutional recurrent deep learning model for clas-
sification is proposed in [7]. This approach is similar to our work but our focus is on
classification of documents instead of sentences and we use feature vectors constructed
by concepts derived by an ontology.
3. Architecture of the Proposed Model
The main goal of the proposed model shown in Figure 1 is classification of image
and textual documents using an improved concept vector space which relies on seman-
tically rich document representations and an enhanced concept weighting scheme. An
image document in our case is a movie frame containing handwritten lecture notes on
the chalkboard extracted from a lecture video employing image processing techniques
while the textual documents are financial documents that are stored in pdf format. The
model consists of seven main modules that are described in the following subsections.
3.1. Text Analysis Module - TAM
The input of proposed classification model is a collection of documents that can be
stored either as unstructured textual data or image. If the input is a document image, it
initially goes through a text analysis module called TAM to extract texts from that image.
TAM module itself consists of three steps and preprocessing is the very first one which
ensures that the image has a readable text. The readable quality of a text in a document
image is mostly affected by blocking and blurring artifacts as a result of compression
and denoising. These readable text issues are avoided by using a metric designed for
evaluating text quality called a reference free perceptual quality metric (RF-PQM) [20].
The image is then converted into binary format using Otsu technique [21] and text regions
are localized using a 4-connected component based labelling approach as illustrated in
Figure 2.
The next step of TAM module is segmentation and extraction of text lines from the
connected components obtained as blobs after localization followed by extraction of words
using vertical 1-D projection histogram. We assume that the text documents obtained at
this stage are correct since the evaluation of TAM itself is beyond the scope of this paper.
Readers are therefore advised to refer to [22] and [23] for full details on the TAM module.
5
Figure 1: Architecture of the proposed classification model
3.2. Preprocessing
This module takes as input text documents extracted from the image documents in the
TAM module and/or a collection of documents stored in unstructured textual formats,
e.g. Word, PDF, Powerpoint slides, etc. These text documents undergo preprocessing
steps including morpho-syntactic analysis. The first preprocessing step involved is tok-
enization in which the text is split in small pieces known as tokens. Next, stop words
6
Figure 2: Labelling approach using 4 connected-components
and duplicate words are removed, and finally a stemming is performed to normalize the
retrieved words.
The output of this module is a collection of documents composed of plain text with
no semantics associated to them and it is linked directly to the concept extraction module
in order to embed semantics into those text documents.
3.3. Concept Extraction
Concept extraction module concerns with construction of feature vectors. A feature
vector is an n-dimensional vector comprised of concepts provided by domain ontolo-
gies so as to make a move from the keyword-based vector representation towards the
semantic-based vector representation. To achieve this step toward semantic represen-
tation, we primarily need to associate terms extracted from documents with concepts
of the ontology. Terms are located and extracted from documents using a Lucene in-
verted indexing technique which generates a list of all unique terms that occur in any
document and a set of documents in which these terms occur. The extracted terms are
stemmed using a stemming method. Further, noisy terms, i.e terms with single character,
are removed from the list of extracted terms. The extracted terms are associated with
the concepts of the ontology using: 1) the matching method in which terms appearing
in a document are mapped with the relevant concepts from the domain ontology, and
2) acquisition of relevant terminology that is semantically related and can be attached to
concepts of that domain ontology.
The matching method [12] follows the idea of searching for concepts in the domain
ontology that have labels matching either partially or exactly/fully with a term occur-
ring in a document. To put it simply, each term identified and located in a document is
searched in the domain ontology, and if an instance term matches its concept label than
term is replaced with the concept. Concept labels are considered all lexical entries and
lexical variations contained in a concept. The obtained concepts are used to construct
concept vectors. An exact match is the case where a concept label is identical with a
instance term occurring in the document. A partial match is the case when concept la-
bel contains a term occurring in the document. The exact and partial match is formally
defined as following.
Definition 1 Let Ont be the domain ontology and let Dbe the dataset composed of doc-
uments of this given domain. Let Doc Dbe a document defined by a finite set of terms
Doc ={t1, t2, ..., ti}. Mapping of term tiDoc into concept cjOnt is defined as:
7
EM (ti, cj) = 1,if label (cj)=ti
0,if label (cj)6=ti
P M (ti, cj) = 1,if label (cj) contains ti
0,if label (cj) does not contain ti
where, EM and PM denote exact match and partial match, respectively.
If EM (ti, cj)=1, it means that term tiand concept label cjare identical, then term tiis
replaced with concept cj. For example, for a concept in the ontology such as Organization
or Call as shown in Figure 3, there exists an identical term that appears in the document.
If P M (ti, cj) = 1, it means that term tiis part of concept label cj, then term tiis replaced
with concept cj. For example, the ProjectFunding compound ontology concept shown in
Figure 3, contains terms that appears in the document such as Project and/or Funding.
Extraction of concepts through acquisition of relevant terminology that is related and
can be attached to ontology concepts is a more complex task which is achieved through
exploitation of both contextual and semantic information of terms occurring in a docu-
ment.
Contextual information of a term is defined by its surrounding words and it is com-
puted using Equation 1.
Context(ti, tj) = ti·tj
ktikk tjk(1)
The vectors, tiand tj, are composed of values derived by three statistical features,
namely, term frequency, term font types, and term font sizes, respectively. Different font
types, i.e. bold,italic,underline, and font sizes, i.e. title,level 1,level 2, are introduced
to derive the context. In our case, values of these statistical features are extracted from
input pdf documents using Apache PDFBox library, that is, an open source Java library
which allows creation of new pdf documents, manipulation of existing documents and
the extraction of content from documents.
Semantic information of a term is calculated using a semantic similarity measure
based on the English lexical database WordNet. Wu&Palmer similarity measure [24] is
employed to compute a semantic score (Eq. 2) for all possible pairs of terms tiand tj
occurring in a document.
Semantic(ti, tj) = 2depth(lcs)
depth(ti) + depth(tj)(2)
Parameter, depth(lcs) shows the least common subsumer of terms tiand tj, and parameters
depth(ti)and depth(tj)show the path’s depth of terms tiand tj, in the WordNet.
Combination of contextual and semantic information gives an aggregated score as
shown in Equation 3.
AggregatedScore(ti, tj) = λContext(ti, tj) + (1 λ)Semantic(ti, tj)(3)
8
where, λis set to 0.5 showing an equal contribution of context and semantic compo-
nents on the aggregated score.
Aggregated score through a rank cut-off method is used to acquire terms that are
related to concepts of the ontology. More concretely, terms that are above the specified
threshold (top-N) are considered to be the relevant terms.
3.4. Domain Ontology
This module covers domain ontology which interfaces Term-to-Concept mapper com-
ponent in the concept extraction module and the weighting scheme module. A domain
ontology is a data model which represents concepts and relations between them in a
given domain. An ontology structure is formally represented by a 5-tuple [25], as shown
in the Equation 4.
Ont := (C, R, HC, rel, A)(4)
where,
Cis a set of concepts, e.g. Funding,Call;
Ris a set of relations, e.g. announces,promotes;
HCis a hierarchy or taxonomy of concepts with multiple inheritance, e.g. Program-
meFunding isa Funding and FinanceProgramme isa ProgrammeFunding;
rel is a set of non-taxonomic relations which are described by their domain and
range restrictions, e.g. isReceivedBy, appliesFor;
Ais a set of ontology axioms, expressed in an appropriate logical language, which
describe additional constraints;
The ontology definition shown in Equation 4can be domain specific by defining a
lexicon which is a 3-tuple Lex:=(L,F,G) consisting of a set of lexical entries Lfor concepts
and relations, and two sets Fand Gthat link concepts and relations with their lexical
entries.
3.5. Weighting Scheme
The weight of a concept is a numeric value which is assigned to each concept in or-
der to assess its power in distinguishing a particular document from others. A technique
used to compute the weight of concepts is known as concept weighting scheme. There
exist various weighting schemes that typically rely on the relevance of concepts reflected
by frequency of occurrences of concept’s lexicalizations within a document. In this mod-
ule we present an enhanced concept weighting scheme which besides concept relevance,
introduces a new parameter called concept importance that reflects the contribution of
a concept in the ontology. Concept importance is processed offline and it involves the
following steps: 1) mapping the domain ontology into an ontology graph, 2) applying
9
Markov based algorithms, and 3) calculation of concept importance and aggregation with
concept relevance.
The first and the foremost step of this module is to convert the domain ontology de-
scribed in subsection 3.4 into an ontology graph for calculating concept importance. To
achieve this, we adopt a model where the ontology is represented as a directed acyclic
graph. The modelling is an equivalent mapping which means that an ontology concept
is mapped into a graph vertex and an ontology relation into a graph edge which connects
two vertices. The formal definition of this graph, known as ontology graph, is given as
follows.
Definition 2 Given a domain ontology Ont, the ontology graph G={V , E, f }of Ont is
a directed acyclic graph, where Vis a finite set of vertices mapped from concepts in Ont,
Eis a finite set of edge labels mapped from relations in Ont, and fis a function from E
to V×V.2
In Figure 3, we present part of the INFUSE ontology graph which consists of a subset
of concepts and relations from the funding domain. The details of the INFUSE ontology
are given in Section 4.
In the semantic web, a formal syntax for defining ontologies is Web Ontology Lan-
guage (OWL) and Resource Description Framework (RDF) Schema. These languages
represent the ontology as a set of Subject-Predicate-Object (SPO) expressions known as
RDF triples. The set of RDF triples is known as RDF graph where subject is the source
vertex and object is the destination vertex, and predicate is a directed edge label which
links those two vertices. The formal definition of RDF graph is given as following.
Figure 3: A part of INFUSE ontology graph
10
Definition 3 Given a set of RDF triples T, the RDF graph G={V, E, f}of Tis a directed
acyclic graph, where Vis a finite set of vertices (subjects and objects) in Gdefined as
V={vu:u(S(T)P(T))},Eis a finite set of edge labels (predicates) in Gdefined
as E={eSP O :S P O T},fis a function linking subject Sto an object Oby an edge E
defined as f={fP:fP=VSVO, VS, VOT}2
The ontology graph and RDF graph are not the same for a given ontology. The differ-
ence is that a relation in an ontology graph is defined as a vertex in the RDF graph. For
example, relation isReceived in ontology graph shown in Figure 3 is represented as a ver-
tex in RDF graph, as shown in Figure 4. In other words, a relation in RDF graph is a link
between a subject denoted by rdfs:domain property and an object denoted by rdfs:range
property as given in Definition 3.
Figure 4: An example RDF graph representation
The next step is computation of the importance of vertices of the graph using an adop-
tion of the Markov based algorithms. The graph can be either ontology graph or RDF
graph as defined above. The idea behind Markov based algorithms is representing the
graph as a stochastic process, more concretely as a first-order Markov chain where the
importance for a given vertex is defined as the fraction of time spent traversing that ver-
tex for an infinitely long time in a random walk over the vertices. The probability of
transitioning from a vertex ito a vertex jis only dependent on the vertex iand not on
the path to arrive at vertex j. This property, known as the Markov property, enables the
transition probabilities to be represented as a stochastic matrix with non-negative entries
and the maximum probability of 1.
In this paper, we use PageRank [26] algorithm as one of the most well known and
successful example of Markov based algorithms [27].
A simplified principle of work of PageRank algorithm is as follows. It initially defines
the importance of a vertex ias given in Equation 5.
P R(i) = X
jVi
P R(j)
Outdegree(j)(5)
where, PR(j) is the importance of vertex j,Viis the set of vertices that links to vertex i,
and Outdegree(j) is the number of vertices that have outlinks from vertex j.
As we can see from the Equation 5, the PageRank is an iterative algorithm. It assigns
an initial importance to a vertex ias shown in Equation 6.
P R(0)(i) = 1
N(6)
11
where, Nis the total number of vertices in the graph. Then PageRank iterates as per
Equation 7 and continues to iterate until a convergence criterion is satisfied.
P R(k+1)(i) = X
jVi
P R(k)(j)
Outdegree(j)(7)
The process can also be defined using the matrix notation. Let Mbe the square,
stochastic transition probabilities matrix corresponding to the directed graph G, and
Imp(k) is the Importance vector at the kth iteration. Then the computation of one itera-
tion corresponds to the matrix-vector multiplication as shown in Equation 8.
P R(k+1) =MP R(k)(8)
The entry of transition probability matrix M, for a vertex jwhich links to vertex i, is
defined using Equation 9.
pi,j =1
Outdegree(j),if there is a link from j to i
0,otherwise (9)
There are two properties that are necessary to be satisfied in order for a Markov based
algorithm to converge. It should be aperiodic and irreducible [28]. The transition prob-
ability matrix Mis a stochastic matrix with probability 1 and this makes the PageRank
algorithm aperiodic. The PageRank algorithm is not irreducible due to the definition
given in Equation 9, where some of the transition probabilities in matrix Mmay be 0.
This does not meet the criteria of irreducibility property which requires the transition
probabilities to be greater than 0.
To make the PageRank algorithm irreducible in order to converge, a damp factor 1α
is introduced. As a result of this, a new transition probability matrix Mis defined where
a complete set of outgoing edges with probability α/N are added to all vertices in graph.
The definition of matrix Mis given in Equation 10.
M= (1 α)M+α1
NN×N
(10)
The damp factor besides enabling the PageRank algorithm to converge also over-
comes the problem of rank sinks [28].
Replacing Mwith Min Equation 8, the PageRank algorithm is defined as given in
Equation 11.
P R(k+1) = (1 α)M×P r(k)+α1
NN×N
(11)
Finally, concept importance is defined as given in Equation 12.
Imp(ci) = P R(k+1) (12)
12
The final step of this module is aggregation of concept importance and concept rele-
vance to compute weight of concepts. The value of a concept weight is in the range of
[0,1] because both concept importance and concept relevance are normalized.
w(ci) = Imp(ci)×Rel(ci)(13)
Concept importance Imp is computed using Equation 12 described above, while con-
cept relevance Rel is computed using Equation 14.
Rel(ci) =
m
X
i=1
F req(ci)(14)
where, Freq(ci)is the frequency of occurrences of lexicalizations of concept ciin the
document to be classified.
3.6. Document Representation
The output of both modules, concept extraction and weighting scheme, will serve
as an input to semantic document representation module for representing a document.
More concretely, concepts obtained from concept extraction module and their weights
computed through weighting scheme module are used to represent a document in a vec-
tor space as defined in Equation 15.
Doc ={(c1, w1),(c2, w2),(c3, w3), ..., (ci, wi)}(15)
where ciis the ith concept obtained from concept extraction module and wnis its weight
computed from weighting scheme module.
Table 1 illustrates an example of semantic document representation through a vec-
tor space that is constructed by using concepts (GeographicalArea and Applicant) and their
weights composed of two components, Importance (Imp) and Relevance (Rel), as de-
scribed in subsection 3.5.
Table 1: An example of building concept vector space
Doc
GeographicalArea Applicant
Imp Rel w Imp Rel w
d1 0.130 0.797 0.104 0.020 0.797 0.016
d2 0.130 0.624 0.081 0.020 0.624 0.012
d3 0.130 0.000 0.000 0.020 0.860 0.017
3.7. Document Classification
The last module of proposed model deals with classification of documents into ap-
propriate categories using conventional machine learning classifiers and deep learning.
In essence, a document represented via concept vector space is fed into the classifier to
build a prediction model that can be used to classify a new unseen document.
13
4. RESULTS AND ANALYSIS
This section describes the calculation of concept importance of a real-world ontology.
It also gives a description of the dataset used to perform the experiments for demon-
strating the applicability of our proposed document representation models. Finally, it
provides a thorough comparison of document classification results achieved using both
conventional machine learning techniques and deep networks.
4.1. Concept Importance Calculation
A real-world domain ontology called INFUSE ontology is used for computing concept
importance. This ontology is developed as part of the INFUSE 1project and it comes
from the funding domain. It is composed of 85 concepts, e.g. Funding,GrantSeeker and
18 object properties, e.g. isGivenBy,appliesFor, that connect these concepts. A part of
INFUSE domain ontology represented as an ontology graph is shown in Figure 3.
To convert the ontology into an ontology graph and compute the concept importance,
we have used the RDF rank algorithm. This algorithm is part of the extensions module of
GraphDB [29] and it computes the importance for every vertex in the entire RDF graph.
Table 2 shows the concept importance values of the top ten concepts of the INFUSE on-
tology. The concept importance is a floating point number with values varying between
0 and 1.
Table 2: Concept importance for the top ten concepts of the INFUSE ontology
No Concept Concept Importance
1 Coverage 0.20
2 GeographicalArea 0.13
3 Topic 0.11
4 County 0.07
5 Participant 0.06
6 Programme 0.05
7 Organisation 0.05
8 Funding 0.05
9 Applicant 0.04
10 Candidate 0.04
Figure 5 shows the concept importance values in ranking order after having com-
puted them for all the concepts of the INFUSE ontology. As can be seen from the chart
diagram, the concept importance is different for different concepts, varying from 0.2 -
0.02 for almost half of the concepts set, while for the rest of the concepts it is 0.01. These
findings confirm the idea that the contribution of ontology concepts in terms of concepts’
discriminating power is different and thus some concepts are more important than the
others with respect to document classification.
1https://www.eurostars-eureka.eu/project/id/7141
14
Figure 5: Concept importance for all concepts of the INFUSE ontology
4.2. Performance Evaluation of Baseline CVS and iCVS
In order to demonstrate the general applicability of our proposed classification model
and to validate its effectiveness, extensive experiments using various classifiers are con-
ducted on the INFUSE dataset.
The INFUSE dataset consists of 467 grant documents that had been collected and clas-
sified into 5 categories by field experts as part of the INFUSE project. The dataset is split
randomly, in which 70% of the documents are used to build the classifier and the remain-
ing 30% to test the performance of the model. The number of documents in each category
varied widely, ranging from the Society category which contains 165 documents to the
Music category which contains only 14 documents. Table 3 shows five categories along
with the number of training and testing documents in each category.
Table 3: Dataset size
No Category # Train # Test Total
1 Culture 102 44 146
2 Health 73 32 105
3 Music 10 4 14
4 Society 115 50 165
5 Sportssociety 26 11 37
6 Total 326 141 467
Parametric and nonparametric machine learning techniques are used for experiment-
ing. A parametric machine learning technique assumes that the data can be parameter-
ized by a fixed number of parameters. In essence, the statistical model of parametric tech-
niques is specified by a simplified function through two types of distributions, namely,
the class prior probability, and the class conditional probability density function (poste-
rior) for each dimension. On the contrary, a nonparametric machine learning technique
15
assumes no prior parameterized knowledge about the underlying probability density
function and the classification uses the information provided by training samples alone.
Naive Bayes is a parametric machine learning technique applied for classification in
this paper, while nonparametric techniques applied in this paper include Decision Tree
and Random Forest. We also have chosen to use Support Vector Machine (SVM) for
classification that can be either parametric or non-parametric technique. Linear Support
Vector Machine contains a fixed size of parameters represented by the weight coefficient
and thus it belongs to the parametric techniques. On the other side, Non-linear Support
Vector Machine is a non-parametric technique and Radial Basis Function Kernel Support
Vector Machine, known as RBF Kernel SVM, is a typical example of this family. In ad-
dition, we have applied two boosting techniques, namely Gradient Boosting and Ada
Boosting, which grant power of ensemble classifiers that generate multiple predictions
and majority voting among the individual classifiers.
Additionally, a Multilayer Perceptron (MLP) is used in this study. An MLP is a feed-
forward Artificial Neural Network (ANN). The artificial neurons in the network compute
a weighted sum of its inputs xi, adds a bias b, and applies an activation function. A simple
ANN is represented as: y=f(wxi+b), where wis the weigh and fis the activation func-
tion. Most commonly used activation functions are sigmoid, which is σ(z)=1/(1 + ez)
and rectified linear units which is ReLU(z) = max(0, z). The weight and bias terms are es-
timated by training the network on the observable data to minimize the loss using cross-
entropy or mean square error. In an MLP, the neurons are structured into layers. These
layers are fully-connected which implies that every neuron in one layer is connected to
every neuron in the adjacent layer. The input and the output layers are the visible lay-
ers in the network while a network may contain multiple hidden layers. Normally, a
network containing more than one hidden layer is known as a deep neural network.
The standard information retrieval measures such as precision, recall and F1 measure,
are used to evaluate the performance of the document classification. Precision is the num-
ber of documents which are classified correctly with respect to all classified documents. It
is given as: tp/(tp +f p). Recall is the number of classified documents with respect to the
total number of documents in the dataset. Recall is defined as: tp/(tp +f n), where tp, tn,
and fn are true positive, true negative, and false negative samples. F1 measure is the har-
monic mean of precision and recall and it is defined as: 2((precision recall)/(precision +
recall)).
Best results are obtained on the conventional machine learning techniques for follow-
ing configurations. For the Bayesian classifier, a Gaussian NB is used whereas for SVM, a
radial basis function (RBF) kernel SVM is used. A value of 0.001 is used for gamma which
describes how much influence a single training sample has, and a maximum value is
set for the regularization parameter c. The depth of the tree for RF classifier is set to 10
which gave best results. For all other parameters of the classifiers, default configurations
are used. For deep learning based MLP architecture, multiple simulations consisting of
L×Nare carried out by varying the number of hidden layers Land the number of
neurons Nin each layer, where L={3,5,7}, and N={64,128,256,512,1024}. Figure
6 shows the total number of trainable parameters for a 5-hidden layer MLP containing
16
1024 neurons in each layer. The input to the network shown is 323 size concept vector for
iCVS variant 2. Relu is applied as the activation function, adam is used as the optimizer,
while the learning rate αis set to 1e3. A softmax function is applied at the last layer to
convert the likelihood of a test sample belonging to one of the 5 classes.
Figure 6: Model summary for a 5-hidden layer MLP architecture for 323 concept input vector size
with 1024 neurons.
Three different models of vector space document representation are used to test the
classifiers. In the first model called baseline CVS, we conducted a document classifi-
cation experiment on the INFUSE dataset in which an exact/partial match technique is
employed to match term occurring in a document with relevant concepts of the ontol-
ogy to build concept vectors for representing documents into vector space. Precision,
recall, and F1 results obtained from six conventional Machine Learning techniques and
a deep MLP with different number of hidden layers and neurons are shown in Table 4
and Table 5, respectively. As can be seen from the results, Gradient Booosting classifier
shows the best performance compared to other conventional classifier achieving a 82.58%
of weighted F1 score. On the other hand, MLP with 3 hidden layers and 1024 neurons in
each layer outperforms other deep network achieving an F1 score of 80.02%.
Table 4: Performance of conventional ML techniques using baseline CVS
Technique Precision (%) Recall (%) F1 (%)
Naive Bayes 67.24 60.99 61.90
Decision Tree 66.10 66.40 65.50
Random Forest 77.69 77.30 77.25
SVM 81.73 77.30 78.85
Gradient Boosting 82.99 82.26 82.58
Ada Boosting 58.61 53.90 54.69
In the second experiment, we performed document classification using the same clas-
sifiers on the same corpus of documents from the INFUSE dataset, but employing the
second model of document representation. The second model called iCVS variant 1 is
17
Table 5: Performance of MLP using baseline CVS
# of hidden
layers
# of neurons Precision (%) Recall (%) F1 (%)
3
64 79.32 78.72 78.47
128 77.80 78.01 77.89
256 77.05 77.30 77.08
512 79.75 79.43 79.07
1024 80.13 80.14 80.02
5
64 78.11 78.01 77.50
128 78.29 78.01 77.96
256 75.21 74.46 74.36
512 77.21 76.59 76.64
1024 77.87 77.30 77.24
7
64 77.99 78.01 77.77
128 77.93 77.30 77.40
256 76.53 75.58 75.89
512 75.00 73.75 73.92
1024 78.73 76.59 76.90
an enhanced concept weighting scheme that is used for assessing weight of concepts of
the ontology. Six different conventional Machine Learning techniques, and a Multilayer
Perceptron with different number of hidden layers and different number of neurons per
layer, are used for classification and the obtained results are shown in Table 6 and Ta-
ble 7, respectively. As with baseline CVS model, the obtained results using iCVS variant
1 show that Gradient Boosting classifier achieved the highest improvement compared
to other conventional machine learning and deep learning techniques. In the context of
deep networks, the best performance is achieved by an MLP architecture with 7 hidden
layers and 256 neurons per layer with an F1 score of 76.64%,.
Table 6: Performance of conventional ML techniques using iCVS variant 1
Technique Precision (%) Recall (%) F1 (%)
Naive Bayes 66.63 53.90 57.73
Decision Tree 69.10 70.00 68.80
Random Forest 84.54 80.85 82.07
SVM 66.65 53.19 56.64
Gradient Boosting 83.06 81.56 82.14
Ada Boosting 61.72 60.28 60.33
iCVS variant 2 model is also evaluated in a similar fashion. In this model, concept vec-
tors for representing documents into vector space are build through acquisition of new
terms that are semantically related and can be attached to concepts of the ontology. In
18
Table 7: Performance of MLP using iCVS variant 1
# of hidden
layers
# of neurons Precision (%) Recall (%) F1 (%)
3
64 72.84 73.04 72.77
128 67.40 69.50 68.22
256 71.86 70.92 71.29
512 73.69 73.75 73.55
1024 73.35 73.04 72.81
5
64 70.33 69.50 69.53
128 72.65 73.04 72.77
256 72.16 72.34 72.16
512 68.30 68.79 68.23
1024 73.14 73.04 72.82
7
64 66.55 68.08 66.46
128 67.79 69.50 68.30
256 76.82 76.59 76.64
512 77.10 75.17 75.87
1024 73.48 73.75 73.47
our case, for each concept of the INFUSE ontology we used only the top-5 terms found
as relevant in terms of relatedness. For example, terms fund,amount,part,subsistence,
and grant, are the top-5 terms that are found to be the most semantically related terms
with ontology concept funding. The performance of document classification, in terms of
precision, recall and F1 measure, achieved by six conventional Machine Learning tech-
niques and a Multilayer Perceptron with different number of hidden layers and neurons,
is given in Table 8 and Table 9, respectively. As can be seen from the results shown in
Table 8 and Table 9, the best performing classifier is an MLP having three hidden layers
and 64 neurons in each layer with an F1 score of 84.98% which is slightly better than SVM
with an F1 score of 84.11%.
Table 8: Performance of conventional ML techniques using iCVS variant 2
Technique Precision (%) Recall (%) F1 (%)
Naive Bayes 67.02 65.95 65.28
Decision Tree 79.20 77.90 76.70
Random Forest 77.04 74.46 75.06
SVM 85.66 83.68 84.11
Gradient Boosting 84.35 83.68 83.96
Ada Boosting 69.79 60.99 62.56
A side by side comparison is illustrated in Figure 7 for three models. The figure
presents a complete picture of the performance of conventional machine learning and
19
Table 9: Performance of MLP using iCVS variant 2
# of hidden
layers
# of neurons Precision (%) Recall (%) F1 (%)
3
64 85.05 85.10 84.98
128 80.12 80.14 79.55
256 79.04 79.43 78.79
512 81.47 81.56 81.29
1024 81.68 82.26 81.80
5
64 80.11 80.85 80.11
128 78.07 79.43 78.06
256 80.76 80.85 80.34
512 78.79 78.72 77.82
1024 78.42 79.43 78.58
7
64 77.68 78.01 77.21
128 80.76 80.85 80.47
256 77.50 77.30 77.17
512 82.99 83.68 83.07
1024 81.69 82.26 81.57
deep learning techniques on the INFUSE dataset for the proposed models. The bar chart
shows the weighted F1 score obtained by conventional machine learning, namely Naive
Bayes (NB), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM),
Gradient Boosting (GB), and Ada Boosting (AD), and a Multilayer Perceptron (MLP) with
3 hidden layers and 64 neurons per layer, tested on three different models of document
representation.
As can be seen from the results shown in Figure 7, a higher weighted classification
F1 score is achieved by all classifiers using iCVS variant 2. An exception is Random
Forest that gives slightly worse classification performance than other classifiers. Random
Forest is an ensemble method that employs the same decision tree classifier on different
training sets generated using the bootstrap sampling method. In a bootstrap sampling, a
new training set is created by taking data from the original training set, thus some data
may be used several times to construct the forest and others not at all. This may be one
of the reasons that this classifier performs worse.
It is also interesting to note from the Figure 7 that in general MLP classifier outper-
forms all conventional machine learning classifiers achieving a classification F1 score of
84.98%. On the other hand, the worst performance is shown by Naive Bayes classifier
which may have happened due to the imbalanced classes of the INFUSE dataset. Imbal-
anced classes may result in biasing of the classifier towards the majority of the class and
thus the performance of Naive Bayes classifier can quickly turn poor.
An interesting fact that also can be observed from the bar chart shown in Figure 7 is
that iCVS variant 1 model has different impact on the performance of classifiers. While
20
NB DT RF SVM GB AB MLP
0
10
20
30
40
50
60
70
80
90
100
F1 (%)
baseline CVS iCVS variant 1 iCVS variant 2
Figure 7: F1 measure of different classifiers using exact/partial match (baseline CVS), enhanced
weighting scheme (iCVS variant 1), and acquisition of related terms (iCVS variant 2)
nonparametric and boosting machine learning techniques demonstrate a positive impact
on document classification using an iCVS variant 1, parametric and MLP show a negative
impact on classification performance giving worse accuracy.
5. Conclusion and Future Work
In this paper, we have investigated and analysed the document classification per-
formance using a concept vector space model improved with new concept weighting
scheme, and semantic document representation. Concept weighting scheme is enhanced
with new parameter that takes into account the importance of ontology concepts. Con-
cept importance is computed automatically and this is achieved by converting the on-
tology into a graph and then employing the PageRank algorithm on it. Importance of
an ontology concept is then aggregated with concept relevance which is computed using
the frequency of appearances of a concept in the document. A semantic representation of
document is achieved using concepts derived from ontology through matching technique
and acquisition of new terms that can be semantically related with ontology concepts.
We conducted various document classification experiments on three models of docu-
ment representation i.e. baseline CVS model and iCVS model with two variants. Addi-
tionally, a comparison between seven different classifiers is performed for all three mod-
els using precision, recall, and F1 score. For all three models, Random Forest, Gradient
Boosting, and Multilayer Perceptron, performed rather well. Furthermore, a thorough
investigation is carried out to evaluate the performance of MLP by varying the number
of hidden layers and the number of neurons in each layer. A three hidden layer MLP
21
with 64 neurons achieves higher classification performance compared to other architec-
ture configurations.
Generally, iCVS variant 1 employing an enhanced weighting scheme used for assess-
ing weights of concepts did not add much to the overall performance except for Random
Forest which gave better results employing baseline CVS and iCVS variant 2 with an F1
score of just over 81%. Our findings showed that adding more concepts to ontology im-
proves the classification performance by 4.78 percentage point on average in all cases,
however, it is computationally expensive due to a large number of feature vectors. The
classification performance is also highly dependent upon the choice of a classifier and
we can achieve the same performance on the iCVS model (variant 1 and variant 2) with
Random Forest and Gradient Boosting classifier.
Investigation and analysis of classification performance is done on real-world ontol-
ogy and dataset consisting a small number of documents, so in future work we plan to
conduct a performance analysis in a large-scale dataset. We also plan to implement and
test other Markov based algorithms for computing concept importance as fundamental
part of concept weighting scheme and compare those techniques with the PageRank al-
gorithm.
Furthermore, the primary focus of our study was addressing two major concept vec-
tors limitations namely exact matching and weighting scheme by proposing an improved
concept vector space model. However, our proposed approach does not handle another
concept vectors limitation which is ontological relationships. Future studies on the cur-
rent topic are therefore suggested in order to establish representation of documents in
which concept vectors can be redefined to consider the various relationships that exist in
an ontology.
Acknowledgment
The authors would like to thank Cristina Marco from the INFUSE project for provid-
ing the domain ontology and the dataset used in this paper.
References
[1] DOMO, Data never sleeps 6.0: How much data is generated every minute?, accessed: 2018-06-18
(2018).
URL https://www.domo.com/learn/data-never-sleeps-6
[2] R. Jacobson, 2.5 quintillion bytes of data created every day. how does cpg & retail manage it?,
accessed: 2018-07-20 (2018).
URL https://www.ibm.com/blogs/insights-on-business/consumer-products/2- 5-
quintillion-bytes-of-data- created-every-day-how-does-cpg-retail- manage-
it/
[3] P. Raghavan, Extracting and Exploiting Structure in Text Search, in: SIGMOD Conference, ACM,
2003, p. 635.
[4] A.-A. R. Al-Azmi, Data, Text, and Web Mining for Business Intelligence: A Survey, International
journal of Data Mining and Knowledge Management Process 3 (2) (2013) 1–26.
[5] S. Khan, M. Safyan, Semantic matching in hierarchical ontologies, Journal of King Saud University -
Computer and Information Sciences 26 (3) (2014) 247 257.
22
[6] M. Keikha, A. Khonsari, F. Oroumchian, Rich Document Rrepresentation and Classification: An Anal-
ysis, Knowledge-Based Systems 22 (1) (2009) 67–71.
[7] A. Hassan, A. Mahmood, Convolutional recurrent deep learning model for sentence classification,
IEEE Access 6 (2018) 13949–13957.
[8] U. Reshma, B. Ganesh, M. Kale, P. Mankame, G. Kulkarni, Deep learning for digital text analytics:
Sentiment analysis, CoRR abs/1804.03673.
[9] N. Sanchez-Pi, L. Marti, A. C. B. Garcia, Improving Ontology-based Text Classification: An Occupa-
tional Health and Security Application, Journal of Applied Logic 17 (2016) 48–58.
[10] C. Bratsas, V. Koutkias, E. Kaimakamis, P. Bamidis, N. Maglaveras, Ontology Based Vector Space
Model and Fuzzy Query Expansion to Retrieve Knowledge on Medical Computational Problem So-
lutions, in: Proceedings of the 29th Annual International Conference of the IEEE Engineering in
Medicine and Biology Society, IEEE, 2007, pp. 3794–3797.
[11] P. Castells, M. Fernandez, D. Vallet, An Adaptation of the Vector Space Model for Ontology Based
Information Retrieval, IEEE Transactions on Knowledge and data engineering 19 (2) (2007) 261–272.
[12] S. Deng, H. Peng, Document Classification Based on Support Vector Machine Using A Concept Vector
Model, in: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, IEEE,
2006, pp. 473–476.
[13] Z. Kastrati, A. S. Imran, S. Y. Yayilgan, An improved concept vector space model for ontology based
classification, in: 2015 11th International Conference on Signal-Image Technology & Internet-Based
Systems (SITIS), IEEE, 2015, pp. 240–245.
[14] G. Wu, J. Li, L. Feng, K. Wang, Identifying Potentially Important Concepts and Relations in an On-
tology, in: Proceedings of the 7th International Conference on The Semantic Web, Springer-Verlag,
Berlin, Heidelberg, 2008, pp. 33–49.
[15] N. Sanchez-Pi, L. Marti, A. C. B. Garcia, Text Classification Techniques in Oil Industry Applications,
in: Proceedings of the International Joint Conference SOCO’13-CISIS’13-ICEUTE’13, Springer Inter-
national Publishing, 2014, pp. 211–220.
[16] X. quan Yang, N. Sun, Y. Zhang, D. run Kong, General Framework for Text Classification Based on
Domain Ontology, in: Proceedings of the 3rd International Workshop on Semantic Media Adaptation
and Personalization, IEEE, 2008, pp. 147–152.
[17] J. Fang, L. Guo, X. Wang, N. Yang, Ontology-Based Automatic Classification and Ranking for Web
Documents, in: Proceedings of the 4th International Conference on Fuzzy Systems and Knowlede
Discovery, IEEE, 2007, pp. 627–631.
[18] H. Gu, Z. Kuanjiu, Text Classification Based on Domain Ontology, Journal of Communication and
Computer 3 (5) (2006) 261–272.
[19] C. d. C. Pereira, A. G. B. Tettamanzi, An Evolutionary Approach to Ontology-Based User Model
Acquisition, in: Proceedings of the 5th International Workshop on Fuzzy Logic and Applications,
Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 25–32.
[20] A. S. Imran, F. A. Cheikh, Blind image quality metric for blackboard lecture images, in: Proceedings
of the 18th European Signal Processing Conference, IEEE, 2010, pp. 333–337.
[21] L. Jianzhuang, L. Wenqing, T. Yupeng, Automatic thresholding of gray-level pictures using two-
dimension otsu method, in: Proceedings of the International Conference on Circuits and Systems,
IEEE, 1991, pp. 325–327 vol.1.
[22] Z. Kastrati, A. S. Imran, Document image classification using semcon, in: Proceedings of the 20th
Symposium on Signal Processing, Images and Computer Vision (STSIVA), 2015, pp. 1–6.
[23] A. S. Imran, S. Chanda, F. A. Cheikh, K. Franke, U. Pal, Cursive handwritten segmentation and recog-
nition for instructional videos, in: 2012 Eighth International Conference on Signal Image Technology
and Internet Based Systems, 2012, pp. 155–160.
[24] Z. Wu, M. Palmer, Verbs Semantics and Lexical Selection, in: Proceedings of the 32nd Annual Meeting
on Association for Computational Linguistics, Association for Computational Linguistics, 1994, pp.
133–138.
[25] A. Maedche, Ontology Learning for the Semantic Web, Springer US, 2002.
23
[26] S. Brin, L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, in: Proceedings of
the 7th International Conference on World Wide Web 7, Elsevier Science Publishers B. V., Amsterdam,
The Netherlands, The Netherlands, 1998, pp. 107–117.
[27] S. White, P. Smyth, Algorithms for Estimating Relative Importance in Networks, in: Proceedings of
the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM,
New York, NY, USA, 2003, pp. 266–275.
[28] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the
Web, Technical Report, Stanford InfoLab (1998).
[29] Ontotext, Graphdb workbench users guide, accessed: 2015-09-20 (2014).
URL http://owlim.ontotext.com/display/GraphDB6/GraphDBWorkbench
24
... The conventional machine learning models employed in this study for sentiment and emotion classification include Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), and AdaBoost, as they are known for their good performance [15] and efficiency even for handling millions of tweets [18]. All the algorithms are trained in scikit-learn library in Jupyter Notebook in Anaconda, with default values for all parameters for all classifiers. ...
Chapter
Automatic text-based sentiment analysis and emotion detection on social media platforms has gained tremendous popularity recently due to its widespread application reach, despite the unavailability of a massive amount of labeled datasets. With social media platforms in the limelight in recent years, it’s easier for people to express their opinions and reach a larger target audience via Twitter and Facebook. Large tweet postings provide researchers with much data to train deep learning models for analysis and predictions for various applications. However, deep learning-based supervised learning is data-hungry and relies heavily on abundant labeled data, which remains a challenge. To address this issue, we have created a large-scale labeled emotion dataset of 1.83 million tweets by harnessing emotion-indicative emojis available in tweets. We conducted a set of experiments on our distant-supervised labeled dataset using conventional machine learning and deep learning models for estimating sentiment polarity and multi-class emotion detection. Our experimental results revealed that deep neural networks such as BiLSTM and CNN-BiLSTM outperform other models in both sentiment polarity and multi-class emotion classification tasks achieving an F1 score of 62.21% and 39.46%, respectively, an average performance improvement of nearly 2–3 percentage points on the baseline results.KeywordsSentiment polarityEmotion detectionDistant supervisionEmojiDeep learningTwitterClassification
... This research work is however limited in this respect. Furthermore, semantics [64][65][66] and concept space [67,68], as well as ontology models [69][70][71], and processing systems [72] can be utilized to enrich Urdu lexicon for developing polarity assessment models for Urdu [73], rather than using translation services to convert text into English first. ...
Preprint
Full-text available
Discovering what other people think has always been a key aspect of our information-gathering strategy. People can now actively utilize information technology to seek out and comprehend the ideas of others, thanks to the increased availability and popularity of opinion-rich resources such as online review sites and personal blogs. Because of its crucial function in understanding people's opinions, sentiment analysis (SA) is a crucial task. Existing research, on the other hand, is primarily focused on the English language, with just a small amount of study devoted to low-resource languages. For sentiment analysis, this work presented a new multi-class Urdu dataset based on user evaluations. The tweeter website was used to get Urdu dataset. Our proposed dataset includes 10,000 reviews that have been carefully classified into two categories by human experts: positive, negative. The primary purpose of this research is to construct a manually annotated dataset for Urdu sentiment analysis and to establish the baseline result. Five different lexicon- and rule-based algorithms including Naivebayes, Stanza, Textblob, Vader, and Flair are employed and the experimental results show that Flair with an accuracy of 70% outperforms other tested algorithms.
... Currently, the sentiment analysis algorithms integrated are lexicon-based. It would be interesting to add algorithms which are machine learning-based to compare the differences and train them on domainspecific topics using ontology [62,63] or concept vectors [64]. Another aspect regarding the future work is multilingual compatibility [46]. ...
Preprint
Full-text available
The amount of opinionated data on the internet is rapidly increasing. More and more people are sharing their ideas and opinions in reviews, discussion forums, microblogs and general social media. As opinions are central in all human activities, sentiment analysis has been applied to gain insights in this type of data. There are proposed several approaches for sentiment classification. The major drawback is the lack of standardized solutions for classification and high-level visualization. In this study, a sentiment analyzer dashboard for online social networking analysis is proposed. This, to enable people gaining insights in topics interesting to them. The tool allows users to run the desired sentiment analysis algorithm in the dashboard. In addition to providing several visualization types, the dashboard facilitates raw data results from the sentiment classification which can be downloaded for further analysis.
... We were limited to only bag of words. These features can further be incorporated with semantics [39][40][41] and vector space representation models [42][43][44][45][46] to improve classification performance. ...
Preprint
Full-text available
In today's world, everyone is expressive in some way, and the focus of this project is on people's opinions about rising electricity prices in United Kingdom and India using data from Twitter, a micro-blogging platform on which people post messages, known as tweets. Because many people's incomes are not good and they have to pay so many taxes and bills, maintaining a home has become a disputed issue these days. Despite the fact that Government offered subsidy schemes to compensate people electricity bills but it is not welcomed by people. In this project, the aim is to perform sentiment analysis on people's expressions and opinions expressed on Twitter. In order to grasp the electricity prices opinion, it is necessary to carry out sentiment analysis for the government and consumers in energy market. Furthermore, text present on these medias are unstructured in nature, so to process them we firstly need to pre-process the data. There are so many feature extraction techniques such as Bag of Words, TF-IDF (Term Frequency-Inverse Document Frequency), word embedding, NLP based features like word count. In this project, we analysed the impact of feature TF-IDF word level on electricity bills dataset of sentiment analysis. We found that by using TF-IDF word level performance of sentiment analysis is 3-4 higher than using N-gram features. Analysis is done using four classification algorithms including Naive Bayes, Decision Tree, Random Forest, and Logistic Regression and considering F-Score, Accuracy, Precision, and Recall performance parameters.
... Then, Hedwig [26] developed a semantic data mining algorithm that exploits this summarized knowledge for deriving efficient rules. Kastrati & Imran [27] also applied the PageRank algorithm to identify the importance value of each concept in the ontology for assisting the document classification task. The importance value is aggregated with the concept relevance score, which is the frequency of the concept in the document, to determine the final weight of each concept for the classification process. ...
Article
Full-text available
Decision Trees are a common approach used for classifying unseen data into defined classes. The Information Gain is usually applied as splitting criteria in the node selection process for constructing the decision tree. However, bias in selecting the multi-variation attributes is a major limitation of using this splitting condition, leading to unsatisfactory classification performance. To deal with this problem, a new decision tree algorithm called "Knowledge-Based Decision Tree (KDT)" is proposed which exploits the knowledge in an ontology to assist the decision tree construction. The novelty of the study is that an ontology is applied to determine the attribute importance values using the PageRank algorithm. These values are used to modify the Information Gain to obtain appropriate attributes to be nodes in the decision tree. Four different datasets, Soybean, Heart disease, Dengue fever, and COVID-19 dataset, were employed to evaluate the proposed approach. The experimental results show that the proposed method is superior to the other decision tree algorithms, such as the traditional ID3 and the Mutual Information Decision tree (MIDT), and also performs better than a non-decision tree algorithm, e.g., the k-Nearest Neighbors.
... Term Presence and Frequency, Part of Speech Tagging, and Negation are some of the features that can be used. Also incorporating the semantic context using publicly available lexical databases (i.e WordNet, SentiWordNet, SenticNet, etc.) [54] or semantically rich representations using ontologies [60,61] and their thesaurus [62,63] to identify opinion and attitude of users from text would be an import aspect to further investigate. ...
Preprint
Full-text available
Online learning is becoming increasingly popular, whether for convenience, to accommodate work hours, or simply to have the freedom to study from anywhere. Especially, during the Covid-19 pandemic, it has become the only viable option for learning. The effectiveness of teaching various hard-core programming courses with a mix of theoretical content is determined by the student interaction and responses. In contrast to a digital lecture through Zoom or Teams, a lecturer may rapidly acquire such responses from students' facial expressions, behavior, and attitude in a physical session, even if the listener is largely idle and non-interactive. However, student assessment in virtual learning is a challenging task. Despite the challenges, different technologies are progressively being integrated into teaching environments to boost student engagement and motivation. In this paper, we evaluate the effectiveness of various in-class feedback assessment methods such as Kahoot!, Mentimeter, Padlet, and polling to assist a lecturer in obtaining real-time feedback from students throughout a session and adapting the teaching style accordingly. Furthermore, some of the topics covered by student suggestions include tutor suggestions, enhancing teaching style, course content, and other subjects. Any input gives the instructor valuable insight into how to improve the student's learning experience, however, manually going through all of the qualitative comments and extracting the ideas is tedious. Thus, in this paper, we propose a sentiment analysis model for extracting the explicit suggestions from the students' qualitative feedback comments.
... BERT and all other deep learning models are only good at NLP tasks, not much when it comes to natural language understanding 1 . Such issues can be addressed employing ontologies, better vector space representation models [47,48], and objective and semantic metrics [49,50]. ...
Preprint
Full-text available
While the whole world is still struggling with the COVID-19 pandemic, online learning and home office become more common. Many schools transfer their courses teaching to the online classroom. Therefore, it is significant to mine the students' feedback and opinions from their reviews towards studies so that both schools and teachers can know where they need to improve. This paper trains machine learning and deep learning models using both balanced and imbalanced datasets for sentiment classification. Two SOTA category-aware text generation GAN models: CatGAN and SentiGAN, are utilized to synthesize text used to balance the highly imbalanced dataset. Results on three datasets with different imbalance degree from distinct domains show that when using generated text to balance the dataset, the F1-score of machine learning and deep learning model on sentiment classification increases 2.79% ~ 9.28%. Also, the results indicate that the average growth degree for CR100k is higher than CR23k, the average growth degree for deep learning is more increased than machine learning algorithms, and the average growth degree for more complex deep learning models is more increased than simpler deep learning models in experiments.
Article
This work aims to develop a brain–computer interface (BCI) system based on electroencephalogram (EEG) signals, that is capable of remote controlling rehabilitation systems using wireless connections. This system can extract delta waves from raw EEG in real-time to predict motor imagery (MI) tasks. Where we built a simple acquisition device that acquires EEG signals using three dry electrodes, these non-invasive channels are positioned on the scalp surface at the occipital and central lobes. After the acquisition step, we amplify the signals and remove permanent noise during the preprocessing step. Then, in the feature extraction step, we extract possible features from each channel. Then, we select only some important features at the feature selection step, by the calculation of each feature’s contribution score. In the classification phase using machine learning algorithms, we select the light gradient boosting machine (LGBM) algorithm enhanced by the multi-verse optimization (MVO) algorithm, which enables the building of optimum prediction models. Also, this work employed a data analysis phase. Where to evaluate the characteristics independent between features at each step, we analysed the data using the correlation matrix results. As well as, we analysed the data changes temporally and spatially between MI tasks at each step. Therefore, the classification results indicated that the system accuracy score is over 90%. While in related work, we have an accuracy value ranging between 79% and 89%. These comparative results show the best quality of our system proposed for this work-based delta wave.
Article
Full-text available
El aumento de la producción científica convierte en un desafío la tarea de identificar patrones y rasgos particulares que caractericen a los investigadores. Lograr establecer niveles de compatibilidad y similaridad entre actores en un contexto de investigación científica a partir de sus perfiles requiere de un proceso rápido y apropiado. El objetivo de este artículo es evaluar los niveles de similaridad, distancia euclidiana y compatibilidad entre vectores de investigadores, a partir de algoritmos de agrupamiento, escalamiento multidimensional, principios del modelo espacio-vectorial y atributos de sus perfiles científicos, considerando las terminologías que se abordan en su producción científica. Se utilizaron métodos teóricos y empíricos, incluyendo técnicas y herramientas de minería de texto. La aplicación del procedimiento en el Centro de Estudios de la Energía y Tecnología Avanzada de Cuba (CEETAM) y la Universidad Técnica de Cotopaxi (UTC) en Ecuador, evidenció su efectividad. Como resultado se pudo identificar los profesionales con mayores niveles de coincidencia en áreas a fines y líneas de investigación, lo que propicia el establecimiento de Comunidades Colectivas de Conocimientos; se pudo demostrar que los métodos empleados pueden ser integrados a las TIC, resultando en la obtención de relaciones perceptuales entre los investigadores y expresando los grupos que se forman a partir de conglomerados de observaciones en cada subcategoría y dominios de conocimientos de los dos casos de estudio analizados.
Article
Full-text available
As the amount of unstructured text data that humanity produces overall and on the internet grows, so does the need to intelligently to process it and extract different types of knowledge from it. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been applied to Natural Language Processing (NLP) systems with comparative, remarkable results. CNN is a noble approach to extract higher-level features that are invariant to local translation. However, it requires stacking multiple convolutional layers in order to capture long-term dependencies, due to the locality of the convolutional and pooling layers. In this article, we describe a joint CNN and RNN framework to overcome this problem. Briefly, we use an unsupervised neural language model to train initial word embeddings that are further tuned by our deep learning network, then the pre-trained parameters of the network are used to initialize the model. At a final stage, the proposed framework combines former information with a set of feature maps learned by a convolutional layer with long-term dependencies learned via Long-Short-Term Memory (LSTM). Empirically, we show that our approach, with slight hyperparameter tuning and static vectors, achieves outstanding results on multiple sentiment analysis benchmarks. Our approach outperforms several existing approaches in term of accuracy; our results are also competitive with the state-of-the-art results on the Stanford Large Movie Review (IMDB) dataset with 93.3% accuracy, and the Stanford Sentiment Treebank (SSTb) dataset with 48.8% fine-grained and 89.2% binary accuracy, respectively. Our approach has a significant role in reducing the number of parameters and constructing the convolutional layer followed by the recurrent layer as a substitute for the pooling layer. Our results show that we were able to reduce the loss of detailed, local information and capture long-term dependencies with an efficient framework that has fewer parameters and a high level of performance.
Article
Full-text available
Information retrieval has been widely studied due to the growing amounts of textual information available electronically. Nowadays organizations and industries are facing the challenge of organizing, analyzing and extracting knowledge from masses of unstructured information for decision making process. The development of automatic methods to produce usable structured information from unstructured text sources is extremely valuable to them. Opposed to the traditional text classification methods that need a set of well-classified trained corpus to perform efficient classification; the ontology-based classifier benefits from the domain knowledge and provides more accuracy. In a previous work we proposed and evaluated an ontology-based heuristic algorithm [28] for occupational health control process, particularly, for the case of automatic detection of accidents from unstructured texts. Our extended proposal is more domain dependent because it uses technical terms and contrast the relevance of these technical terms into the text, so the heuristic is more accurate. It divides the problem in subtasks such as: (i) text analysis, (ii) recognition and (iii) classification of failed occupational health control, resolving accidents as text analysis, recognition and classification of failed occupational health control, resolving accidents.
Article
Full-text available
Hierarchical ontologies play a key role in organizing documents in a repository. While matching the ontologies, the relationships among the concepts are considered to be a major aspect. In hierarchical ontologies, the concepts are associated with one another only through the “is-a” relation. In this paper, we discuss an approach for matching heterogeneous hierarchical ontologies that are related to the same domain through the semantic interpretation and implicit context of the concepts. We have designed rules that can handle heterogeneities and inconsistencies that are found in hierarchical ontologies. These rules can be embedded to complement the existing matching systems, to resolve the matching complexities in the hierarchical ontologies.
Conference Paper
Full-text available
In this paper, we address the issues pertaining to segmentation and recognition of cursive handwritten text from chalkboard lecture videos. Recognizing handwritten text is a challenging problem in instructor-led lecture video. The task gets even tougher with varying handwriting styles and blackboard type. Unlike handwritten text on whiteboard and electronic boards, chalkboard represents serious chal-lenges such as, lack of uniform edge density, weak chalk contrast against blackboard and leftover chalk dust noise as a result of erasing – and many others. Moreover, the varying color of boards and the illumination changes within the video makes it impossible to use trivial thresholding techniques, for the extraction of content. Many universities throughout the world still heavily rely on chalkboard as a mode of instruction. Therefore, recognizing these lecture content will not only aid in indexing and retrieval applications but will also help understand high level video semantics, useful for Multi-media Learning Objects (MLO). In order to encounter those adversaries, we here propose a system for segmentation and recognition of cursive handwritten text from chalkboard lecture videos. We first create a foreground model to segment background blackboard. We then segment the text characters using one-dimensional vertical histogram. Later, we extract gradient based features and classify those characters using an SVM classifier. We obtained an encouraging accuracy of 86.28% on 5-fold cross validation.
Conference Paper
This paper proposes an improved concept vector space (ICVS) model which takes into account the importance of ontology concepts. Concept importance shows how important a concept is in an ontology. This is reflected by the number of relations a concept has to other concepts. Concept importance is computed automatically by converting the ontology into a graph initially and then employing one of the Markov based algorithms. Concept importance is then aggregated with concept relevance which is computed using the frequency of concept occurrences in the dataset. In order to demonstrate the applicability of our proposed model and to validate its efficacy, we conducted experiments on document classification using concept based vector space model. The dataset used in this paper consists of 348 documents from the funding domain. The results show that the proposed model yields higher classification accuracy comparing to traditional concept vector space (CVS) model, ultimately giving better document classification performance. We also used different classifiers in order to check for the classification accuracy. We tested CVS and ICVS on Naive Bayes and Decision Tree classifiers and the results show that the classification performance in terms of F1 measure is improved when ICVS is used on both classifiers.
Conference Paper
In this paper, we are proposing a new semantic and contextual based document image classification framework. The framework is composed of two main modules. The first one is the text analysis module (TAM) which processes document images and extracts words from the image, and second one is the SEMCON, which is a semantic and contextual objective metric. From the list of extracted words by TAM, SEMCON finds a list of noun terms, employs contextual and semantic meaning to it and then uses those terms to classify documents. The scope of this paper is limited to the proposed framework and testing the approach presented on a limited test dataset. Our preliminary results are very promising and suggest that the proposed framework can be used effectively to classify document images.
Article
With the quick increase of information and knowledge, automatically classifying text documents is becoming a hotspot of knowledge management. A critical capability of knowledge management systems is to classify the text documents into different categories, which are meaningful to users. In this paper, a text topic classification model based on domain ontology by using Vector Space Model is proposed. Eigenvectors as the input to the vector space model are constructed by utilizing concepts and hierarchical structure of ontology, which also provides the domain knowledge. However, a limited vocabulary problem is encountered while mapping keywords to their corresponding ontology concepts. A synonymy lexicon is utilized to extend the ontology and compress the eigenvector. The problem that eigenvectors are too large and complex to be calculated in traditional methods can be solved. At last, combing the concept 's support ing, a top -down method according to the ontology structure is used to complete topic classification. An experimental system is implemented and the model is applied to this practical system. Test results show that this model is feasible.*
Conference Paper
Ontology can provide a powerful representation of information space and solve many semantic problems. It is wonderful to apply ontology to text classification. This paper proposes a general framework for text classification, which can overcome the limitations of traditional text classification methods. The results of experiment prove that the general framework is applicable across different domains and this method produces better performance.
Article
There are three factors involved in text classification. These are classification model, similarity measure and document representation model. In this paper, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classifier. In our experiments, we have used the centroid-based text classifier, which is a simple and robust text classification scheme. We will compare four different types of document representations: N-grams, Single terms, phrases and RDR which is a logic-based document representation. The N-gram representation is a string-based representation with no linguistic processing. The Single term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. We have experimented with many text collections and we have obtained similar results. Here, we base our arguments on experiments conducted on Reuters-21578. We show that RDR, the more complex representation, produces more effective classifier on Reuters-21578, followed by the phrase approach.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.