ArticlePDF Available

Performance analysis of machine learning classifiers on improved concept vector space models

February 2019
Future Generation Computer Systems 96

February 2019
96

DOI:10.1016/j.future.2019.02.006

License
CC BY 4.0

Authors:

Zenun Kastrati

Linnaeus University

Ali Shariq Imran

Norwegian University of Science and Technology

This paper provides a comprehensive performance analysis of parametric and non-parametric machine learning classifiers including a deep feed-forward multi-layer perceptron (MLP) network on two variants of improved Concept Vector Space (iCVS) model. In the first variant, a weighting scheme enhanced with the notion of concept importance is used to assess weight of ontology concepts. Concept importance shows how important a concept is in an ontology and it is automatically computed by converting the ontology into a graph and then applying one of the Markov based algorithms. In the second variant of iCVS, concepts provided by the ontology and their semantically related terms are used to construct concept vectors in order to represent the document into a semantic vector space. We conducted various experiments using a variety of machine learning classifiers for three different models of document representation. The first model is a baseline concept vector space (CVS) model that relies on an exact/partial match technique to represent a document into a vector space. The second and third model is an iCVS model that employs an enhanced concept weighting scheme for assessing weights of concepts (variant 1), and the acquisition of terms that are semantically related to concepts of the ontology for semantic document representation (variant 2), respectively. Additionally, a comparison between seven different classifiers is performed for all three models using precision, recall, and F1 score. Results for multiple configurations of deep learning architecture are obtained by varying the number of hidden layers and nodes in each layer, and are compared to those obtained with conventional classifiers. The obtained results show that the classification performance is highly dependent upon the choice of a classifier, and that the Random Forest, Gradient Boosting, and Multilayer Perceptron are among the classifiers that performed rather well for all three models.

Architecture of the proposed classification model

…

Labelling approach using 4 connected-components

…

A part of INFUSE ontology graph

…

An example RDF graph representation

…

+10

Concept importance for all concepts of the INFUSE ontology

…

Figures - uploaded by Ali Shariq Imran

Content may be subject to copyright.

Content uploaded by Ali Shariq Imran

Content may be subject to copyright.

Performance Analysis of Machine Learning Classiﬁers on

Improved Concept Vector Space Models

Zenun Kastrati∗, Ali Shariq Imran

Norwegian University of Science and Technology

Norway

Pre-Print Version - Published version avaiable at https://doi.org/10.1016/j.future.2019.02.006

Abstract

This paper provides a comprehensive performance analysis of parametric and non- para-

metric machine learning classiﬁers including a deep feed-forward multi-layer perceptron

(MLP) network on two variants of improved Concept Vector Space (iCVS) model. In the

ﬁrst variant, a weighting scheme enhanced with the notion of concept importance is used

to asses weight of ontology concepts. Concept importance shows how important a con-

cept is in an ontology and it is automatically computed by converting the ontology into

a graph and then applying one of the Markov based algorithms. In the second variant of

iCVS, concepts provided by the ontology and their semantically related terms are used

to construct concept vectors in order to represent the document into a semantic vector

space.

We conducted various experiments using a variety of machine learning classiﬁers for

three different models of document representation. The ﬁrst model is a baseline concept

vector space (CVS) model that relies on an exact/partial match technique to represent a

document into a vector space. The second and third model is an iCVS model that em-

ploys an enhanced concept weighting scheme for assessing weights of concepts (variant

1), and the acquisition of terms that are semantically related to concepts of the ontology

for semantic document representation (variant 2), respectively. Additionally, a compari-

son between seven different classiﬁers is performed for all three models using precision,

recall, and F1 score. Results for multiple conﬁgurations of deep learning architecture are

obtained by varying the number of hidden layers and nodes in each layer, and are com-

pared to those obtained with conventional classiﬁers. The obtained results show that the

classiﬁcation performance is highly dependent upon the choice of a classiﬁer, and that the

Random Forest, Gradient Boosting, and Multilayer Perceptron are among the classiﬁers

that performed rather well for all three models.

Keywords: document representation, CVS, iCVS, document classiﬁcation, deep learning,

ontology

1. Introduction

The global Internet population has reached 3.8 billion in 2017 from 3.4 billion the year

before, which is 47% of the world’s population [1]. According to IBM [2], in 2013 the

amount of data produced was 2.5 quintillion when the Internet users were around 2.7

billion only. The number is expected to grow in coming years which means that the

amount of data produced will be tremendous. By 2020, it is estimated that around 1.7

MB of data will be created every second for every person on earth.

The penetration of Internet of Things (IoT) and smart gadgets to households and a

huge amount of data produced every minute as a result has created a need for better or-

ganization and structuring of the data, which according to [3] is mostly unstructured.

Despite the computational resources available nowadays, organizing and structuring

tremendous amount of data is not a trivial task and without it, ﬁnding and extracting

useful information from massive Internet resources is a challenge [4]. Nearly 3.87 million

Google searches are conducted every minute of the day by the users [1]. Finding rele-

vant information for every query from plethora of resources is a challenging task. For

text-based documents, ontology can play a vital role in this regard [5].

An ontology is a data representation techniques that not only help better organize

data but also help categorize and classify data objects for easy search and retrieval. Many

text document classiﬁcation approaches widely employ ontologies to classify and or-

ganize text-based documents. A text document is generally represented by a vector

space model [6]. A vector space model is a feature vector representation constructed

by terms/words occurring in a document and their corresponding weights. Each term

denotes a dimension in the vector space and it is independent to other terms in the same

document. This representation technique is based on string literals and fail to consider or-

der of words and semantic relationships between them i.e. taxonomic and non-taxonomic

relations. In order to overcome these issues, a conceptual space document representation

emerged as a means that takes advantages of using wide coverage of concepts and rela-

tions provided by ontologies. In a conceptual space representation, a document is repre-

sented as a vector comprised of concepts (rather than words) and their weights. Concepts

are identiﬁed and located in a document through a matching technique which links the

terms appearing in that document with the concepts in the ontology. In fact, the link be-

tween a term tand a concept cis a mapping denoted by ht, ciin which textual description

deﬁned in label of tis replaced with textual description deﬁned in label of c. The weights

of concepts are deﬁned by counting the occurrences of the concepts within a document

i.e. concept relevance. Researchers in [7, 8, 9, 10, 11, 12] have widely used concept vector

space model for document classiﬁcation. Even though this approach has proven useful

for document classiﬁcation of many domains, it however has some limitations. Two ma-

jor limitations of this approach are: 1) it relies on the exact technique in which a document

is represented into vector space using concept vectors built by mapping terms occurring

∗I am corresponding author

Email address: zenun.kastrati@ntnu.no (Zenun Kastrati)

in a document with concepts appearing in a ontology, and 2) weighting technique that

treats all concepts equally important regardless of where the concepts are depicted in the

hierarchy of an ontology [13]. The importance is not equal for all concepts and it depends

on relations of concepts with other concepts in the ontology hierarchy. Concepts which

have more relations with other concepts are more important than the concepts which

have less relations [14].

These limitations are addressed in this paper by proposing an improved concept vec-

tor space model in which

1. a weighting technique enhanced with the new concept importance parameter is

used to asses weight of ontology concepts. The concept importance in our case is

computed automatically by ﬁrst converting the ontology into an ontology graph

and then implementing one of the Markov based algorithms called PageRank. The

obtained importance is then aggregated with the concept relevance in order to

achieve the ﬁnal weight of that particular concept.

2. concept vectors used to represent the document into a semantic vector space are

constructed by using concepts provided by the ontology through exact technique

and by acquiring terms that are related and can be attached to concepts of that

ontology.

The rest of the paper is structured as follows. Section 2 describes related work. Section

3 gives an overview of the proposed architecture and presents a detailed description of

our proposed concept vector space model. Section 4 describes the concept importance

calculation procedure and presents the performance of conventional and deep machine

learning classiﬁers on the INFUSE dataset for classifying funding documents in to ﬁve

distinct categories. Lastly, section 5 concludes the paper and gives an insight into the

future work.

2. Related Work

The ﬁeld of document classiﬁcation has attracted a lot of attention in recent years,

thereby resulting in a wide variety of approaches. Depending on the vector space docu-

ment representation model employed there are two main categories of these approaches

relevant to the classiﬁcation task: 1) Keyword based vector space approach, and 2) Con-

cept (ontology) based vector space approach.

The ﬁrst approach relies on a set of terms (words) extracted from the documents in

the dataset. This approach has some limitations as it does not consider the dependency

between the terms and it also ignores the order and the syntactic structure of the terms

in the documents. To overcome these limitations, concept based vector space approach

comes into effect. This approach relies on a set of concepts taken from an ontology to

derive the semantic representation of documents. There is some research work in which

concepts exploited by ontologies are used for semantic document representation. One ex-

ample is presented in [15], in which the authors introduced a classiﬁcation approach that

relies on a document representation model constructed using concepts gathered by a do-

main ontology. In particular, a domain ontology for Health, Safety, and Environment for

oil and gas application contexts is used for classifying documents dealing with accidents

from the oil and gas industry. An extended version of classiﬁcation approach given in

[15] is presented later in [9]. This extended work proposed a classiﬁcation approach that

employs a semantic document representation model that, besides concepts derived by

the ontology, uses a list of semantically related terms. Although the approach presented

in this paper is similar to our work, we differ in the way of how we acquire semantically

related terms. An extraction technique that relies on semantic and contextual information

of terms is used in our approach to ﬁnd and extract the most semantically related terms

instead of n-gram extraction technique used in [9].

Concept vector space approach employs a weighting technique for assessing weight

of concepts that relies on the concept relevance as a discriminatory feature for document

classiﬁcation. A drawback of this weighting technique is that it considers all concepts

equally regardless of where in the hierarchy the concepts occur. There have been some

efforts to ﬁnd concepts importance depending on the position of concepts where they are

depicted in the hierarchy. For instance, researchers in [16] used three different weights for

concepts depending on the position where they occur in the ontology hierarchy. The ﬁrst

weight was assigned to concepts which are occurring as classes, second weight for con-

cepts occurring as subclasses and the third weight for concepts occurring as instances.

The value of these weights is set empirically through trial and error by conducting ex-

periments. The value of 0.2 is set for concepts which occur as classes, 0.5 for concepts

occurring as subclasses and 0.8 when concepts occur as instances.

A slightly different approach of computing weights is implemented in [17, 18] where

layers of ontology tree are used to represent the position of concepts in the ontology. The

weight of each concept is then computed by counting the length of path from the root

node to the given concept. The same approach of using layers for calculating weight

values of concepts is used in [19]. Path length is also used to compute the weight of

concepts but rather than considering all ontology concepts, only the leaf concepts are

used. The idea behind this approach is that more general concepts, such as superclasses,

are implicitly taken into account through the use of leaf concepts by distributing their

weights to all of their subclasses down to the leaf concepts in equal proportion.

The drawback of above presented approaches is that they compute concepts’ weight

either empirically through trial and error by conducting experiments thus keeping these

weights ﬁxed or using the path length. Furthermore, the approach presented in [19] uses

only the top-level ontology for computing weights. Our approach uses a Markov based

PageRank algorithm to compute the concept importance. The algorithm uses all concepts

of ontology and the importance of a concept is computed relative to all other concepts in

the ontology.

From classiﬁcation perspective, studies presented above have not established well the

representation of documents which is one of the main aspects that inﬂuences the perfor-

mance of ontology based classiﬁcation models. Documents are represented as vectors

containing relevance of the concepts that are gathered by an ontology by searching only

the presence of their lexicalizations (concept label) in the documents. As a result of this,

classiﬁcation models are limited to capture the whole conceptualization involved in doc-

uments.

Another strand of research covers the work related to the use of machine learning ap-

proaches for document classiﬁcation. For instance, the authors in [8] proposed a machine

learning based classiﬁcation approach for understanding sentiment through differentiat-

ing good news from bad news. This is achieved using a vector space document represen-

tations learned by deep learning and convolutional neural networks with a test accuracy

of 85%. Another example of using convolutional recurrent deep learning model for clas-

siﬁcation is proposed in [7]. This approach is similar to our work but our focus is on

classiﬁcation of documents instead of sentences and we use feature vectors constructed

by concepts derived by an ontology.

3. Architecture of the Proposed Model

The main goal of the proposed model shown in Figure 1 is classiﬁcation of image

and textual documents using an improved concept vector space which relies on seman-

tically rich document representations and an enhanced concept weighting scheme. An

image document in our case is a movie frame containing handwritten lecture notes on

the chalkboard extracted from a lecture video employing image processing techniques

while the textual documents are ﬁnancial documents that are stored in pdf format. The

model consists of seven main modules that are described in the following subsections.

3.1. Text Analysis Module - TAM

The input of proposed classiﬁcation model is a collection of documents that can be

stored either as unstructured textual data or image. If the input is a document image, it

initially goes through a text analysis module called TAM to extract texts from that image.

TAM module itself consists of three steps and preprocessing is the very ﬁrst one which

ensures that the image has a readable text. The readable quality of a text in a document

image is mostly affected by blocking and blurring artifacts as a result of compression

and denoising. These readable text issues are avoided by using a metric designed for

evaluating text quality called a reference free perceptual quality metric (RF-PQM) [20].

The image is then converted into binary format using Otsu technique [21] and text regions

are localized using a 4-connected component based labelling approach as illustrated in

Figure 2.

The next step of TAM module is segmentation and extraction of text lines from the

connected components obtained as blobs after localization followed by extraction of words

using vertical 1-D projection histogram. We assume that the text documents obtained at

this stage are correct since the evaluation of TAM itself is beyond the scope of this paper.

Readers are therefore advised to refer to [22] and [23] for full details on the TAM module.

Figure 1: Architecture of the proposed classiﬁcation model

3.2. Preprocessing

This module takes as input text documents extracted from the image documents in the

TAM module and/or a collection of documents stored in unstructured textual formats,

e.g. Word, PDF, Powerpoint slides, etc. These text documents undergo preprocessing

steps including morpho-syntactic analysis. The ﬁrst preprocessing step involved is tok-

enization in which the text is split in small pieces known as tokens. Next, stop words

Figure 2: Labelling approach using 4 connected-components

and duplicate words are removed, and ﬁnally a stemming is performed to normalize the

retrieved words.

The output of this module is a collection of documents composed of plain text with

no semantics associated to them and it is linked directly to the concept extraction module

in order to embed semantics into those text documents.

3.3. Concept Extraction

Concept extraction module concerns with construction of feature vectors. A feature

vector is an n-dimensional vector comprised of concepts provided by domain ontolo-

gies so as to make a move from the keyword-based vector representation towards the

semantic-based vector representation. To achieve this step toward semantic represen-

tation, we primarily need to associate terms extracted from documents with concepts

of the ontology. Terms are located and extracted from documents using a Lucene in-

verted indexing technique which generates a list of all unique terms that occur in any

document and a set of documents in which these terms occur. The extracted terms are

stemmed using a stemming method. Further, noisy terms, i.e terms with single character,

are removed from the list of extracted terms. The extracted terms are associated with

the concepts of the ontology using: 1) the matching method in which terms appearing

in a document are mapped with the relevant concepts from the domain ontology, and

2) acquisition of relevant terminology that is semantically related and can be attached to

concepts of that domain ontology.

The matching method [12] follows the idea of searching for concepts in the domain

ontology that have labels matching either partially or exactly/fully with a term occur-

ring in a document. To put it simply, each term identiﬁed and located in a document is

searched in the domain ontology, and if an instance term matches its concept label than

term is replaced with the concept. Concept labels are considered all lexical entries and

lexical variations contained in a concept. The obtained concepts are used to construct

concept vectors. An exact match is the case where a concept label is identical with a

instance term occurring in the document. A partial match is the case when concept la-

bel contains a term occurring in the document. The exact and partial match is formally

deﬁned as following.

Deﬁnition 1 Let Ont be the domain ontology and let Dbe the dataset composed of doc-

uments of this given domain. Let Doc ∈Dbe a document deﬁned by a ﬁnite set of terms

Doc ={t1, t2, ..., ti}. Mapping of term ti∈Doc into concept cj∈Ont is deﬁned as:

EM (ti, cj) = 1,if label (cj)=ti

0,if label (cj)6=ti

P M (ti, cj) = 1,if label (cj) contains ti

0,if label (cj) does not contain ti

where, EM and PM denote exact match and partial match, respectively.

If EM (ti, cj)=1, it means that term tiand concept label cjare identical, then term tiis

replaced with concept cj. For example, for a concept in the ontology such as Organization

or Call as shown in Figure 3, there exists an identical term that appears in the document.

If P M (ti, cj) = 1, it means that term tiis part of concept label cj, then term tiis replaced

with concept cj. For example, the ProjectFunding compound ontology concept shown in

Figure 3, contains terms that appears in the document such as Project and/or Funding.

Extraction of concepts through acquisition of relevant terminology that is related and

can be attached to ontology concepts is a more complex task which is achieved through

exploitation of both contextual and semantic information of terms occurring in a docu-

ment.

Contextual information of a term is deﬁned by its surrounding words and it is com-

puted using Equation 1.

Context(ti, tj) = ti·tj

ktikk tjk(1)

The vectors, tiand tj, are composed of values derived by three statistical features,

namely, term frequency, term font types, and term font sizes, respectively. Different font

types, i.e. bold,italic,underline, and font sizes, i.e. title,level 1,level 2, are introduced

to derive the context. In our case, values of these statistical features are extracted from

input pdf documents using Apache PDFBox library, that is, an open source Java library

which allows creation of new pdf documents, manipulation of existing documents and

the extraction of content from documents.

Semantic information of a term is calculated using a semantic similarity measure

based on the English lexical database WordNet. Wu&Palmer similarity measure [24] is

employed to compute a semantic score (Eq. 2) for all possible pairs of terms tiand tj

occurring in a document.

Semantic(ti, tj) = 2∗depth(lcs)

depth(ti) + depth(tj)(2)

Parameter, depth(lcs) shows the least common subsumer of terms tiand tj, and parameters

depth(ti)and depth(tj)show the path’s depth of terms tiand tj, in the WordNet.

Combination of contextual and semantic information gives an aggregated score as

shown in Equation 3.

AggregatedScore(ti, tj) = λ∗Context(ti, tj) + (1 −λ)∗Semantic(ti, tj)(3)

where, λis set to 0.5 showing an equal contribution of context and semantic compo-

nents on the aggregated score.

Aggregated score through a rank cut-off method is used to acquire terms that are

related to concepts of the ontology. More concretely, terms that are above the speciﬁed

threshold (top-N) are considered to be the relevant terms.

3.4. Domain Ontology

This module covers domain ontology which interfaces Term-to-Concept mapper com-

ponent in the concept extraction module and the weighting scheme module. A domain

ontology is a data model which represents concepts and relations between them in a

given domain. An ontology structure is formally represented by a 5-tuple [25], as shown

in the Equation 4.

Ont := (C, R, HC, rel, A)(4)

where,

•Cis a set of concepts, e.g. Funding,Call;

•Ris a set of relations, e.g. announces,promotes;

•HCis a hierarchy or taxonomy of concepts with multiple inheritance, e.g. Program-

meFunding isa Funding and FinanceProgramme isa ProgrammeFunding;

•rel is a set of non-taxonomic relations which are described by their domain and

range restrictions, e.g. isReceivedBy, appliesFor;

•Ais a set of ontology axioms, expressed in an appropriate logical language, which

describe additional constraints;

The ontology deﬁnition shown in Equation 4can be domain speciﬁc by deﬁning a

lexicon which is a 3-tuple Lex:=(L,F,G) consisting of a set of lexical entries Lfor concepts

and relations, and two sets Fand Gthat link concepts and relations with their lexical

entries.

3.5. Weighting Scheme

The weight of a concept is a numeric value which is assigned to each concept in or-

der to assess its power in distinguishing a particular document from others. A technique

used to compute the weight of concepts is known as concept weighting scheme. There

exist various weighting schemes that typically rely on the relevance of concepts reﬂected

by frequency of occurrences of concept’s lexicalizations within a document. In this mod-

ule we present an enhanced concept weighting scheme which besides concept relevance,

introduces a new parameter called concept importance that reﬂects the contribution of

a concept in the ontology. Concept importance is processed ofﬂine and it involves the

following steps: 1) mapping the domain ontology into an ontology graph, 2) applying

Markov based algorithms, and 3) calculation of concept importance and aggregation with

concept relevance.

The ﬁrst and the foremost step of this module is to convert the domain ontology de-

scribed in subsection 3.4 into an ontology graph for calculating concept importance. To

achieve this, we adopt a model where the ontology is represented as a directed acyclic

graph. The modelling is an equivalent mapping which means that an ontology concept

is mapped into a graph vertex and an ontology relation into a graph edge which connects

two vertices. The formal deﬁnition of this graph, known as ontology graph, is given as

follows.

Deﬁnition 2 Given a domain ontology Ont, the ontology graph G={V , E, f }of Ont is

a directed acyclic graph, where Vis a ﬁnite set of vertices mapped from concepts in Ont,

Eis a ﬁnite set of edge labels mapped from relations in Ont, and fis a function from E

to V×V.2

In Figure 3, we present part of the INFUSE ontology graph which consists of a subset

of concepts and relations from the funding domain. The details of the INFUSE ontology

are given in Section 4.

In the semantic web, a formal syntax for deﬁning ontologies is Web Ontology Lan-

guage (OWL) and Resource Description Framework (RDF) Schema. These languages

represent the ontology as a set of Subject-Predicate-Object (SPO) expressions known as

RDF triples. The set of RDF triples is known as RDF graph where subject is the source

vertex and object is the destination vertex, and predicate is a directed edge label which

links those two vertices. The formal deﬁnition of RDF graph is given as following.

Figure 3: A part of INFUSE ontology graph

Deﬁnition 3 Given a set of RDF triples T, the RDF graph G={V, E, f}of Tis a directed

acyclic graph, where Vis a ﬁnite set of vertices (subjects and objects) in Gdeﬁned as

V={vu:u∈(S(T)∪P(T))},Eis a ﬁnite set of edge labels (predicates) in Gdeﬁned

as E={eSP O :S P O ∈T},fis a function linking subject Sto an object Oby an edge E

deﬁned as f={fP:fP=VS→VO, VS, VO∈T}2

The ontology graph and RDF graph are not the same for a given ontology. The differ-

ence is that a relation in an ontology graph is deﬁned as a vertex in the RDF graph. For

example, relation isReceived in ontology graph shown in Figure 3 is represented as a ver-

tex in RDF graph, as shown in Figure 4. In other words, a relation in RDF graph is a link

between a subject denoted by rdfs:domain property and an object denoted by rdfs:range

property as given in Deﬁnition 3.

Figure 4: An example RDF graph representation

The next step is computation of the importance of vertices of the graph using an adop-

tion of the Markov based algorithms. The graph can be either ontology graph or RDF

graph as deﬁned above. The idea behind Markov based algorithms is representing the

graph as a stochastic process, more concretely as a ﬁrst-order Markov chain where the

importance for a given vertex is deﬁned as the fraction of time spent traversing that ver-

tex for an inﬁnitely long time in a random walk over the vertices. The probability of

transitioning from a vertex ito a vertex jis only dependent on the vertex iand not on

the path to arrive at vertex j. This property, known as the Markov property, enables the

transition probabilities to be represented as a stochastic matrix with non-negative entries

and the maximum probability of 1.

In this paper, we use PageRank [26] algorithm as one of the most well known and

successful example of Markov based algorithms [27].

A simpliﬁed principle of work of PageRank algorithm is as follows. It initially deﬁnes

the importance of a vertex ias given in Equation 5.

P R(i) = X

j∈Vi

P R(j)

Outdegree(j)(5)

where, PR(j) is the importance of vertex j,Viis the set of vertices that links to vertex i,

and Outdegree(j) is the number of vertices that have outlinks from vertex j.

As we can see from the Equation 5, the PageRank is an iterative algorithm. It assigns

an initial importance to a vertex ias shown in Equation 6.

P R(0)(i) = 1

N(6)

where, Nis the total number of vertices in the graph. Then PageRank iterates as per

Equation 7 and continues to iterate until a convergence criterion is satisﬁed.

P R(k+1)(i) = X

j∈Vi

P R(k)(j)

Outdegree(j)(7)

The process can also be deﬁned using the matrix notation. Let Mbe the square,

stochastic transition probabilities matrix corresponding to the directed graph G, and

Imp(k) is the Importance vector at the kth iteration. Then the computation of one itera-

tion corresponds to the matrix-vector multiplication as shown in Equation 8.

P R(k+1) =M∗P R(k)(8)

The entry of transition probability matrix M, for a vertex jwhich links to vertex i, is

deﬁned using Equation 9.

pi,j =1

Outdegree(j),if there is a link from j to i

0,otherwise (9)

There are two properties that are necessary to be satisﬁed in order for a Markov based

algorithm to converge. It should be aperiodic and irreducible [28]. The transition prob-

ability matrix Mis a stochastic matrix with probability 1 and this makes the PageRank

algorithm aperiodic. The PageRank algorithm is not irreducible due to the deﬁnition

given in Equation 9, where some of the transition probabilities in matrix Mmay be 0.

This does not meet the criteria of irreducibility property which requires the transition

probabilities to be greater than 0.

To make the PageRank algorithm irreducible in order to converge, a damp factor 1−α

is introduced. As a result of this, a new transition probability matrix M∗is deﬁned where

a complete set of outgoing edges with probability α/N are added to all vertices in graph.

The deﬁnition of matrix M∗is given in Equation 10.

M∗= (1 −α)M+α1

NN×N

(10)

The damp factor besides enabling the PageRank algorithm to converge also over-

comes the problem of rank sinks [28].

Replacing M∗with Min Equation 8, the PageRank algorithm is deﬁned as given in

Equation 11.

P R(k+1) = (1 −α)M×P r(k)+α1

NN×N

(11)

Finally, concept importance is deﬁned as given in Equation 12.

Imp(ci) = P R(k+1) (12)

The ﬁnal step of this module is aggregation of concept importance and concept rele-

vance to compute weight of concepts. The value of a concept weight is in the range of

[0,1] because both concept importance and concept relevance are normalized.

w(ci) = Imp(ci)×Rel(ci)(13)

Concept importance Imp is computed using Equation 12 described above, while con-

cept relevance Rel is computed using Equation 14.

Rel(ci) =

i=1

F req(ci)(14)

where, Freq(ci)is the frequency of occurrences of lexicalizations of concept ciin the

document to be classiﬁed.

3.6. Document Representation

The output of both modules, concept extraction and weighting scheme, will serve

as an input to semantic document representation module for representing a document.

More concretely, concepts obtained from concept extraction module and their weights

computed through weighting scheme module are used to represent a document in a vec-

tor space as deﬁned in Equation 15.

Doc ={(c1, w1),(c2, w2),(c3, w3), ..., (ci, wi)}(15)

where ciis the ith concept obtained from concept extraction module and wnis its weight

computed from weighting scheme module.

Table 1 illustrates an example of semantic document representation through a vec-

tor space that is constructed by using concepts (GeographicalArea and Applicant) and their

weights composed of two components, Importance (Imp) and Relevance (Rel), as de-

scribed in subsection 3.5.

Table 1: An example of building concept vector space

Doc

GeographicalArea Applicant

Imp Rel w Imp Rel w

d1 0.130 0.797 0.104 0.020 0.797 0.016

d2 0.130 0.624 0.081 0.020 0.624 0.012

d3 0.130 0.000 0.000 0.020 0.860 0.017

3.7. Document Classiﬁcation

The last module of proposed model deals with classiﬁcation of documents into ap-

propriate categories using conventional machine learning classiﬁers and deep learning.

In essence, a document represented via concept vector space is fed into the classiﬁer to

build a prediction model that can be used to classify a new unseen document.

4. RESULTS AND ANALYSIS

This section describes the calculation of concept importance of a real-world ontology.

It also gives a description of the dataset used to perform the experiments for demon-

strating the applicability of our proposed document representation models. Finally, it

provides a thorough comparison of document classiﬁcation results achieved using both

conventional machine learning techniques and deep networks.

4.1. Concept Importance Calculation

A real-world domain ontology called INFUSE ontology is used for computing concept

importance. This ontology is developed as part of the INFUSE 1project and it comes

from the funding domain. It is composed of 85 concepts, e.g. Funding,GrantSeeker and

18 object properties, e.g. isGivenBy,appliesFor, that connect these concepts. A part of

INFUSE domain ontology represented as an ontology graph is shown in Figure 3.

To convert the ontology into an ontology graph and compute the concept importance,

we have used the RDF rank algorithm. This algorithm is part of the extensions module of

GraphDB [29] and it computes the importance for every vertex in the entire RDF graph.

Table 2 shows the concept importance values of the top ten concepts of the INFUSE on-

tology. The concept importance is a ﬂoating point number with values varying between

0 and 1.

Table 2: Concept importance for the top ten concepts of the INFUSE ontology

No Concept Concept Importance

1 Coverage 0.20

2 GeographicalArea 0.13

3 Topic 0.11

4 County 0.07

5 Participant 0.06

6 Programme 0.05

7 Organisation 0.05

8 Funding 0.05

9 Applicant 0.04

10 Candidate 0.04

Figure 5 shows the concept importance values in ranking order after having com-

puted them for all the concepts of the INFUSE ontology. As can be seen from the chart

diagram, the concept importance is different for different concepts, varying from 0.2 -

0.02 for almost half of the concepts set, while for the rest of the concepts it is 0.01. These

ﬁndings conﬁrm the idea that the contribution of ontology concepts in terms of concepts’

discriminating power is different and thus some concepts are more important than the

others with respect to document classiﬁcation.

1https://www.eurostars-eureka.eu/project/id/7141

Figure 5: Concept importance for all concepts of the INFUSE ontology

4.2. Performance Evaluation of Baseline CVS and iCVS

In order to demonstrate the general applicability of our proposed classiﬁcation model

and to validate its effectiveness, extensive experiments using various classiﬁers are con-

ducted on the INFUSE dataset.

The INFUSE dataset consists of 467 grant documents that had been collected and clas-

siﬁed into 5 categories by ﬁeld experts as part of the INFUSE project. The dataset is split

randomly, in which 70% of the documents are used to build the classiﬁer and the remain-

ing 30% to test the performance of the model. The number of documents in each category

varied widely, ranging from the Society category which contains 165 documents to the

Music category which contains only 14 documents. Table 3 shows ﬁve categories along

with the number of training and testing documents in each category.

Table 3: Dataset size

No Category # Train # Test Total

1 Culture 102 44 146

2 Health 73 32 105

3 Music 10 4 14

4 Society 115 50 165

5 Sportssociety 26 11 37

6 Total 326 141 467

Parametric and nonparametric machine learning techniques are used for experiment-

ing. A parametric machine learning technique assumes that the data can be parameter-

ized by a ﬁxed number of parameters. In essence, the statistical model of parametric tech-

niques is speciﬁed by a simpliﬁed function through two types of distributions, namely,

the class prior probability, and the class conditional probability density function (poste-

rior) for each dimension. On the contrary, a nonparametric machine learning technique

assumes no prior parameterized knowledge about the underlying probability density

function and the classiﬁcation uses the information provided by training samples alone.

Naive Bayes is a parametric machine learning technique applied for classiﬁcation in

this paper, while nonparametric techniques applied in this paper include Decision Tree

and Random Forest. We also have chosen to use Support Vector Machine (SVM) for

classiﬁcation that can be either parametric or non-parametric technique. Linear Support

Vector Machine contains a ﬁxed size of parameters represented by the weight coefﬁcient

and thus it belongs to the parametric techniques. On the other side, Non-linear Support

Vector Machine is a non-parametric technique and Radial Basis Function Kernel Support

Vector Machine, known as RBF Kernel SVM, is a typical example of this family. In ad-

dition, we have applied two boosting techniques, namely Gradient Boosting and Ada

Boosting, which grant power of ensemble classiﬁers that generate multiple predictions

and majority voting among the individual classiﬁers.

Additionally, a Multilayer Perceptron (MLP) is used in this study. An MLP is a feed-

forward Artiﬁcial Neural Network (ANN). The artiﬁcial neurons in the network compute

a weighted sum of its inputs xi, adds a bias b, and applies an activation function. A simple

ANN is represented as: y=f(wxi+b), where wis the weigh and fis the activation func-

tion. Most commonly used activation functions are sigmoid, which is σ(z)=1/(1 + e−z)

and rectiﬁed linear units which is ReLU(z) = max(0, z). The weight and bias terms are es-

timated by training the network on the observable data to minimize the loss using cross-

entropy or mean square error. In an MLP, the neurons are structured into layers. These

layers are fully-connected which implies that every neuron in one layer is connected to

every neuron in the adjacent layer. The input and the output layers are the visible lay-

ers in the network while a network may contain multiple hidden layers. Normally, a

network containing more than one hidden layer is known as a deep neural network.

The standard information retrieval measures such as precision, recall and F1 measure,

are used to evaluate the performance of the document classiﬁcation. Precision is the num-

ber of documents which are classiﬁed correctly with respect to all classiﬁed documents. It

is given as: tp/(tp +f p). Recall is the number of classiﬁed documents with respect to the

total number of documents in the dataset. Recall is deﬁned as: tp/(tp +f n), where tp, tn,

and fn are true positive, true negative, and false negative samples. F1 measure is the har-

monic mean of precision and recall and it is deﬁned as: 2((precision ∗recall)/(precision +

recall)).

Best results are obtained on the conventional machine learning techniques for follow-

ing conﬁgurations. For the Bayesian classiﬁer, a Gaussian NB is used whereas for SVM, a

radial basis function (RBF) kernel SVM is used. A value of 0.001 is used for gamma which

describes how much inﬂuence a single training sample has, and a maximum value is

set for the regularization parameter c. The depth of the tree for RF classiﬁer is set to 10

which gave best results. For all other parameters of the classiﬁers, default conﬁgurations

are used. For deep learning based MLP architecture, multiple simulations consisting of

L×Nare carried out by varying the number of hidden layers Land the number of

neurons Nin each layer, where L={3,5,7}, and N={64,128,256,512,1024}. Figure

6 shows the total number of trainable parameters for a 5-hidden layer MLP containing

1024 neurons in each layer. The input to the network shown is 323 size concept vector for

iCVS variant 2. Relu is applied as the activation function, adam is used as the optimizer,

while the learning rate αis set to 1e−3. A softmax function is applied at the last layer to

convert the likelihood of a test sample belonging to one of the 5 classes.

Figure 6: Model summary for a 5-hidden layer MLP architecture for 323 concept input vector size

with 1024 neurons.

Three different models of vector space document representation are used to test the

classiﬁers. In the ﬁrst model called baseline CVS, we conducted a document classiﬁ-

cation experiment on the INFUSE dataset in which an exact/partial match technique is

employed to match term occurring in a document with relevant concepts of the ontol-

ogy to build concept vectors for representing documents into vector space. Precision,

recall, and F1 results obtained from six conventional Machine Learning techniques and

a deep MLP with different number of hidden layers and neurons are shown in Table 4

and Table 5, respectively. As can be seen from the results, Gradient Booosting classiﬁer

shows the best performance compared to other conventional classiﬁer achieving a 82.58%

of weighted F1 score. On the other hand, MLP with 3 hidden layers and 1024 neurons in

each layer outperforms other deep network achieving an F1 score of 80.02%.

Table 4: Performance of conventional ML techniques using baseline CVS

Technique Precision (%) Recall (%) F1 (%)

Naive Bayes 67.24 60.99 61.90

Decision Tree 66.10 66.40 65.50

Random Forest 77.69 77.30 77.25

SVM 81.73 77.30 78.85

Gradient Boosting 82.99 82.26 82.58

Ada Boosting 58.61 53.90 54.69

In the second experiment, we performed document classiﬁcation using the same clas-

siﬁers on the same corpus of documents from the INFUSE dataset, but employing the

second model of document representation. The second model called iCVS variant 1 is

Table 5: Performance of MLP using baseline CVS

# of hidden

layers

# of neurons Precision (%) Recall (%) F1 (%)

64 79.32 78.72 78.47

128 77.80 78.01 77.89

256 77.05 77.30 77.08

512 79.75 79.43 79.07

1024 80.13 80.14 80.02

64 78.11 78.01 77.50

128 78.29 78.01 77.96

256 75.21 74.46 74.36

512 77.21 76.59 76.64

1024 77.87 77.30 77.24

64 77.99 78.01 77.77

128 77.93 77.30 77.40

256 76.53 75.58 75.89

512 75.00 73.75 73.92

1024 78.73 76.59 76.90

an enhanced concept weighting scheme that is used for assessing weight of concepts of

the ontology. Six different conventional Machine Learning techniques, and a Multilayer

Perceptron with different number of hidden layers and different number of neurons per

layer, are used for classiﬁcation and the obtained results are shown in Table 6 and Ta-

ble 7, respectively. As with baseline CVS model, the obtained results using iCVS variant

1 show that Gradient Boosting classiﬁer achieved the highest improvement compared

to other conventional machine learning and deep learning techniques. In the context of

deep networks, the best performance is achieved by an MLP architecture with 7 hidden

layers and 256 neurons per layer with an F1 score of 76.64%,.

Table 6: Performance of conventional ML techniques using iCVS variant 1

Technique Precision (%) Recall (%) F1 (%)

Naive Bayes 66.63 53.90 57.73

Decision Tree 69.10 70.00 68.80

Random Forest 84.54 80.85 82.07

SVM 66.65 53.19 56.64

Gradient Boosting 83.06 81.56 82.14

Ada Boosting 61.72 60.28 60.33

iCVS variant 2 model is also evaluated in a similar fashion. In this model, concept vec-

tors for representing documents into vector space are build through acquisition of new

terms that are semantically related and can be attached to concepts of the ontology. In

Table 7: Performance of MLP using iCVS variant 1

# of hidden

layers

# of neurons Precision (%) Recall (%) F1 (%)

64 72.84 73.04 72.77

128 67.40 69.50 68.22

256 71.86 70.92 71.29

512 73.69 73.75 73.55

1024 73.35 73.04 72.81

64 70.33 69.50 69.53

128 72.65 73.04 72.77

256 72.16 72.34 72.16

512 68.30 68.79 68.23

1024 73.14 73.04 72.82

64 66.55 68.08 66.46

128 67.79 69.50 68.30

256 76.82 76.59 76.64

512 77.10 75.17 75.87

1024 73.48 73.75 73.47

our case, for each concept of the INFUSE ontology we used only the top-5 terms found

as relevant in terms of relatedness. For example, terms fund,amount,part,subsistence,

and grant, are the top-5 terms that are found to be the most semantically related terms

with ontology concept funding. The performance of document classiﬁcation, in terms of

precision, recall and F1 measure, achieved by six conventional Machine Learning tech-

niques and a Multilayer Perceptron with different number of hidden layers and neurons,

is given in Table 8 and Table 9, respectively. As can be seen from the results shown in

Table 8 and Table 9, the best performing classiﬁer is an MLP having three hidden layers

and 64 neurons in each layer with an F1 score of 84.98% which is slightly better than SVM

with an F1 score of 84.11%.

Table 8: Performance of conventional ML techniques using iCVS variant 2

Technique Precision (%) Recall (%) F1 (%)

Naive Bayes 67.02 65.95 65.28

Decision Tree 79.20 77.90 76.70

Random Forest 77.04 74.46 75.06

SVM 85.66 83.68 84.11

Gradient Boosting 84.35 83.68 83.96

Ada Boosting 69.79 60.99 62.56

A side by side comparison is illustrated in Figure 7 for three models. The ﬁgure

presents a complete picture of the performance of conventional machine learning and

Table 9: Performance of MLP using iCVS variant 2

# of hidden

layers

# of neurons Precision (%) Recall (%) F1 (%)

64 85.05 85.10 84.98

128 80.12 80.14 79.55

256 79.04 79.43 78.79

512 81.47 81.56 81.29

1024 81.68 82.26 81.80

64 80.11 80.85 80.11

128 78.07 79.43 78.06

256 80.76 80.85 80.34

512 78.79 78.72 77.82

1024 78.42 79.43 78.58

64 77.68 78.01 77.21

128 80.76 80.85 80.47

256 77.50 77.30 77.17

512 82.99 83.68 83.07

1024 81.69 82.26 81.57

deep learning techniques on the INFUSE dataset for the proposed models. The bar chart

shows the weighted F1 score obtained by conventional machine learning, namely Naive

Bayes (NB), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM),

Gradient Boosting (GB), and Ada Boosting (AD), and a Multilayer Perceptron (MLP) with

3 hidden layers and 64 neurons per layer, tested on three different models of document

representation.

As can be seen from the results shown in Figure 7, a higher weighted classiﬁcation

F1 score is achieved by all classiﬁers using iCVS variant 2. An exception is Random

Forest that gives slightly worse classiﬁcation performance than other classiﬁers. Random

Forest is an ensemble method that employs the same decision tree classiﬁer on different

training sets generated using the bootstrap sampling method. In a bootstrap sampling, a

new training set is created by taking data from the original training set, thus some data

may be used several times to construct the forest and others not at all. This may be one

of the reasons that this classiﬁer performs worse.

It is also interesting to note from the Figure 7 that in general MLP classiﬁer outper-

forms all conventional machine learning classiﬁers achieving a classiﬁcation F1 score of

84.98%. On the other hand, the worst performance is shown by Naive Bayes classiﬁer

which may have happened due to the imbalanced classes of the INFUSE dataset. Imbal-

anced classes may result in biasing of the classiﬁer towards the majority of the class and

thus the performance of Naive Bayes classiﬁer can quickly turn poor.

An interesting fact that also can be observed from the bar chart shown in Figure 7 is

that iCVS variant 1 model has different impact on the performance of classiﬁers. While

NB DT RF SVM GB AB MLP

100

F1 (%)

baseline CVS iCVS variant 1 iCVS variant 2

Figure 7: F1 measure of different classiﬁers using exact/partial match (baseline CVS), enhanced

weighting scheme (iCVS variant 1), and acquisition of related terms (iCVS variant 2)

nonparametric and boosting machine learning techniques demonstrate a positive impact

on document classiﬁcation using an iCVS variant 1, parametric and MLP show a negative

impact on classiﬁcation performance giving worse accuracy.

5. Conclusion and Future Work

In this paper, we have investigated and analysed the document classiﬁcation per-

formance using a concept vector space model improved with new concept weighting

scheme, and semantic document representation. Concept weighting scheme is enhanced

with new parameter that takes into account the importance of ontology concepts. Con-

cept importance is computed automatically and this is achieved by converting the on-

tology into a graph and then employing the PageRank algorithm on it. Importance of

an ontology concept is then aggregated with concept relevance which is computed using

the frequency of appearances of a concept in the document. A semantic representation of

document is achieved using concepts derived from ontology through matching technique

and acquisition of new terms that can be semantically related with ontology concepts.

We conducted various document classiﬁcation experiments on three models of docu-

ment representation i.e. baseline CVS model and iCVS model with two variants. Addi-

tionally, a comparison between seven different classiﬁers is performed for all three mod-

els using precision, recall, and F1 score. For all three models, Random Forest, Gradient

Boosting, and Multilayer Perceptron, performed rather well. Furthermore, a thorough

investigation is carried out to evaluate the performance of MLP by varying the number

of hidden layers and the number of neurons in each layer. A three hidden layer MLP

with 64 neurons achieves higher classiﬁcation performance compared to other architec-

ture conﬁgurations.

Generally, iCVS variant 1 employing an enhanced weighting scheme used for assess-

ing weights of concepts did not add much to the overall performance except for Random

Forest which gave better results employing baseline CVS and iCVS variant 2 with an F1

score of just over 81%. Our ﬁndings showed that adding more concepts to ontology im-

proves the classiﬁcation performance by 4.78 percentage point on average in all cases,

however, it is computationally expensive due to a large number of feature vectors. The

classiﬁcation performance is also highly dependent upon the choice of a classiﬁer and

we can achieve the same performance on the iCVS model (variant 1 and variant 2) with

Random Forest and Gradient Boosting classiﬁer.

Investigation and analysis of classiﬁcation performance is done on real-world ontol-

ogy and dataset consisting a small number of documents, so in future work we plan to

conduct a performance analysis in a large-scale dataset. We also plan to implement and

test other Markov based algorithms for computing concept importance as fundamental

part of concept weighting scheme and compare those techniques with the PageRank al-

gorithm.

Furthermore, the primary focus of our study was addressing two major concept vec-

tors limitations namely exact matching and weighting scheme by proposing an improved

concept vector space model. However, our proposed approach does not handle another

concept vectors limitation which is ontological relationships. Future studies on the cur-

rent topic are therefore suggested in order to establish representation of documents in

which concept vectors can be redeﬁned to consider the various relationships that exist in

an ontology.

Acknowledgment

The authors would like to thank Cristina Marco from the INFUSE project for provid-

ing the domain ontology and the dataset used in this paper.

References

[1] DOMO, Data never sleeps 6.0: How much data is generated every minute?, accessed: 2018-06-18

(2018).

URL https://www.domo.com/learn/data-never-sleeps-6

[2] R. Jacobson, 2.5 quintillion bytes of data created every day. how does cpg & retail manage it?,

accessed: 2018-07-20 (2018).

URL https://www.ibm.com/blogs/insights-on-business/consumer-products/2- 5-

quintillion-bytes-of-data- created-every-day-how-does-cpg-retail- manage-

it/

[3] P. Raghavan, Extracting and Exploiting Structure in Text Search, in: SIGMOD Conference, ACM,

2003, p. 635.

[4] A.-A. R. Al-Azmi, Data, Text, and Web Mining for Business Intelligence: A Survey, International

journal of Data Mining and Knowledge Management Process 3 (2) (2013) 1–26.

[5] S. Khan, M. Safyan, Semantic matching in hierarchical ontologies, Journal of King Saud University -

Computer and Information Sciences 26 (3) (2014) 247 – 257.

[6] M. Keikha, A. Khonsari, F. Oroumchian, Rich Document Rrepresentation and Classiﬁcation: An Anal-

ysis, Knowledge-Based Systems 22 (1) (2009) 67–71.

[7] A. Hassan, A. Mahmood, Convolutional recurrent deep learning model for sentence classiﬁcation,

IEEE Access 6 (2018) 13949–13957.

[8] U. Reshma, B. Ganesh, M. Kale, P. Mankame, G. Kulkarni, Deep learning for digital text analytics:

Sentiment analysis, CoRR abs/1804.03673.

[9] N. Sanchez-Pi, L. Marti, A. C. B. Garcia, Improving Ontology-based Text Classiﬁcation: An Occupa-

tional Health and Security Application, Journal of Applied Logic 17 (2016) 48–58.

[10] C. Bratsas, V. Koutkias, E. Kaimakamis, P. Bamidis, N. Maglaveras, Ontology Based Vector Space

Model and Fuzzy Query Expansion to Retrieve Knowledge on Medical Computational Problem So-

lutions, in: Proceedings of the 29th Annual International Conference of the IEEE Engineering in

Medicine and Biology Society, IEEE, 2007, pp. 3794–3797.

[11] P. Castells, M. Fernandez, D. Vallet, An Adaptation of the Vector Space Model for Ontology Based

Information Retrieval, IEEE Transactions on Knowledge and data engineering 19 (2) (2007) 261–272.

[12] S. Deng, H. Peng, Document Classiﬁcation Based on Support Vector Machine Using A Concept Vector

Model, in: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, IEEE,

2006, pp. 473–476.

[13] Z. Kastrati, A. S. Imran, S. Y. Yayilgan, An improved concept vector space model for ontology based

classiﬁcation, in: 2015 11th International Conference on Signal-Image Technology & Internet-Based

Systems (SITIS), IEEE, 2015, pp. 240–245.

[14] G. Wu, J. Li, L. Feng, K. Wang, Identifying Potentially Important Concepts and Relations in an On-

tology, in: Proceedings of the 7th International Conference on The Semantic Web, Springer-Verlag,

Berlin, Heidelberg, 2008, pp. 33–49.

[15] N. Sanchez-Pi, L. Marti, A. C. B. Garcia, Text Classiﬁcation Techniques in Oil Industry Applications,

in: Proceedings of the International Joint Conference SOCO’13-CISIS’13-ICEUTE’13, Springer Inter-

national Publishing, 2014, pp. 211–220.

[16] X. quan Yang, N. Sun, Y. Zhang, D. run Kong, General Framework for Text Classiﬁcation Based on

Domain Ontology, in: Proceedings of the 3rd International Workshop on Semantic Media Adaptation

and Personalization, IEEE, 2008, pp. 147–152.

[17] J. Fang, L. Guo, X. Wang, N. Yang, Ontology-Based Automatic Classiﬁcation and Ranking for Web

Documents, in: Proceedings of the 4th International Conference on Fuzzy Systems and Knowlede

Discovery, IEEE, 2007, pp. 627–631.

[18] H. Gu, Z. Kuanjiu, Text Classiﬁcation Based on Domain Ontology, Journal of Communication and

Computer 3 (5) (2006) 261–272.

[19] C. d. C. Pereira, A. G. B. Tettamanzi, An Evolutionary Approach to Ontology-Based User Model

Acquisition, in: Proceedings of the 5th International Workshop on Fuzzy Logic and Applications,

Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 25–32.

[20] A. S. Imran, F. A. Cheikh, Blind image quality metric for blackboard lecture images, in: Proceedings

of the 18th European Signal Processing Conference, IEEE, 2010, pp. 333–337.

[21] L. Jianzhuang, L. Wenqing, T. Yupeng, Automatic thresholding of gray-level pictures using two-

dimension otsu method, in: Proceedings of the International Conference on Circuits and Systems,

IEEE, 1991, pp. 325–327 vol.1.

[22] Z. Kastrati, A. S. Imran, Document image classiﬁcation using semcon, in: Proceedings of the 20th

Symposium on Signal Processing, Images and Computer Vision (STSIVA), 2015, pp. 1–6.

[23] A. S. Imran, S. Chanda, F. A. Cheikh, K. Franke, U. Pal, Cursive handwritten segmentation and recog-

nition for instructional videos, in: 2012 Eighth International Conference on Signal Image Technology

and Internet Based Systems, 2012, pp. 155–160.

[24] Z. Wu, M. Palmer, Verbs Semantics and Lexical Selection, in: Proceedings of the 32nd Annual Meeting

on Association for Computational Linguistics, Association for Computational Linguistics, 1994, pp.

133–138.

[25] A. Maedche, Ontology Learning for the Semantic Web, Springer US, 2002.

[26] S. Brin, L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, in: Proceedings of

the 7th International Conference on World Wide Web 7, Elsevier Science Publishers B. V., Amsterdam,

The Netherlands, The Netherlands, 1998, pp. 107–117.

[27] S. White, P. Smyth, Algorithms for Estimating Relative Importance in Networks, in: Proceedings of

the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM,

New York, NY, USA, 2003, pp. 266–275.

[28] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the

Web, Technical Report, Stanford InfoLab (1998).

[29] Ontotext, Graphdb workbench users guide, accessed: 2015-09-20 (2014).

URL http://owlim.ontotext.com/display/GraphDB6/GraphDBWorkbench

Sentiment Polarity and Emotion Detection from Tweets Using Distant Supervision and Deep Learning Models

Chapter

Sep 2022

Automatic text-based sentiment analysis and emotion detection on social media platforms has gained tremendous popularity recently due to its widespread application reach, despite the unavailability of a massive amount of labeled datasets. With social media platforms in the limelight in recent years, it’s easier for people to express their opinions and reach a larger target audience via Twitter and Facebook. Large tweet postings provide researchers with much data to train deep learning models for analysis and predictions for various applications. However, deep learning-based supervised learning is data-hungry and relies heavily on abundant labeled data, which remains a challenge. To address this issue, we have created a large-scale labeled emotion dataset of 1.83 million tweets by harnessing emotion-indicative emojis available in tweets. We conducted a set of experiments on our distant-supervised labeled dataset using conventional machine learning and deep learning models for estimating sentiment polarity and multi-class emotion detection. Our experimental results revealed that deep neural networks such as BiLSTM and CNN-BiLSTM outperform other models in both sentiment polarity and multi-class emotion classification tasks achieving an F1 score of 62.21% and 39.46%, respectively, an average performance improvement of nearly 2–3 percentage points on the baseline results.KeywordsSentiment polarityEmotion detectionDistant supervisionEmojiDeep learningTwitterClassification

Urdu Speech and Text Based Sentiment Analyzer

Preprint

Full-text available

Jul 2022

Discovering what other people think has always been a key aspect of our information-gathering strategy. People can now actively utilize information technology to seek out and comprehend the ideas of others, thanks to the increased availability and popularity of opinion-rich resources such as online review sites and personal blogs. Because of its crucial function in understanding people's opinions, sentiment analysis (SA) is a crucial task. Existing research, on the other hand, is primarily focused on the English language, with just a small amount of study devoted to low-resource languages. For sentiment analysis, this work presented a new multi-class Urdu dataset based on user evaluations. The tweeter website was used to get Urdu dataset. Our proposed dataset includes 10,000 reviews that have been carefully classified into two categories by human experts: positive, negative. The primary purpose of this research is to construct a manually annotated dataset for Urdu sentiment analysis and to establish the baseline result. Five different lexicon- and rule-based algorithms including Naivebayes, Stanza, Textblob, Vader, and Flair are employed and the experimental results show that Flair with an accuracy of 70% outperforms other tested algorithms.

OSN Dashboard Tool For Sentiment Analysis

Preprint

Full-text available

Jun 2022

The amount of opinionated data on the internet is rapidly increasing. More and more people are sharing their ideas and opinions in reviews, discussion forums, microblogs and general social media. As opinions are central in all human activities, sentiment analysis has been applied to gain insights in this type of data. There are proposed several approaches for sentiment classification. The major drawback is the lack of standardized solutions for classification and high-level visualization. In this study, a sentiment analyzer dashboard for online social networking analysis is proposed. This, to enable people gaining insights in topics interesting to them. The tool allows users to run the desired sentiment analysis algorithm in the dashboard. In addition to providing several visualization types, the dashboard facilitates raw data results from the sentiment classification which can be downloaded for further analysis.

Sentiment analysis on electricity twitter posts

Preprint

Full-text available

Jun 2022

In today's world, everyone is expressive in some way, and the focus of this project is on people's opinions about rising electricity prices in United Kingdom and India using data from Twitter, a micro-blogging platform on which people post messages, known as tweets. Because many people's incomes are not good and they have to pay so many taxes and bills, maintaining a home has become a disputed issue these days. Despite the fact that Government offered subsidy schemes to compensate people electricity bills but it is not welcomed by people. In this project, the aim is to perform sentiment analysis on people's expressions and opinions expressed on Twitter. In order to grasp the electricity prices opinion, it is necessary to carry out sentiment analysis for the government and consumers in energy market. Furthermore, text present on these medias are unstructured in nature, so to process them we firstly need to pre-process the data. There are so many feature extraction techniques such as Bag of Words, TF-IDF (Term Frequency-Inverse Document Frequency), word embedding, NLP based features like word count. In this project, we analysed the impact of feature TF-IDF word level on electricity bills dataset of sentiment analysis. We found that by using TF-IDF word level performance of sentiment analysis is 3-4 higher than using N-gram features. Analysis is done using four classification algorithms including Naive Bayes, Decision Tree, Random Forest, and Logistic Regression and considering F-Score, Accuracy, Precision, and Recall performance parameters.

Exploiting a knowledge base for intelligent decision tree construction to enhance classification power

Article

Full-text available

Mar 2022

Decision Trees are a common approach used for classifying unseen data into defined classes. The Information Gain is usually applied as splitting criteria in the node selection process for constructing the decision tree. However, bias in selecting the multi-variation attributes is a major limitation of using this splitting condition, leading to unsatisfactory classification performance. To deal with this problem, a new decision tree algorithm called "Knowledge-Based Decision Tree (KDT)" is proposed which exploits the knowledge in an ontology to assist the decision tree construction. The novelty of the study is that an ontology is applied to determine the attribute importance values using the PageRank algorithm. These values are used to modify the Information Gain to obtain appropriate attributes to be nodes in the decision tree. Four different datasets, Soybean, Heart disease, Dengue fever, and COVID-19 dataset, were employed to evaluate the proposed approach. The experimental results show that the proposed method is superior to the other decision tree algorithms, such as the traditional ID3 and the Mutual Information Decision tree (MIDT), and also performs better than a non-decision tree algorithm, e.g., the k-Nearest Neighbors.

A literature survey on student feedback assessment tools and their usage in sentiment analysis

Preprint

Full-text available

Sep 2021

Himali Aryal

Online learning is becoming increasingly popular, whether for convenience, to accommodate work hours, or simply to have the freedom to study from anywhere. Especially, during the Covid-19 pandemic, it has become the only viable option for learning. The effectiveness of teaching various hard-core programming courses with a mix of theoretical content is determined by the student interaction and responses. In contrast to a digital lecture through Zoom or Teams, a lecturer may rapidly acquire such responses from students' facial expressions, behavior, and attitude in a physical session, even if the listener is largely idle and non-interactive. However, student assessment in virtual learning is a challenging task. Despite the challenges, different technologies are progressively being integrated into teaching environments to boost student engagement and motivation. In this paper, we evaluate the effectiveness of various in-class feedback assessment methods such as Kahoot!, Mentimeter, Padlet, and polling to assist a lecturer in obtaining real-time feedback from students throughout a session and adapting the teaching style accordingly. Furthermore, some of the topics covered by student suggestions include tutor suggestions, enhancing teaching style, course content, and other subjects. Any input gives the instructor valuable insight into how to improve the student's learning experience, however, manually going through all of the qualitative comments and extracting the ideas is tedious. Thus, in this paper, we propose a sentiment analysis model for extracting the explicit suggestions from the students' qualitative feedback comments.

Using GAN-based models to sentimental analysis on imbalanced datasets in education domain

Preprint

Full-text available

Aug 2021

While the whole world is still struggling with the COVID-19 pandemic, online learning and home office become more common. Many schools transfer their courses teaching to the online classroom. Therefore, it is significant to mine the students' feedback and opinions from their reviews towards studies so that both schools and teachers can know where they need to improve. This paper trains machine learning and deep learning models using both balanced and imbalanced datasets for sentiment classification. Two SOTA category-aware text generation GAN models: CatGAN and SentiGAN, are utilized to synthesize text used to balance the highly imbalanced dataset. Results on three datasets with different imbalance degree from distinct domains show that when using generated text to balance the dataset, the F1-score of machine learning and deep learning model on sentiment classification increases 2.79% ~ 9.28%. Also, the results indicate that the average growth degree for CR100k is higher than CR23k, the average growth degree for deep learning is more increased than machine learning algorithms, and the average growth degree for more complex deep learning models is more increased than simpler deep learning models in experiments.

An enhanced motor imagery EEG signals prediction system in real-time based on delta rhythm

Article

Jan 2023
BIOMED SIGNAL PROCES

This work aims to develop a brain–computer interface (BCI) system based on electroencephalogram (EEG) signals, that is capable of remote controlling rehabilitation systems using wireless connections. This system can extract delta waves from raw EEG in real-time to predict motor imagery (MI) tasks. Where we built a simple acquisition device that acquires EEG signals using three dry electrodes, these non-invasive channels are positioned on the scalp surface at the occipital and central lobes. After the acquisition step, we amplify the signals and remove permanent noise during the preprocessing step. Then, in the feature extraction step, we extract possible features from each channel. Then, we select only some important features at the feature selection step, by the calculation of each feature’s contribution score. In the classification phase using machine learning algorithms, we select the light gradient boosting machine (LGBM) algorithm enhanced by the multi-verse optimization (MVO) algorithm, which enables the building of optimum prediction models. Also, this work employed a data analysis phase. Where to evaluate the characteristics independent between features at each step, we analysed the data using the correlation matrix results. As well as, we analysed the data changes temporally and spatially between MI tasks at each step. Therefore, the classification results indicated that the system accuracy score is over 90%. While in related work, we have an accuracy value ranging between 79% and 89%. These comparative results show the best quality of our system proposed for this work-based delta wave.

Método de algoritmo de clúster para el análisis del perfil de investigadores científicos

Article

Full-text available

Jul 2022

Gustavo Rodríguez Bárcenas

El aumento de la producción científica convierte en un desafío la tarea de identificar patrones y rasgos particulares que caractericen a los investigadores. Lograr establecer niveles de compatibilidad y similaridad entre actores en un contexto de investigación científica a partir de sus perfiles requiere de un proceso rápido y apropiado. El objetivo de este artículo es evaluar los niveles de similaridad, distancia euclidiana y compatibilidad entre vectores de investigadores, a partir de algoritmos de agrupamiento, escalamiento multidimensional, principios del modelo espacio-vectorial y atributos de sus perfiles científicos, considerando las terminologías que se abordan en su producción científica. Se utilizaron métodos teóricos y empíricos, incluyendo técnicas y herramientas de minería de texto. La aplicación del procedimiento en el Centro de Estudios de la Energía y Tecnología Avanzada de Cuba (CEETAM) y la Universidad Técnica de Cotopaxi (UTC) en Ecuador, evidenció su efectividad. Como resultado se pudo identificar los profesionales con mayores niveles de coincidencia en áreas a fines y líneas de investigación, lo que propicia el establecimiento de Comunidades Colectivas de Conocimientos; se pudo demostrar que los métodos empleados pueden ser integrados a las TIC, resultando en la obtención de relaciones perceptuales entre los investigadores y expresando los grupos que se forman a partir de conglomerados de observaciones en cada subcategoría y dominios de conocimientos de los dos casos de estudio analizados.

Kumpulan Pidato Pengukuhan Guru Besar Universitas Negeri Malang (UM) Sains dan Teknologi

Book

Full-text available

Dec 2021

Convolutional Recurrent Deep Learning Model for Sentence Classification

Article

Full-text available

Mar 2018

As the amount of unstructured text data that humanity produces overall and on the internet grows, so does the need to intelligently to process it and extract different types of knowledge from it. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been applied to Natural Language Processing (NLP) systems with comparative, remarkable results. CNN is a noble approach to extract higher-level features that are invariant to local translation. However, it requires stacking multiple convolutional layers in order to capture long-term dependencies, due to the locality of the convolutional and pooling layers. In this article, we describe a joint CNN and RNN framework to overcome this problem. Briefly, we use an unsupervised neural language model to train initial word embeddings that are further tuned by our deep learning network, then the pre-trained parameters of the network are used to initialize the model. At a final stage, the proposed framework combines former information with a set of feature maps learned by a convolutional layer with long-term dependencies learned via Long-Short-Term Memory (LSTM). Empirically, we show that our approach, with slight hyperparameter tuning and static vectors, achieves outstanding results on multiple sentiment analysis benchmarks. Our approach outperforms several existing approaches in term of accuracy; our results are also competitive with the state-of-the-art results on the Stanford Large Movie Review (IMDB) dataset with 93.3% accuracy, and the Stanford Sentiment Treebank (SSTb) dataset with 48.8% fine-grained and 89.2% binary accuracy, respectively. Our approach has a significant role in reducing the number of parameters and constructing the convolutional layer followed by the recurrent layer as a substitute for the pooling layer. Our results show that we were able to reduce the loss of detailed, local information and capture long-term dependencies with an efficient framework that has fewer parameters and a high level of performance.

Improving ontology-based text classification: An occupational health and security application

Article

Full-text available

Sep 2015
J Appl Logic

Information retrieval has been widely studied due to the growing amounts of textual information available electronically. Nowadays organizations and industries are facing the challenge of organizing, analyzing and extracting knowledge from masses of unstructured information for decision making process. The development of automatic methods to produce usable structured information from unstructured text sources is extremely valuable to them. Opposed to the traditional text classification methods that need a set of well-classified trained corpus to perform efficient classification; the ontology-based classifier benefits from the domain knowledge and provides more accuracy. In a previous work we proposed and evaluated an ontology-based heuristic algorithm [28] for occupational health control process, particularly, for the case of automatic detection of accidents from unstructured texts. Our extended proposal is more domain dependent because it uses technical terms and contrast the relevance of these technical terms into the text, so the heuristic is more accurate. It divides the problem in subtasks such as: (i) text analysis, (ii) recognition and (iii) classification of failed occupational health control, resolving accidents as text analysis, recognition and classification of failed occupational health control, resolving accidents.

Semantic Matching in Hierarchical Ontologies

Article

Full-text available

May 2014

Hierarchical ontologies play a key role in organizing documents in a repository. While matching the ontologies, the relationships among the concepts are considered to be a major aspect. In hierarchical ontologies, the concepts are associated with one another only through the “is-a” relation. In this paper, we discuss an approach for matching heterogeneous hierarchical ontologies that are related to the same domain through the semantic interpretation and implicit context of the concepts. We have designed rules that can handle heterogeneities and inconsistencies that are found in hierarchical ontologies. These rules can be embedded to complement the existing matching systems, to resolve the matching complexities in the hierarchical ontologies.

Cursive Handwritten Segmentation and Recognition for Instructional Videos

Conference Paper

Full-text available

Nov 2012

In this paper, we address the issues pertaining to segmentation and recognition of cursive handwritten text from chalkboard lecture videos. Recognizing handwritten text is a challenging problem in instructor-led lecture video. The task gets even tougher with varying handwriting styles and blackboard type. Unlike handwritten text on whiteboard and electronic boards, chalkboard represents serious chal-lenges such as, lack of uniform edge density, weak chalk contrast against blackboard and leftover chalk dust noise as a result of erasing – and many others. Moreover, the varying color of boards and the illumination changes within the video makes it impossible to use trivial thresholding techniques, for the extraction of content. Many universities throughout the world still heavily rely on chalkboard as a mode of instruction. Therefore, recognizing these lecture content will not only aid in indexing and retrieval applications but will also help understand high level video semantics, useful for Multi-media Learning Objects (MLO). In order to encounter those adversaries, we here propose a system for segmentation and recognition of cursive handwritten text from chalkboard lecture videos. We first create a foreground model to segment background blackboard. We then segment the text characters using one-dimensional vertical histogram. Later, we extract gradient based features and classify those characters using an SVM classifier. We obtained an encouraging accuracy of 86.28% on 5-fold cross validation.

An Improved Concept Vector Space Model for Ontology Based Classification

Conference Paper

Nov 2015

This paper proposes an improved concept vector space (ICVS) model which takes into account the importance of ontology concepts. Concept importance shows how important a concept is in an ontology. This is reflected by the number of relations a concept has to other concepts. Concept importance is computed automatically by converting the ontology into a graph initially and then employing one of the Markov based algorithms. Concept importance is then aggregated with concept relevance which is computed using the frequency of concept occurrences in the dataset. In order to demonstrate the applicability of our proposed model and to validate its efficacy, we conducted experiments on document classification using concept based vector space model. The dataset used in this paper consists of 348 documents from the funding domain. The results show that the proposed model yields higher classification accuracy comparing to traditional concept vector space (CVS) model, ultimately giving better document classification performance. We also used different classifiers in order to check for the classification accuracy. We tested CVS and ICVS on Naive Bayes and Decision Tree classifiers and the results show that the classification performance in terms of F1 measure is improved when ICVS is used on both classifiers.

Document image classification using SEMCON

Conference Paper

Sep 2015

In this paper, we are proposing a new semantic and contextual based document image classification framework. The framework is composed of two main modules. The first one is the text analysis module (TAM) which processes document images and extracts words from the image, and second one is the SEMCON, which is a semantic and contextual objective metric. From the list of extracted words by TAM, SEMCON finds a list of noun terms, employs contextual and semantic meaning to it and then uses those terms to classify documents. The scope of this paper is limited to the proposed framework and testing the approach presented on a limited test dataset. Our preliminary results are very promising and suggest that the proposed framework can be used effectively to classify document images.

Text Classification Based on Domain Ontology

Article

With the quick increase of information and knowledge, automatically classifying text documents is becoming a hotspot of knowledge management. A critical capability of knowledge management systems is to classify the text documents into different categories, which are meaningful to users. In this paper, a text topic classification model based on domain ontology by using Vector Space Model is proposed. Eigenvectors as the input to the vector space model are constructed by utilizing concepts and hierarchical structure of ontology, which also provides the domain knowledge. However, a limited vocabulary problem is encountered while mapping keywords to their corresponding ontology concepts. A synonymy lexicon is utilized to extend the ontology and compress the eigenvector. The problem that eigenvectors are too large and complex to be calculated in traditional methods can be solved. At last, combing the concept 's support ing, a top -down method according to the ontology structure is used to complete topic classification. An experimental system is implemented and the model is applied to this practical system. Test results show that this model is feasible.*

General Framework for Text Classification Based on Domain Ontology

Conference Paper

Jan 2009

Ontology can provide a powerful representation of information space and solve many semantic problems. It is wonderful to apply ontology to text classification. This paper proposes a general framework for text classification, which can overcome the limitations of traditional text classification methods. The results of experiment prove that the general framework is applicable across different domains and this method produces better performance.

Rich document representation and classification: An analysis

Article

Jan 2009
KNOWL-BASED SYST

There are three factors involved in text classification. These are classification model, similarity measure and document representation model. In this paper, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classifier. In our experiments, we have used the centroid-based text classifier, which is a simple and robust text classification scheme. We will compare four different types of document representations: N-grams, Single terms, phrases and RDR which is a logic-based document representation. The N-gram representation is a string-based representation with no linguistic processing. The Single term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. We have experimented with many text collections and we have obtained similar results. Here, we base our arguments on experiments conducted on Reuters-21578. We show that RDR, the more complex representation, produces more effective classifier on Reuters-21578, followed by the phrase approach.

The PageRank Citation Ranking: Bringing Order to the Web

Article

Nov 1998

The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

Performance analysis of machine learning classifiers on improved concept vector space models

Abstract and Figures

Recommended publications

On uniformly distributed ON/OFF arrivals in virtual output queued switches with geometric service ti...

Recursive Neural Filters and Dynamical Range Transformers

Tree-Based Forecasting Methods

Reliability and availability analysis of the ash handling unit of a steam thermal power plant