Conference PaperPDF Available

A Systematic Mapping Study of Language Features Identification from Large Text Collection

Authors:

Figures

Content may be subject to copyright.
A Systematic Mapping Study of Language Features
Identification from Large Text Collection
Diellza Nagavci Mati
Faculty of Computer Science and Technologies
South East European University (SEEU)
Tetovo, Macedonia
dn16574@seeu.edu.mk
Jaumin Ajdari
Faculty of Computer Science and Technologies
South East European University (SEEU)
Tetovo, Macedonia
j.ajdari@seeu.edu.mk
Bujar Raufi
Faculty of Computer Science and Technologies
South East European University (SEEU)
Tetovo, Macedonia
b.raufi@seeu.edu.mk
Mentor Hamiti
Faculty of Computer Science and Technologies
South East European University (SEEU)
Tetovo, Macedonia
m.hamiti@seeu.edu.mk
Besnik Selimi
Faculty of Computer Science and Technologies
South East European University (SEEU)
Tetovo, Macedonia
b.selimi@seeu.edu.mk
Abstract— Natural Language Processing1 is an emerging research
area in today’s era. The NLP resources are quite useful when it
comes to building a machine capable of translating between
linguistic pairs a solution that strives to resolve the language
barrier problems. Based on this premise, we are focusing our
research on feature identification from large text collections of
Albanian language. ‘Rule-based’ or statistical Part-of-Speech2
(POS) taggers are sought to be utilized that would either need
considerable time for rule development or a sufficient amount of
manually labelled data.
In light of this, the impact of this research is based on e xploring
numerous cases that are conducive to progress and further
development of this field. One of the goals of this paper is to
conduct a systematic review study; to explore and analyze existing
research that seek to target low resources language such as is the
case of the Albanian language. According to prior observation o f
published research conducted since 2015, we are focusing our
research on studies that have been published in areas that are
relevant to Natural Language Processing. Based on considerable
load of related research on this field, it is essential to conduct a
review and provide an outline of the research situation as well as
current developments in this specific but important field o f
research.
1 Henceforth: NLP
Keywords-component; Natural Language Processing, Machine
Learning, Part-of-Speech, Algorithms, Chinese Whispers,
Clustering.
I. INTRODUCTION
Nowadays assigning syntactical classes of words is a crucial
pre-processing step for many NLP applications. On this note,
POS-tags are largely utilized when we seek to parse, chunk,
resolve anaphora, recognize named entities and extract
knowledge, among other uses. Basically, constructing a tagger
would require two conditions: a lexicon which is constituted of
tags-for-words, as well as a mechanism that attribute such tags
to relevant tokens in a text setting. Based on previous research,
focus will be given on analyzing ‘Natural Language Processing’
and ‘Machine Learning’ research papers that have proposed
different methods and algorithms on dealing with low resource
languages. As such, the methodology used in this paper will seek
to analyze and resolve the relevant research questions that arise.
After that, the paper seeks to summarize a classification scheme
on the fields of Machine Learning, Natural Language
Processing, and related Ontology. After that, four research
questions are answered and four others are proposed as future
research interests. In the literature review section of the paper,
the time series of papers according to machine learning and
natural language processing areas of interest are also included.
2 Henceforth: POS
Unsupervised POS-tagging methods and Chines Whispers3 are
specifically proposed for future research areas in low resource
languages.
II. METHODOLOGY
The main scope of this systematic review study is to
determine and answer the research questions based on analyzing
and cross-referencing related articles. After conducting a
comprehensive query; the most adequate and relevant papers are
selected, based on which the classification scheme is defined.
On that note, the research questions are addressed based on the
outcomes of the mapping related to-, and the outcome of this
entire systematic review process. This approach is appropriate,
since it often provides a visual summary, a map, of its results
[7]. Initially, this paper endeavors to amass all the relevant
publications relate to the field of interest. At the same time an
outline of this research field is to be provided in which the
quantity, type of research along with available results are
identified. The second part of this paper will focus on conducting
a research related to major leading papers in the field, and
excluding the remainder of the studies that are not relevant to the
research questions.
TABLE 1. SEARCH STRINGS
After this, the paper addresses the research questions that
will guide through the structure of the research. Then, the third
part of this paper is focused on enacting the results of a
classification scheme – which seeks to capitalize on the existing
published studies towards providing more accurate results.
Answering the research questions ensures data extraction and
complete mapping of studies, by identifying, analyzing and
interpreting the suitable evidence. The classification scheme
depicts the field of interests that we mentioned earlier, namely:
Machine Learning, NLP and related Ontology.
3 Henceforth: CW
A. Research Questions and Search Strategy
The aim of this paper is to analyze publications that have
tackled Natural Language Processing based on some researc h
questions such as:
What is the core topic of interest that has been
discussed in the papers?
What type of methods were used with regard to
Natural Processing Learning?
How publications have evolved over time? What are
the research and publication trends?
Which algorithms can help towards finding rarely used
words?
The majority of research publications that were used for cross-
referencing and analysis in this research, were extracted from
digital libraries such as IEEE-Xplore, ACM, while additionally,
some articles were taken from Springer Link. The search strings
shown in Table 1 are the ones that were used to perform the
queries in the digital libraries mentioned above.
The majority of articles have proposed different search
strings. Out of those, we have selected just the ones that were
deemed more relatable for the field of interest of this paper
(shown on Table 1). Most of the selected papers are published in
recent years. Tab. 3 shows the number of publications in the
recent years (from 2015 to 2019) – again, focusing on the ones
that are appropriate to this study. From the selected papers,
further analysis is conducted, sorting only the papers that are
related to NLP and Machine Learning, Unsupervised and
Language Identification. As a result, after removing duplicates
and irrelevant papers, only 125 relevant articles remained.
III. CLASSIFICATION SCHEME
The classification scheme is presented in three columns
presenting the main fields of interests related to the research
(Fig. 1). Machine Learning is the main area of interest on which
the focus will be placed, which will eventually lead towards
creating dictionaries for low-resource languages by using
unsupervised POS-tagging and methods strategies in the future.
The field of interest is defined in the first column of the scheme,
including Machine learning, NLP and Ontology as fields of
interest. Then, as shown in the second column, the unsupervised
learning can be used to help improve the automated POS-
tagging in low-resource languages. And finally, in the third
column, different Machine Learning Algorithms are included –
which will be used in low-resource languages such as the
Chinese Whispers algorithm. Based on the analysis from the
collected papers the gap was analyzed in pertinence to the ‘low-
resource languages’ - in which such research can add valuable
contribution.
No. Search String No. of
papers
SS1
((("Abstract”: Natural Language
Processing OR "
Abstract”: Machine Learning)
AND "Abstract”: Language
Identification)
205
SS2
((("Natural Language Processing ")
OR “Unsupervised")
AND Machine Learning)
AND Language Identification
132
SS3
((((("Natural Language Processing ")
OR" Unsupervised") AND Machine
Learning) AND Language
Identification)
AND prediction
23
Figure 1. Classification scheme
IV. RESULTS
All selected research papers were classified into distinct
categories in order to provide answers to the five research
questions. The research questions and the resulting responses of
the systematic review study are presented in the following
paragraphs.
A. RQ1: What is the core topic of interest that has been
discussed in the papers?
This question deals with the main field of interest that is
investigated in each of the papers. We are interested in Natural
Language Processing, but since we used several search strings,
we got several results. In order to answer this question, we
created the ‘Field of Interest’ classification for the papers.
From Table 2 we can observe that about 54% of the papers have
the machine learning as the main focus, [9] along with various
methods that have been utilized.
The second most mentioned area of interest is the Natural
Language Processing reaching almost 37%. This category of
papers includes semantic role labeling, spatial expression
recognition, opinion summarization, topic linking and
visualization plug-ins, etc. It has a lot of other major
applications such as OCR, parsing, natural language
understanding, named entity recognition and machine
translation [10].
TABLE 2. NUMBER OF PAPERS BY MAIN FIELD OF INTEREST
Field of interest Number of papers
Percentage
Machine Learning 68 54%
Natural Process
Learning
46 37%
Ontology 11 9%
B. RQ2. How publications have evolved over time? What are
the research and publication trends?
While examining the year of distribution for each paper, we
focus specifically on the time ranges between 2015 and 2019.
The majority of the selected papers (44.80%) have been
published in 2018 In fact, if we look at the graph in Fig. 2, it can
be observed that there is an increased publication rate from year
to year, indicating an increased interest in the field.
TABLE 3. NUMBER OF PAPERS PER YEAR UNTIL Q1-2019
Year Numbers of Papers %
2019 24 19.20%
2018 56 44.80%
2017 18 14.40%
2016 17 13.60%
2015 10 8.00%
Figure 2. Number of papers per year
C. RQ3. What type of methods were used with regard to
Natural Processing Learning?
In recent years, NLP (Natural Language Processing) have been
applied to solve different problems faced by linguistics.
Dipanjan, in his paper regarding natural processing language
(NLP) [11] has shown the efficacy of graph-based label
propagation for projecting part-of-speech information across
different languages. Such results show that it is possible to learn
accurate POS taggers for languages which do not have any
annotated data, but have translations into a resource-rich
language. Also, the results show leaning support towards
unsupervised POS-tagging; and related approaches that rely on
direct projections, and bridging the gap between purely
supervised and unsupervised POS tagging models [12]. From
the results of this paper, it has been found out that (please see
Tab. 4) the majority of the selected papers were focused on
methods related to Unsupervised Part-Of-Speech tagging
(61%), whereas the remainder tend to deal with Clustering,
approximately 39%.
Field of interest
Machine Learning
Natural Language
Processing
Ontology
Methods
Unsupervised
POS-tagger
Clustering
N-Gram
Algorithms
Chinese
Whispers
Linear
Regression
Logistic
Regresion
Classification
Active
Learning
19,20%
44,80%
14,40% 13,60%
8,00%
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
0
10
20
30
40
50
60
2019 2018 2017 2016 2015
N U M B E R O F P AP E RS
TABLE 4. NUMBER OF PAPERS BY FRAMEWORK TYPE
Methods Type Number of
papers Percentage
Unsupervised POS-
tagging 76 61%
Clustering 49 39%
D. RQ4: Which algorithms can help towards finding rarely
used words?
One important part of our research is to find out which
algorithms are used for finding the rarest words in low resource
languages. Regarding Natural Language Processing, one such
algorithm as used by Biemann [5] is the Chinese Whispers -
which is very basic algorithm to partition the nodes of attached,
undirected graphs.
Other algorithms such as Naïve Bayes, SGD, Logistic,
HyperPipes, RBFNetwork were used by Marenglen et al. [13]
in their research paper, for evaluation through experiments to
see the performance of classification algorithms for opinion
mining in a multi-domain corpus in Albanian language. They
have created 11 text corpuses of Albanian written opinions
collected from different well-known Albanian newspapers.
Each corpus has an identical number of text documents
categorized as positive opinions and text documents
categorized as negative opinions. Another entity recognition
system was created and evaluated, in view of existing machine
learning algorithms, such as decision trees and neural networks
presented by Georgios et al. [14]. These systems were evaluated
in the Greek text collection, and they carry to the recognition of
the disadvantages and restrictions imposed by the inspected
algorithms, when applied to natural language data. This new
technique is part of the category of inductive grammar learning.
The fundamental preferences of this method with respect to
other machine learning methods are the ability to handle textual
data, as well as the chance of using learned grammars in actual
systems, replacing manually developed grammars [16]. For
applying inductive grammar learning, a new algorithm has been
created that learns grammars from positive examples only. This
new algorithm can conclude context-free grammars, and it has
been founded on the existing algorithm- GRIDS [15],
improving both the user-friendliness, as well as the search
process in the space of possible grammars; jointly increasing
the applicability of the new algorithm into bigger collections of
data.
Anchor-NMF have been presented by Karl et al. [18]. This
learning algorithm is used to deal with the task of unsupervised
POS tagging. The goal of such task is to stimulate the correct
sequence of POS tags (hidden states) -given a sequence of
words (observation states). In the system of each POS tag, the
anchor condition corresponds to the assumption that at least one
word that occurs, is found under that tag.
E. RQ5: How can raw text be used to generate spell-check
dictionaries?
The low resource languages have lots of raw text, so we have
to see how can they be used to generate spell-check dictionaries
[63]. Some questions that need to be answered for unsupervised
algorithms are as the following:
- Which software should be used to generate the spell-
check dictionaries?
- Are we going to have results from raw text?
- What methods should be applied to generate the spell-
check dictionaries?
F. RQ6. Can usage differences detect misspelled words?
In order to detect usage differences in misspelled word in low-
resource languages, some questions need to be answered,
pertaining to unsupervised area:
- How can usage differences in a textual context detect
misspelled words in text collection?
- Which results can be obtained in low-resource
languages?
V. DISCUSSION
After analyzing 125 research papers and shared experiences
in Natural language Processing we are ready to apply this on the
low-resource languages. There are many researches that had
applied different algorithms of Machine Learning using
Supervised or Unsupervised learning [2] in different languages.
The Chinese Whispers graph clustering algorithm has been used
to perform necessary abstractions and generalizations [22] for
grouping words into POS-classes for text collection in a
language as Dutch, Italian, Sweden, Hungarian, German etc.
TABLE 6. METHODS THAT ARE USED
Methods Can be used
Unsupervised
POS-tagging
U
sed to draw inferences from datasets
consisting of input data without labeled
responses.
Clustering Used as a statistical data analysis
used in
many fields.
N-gram That each n-gram is composed of n-words.
\
Figure 3. Time series of papers according to methods field of interest
5
10
432
6
12 12 10
55
12 11
18
10
0
10
20
2019 2018 2017 2016 2015
Unsupervised POS-tagging
Clustering
N-gram
The system presented by Biemann et al. [5] uses CW
clustering of graphs constructed by distributional similarity to
induce a lexicon of supposedly non-ambiguous words with
respect to the Part-of-Speech, by choosing largely safe cases and
excluding questionable ones from the lexicon. In such
implementation, two clusters are combined: one for high and
medium frequency words, whereas the other for medium and
low frequency words. High and medium frequency words are
clustered by the similarity of their stop-word and context feature
vectors, such that a graph would be built that’d include only
words that are involved in highly similar pairs [4]. Clustering
such a graph of typical 5,000 vertices would result in several
hundred clusters, which are further used as POS categories. To
extend the lexicon, words of medium and low frequency would
be clustered using a graph that encodes the similarity of
NeighborCare co-occurrences. Both clusters are mapped by
overlapping elements into a lexicon that provides POS
information for some 50,000 words. For getting a cluster on data
sets of this size, an efficient algorithmic rule like CW is crucial
as well as a written word tagger with a morphological extension
would be trained -which in turn would assign a tag to each token
within the corpus [3].
VI. FUTURE RESEARCH AREA
This paper is based on many researches, that have great
relevance concerning this topic. Orientations in this field are
based on the latest technology developments (trends). Natural
language processing is an emerging research area in this
contemporary age, especially for low-resource languages. It
provides solutions for people that belong to different linguistic
backgrounds in the context of language learning. Through NLP
resources, a language translator can be developed and language
barrier problem can be reduced among populations [35].
Taking into consideration the great challenges of creating a
vocabulary for any language, this research will strive to
contribute towards enriching this low-resource language with
the results that will ensue, all the more since Albanian language
still possesses no proper digital vocabulary. So, in such instance
the limitations of using supervised learning methods would be
overcome through the use of unsupervised methods of
identifying language features in text collections. Two research
questions proposed as future research area:
Which results can be obtained when applying
unsupervised POS tagging to a large text collection
in low-resource languages?
How increasing the text collection can affect the
improvement of accuracy?
VII. CONCLUSION
According to the analyzed literature, NLP Unsupervised
learning had seen an increase in usage in the recent years, as per
the research trends. A systematic study in order to improve
Language feature identification from large text collection is
introduced in this study. In addition, we have introduced so me
research questions and some future research questions which
have to be further elaborated. We were looking on time series
of papers according to the gap of the NLP field of interest. Most
of the papers were supported by different analysis and models
of NLP and Unsupervised POS-tagging.
REFERENCES
[1] Alam, H., & Kumar, A. (2015). Multi-lingual author identification and
linguistic feature extraction — A machine learning approach. IEEE.
[2] Anjana, J. S., & Poorna, S. S. (2018). Language Identification from
Speech Features Using SVM and LDA. IEEE.
[3] Atzeni, M., & Atzori, M. (2018). Translating Natural Language to Code:
An Unsupervised Ontology-Based Approach. IEEE.
[4] Bais, H., Machkour, M., & Koutti, L. (2016). Querying database using a
universal natural language interface based on machine learning. IEEE.
[5] Biemann, C. (2016). Unsupervised Part-of-Speech Tagging in the Large.
Germany: Res on Lang and Comput.
[6] Bodapati, S. B., Ramaswamy, S., & Narayanan, G. (2018). A Machine
Learning Approach to Detecting Start Reading Location of eBooks. IEEE.
[7] Bodapati, S. B., Ramaswamy, S., & Narayanan, G. (2018). A Machine
Learning Approach to Detecting Start Reading Location of eBooks. IEEE.
[8] Cai, J., Li, J., Li, W., & Wang, J. (2018). Deeplearning Model Used in
Text Classification. IEEE.
[9] Caranica, A., Cucu, H., & Buzo, A. (2016). Exploring an unsupervised,
language independent, spoken document retrieval system. IEEE.
[10] Carbajal, M. J., Dawud, A., Thiollière, R., & Dupoux, E. (2016). The
“language filter” hypothesis: A feasibility study of language separation in
infancy using unsupervised clustering of I-vectors. IEEE.
[11] Celikyilmaz, A., Sarikaya, R., Jeong, M., & Deoras, A. (2016). An
Empirical Investigation of W ord Class-Based Features for Natural
Language Understanding. IEEE.
[12] Collobert, R. (2016). A Unified Architecture for Natural Language
Processing: Deep Neural Networks with Multitask Learning. IEEE.
[13] Dipanjan Das, S. P. (2011). Unsupervised Part-of-Speech Tagging with
Bilingual Graph-Based Projections. Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics, pages 600–609
[14] Dodal, S. S., & Kulkarni, P. V. (2018). Multi-Lingual Information
Retrieval Using Deep Learning. IEEE.
[15] Erik Cambria, B. W. (2015). Jumping NLP Curves: A Review of Natural.
Digital Object Identifier 10.1109/MCI.
[16] Gharge, S., & C havan, M. (2017). An integrated approach for malicious
tweets detection using NLP. IEEE.
[17] Goldberg, D. (2015). Genetic Algorithms in Search, Optimization,. IEEE.
[18]
representations. IEEE.
[19] Gunn, S. R. (2015). Support Vector Machines for Classification and
Regression.
[20] Hung, C.-K. (2017). Making machine-learning tools accessible to
language teachers and other non-techies: T-SNE-lab and rocanr as first
examples. IEEE.
[21] Hutchinson, T. (2018). Protecting Privacy in the Archives: Supervised
Machine Learning and Born-Digital Records. IEEE.
[22] IEEE/ACM Transactions on Audio, S. a. (2015). Supervised Detection
and Unsupervised Discovery of Pronunciation Error Patterns for
Computer-Assisted Language Learning. IEEE.
[23] Itauma Itauma 1, M. S.-w. (2015). Unsupervised Learning and Image
Classification in High Performance Computing Cluster. IEEE 14th
International Conference on Machine Learning and Applications.
... Feature Extraction helps to distinguish between HS and non-HS as well as to increase the accuracy in its detection. This process usually involves a combination of linguistic and statistical techniques [15]. ...
Conference Paper
Social network usage is growing daily, making it impossible to manage the enormous amount of data being generated. The presence of abusive behavior and hate speech is a clearly harmful phenomenon that is evident on these networks. Due to its importance, recent studies have revealed a significant concern in this field. This review aims to provide insight into the tasks and procedures associated with the automatic detection and classification of texts containing hate speech. As the domain of hate speech is wide, an analysis of definitions is conducted and a comprehensive and unifying definition is proposed. This paper also investigates the latest datasets across languages used to train AI models. Recent studies show that feature selection plays a key role in detecting hate speech. In this research, we analyze which are the most utilized and impactful features from works done in this domain. While various classification algorithms have been used for hate speech detection, we investigate numerous research studies using multiple types of machine learning and deep learning models and present the most recent and relevant methods.
... (what is), 't'i' shortened from 'të i' -Eng. (to) etc.; numbers expanding contractions, converting numbers to their word equivalents, and so on [4]. Normalization puts all words on a similar footing and allows processing to proceed uniformly. ...
... (what is), 't'i' shortened from 'të i' -Eng. (to) etc.; numbers expanding contractions, converting numbers to their word equivalents, and so on [4]. Normalization puts all words on a similar footing and allows processing to proceed uniformly. ...
Conference Paper
Full-text available
The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.
Article
In current digital climate, education sector is evolving as the computer technology advances. Education is being digitized: online classes, online examination methods are conducted, etc. During examination, students are assessed by their answers having given for the question set by a teacher. Today many tools are available to assess the performance of a student using multi choice questions tools which provide instant evaluation, but there are available very limited and operational tools where subjective type answer of students are evaluated. This paper presents a web-based application to address this challenge. It automates the process of subjective answers checking and generates results through using natural language processing methods, like keyword matching semantic, lexical analysis and cosine similarity. Experiments show that appreciated by the teacher result and the system estimation does not have much difference which signifies that the system evaluates answers with a 97 % accuracy. The presented system not only reduces manpower but also eliminates the traditional method of conducting exclusively subjective exams using paper documents. It also eliminates the delays in the paper checking, result generation process. The cases of information leak are being reduced and the objectivity of the assessment is being increased.
Article
Full-text available
An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging. This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them. The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.