ChapterPDF Available

Named Entity Recognition for Icelandic: Annotated Corpus and Models

Authors:

Abstract and Figures

Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million tokens. Furthermore, we have used the corpus to train three machine learning models: first, a CRF model that makes use of shallow word features and a gazetteer function; second, a perceptron model with shallow word features and externally trained word clusters; and third, a BiLSTM model with external word embeddings. Finally, we applied simple voting to combine the model outputs. The voting method obtains an \(F_{1}\) score of 85.79, gaining 1.89 points compared to the best performing individual model. The corpus and the models are publicly available.
Content may be subject to copyright.
Named Entity Recognition for Icelandic:
Annotated Corpus and Models
Svanhvít L. Ingólfsdóttir[000000016965692X],
Ásmundur A. Guðjónsson[000000027790664X], and
Hrafn Loftsson[0000000292984830]
Language and Voice Lab
Reykjavik University
Menntavegur 1, 101 Reykjavík, Iceland
svanhviti16@ru.is,asmundur10@ru.is,hrafn@ru.is
https://lvl.ru.is
Abstract. Named entity recognition (NER) can be a challenging task,
especially in highly inflected languages where each entity can have many
different surface forms. We have created the first NER corpus for Ice-
landic by annotating 48,371 named entities (NEs) using eight NE types,
in a text corpus of 1 million tokens. Furthermore, we have used the cor-
pus to train three machine learning models: first, a CRF model that
makes use of shallow word features and a gazetteer function; second,
a perceptron model with shallow word features and externally trained
word clusters; and third, a BiLSTM model with external word embed-
dings. Finally, we applied simple voting to combine the model outputs.
The voting method obtains an F1score of 85.79, gaining 1.89 percentage
points compared to the best performing individual model. The corpus
and the models are publicly available.
Keywords: Named entity recognition ·Corpus annotation ·BiLSTM ·
CRF ·Clustering ·Machine learning
1 Introduction
Since the integration of named entity recognition (NER) into the sixth Message
Understanding Conference (MUC) in 1995 [16], NER has become recognized as
an important task in natural language processing (NLP), and NER datasets and
methods have been developed for many languages. Detecting and recognizing
named entities (NEs) in text is not a trivial task, as various patterns need to
be learned, and new proper names appear frequently. Furthermore, in highly
inflected languages, such as Icelandic, each proper name can have many different
surface forms, which further complicates the task.
No NER corpus was available for the Icelandic language before the work in-
troduced in [18], where a sample of 200,000 tokens, from the MIM-GOLD corpus
of 1 million tokens [23], was annotated with four NE types and used for training
a NER prototype. In this paper, we describe the completion of the annotation
2 S. Ingólfsdóttir et al.
of the whole corpus, using eight NE types, which has resulted in 48,371 NEs.
The corpus, MIM-GOLD-NER, is available online1. Furthermore, we describe
the evaluation of three different models trained on the corpus: first, a Condi-
tional Random Field (CRF) model that makes use of shallow word features and
a gazetteer function; second, a perceptron model with shallow word features and
externally trained word clusters; and third, a bidirectional long short-term mem-
ory (BiLSTM) model with external word embeddings. Finally, we have combined
our model outputs using a simple voting method, obtaining an F1score of 85.79
and gaining 1.89 percentage points compared to the best performing individual
model. The code for our models is available online2.
2 Related Work
The most commonly annotated entity types in NER corpora in the general do-
main are person,location and organization, both due to the fact that these
entities are very common in general texts and newswire texts, which are often
the source of NER corpora, and that they contain information that may be valu-
able for a variety of purposes. The three generic NE types were first proposed
for the MUC-6 event as a subtask called enamex [16]. The other subtasks in
MUC-6 are named timex (dates and times) and numex (monetary values and
percentages). In the CoNLL shared task, a fourth category, miscellaneous
was introduced, which covers proper names that fall outside of the three classic
categories, person,location and organization [31].
While rule-based methods for NER can be highly efficient, especially in a
well-defined domain [10], machine learning (ML) methods took over from rule-
based method as the main approach to NER as the field developed and data
became more readily available. These range from unsupervised to fully super-
vised methods (e.g. [3, 25, 35]), as well as hybrid systems, combining various
supervised and unsupervised methods and carefully engineered features, e.g. [2].
Although the majority of the early work on NER was conducted on the
English language, NER corpora and published work is now available for various
different languages. This is an important development, because of the many
structural, morphological, and orthographic features that characterize different
languages. Enriching the input representation with pre-trained word embeddings
has, for example, proven useful for NER in languages such as Turkish [12] and
Arabic [21], as have the larger and more complex language models that have
recently become popular, such as the Finnish [33] and Slavic [4] BERT models.
Various published work exists for the languages most related to Icelandic,
i.e. the Scandinavian languages. The most recent work for Swedish includes a
BiLSTM model [34]. A Danish NER corpus is presented in [13], but some of the
latest research for Danish NER focuses on cross-lingual transfer, to make up for
the limited data available [27]. For Norwegian, the most recent work includes
[19], which involves new annotated NER datasets for the two written forms of
1http://hdl.handle.net/20.500.12537/42
2http://github.com/cadia-lvl/NER
Named Entity Recognition for Icelandic: Annotated Corpus and Models 3
Norwegian, Bokmål and Nynorsk, and BiLSTM models enriched with pre-trained
word embeddings.
Limited work exists on NER for the Icelandic language, most likely due to
the lack of an annotated NER corpus. IceNER, a rule-based NER system is part
of the IceNLP toolkit [22]. It has been reported to reach an F1score of 71.5
without querying a gazetteer, and 79.3 with a gazetteer lookup [32].
3 Corpus Annotation
As a basis for our NER corpus, we used MIM-GOLD, the Icelandic Gold Stan-
dard corpus [23], a balanced text corpus of approximately 1 million tokens,
tagged with part-of-speech (PoS). All the texts are from the years 2000–2009 and
are sourced from thirteen text types, including news texts, books, blogs, web-
sites, laws, adjudications, school essays, scripted radio news, web media texts
and emails. For most languages, this variety of text genres is not common in
NER corpora, which are often centered on newswire texts (with some excep-
tions, such as the Portuguese HAREM NER contests [29, 28]). Nevertheless, our
corpus, MIM-GOLD-NER, is heavy on news-related content, as newspaper arti-
cles, web media, and radio news account for 36% of the tokens.
3.1 Annotation Process
Eight NE types are tagged in MIM-GOLD-NER: person,location,organi-
zation,miscellaneous,date,time,money and percentage. The first four
entity types are the same as used in the CoNLL shared tasks. The last four NE
types were adapted from the numex and timex types in the MUC events.
We applied a semiautomatic approach when annotating the corpus, using
gazetteers and regular expressions to extract as many entities as possible before
reviewing and correcting the corpus manually.
Preprocessing Gazetteers were collected from official Icelandic resources. For
extracting the person names, we used the Database of Modern Icelandic Inflec-
tions (DMII) [6]. Additionally, we collected lists of place names and addresses,
as well as company names. In the end, we had gazetteers with 15,000 person
names, 97,000 location names, and 90,000 organization names. Since the origi-
nal MIM-GOLD is PoS-tagged, we were able to use the PoS tags to filter out
likely proper nouns and match them with our gazetteers, to produce NE candi-
dates. Heuristics and knowledge of the language were used to resolve doubts and
ambiguities and to try to determine NE boundaries. Remaining ambiguities were
registered to be resolved manually. Regular expressions were used for automat-
ically extracting the numerical entities, taking care to match entities regardless
of how they are written out in the corpus, whether numerically or alphabetically.
4 S. Ingólfsdóttir et al.
Manual review After the automatic preprocessing step, the next task was
reviewing the resulting annotations, fixing errors and picking up any remaining
NEs missed in the previous step, since some inaccuracies were bound to appear,
both in the classification and the span (boundaries) of the entities.
Guidelines were constructed regarding the taxonomy used for the annotation
of the corpus. For the four NE types adapted from CoNLL, we mostly relied on
the CoNLL guidelines for each entity type [9], though with some modifications to
fit Icelandic settings and writing conventions. The four numerical and temporal
entity types, date,time,money, and percent, appeared less commonly in the
corpus, and were easier to find using regular expressions. The MUC guidelines
were followed as closely as possible when annotating these entities.
One implication of using a balanced corpus such as MIM-GOLD, is that only
parts of it have been reviewed/edited, so the texts vary quite a lot in writing
quality; some have been thoroughly proofread (published books, laws and adju-
dications), some have undergone some editing (news articles, some web content,
scripted texts for radio), and some have not been edited at all (blogs, emails,
web content, classified newspaper ads). The corpus contained many problematic
NE candidates, for which annotation was not clear. These were marked specially
during the annotation and resolved at the end, to keep consistency within the
corpus. In some parts of the corpus, such as laws and regulatory texts, NEs are
sparsely distributed, but we have nonetheless opted for annotating the whole
corpus without removing sentences unlikely to contain NEs. This was done both
to keep the organic distribution of NEs in the corpus and to ensure compatibility
with the original MIM-GOLD corpus.
One annotator (the first author of this paper) was in charge of defining the
annotation task and labeling the bulk of the corpus. This was done part-time
in the course of a year. A second annotator (the second author) stepped in to
help in the last weeks, and reviewed around 8% of the corpus. We estimate that
150-200 hours went into the manual annotation stage. The annotation resulted
in 48,371 NEs, split between the eight entity types as shown in Table 1.
Table 1. NE split in the MIM-GOLD-NER corpus.
Entity type Total NE count Percentage
Person 15,599 32.25%
Location 9,011 18.63%
Organization 8,966 18.54%
Miscellaneous 6,264 12.95%
Date 5,529 11.43%
Time 1,214 2.51%
Money 1,050 2.17%
Percent 738 1.53%
Total 48,371
Named Entity Recognition for Icelandic: Annotated Corpus and Models 5
Once the annotation was finished, an external linguist reviewed a randomly
chosen 10% sample of the corpus, to estimate the accuracy of the annotation. The
corpus sample contained a total of 4,527 NEs. The reviewer was presented with
the same instructions and guidelines as annotator 1 and annotator 2, and was
asked to mark any doubts or errors spotted in the corpus sample. Annotator 1
then reviewed this list of error candidates and, in accordance with the guidelines,
evaluated which of them were true positives, i.e. real errors that should be fixed.
In the end, 250 error candidates were found, out of which 180 were marked as
false positives by annotator 1. These false positives were due to either doubts
that the reviewer had on how to annotate or lack of detail in the annotation
guidelines. Thus, the total number of real errors was 70, which is equivalent to
an accuracy of 98.45% for this corpus sample.
4 Models
Until recently, the main approach for NER has been the application of BiLSTM
models, along with the best hybrid feature-engineered ML systems [36]. Even
though these methods have mostly fallen in the shadow of state-of-the-art trans-
former models [5, 15], they do a good job of solving problems where the input is
sequential data, such as in NER. Since a transformer model does not yet exist for
Icelandic, we opted for these tried and tested methods for the first experiments
with our new corpus.
Three different methods were chosen for the experiments: a CRF model, a
perceptron model, and a BiLSTM model. The three models were then combined
into one ensemble NER tagger using simple voting. Before presenting our results,
we will briefly describe each method.
4.1 Conditional Random Field (CRF)
CRF is a conditional probabilistic modeling method, which can take context into
account [20]. Passos et al. [25] implemented a stacked linear-chain CRF system
for NER that makes use of shallow word features in their baseline model, along
with gazetteers, and then compared the results when adding Word2Vec embed-
dings [24] and Brown clusters [8], among other things. Their best performing
model achieved an F1score of 90.9 on the English CoNLL-03 test set.
We implemented a model inspired by this baseline system. Our model uses
the following word features:
the word lower-cased
word suffixes, length 1 to 4
a boolean for whether the word is all in uppercase letters
a boolean for whether the first letter of the word is in uppercase
a boolean for whether the word is a digit
all the character n-grams within the word, of length 2 to 5
the four words prior to the word
6 S. Ingólfsdóttir et al.
the four words after the word
These parameters were mostly found by trial and error. If a parameter did
not improve the model, or if removing a parameter made no difference on the
validation set, it was discarded. As an example, suffixes are used, but not prefixes,
since the prefixes did not contribute any difference. The gazetteers used for the
corpus annotation stage were reused for this model, and each list had its own
boolean parameter for whether the word appeared in the list.
4.2 IXA-pipe-nerc
In a survey on different NER architectures [36], the best performing non-neural
model was IXA-pipe-nerc, a NER module which is a part of the IXA pipes NLP
tool [1]. IXA-pipe-nerc is based on a perceptron model and utilizes both shallow
local word features and semi-supervised word clusters [2]. It supports including
PoS tags as features, as well as a gazetteer lookup. The clustering features used
were Brown [8], Clark [11], and Word2Vec [24] clusters. The software is open
source and language-independent, making it straightforward to train a model
for a new language. The supported local word features are:
current lower-cased token
token shape
the previous prediction for the word
whether it is the first token in the sentence
both the prefixes and suffixes of the token, with default of length 4
word bigrams and trigrams, which both include the current token and the
token shape
all the character n-grams within the word, with default of length 2 to 5
Evaluation shows that the biggest performance boost comes from the word
clusters. The best model in [2] achieved an F1score of 91.36 on the English
CoNLL-03 test set.
The configuration for our model was chosen by experimenting on the valida-
tion set, selecting the features that gave the best results. In our model, we used
the default values for the features enumerated above, except we used character
n-grams of length 1 up to 11, as well as the word trigrams, which are disabled by
default. We trained the three types of word clusters on the Icelandic Gigaword
Corpus (IGC) [30], a corpus of around 1.4 billion words of Icelandic texts from
various sources.
4.3 BiLSTM with Pre-trained Word Embeddings
For the BiLSTM experiments, we used a program called NeuroNER [14]. Neuro-
NER is described as an easy-to-use program for training models to recognize
and classify NEs. It uses TensorFlow3for training the neural networks and has
3https://www.tensorflow.org/
Named Entity Recognition for Icelandic: Annotated Corpus and Models 7
been reported to reach an F1of 90.5 on the English CoNLL-03 dataset. Neuro-
NER is divided into three layers. First, a BiLSTM layer maps each token to
a vector representation using two types of embeddings: a word embedding and
a character-level token embedding. The resulting embeddings are fed into the
second layer, a BiLSTM which outputs the sequence of vectors containing the
probability of each label for each corresponding token. Finally, a CRF decoding
layer outputs the most likely sequence of predicted labels based on the output
from the previous label prediction layer.
Instead of implicitly learning the word embeddings, NeuroNER offers the pos-
sibility to incorporate external word embeddings, pre-trained on a larger dataset.
We have provided external word embeddings, trained on the 2018 version of the
IGC. For the sake of comparison, Word2Vec [24], GloVe [26], as well as Fast-
Text [7] embeddings were tested. GloVe embeddings turned out to give slightly
better results than the other two, so they were used in the models presented in
our results. For the GloVe embeddings, dimensions were set at 300, window size
at 10, and the minimum term count at 5. NeuroNER was configured with the
following main parameters:
character embedding dimension = 25
character lstm dimension = 25
dropout = 0.5
patience = 10
maximum number of epochs = 100
optimizer = stochastic gradient descent
learning rate = 0.005
4.4 CombiTagger
Different ML models may have different strengths and weaknesses depending on
the methods used to train them. For NER, some may, for example, work better
on the more regular numerical entities, while others may be better at overcoming
misspellings in the text. One way of leveraging the strengths of different models,
for an increased overall performance, is using a voting system for the output
tags. CombiTagger [17], a system originally developed for combining different
PoS taggers, offers simple and weighted voting. In our work, we fed our three
best NER model outputs into CombiTagger and applied simple voting.
5 Results and Discussion
In this section, we present and discuss the results from training on MIM-GOLD-
NER using the methods described in Section 4.
For the total corpus of 1 million tokens, the split between training, validation
and test sets was 80%, 10% and 10%, respectively, which gave a training set of
around 800,000 tokens, and validation and test sets of around 100,000 tokens
each. We also trained on five different sizes of the corpus, keeping the test set
8 S. Ingólfsdóttir et al.
intact in all experiments to maintain consistency in the evaluation. The CoNLL-
03 evaluation metrics are used, meaning that both the type and the boundaries
of a predicted NE need to match the gold label for it to count as correct.
Table 2. F1for the different models, in addition to CombiTagger, which combines the
output from the three best models.
Overall PER LOC ORG MISC DATE TIME MON PERC
BiLSTM-internal 73.60 80.11 74.98 65.82 44.10 86.73 91.83 81.86 92.73
CRF 82.24 87.18 86.04 77.02 53.87 90.90 94.46 84.30 98.21
IXA-pipe-nerc 83.10 87.90 85.54 77.02 61.57 91.96 94.29 85.59 97.35
BiLSTM-GloVE 83.90 89.53 85.45 79.03 61.77 90.60 94.78 89.45 95.54
CombiTagger 85.79 90.19 88.21 81.23 64.27 93.13 96.41 86.58 98.65
In Table 2, the results from training the different models on the whole corpus
are presented. Before considering the best performing method, CombiTagger,
we will discuss how the CRF, IXA-pipe-nerc, and BiLSTM models performed
overall, and on each NE type.
It is not surprising that the most advanced model, BiLSTM, outperforms
the other two in overall score, but it is interesting to see how close behind
the IXA-pipe-nerc model comes – by only 0.80 percentage points. Furthermore,
IXA-pipe-nerc, which uses externally trained (semi-supervised) word clusters,
only improves from the CRF by 0.86 percentage points. In [36], the IXA-pipe-
nerc model also obtained marginally better scores (0.46 percentage points) than
a CRF model, when evaluated on English.
Looking at the individual entity types, we see that the BiLSTM outperforms
the other two in five out of eight types, which should not come as a surprise. What
was more unexpected, however, is that the CRF, despite scoring somewhat worse
overall, outperforms the IXA-pipe-nerc model in two categories, location and
percent. The reason for the high-scoring location type may be the gazetteer
lookup, implemented as part of the CRF model, since the place names gazetteer
was quite exhaustive. The IXA-pipe-nerc model, however, gives the best results
for date, outperforming the others by over 1 percentage point.
Note the effect of using pre-trained word embeddings as external input into
the BiLSTM. BiLSTM-internal was trained without these external word em-
beddings, i.e. word and character embeddings were trained internally using the
training data itself. In contrast, BiLSTM-GloVe uses externally trained GloVe
embeddings. We attribute this gain from using the pre-trained word vectors to
the fact that Icelandic, being a highly inflected language, has so many surface
forms for any lemma, that even though during training the network has seen one
surface form of a word, it doesn’t know the next time it sees a different surface
form that it is the same word. Incorporating a pool of word vectors trained from
a corpus of 1.4 billion tokens gives the network access to information on many
different word forms.
Named Entity Recognition for Icelandic: Annotated Corpus and Models 9
The CombiTagger ensemble method improves the overall results, with an F1
score of 85.79, which is better than any of the individual models. When it is
compared with our best NER model, we see a 1.89 percentage points improve-
ment in overall F1, and 2.69 points when compared with the second best model.
The observed results show that while the three different models are not equal
in quality, each one of them is better than the other two at predicting some
particular NE type. The different models thus tend to produce different (com-
plementary) errors and the differences can be exploited to yield better results.
Therefore, CombiTagger outperforms all the others on all but one NE types.
Table 3 shows that training on increasing sizes of the data gradually improves
the performance. However, even with a dataset of 540,000 tokens, the overall F1
is 82.37, and 85.17 in the 720,000 token dataset, which is not far behind from
the result obtained on the whole corpus. Furthermore, preliminary experiments
with training on a subset of the text types, containing only news texts, indicate
that this may be a viable approach, especially if the intended use is within a
particular domain.
Table 3. F1scores for CombiTagger on different sizes of the data. For clarification of
the model names, CombiTagger-180K stands for a corpus size of around 180,000 tokens,
with a training data size of around 160,000 tokens, validation set size of approximately
20,000 tokens, and the consistent test set size of 100,693 tokens (10% of each corpus
size was reserved for the test set).
Overall PER LOC ORG MISC DATE TIME MON PERC
CombiTagger-180K 76.67 85.50 81.87 69.18 46.34 79.60 89.75 71.50 96.46
CombiTagger-360K 79.14 88.24 82.13 71.59 54.11 84.34 86.89 78.10 97.78
CombiTagger-540K 82.37 89.23 83.75 76.00 58.92 89.77 91.97 87.67 98.20
CombiTagger-720K 85.16 89.19 87.19 81.96 61.16 92.59 95.40 90.35 97.78
CombiTagger-900K 85.79 90.19 88.21 81.23 64.27 93.13 96.41 86.58 98.65
6 Conclusion
We have described the annotation of the first NER corpus for Icelandic and the
initial experiments on using the data for training and evaluating Icelandic NER
models. This corpus, with 48,371 NEs tagged in 1 million tokens, is one of the
largest manually annotated NER corpora we have come across in the literature,
and includes a variety of text types that have been annotated for eight common
entity types.
Several different model architectures and training set sizes were tested on
the data, and an ensemble method using simple voting from three models was
shown to perform considerably better than any individual model. These results
are presented without any post-processing, such as a gazetteer lookup, commonly
used to boost NER results. The morphological intricacies of Icelandic make NER
10 S. Ingólfsdóttir et al.
a nuanced problem, but based on these first results we are optimistic about
obtaining higher scores with more advanced models in the future, e.g. by using
BERT-type models.
Acknowledgments
This work was funded by the Icelandic Strategic Research and Development
Programme for Language Technology 2019, grant no. 180027-5301.
References
1. Agerri, R., Bermudez, J., Rigau, G.: IXA pipeline: Efficient and Ready to Use
Multilingual NLP tools. In: Proceedings of the Ninth International Conference on
Language Resources and Evaluation. LREC 2014, Reykjavik, Iceland (2014)
2. Agerri, R., Rigau, G.: Robust Multilingual Named Entity Recognition with Shallow
Semi-Supervised Features. arXiv e-prints arXiv:1701.09123 (2017)
3. Ahmed, I., Sathyaraj, R.: Named Entity Recognition by Using Maximum Entropy.
International Journal of Database Application and Theory 8, 43–50 (2015), https:
//doi.org/10.14257/ijdta.2015.8.2.05
4. Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning Multilingual Trans-
formers for Language-Specific Named Entity Recognition. In: Proceedings of the
7th Workshop on Balto-Slavic Natural Language Processing. Florence, Italy (2019),
https://doi.org/10.18653/v1/W19-3712
5. Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven Pretrain-
ing of Self-attention Networks. In: Proceedings of the 2019 Conference on Em-
pirical Methods in Natural Language Processing and the 9nd International Joint
Conference on Natural Language Processing. EMNLP/IJCNLP, Hong Kong, China
(2019), https://doi.org/10.18653/v1/D19-1539
6. Bjarnadóttir, K.: The Database of Modern Icelandic Inflection. In: Proceedings of
the “Language Technology for Normalisation of Less-Resourced Languages” (SaLT-
MiL 8 – AfLaT2012), workshop at the 8th International Conference on Language
Resources and Evaluation. LREC 2012, Istanbul, Turkey (2012)
7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with
Subword Information. Transactions of the Association for Computational Linguis-
tics 5, 135–146 (2017), https://doi.org/10.1162/tacl_a_00051
8. Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-Based
n-Gram Models of Natural Language. Computational Linguistics 18(4), 467–479
(1992), https://www.aclweb.org/anthology/J92-4003
9. Chinchor, N., Brown, E., Ferro, L., Robinson, P.: Named Entity Recognition Task
Definition. Tech. Rep. Version 1.4, The MITRE Corporation and SAIC (1999)
10. Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., Vaithyanathan, S.: Domain
Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks. In:
Proceedings of the Conference on Empirical Methods in Natural Language Process-
ing. EMNLP 2010, Cambridge, MA, USA (2010), http://aclweb.org/anthology/
D10-1098
11. Clark, A.: Combining Distributional and Morphological Information for Part of
Speech Induction. In: Proceedings of the 10th Conference of the European Chapter
of the Association for Computational Linguistics. EACL 2003, Budapest, Hungary
(2003), https://www.aclweb.org/anthology/E03-1009
Named Entity Recognition for Icelandic: Annotated Corpus and Models 11
12. Demir, H., Özgür, A.: Improving Named Entity Recognition for Morphologically
Rich Languages Using Word Embeddings. In: Proceedings of the 13th International
Conference on Machine Learning and Applications. ICMLA 2013, Miami, FL, USA
(2014), https://doi.org/10.1109/ICMLA.2014.24
13. Derczynski, L., Field, C.V., Bøgh, K.S.: DKIE: Open Source Information Extrac-
tion for Danish. In: Proceedings of the Demonstrations at the 14th Conference of
the European Chapter of the Association for Computational Linguistics. EACL
2014, Gothenburg, Sweden (2014), https://doi.org/10.3115/v1/E14-2016
14. Dernoncourt, F., Lee, J.Y., Szolovits, P.: NeuroNER: an easy-to-use program for
named-entity recognition based on neural networks. In: Proceedings of the Con-
ference on Empirical Methods in Natural Language Processing. EMNLP 2017,
Copenhagen, Denmark (2017)
15. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. In: Proceedings of the
2019 Conference of the North American Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers). NAACL, Minneapolis, MN, USA (2019), https://doi.org/10.18653/
v1/N19-1423
16. Grishman, R., Sundheim, B.: Message Understanding Conference-6: A Brief His-
tory. In: Proceedings of the 16th Conference on Computational Linguistics – Vol-
ume 1. COLING 1996, Copenhagen, Denmark (1996), https://www.aclweb.org/
anthology/C96-1079/
17. Henrich, V., Reuter, T., Loftsson, H.: CombiTagger: A System for Develop-
ing Combined Taggers. In: Proceedings of the 22nd International FLAIRS Con-
ference, Special Track: “Applied Natural Language Processing”. Sanibel Island,
FL, USA (2009), https://www.aaai.org/ocs/index.php/FLAIRS/2009/paper/
viewFile/67/296
18. Ingólfsdóttir, S.L., Þorsteinsson, S., Loftsson, H.: Towards High Accuracy Named
Entity Recognition for Icelandic. In: Proceedings of the 22nd Nordic Conference
on Computational Linguistics. NoDaLiDa 2019, Turku, Finland (2019), https:
//www.aclweb.org/anthology/W19-6142
19. Johansen, B.: Named-Entity Recognition for Norwegian. In: Proceedings of the
22nd Nordic Conference on Computational Linguistics. NoDaLiDa 2019, Turku,
Finland (2019), https://www.aclweb.org/anthology/W19-6123
20. Lafferty, J.D., McCallum, A.K., Pereira, F.C.N.: Conditional Random Fields: Prob-
abilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings
of the Eighteenth International Conference on Machine Learning. ICML 2001,
Williamstown, MA, USA (2001)
21. Liu, L., Shang, J., Han, J.: Arabic Named Entity Recognition: What Works and
What’s Next. In: Proceedings of the Fourth Arabic Natural Language Processing
Workshop. Florence, Italy (2019), https://doi.org/10.18653/v1/W19-4607
22. Loftsson, H., Rögnvaldsson, E.: IceNLP: A natural language processing toolkit for
Icelandic. In: Proceedings of the Annual Conference of the International Speech
Communication Association. Antwerp, Belgium (2007)
23. Loftsson, H., Yngvason, J.H., Helgadóttir, S., Rögnvaldsson, E.: Developing a PoS-
tagged corpus using existing tools. In: Proceedings of “Creation and use of basic
lexical resources for less-resourced languages”, workshop at the 7th International
Conference on Language Resources and Evaluation. LREC 2010, Valetta, Malta
(2010)
24. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Rep-
resentations in Vector Space. arXiv e-prints arXiv:1301.3781 (2013)
12 S. Ingólfsdóttir et al.
25. Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for
named entity resolution. In: Proceedings of the Eighteenth Conference on Compu-
tational Natural Language Learning. CoNLL 2014, Ann Arbor, Michigan (2014),
https://doi.org/10.3115/v1/W14-1609
26. Pennington, J., Socher, R., Manning, C.: Glove: Global Vectors for Word Rep-
resentation. In: Proceedings of the Conference on Empirical Methods in Natural
Language Processing. EMNLP 2014, Doha, Qatar (2014), https://www.aclweb.
org/anthology/D14-1162/
27. Plank, B.: Neural Cross-Lingual Transfer and Limited Annotated Data for Named
Entity Recognition in Danish. In: Proceedings of the 22nd Nordic Conference on
Computational Linguistics. NoDaLiDa 2019, Turku, Finland (2019), https://www.
aclweb.org/anthology/W19-6143
28. Santos, D., Freitas, C., Gonçalo Oliveira, H., Carvalho, P.: Second HAREM: New
Challenges and Old Wisdom. In: Computational Processing of the Portuguese Lan-
guage, 8th International Conference, Proceedings. PROPOR 2008, Aveiro, Portugal
(2008), https://doi.org/10.1007/978-3-540-85980- 2_22
29. Santos, D., Seco, N., Cardoso, N., Vilela, R.: HAREM: An Advanced NER Eval-
uation Contest for Portuguese. In: Proceedings of the Fifth International Confer-
ence on Language Resources and Evaluation. LREC 2006, Genoa, Italy (2006),
http://www.lrec-conf.org/proceedings/lrec2006/pdf/59_pdf.pdf
30. Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S., Gudnason, J.:
Risamálheild: A Very Large Icelandic Text Corpus. In: Proceedings of the Eleventh
International Conference on Language Resources and Evaluation. LREC 2018,
Miyazaki, Japan (2018), https://www.aclweb.org/anthology/L18-1690
31. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 Shared
Task: Language-Independent Named Entity Recognition. In: Proceedings of the
Conference on Computational Natural Language Learning. CoNLL 2003, Edmon-
ton, Canada (2003), https://www.aclweb.org/anthology/W03-0419
32. Tryggvason, A.: Named Entity Recognition for Icelandic. Research report, Reykja-
vik University (2009)
33. Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Gin-
ter, F., Pyysalo, S.: Multilingual is not enough: BERT for Finnish. arXiv e-prints
arXiv:1912.07076 (2019)
34. Weegar, R., Pérez, A., Casillas, A., Oronoz, M.: Recent advances in Swedish and
Spanish medical entity recognition in clinical texts using deep neural approaches.
BMC Medical Informatics and Decision Making 19 (2019)
35. Wu, Y.C., Fan, T.K., Lee, Y.S., Yen, S.J.: Extracting Named Entities Using Sup-
port Vector Machines. In: Bremer, E.G., Hakenberg, J., Han, E.H.S., Berrar, D.,
Dubitzky, W. (eds.) Knowledge Discovery in Life Science Literature. pp. 91–103.
Springer Berlin Heidelberg, Berlin, Heidelberg (2006), https://doi.org/10.1007/
11683568_8
36. Yadav, V., Bethard, S.: A Survey on Recent Advances in Named Entity Recognition
from Deep Learning models. In: Proceedings of the 27th International Conference
on Computational Linguistics. COLING 2018, Santa Fe, NM, USA (2018), https:
//www.aclweb.org/anthology/C18-1182.pdf
... A version of this corpus, MIM-GOLD-NER, in which named entities (NEs) have been annotated, has also been made available. In MIM-GOLD-NER, about 48,000 Icelandic NEs are tagged with one of eight NE types (Ingólfsdóttir et al., 2020). ...
... • Named Entity Recognizer: In parallel to the construction of MIM-GOLD-NER (see Section 3.1), three different machine learning models were evaluated (Ingólfsdóttir et al., 2020). The best performing model was a bidirectional LSTM (BiLSTM) model, obtaining an overall F 1 -score of 83.9, for the entities Person, Organization, and Location. ...
Conference Paper
Full-text available
In this paper we describe how a fairly new CLARIN member is building a broad collection of national language resources for use in language technology (LT). As a CLARIN C-centre,CLARIN-IS is hosting metadata for various text and speech corpora, lexical resources, software packages and models. The providers of the resources are universities, institutions and private companies working on a national LT infrastructure initiative, Language Technology Programme for Icelandic. All deliverables of the programme are published under open licences and are freely accessible for research as well as commercial use. We provide a broad overview of the available repositories and the core publishing guidelines.
... Latest experiments show an accuracy of 98.9% on known word-forms and 91.7% on unknown word-forms. A named entity recognizer based on a fine-tuned ELECTRA-Base model, trained on the IGC, is in development, newest experiments showing an F 1 -score of 91.9%, a dramatic improvement from previous models (Ingólfsdóttir et al., 2020). Two previously published parsers have been updated within the LTPI, a full parser and a shallow parser. ...
Conference Paper
Full-text available
In this paper, we describe how a fairly new CLARIN member is building a broad collection of national language resources for use in language technology (LT). As a CLARIN C-centre, CLARIN-IS is hosting metadata for various text and speech corpora, lexical resources, software packages and models. The providers of the resources are universities, institutions and private companies working on a national (Icelandic) LT infrastructure initiative.
... Lemmatization is often employed as a preprocessing step for NLP tasks in Icelandic, as it dramatically reduces the size of the vocabulary. Although it has been shown to be beneficial for tasks such as named entity recognition (Ingólfsdóttir et al., 2020), information extraction (Steingrímsson et al., 2020) and machine translation (Barkarson and Steingrímsson, 2019), previous experiments with non-neural network-based models failed to show any improvement for extractive text summarization. We find that the same holds true for neural network-based models. ...
Chapter
The definition of the smart city has changed over time and is expected to continue to change. For example, Japan stepped into the Society 5.0 vision, which encompasses robotics, artificial intelligence, the Internet of Things, and big data. Society 5.0 is Japan’s national vision within the fifth Science and Technology Fundamental Plan, which aims to realize a data-driven, human-centered, next-generation society (Kobashi et al., 2020; Zhang et al., 2020). This vision means that everyone, regardless of location, will benefit from innovation and advances in technology, including the elderly population in rural areas. Community 5.0 emphasizes solving all kinds of social problems while harmonizing sustainability and economic growth. Therefore, Community 5.0 guides cities of all sizes on how to build smart infrastructure. It is no longer just about new technologies, but also about understanding a community’s problems and providing appropriate solutions through these technologies. In the context of the smart city, this approach is called “problem-solution-oriented smart city” (Neirotti et al., 2014), and it is the basis for achieving the greater goals of Community 5.0.
Article
Full-text available
Background: Text mining and natural language processing of clinical text, such as notes from electronic health records, requires specific consideration of the specialized characteristics of these texts. Deep learning methods could potentially mitigate domain specific challenges such as limited access to in-domain tools and data sets. Methods: A bi-directional Long Short-Term Memory network is applied to clinical notes in Spanish and Swedish for the task of medical named entity recognition. Several types of embeddings, both generated from in-domain and out-of-domain text corpora, and a number of generation and combination strategies for embeddings have been evaluated in order to investigate different input representations and the influence of domain on the final results. Results: For Spanish, a micro averaged F1-score of 75.25 was obtained and for Swedish, the corresponding score was 76.04. The best results for both languages were achieved using embeddings generated from in-domain corpora extracted from electronic health records, but embeddings generated from related domains were also found to be beneficial. Conclusions: A recurrent neural network with in-domain embeddings improved the medical named entity recognition compared to shallow learning methods, showing this combination to be suitable for entity recognition in clinical text for both languages.
Conference Paper
Full-text available
We present Risamálheild, the Icelandic Gigaword Corpus (IGC), a corpus containing more than one billion running words from mostly contemporary texts. The work was carried out with minimal amount of work and resources, focusing on material that is not protected by copyright and sources which could provide us with large chunks of text for each cleared permission. The two main sources considered were therefore official texts and texts from news media. Only digitally available texts are included in the corpus and formats that can be problematic are not processed. The corpus texts are morphosyntactically tagged and provided with metadata. Processes have been set up for continuous text collection, cleaning and annotation. The corpus is available for search and download with permissive licenses. The dataset is intended to be clearly versioned with the first version released in early 2018. Texts will be collected continually and a new version published every year.
Conference Paper
Full-text available
We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering semi-supervised features induced on large amounts of unlabeled text. Understanding via empiricalexperimentation how to effectively combine various types of clustering features allows us to seamlessly export our system to other datasets and languages. The result is a simple but highly competitive system which obtains state of the art results across five languages and twelve datasets. The results are reported on standard shared task evaluation data such as CoNLL for English, Spanish and Dutch. Furthermore, and despite the lack of linguistically motivated features, we also report best results for languages such as Basque and German. In addition, we demonstrate that our method also obtains very competitive results even when the amount of supervised data is cut by half, alleviating the dependency on manually annotated data. Finally, the results show that our emphasis on clustering features is crucial to develop robust out-of-domain models. The system and models are freely available to facilitate its use and guarantee the reproducibility of results.
Article
Full-text available
Named-entity recognition (NER) aims at identifying entities of interest in a text. Artificial neural networks (ANNs) have recently been shown to outperform existing NER systems. However, ANNs remain challenging to use for non-expert users. In this paper, we present NeuroNER, an easy-to-use named-entity recognition tool based on ANNs. Users can annotate entities using a graphical web-based user interface (BRAT): the annotations are then used to train an ANN, which in turn predict entities' locations and categories in new texts. NeuroNER makes this annotation-training-prediction flow smooth and accessible to anyone.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.