ChapterPDF Available

Named Entity Recognition for Icelandic: Annotated Corpus and Models

September 2020

September 2020

DOI:10.1007/978-3-030-59430-5_4

In book: Statistical Language and Speech Processing (pp.46-57)

Authors:

Svanhvít Lilja Ingólfsdóttir

Reykjavik University

Ásmundur A. Guðjónsson

Reykjavik University

Hrafn Loftsson

Reykjavik University

Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million tokens. Furthermore, we have used the corpus to train three machine learning models: first, a CRF model that makes use of shallow word features and a gazetteer function; second, a perceptron model with shallow word features and externally trained word clusters; and third, a BiLSTM model with external word embeddings. Finally, we applied simple voting to combine the model outputs. The voting method obtains an \(F_{1}\) score of 85.79, gaining 1.89 points compared to the best performing individual model. The corpus and the models are publicly available.

NE split in the MIM-GOLD-NER corpus. Entity type Total NE count Percentage

…

F1 for the different models, in addition to CombiTagger, which combines the output from the three best models.

…

Figures - uploaded by Hrafn Loftsson

Content may be subject to copyright.

Content uploaded by Hrafn Loftsson

Content may be subject to copyright.

Named Entity Recognition for Icelandic:

Annotated Corpus and Models

Svanhvít L. Ingólfsdóttir[0000−0001−6965−692X],

Ásmundur A. Guðjónsson[0000−0002−7790−664X], and

Hrafn Loftsson[0000−0002−9298−4830]

Language and Voice Lab

Reykjavik University

Menntavegur 1, 101 Reykjavík, Iceland

svanhviti16@ru.is,asmundur10@ru.is,hrafn@ru.is

https://lvl.ru.is

Abstract. Named entity recognition (NER) can be a challenging task,

especially in highly inﬂected languages where each entity can have many

diﬀerent surface forms. We have created the ﬁrst NER corpus for Ice-

landic by annotating 48,371 named entities (NEs) using eight NE types,

in a text corpus of 1 million tokens. Furthermore, we have used the cor-

pus to train three machine learning models: ﬁrst, a CRF model that

makes use of shallow word features and a gazetteer function; second,

a perceptron model with shallow word features and externally trained

word clusters; and third, a BiLSTM model with external word embed-

dings. Finally, we applied simple voting to combine the model outputs.

The voting method obtains an F1score of 85.79, gaining 1.89 percentage

points compared to the best performing individual model. The corpus

and the models are publicly available.

Keywords: Named entity recognition ·Corpus annotation ·BiLSTM ·

CRF ·Clustering ·Machine learning

1 Introduction

Since the integration of named entity recognition (NER) into the sixth Message

Understanding Conference (MUC) in 1995 [16], NER has become recognized as

an important task in natural language processing (NLP), and NER datasets and

methods have been developed for many languages. Detecting and recognizing

named entities (NEs) in text is not a trivial task, as various patterns need to

be learned, and new proper names appear frequently. Furthermore, in highly

inﬂected languages, such as Icelandic, each proper name can have many diﬀerent

surface forms, which further complicates the task.

No NER corpus was available for the Icelandic language before the work in-

troduced in [18], where a sample of 200,000 tokens, from the MIM-GOLD corpus

of 1 million tokens [23], was annotated with four NE types and used for training

a NER prototype. In this paper, we describe the completion of the annotation

2 S. Ingólfsdóttir et al.

of the whole corpus, using eight NE types, which has resulted in 48,371 NEs.

The corpus, MIM-GOLD-NER, is available online1. Furthermore, we describe

the evaluation of three diﬀerent models trained on the corpus: ﬁrst, a Condi-

tional Random Field (CRF) model that makes use of shallow word features and

a gazetteer function; second, a perceptron model with shallow word features and

externally trained word clusters; and third, a bidirectional long short-term mem-

ory (BiLSTM) model with external word embeddings. Finally, we have combined

our model outputs using a simple voting method, obtaining an F1score of 85.79

and gaining 1.89 percentage points compared to the best performing individual

model. The code for our models is available online2.

2 Related Work

The most commonly annotated entity types in NER corpora in the general do-

main are person,location and organization, both due to the fact that these

entities are very common in general texts and newswire texts, which are often

the source of NER corpora, and that they contain information that may be valu-

able for a variety of purposes. The three generic NE types were ﬁrst proposed

for the MUC-6 event as a subtask called enamex [16]. The other subtasks in

MUC-6 are named timex (dates and times) and numex (monetary values and

percentages). In the CoNLL shared task, a fourth category, miscellaneous

was introduced, which covers proper names that fall outside of the three classic

categories, person,location and organization [31].

While rule-based methods for NER can be highly eﬃcient, especially in a

well-deﬁned domain [10], machine learning (ML) methods took over from rule-

based method as the main approach to NER as the ﬁeld developed and data

became more readily available. These range from unsupervised to fully super-

vised methods (e.g. [3, 25, 35]), as well as hybrid systems, combining various

supervised and unsupervised methods and carefully engineered features, e.g. [2].

Although the majority of the early work on NER was conducted on the

English language, NER corpora and published work is now available for various

diﬀerent languages. This is an important development, because of the many

structural, morphological, and orthographic features that characterize diﬀerent

languages. Enriching the input representation with pre-trained word embeddings

has, for example, proven useful for NER in languages such as Turkish [12] and

Arabic [21], as have the larger and more complex language models that have

recently become popular, such as the Finnish [33] and Slavic [4] BERT models.

Various published work exists for the languages most related to Icelandic,

i.e. the Scandinavian languages. The most recent work for Swedish includes a

BiLSTM model [34]. A Danish NER corpus is presented in [13], but some of the

latest research for Danish NER focuses on cross-lingual transfer, to make up for

the limited data available [27]. For Norwegian, the most recent work includes

[19], which involves new annotated NER datasets for the two written forms of

1http://hdl.handle.net/20.500.12537/42

2http://github.com/cadia-lvl/NER

Named Entity Recognition for Icelandic: Annotated Corpus and Models 3

Norwegian, Bokmål and Nynorsk, and BiLSTM models enriched with pre-trained

word embeddings.

Limited work exists on NER for the Icelandic language, most likely due to

the lack of an annotated NER corpus. IceNER, a rule-based NER system is part

of the IceNLP toolkit [22]. It has been reported to reach an F1score of 71.5

without querying a gazetteer, and 79.3 with a gazetteer lookup [32].

3 Corpus Annotation

As a basis for our NER corpus, we used MIM-GOLD, the Icelandic Gold Stan-

dard corpus [23], a balanced text corpus of approximately 1 million tokens,

tagged with part-of-speech (PoS). All the texts are from the years 2000–2009 and

are sourced from thirteen text types, including news texts, books, blogs, web-

sites, laws, adjudications, school essays, scripted radio news, web media texts

and emails. For most languages, this variety of text genres is not common in

NER corpora, which are often centered on newswire texts (with some excep-

tions, such as the Portuguese HAREM NER contests [29, 28]). Nevertheless, our

corpus, MIM-GOLD-NER, is heavy on news-related content, as newspaper arti-

cles, web media, and radio news account for 36% of the tokens.

3.1 Annotation Process

Eight NE types are tagged in MIM-GOLD-NER: person,location,organi-

zation,miscellaneous,date,time,money and percentage. The ﬁrst four

entity types are the same as used in the CoNLL shared tasks. The last four NE

types were adapted from the numex and timex types in the MUC events.

We applied a semiautomatic approach when annotating the corpus, using

gazetteers and regular expressions to extract as many entities as possible before

reviewing and correcting the corpus manually.

Preprocessing Gazetteers were collected from oﬃcial Icelandic resources. For

extracting the person names, we used the Database of Modern Icelandic Inﬂec-

tions (DMII) [6]. Additionally, we collected lists of place names and addresses,

as well as company names. In the end, we had gazetteers with 15,000 person

names, 97,000 location names, and 90,000 organization names. Since the origi-

nal MIM-GOLD is PoS-tagged, we were able to use the PoS tags to ﬁlter out

likely proper nouns and match them with our gazetteers, to produce NE candi-

dates. Heuristics and knowledge of the language were used to resolve doubts and

ambiguities and to try to determine NE boundaries. Remaining ambiguities were

registered to be resolved manually. Regular expressions were used for automat-

ically extracting the numerical entities, taking care to match entities regardless

of how they are written out in the corpus, whether numerically or alphabetically.

4 S. Ingólfsdóttir et al.

Manual review After the automatic preprocessing step, the next task was

reviewing the resulting annotations, ﬁxing errors and picking up any remaining

NEs missed in the previous step, since some inaccuracies were bound to appear,

both in the classiﬁcation and the span (boundaries) of the entities.

Guidelines were constructed regarding the taxonomy used for the annotation

of the corpus. For the four NE types adapted from CoNLL, we mostly relied on

the CoNLL guidelines for each entity type [9], though with some modiﬁcations to

ﬁt Icelandic settings and writing conventions. The four numerical and temporal

entity types, date,time,money, and percent, appeared less commonly in the

corpus, and were easier to ﬁnd using regular expressions. The MUC guidelines

were followed as closely as possible when annotating these entities.

One implication of using a balanced corpus such as MIM-GOLD, is that only

parts of it have been reviewed/edited, so the texts vary quite a lot in writing

quality; some have been thoroughly proofread (published books, laws and adju-

dications), some have undergone some editing (news articles, some web content,

scripted texts for radio), and some have not been edited at all (blogs, emails,

web content, classiﬁed newspaper ads). The corpus contained many problematic

NE candidates, for which annotation was not clear. These were marked specially

during the annotation and resolved at the end, to keep consistency within the

corpus. In some parts of the corpus, such as laws and regulatory texts, NEs are

sparsely distributed, but we have nonetheless opted for annotating the whole

corpus without removing sentences unlikely to contain NEs. This was done both

to keep the organic distribution of NEs in the corpus and to ensure compatibility

with the original MIM-GOLD corpus.

One annotator (the ﬁrst author of this paper) was in charge of deﬁning the

annotation task and labeling the bulk of the corpus. This was done part-time

in the course of a year. A second annotator (the second author) stepped in to

help in the last weeks, and reviewed around 8% of the corpus. We estimate that

150-200 hours went into the manual annotation stage. The annotation resulted

in 48,371 NEs, split between the eight entity types as shown in Table 1.

Table 1. NE split in the MIM-GOLD-NER corpus.

Entity type Total NE count Percentage

Person 15,599 32.25%

Location 9,011 18.63%

Organization 8,966 18.54%

Miscellaneous 6,264 12.95%

Date 5,529 11.43%

Time 1,214 2.51%

Money 1,050 2.17%

Percent 738 1.53%

Total 48,371

Named Entity Recognition for Icelandic: Annotated Corpus and Models 5

Once the annotation was ﬁnished, an external linguist reviewed a randomly

chosen 10% sample of the corpus, to estimate the accuracy of the annotation. The

corpus sample contained a total of 4,527 NEs. The reviewer was presented with

the same instructions and guidelines as annotator 1 and annotator 2, and was

asked to mark any doubts or errors spotted in the corpus sample. Annotator 1

then reviewed this list of error candidates and, in accordance with the guidelines,

evaluated which of them were true positives, i.e. real errors that should be ﬁxed.

In the end, 250 error candidates were found, out of which 180 were marked as

false positives by annotator 1. These false positives were due to either doubts

that the reviewer had on how to annotate or lack of detail in the annotation

guidelines. Thus, the total number of real errors was 70, which is equivalent to

an accuracy of 98.45% for this corpus sample.

4 Models

Until recently, the main approach for NER has been the application of BiLSTM

models, along with the best hybrid feature-engineered ML systems [36]. Even

though these methods have mostly fallen in the shadow of state-of-the-art trans-

former models [5, 15], they do a good job of solving problems where the input is

sequential data, such as in NER. Since a transformer model does not yet exist for

Icelandic, we opted for these tried and tested methods for the ﬁrst experiments

with our new corpus.

Three diﬀerent methods were chosen for the experiments: a CRF model, a

perceptron model, and a BiLSTM model. The three models were then combined

into one ensemble NER tagger using simple voting. Before presenting our results,

we will brieﬂy describe each method.

4.1 Conditional Random Field (CRF)

CRF is a conditional probabilistic modeling method, which can take context into

account [20]. Passos et al. [25] implemented a stacked linear-chain CRF system

for NER that makes use of shallow word features in their baseline model, along

with gazetteers, and then compared the results when adding Word2Vec embed-

dings [24] and Brown clusters [8], among other things. Their best performing

model achieved an F1score of 90.9 on the English CoNLL-03 test set.

We implemented a model inspired by this baseline system. Our model uses

the following word features:

–the word lower-cased

–word suﬃxes, length 1 to 4

–a boolean for whether the word is all in uppercase letters

–a boolean for whether the ﬁrst letter of the word is in uppercase

–a boolean for whether the word is a digit

–all the character n-grams within the word, of length 2 to 5

–the four words prior to the word

6 S. Ingólfsdóttir et al.

–the four words after the word

These parameters were mostly found by trial and error. If a parameter did

not improve the model, or if removing a parameter made no diﬀerence on the

validation set, it was discarded. As an example, suﬃxes are used, but not preﬁxes,

since the preﬁxes did not contribute any diﬀerence. The gazetteers used for the

corpus annotation stage were reused for this model, and each list had its own

boolean parameter for whether the word appeared in the list.

4.2 IXA-pipe-nerc

In a survey on diﬀerent NER architectures [36], the best performing non-neural

model was IXA-pipe-nerc, a NER module which is a part of the IXA pipes NLP

tool [1]. IXA-pipe-nerc is based on a perceptron model and utilizes both shallow

local word features and semi-supervised word clusters [2]. It supports including

PoS tags as features, as well as a gazetteer lookup. The clustering features used

were Brown [8], Clark [11], and Word2Vec [24] clusters. The software is open

source and language-independent, making it straightforward to train a model

for a new language. The supported local word features are:

–current lower-cased token

–token shape

–the previous prediction for the word

–whether it is the ﬁrst token in the sentence

–both the preﬁxes and suﬃxes of the token, with default of length 4

–word bigrams and trigrams, which both include the current token and the

token shape

–all the character n-grams within the word, with default of length 2 to 5

Evaluation shows that the biggest performance boost comes from the word

clusters. The best model in [2] achieved an F1score of 91.36 on the English

CoNLL-03 test set.

The conﬁguration for our model was chosen by experimenting on the valida-

tion set, selecting the features that gave the best results. In our model, we used

the default values for the features enumerated above, except we used character

n-grams of length 1 up to 11, as well as the word trigrams, which are disabled by

default. We trained the three types of word clusters on the Icelandic Gigaword

Corpus (IGC) [30], a corpus of around 1.4 billion words of Icelandic texts from

various sources.

4.3 BiLSTM with Pre-trained Word Embeddings

For the BiLSTM experiments, we used a program called NeuroNER [14]. Neuro-

NER is described as an easy-to-use program for training models to recognize

and classify NEs. It uses TensorFlow3for training the neural networks and has

3https://www.tensorflow.org/

Named Entity Recognition for Icelandic: Annotated Corpus and Models 7

been reported to reach an F1of 90.5 on the English CoNLL-03 dataset. Neuro-

NER is divided into three layers. First, a BiLSTM layer maps each token to

a vector representation using two types of embeddings: a word embedding and

a character-level token embedding. The resulting embeddings are fed into the

second layer, a BiLSTM which outputs the sequence of vectors containing the

probability of each label for each corresponding token. Finally, a CRF decoding

layer outputs the most likely sequence of predicted labels based on the output

from the previous label prediction layer.

Instead of implicitly learning the word embeddings, NeuroNER oﬀers the pos-

sibility to incorporate external word embeddings, pre-trained on a larger dataset.

We have provided external word embeddings, trained on the 2018 version of the

IGC. For the sake of comparison, Word2Vec [24], GloVe [26], as well as Fast-

Text [7] embeddings were tested. GloVe embeddings turned out to give slightly

better results than the other two, so they were used in the models presented in

our results. For the GloVe embeddings, dimensions were set at 300, window size

at 10, and the minimum term count at 5. NeuroNER was conﬁgured with the

following main parameters:

–character embedding dimension = 25

–character lstm dimension = 25

–dropout = 0.5

–patience = 10

–maximum number of epochs = 100

–optimizer = stochastic gradient descent

–learning rate = 0.005

4.4 CombiTagger

Diﬀerent ML models may have diﬀerent strengths and weaknesses depending on

the methods used to train them. For NER, some may, for example, work better

on the more regular numerical entities, while others may be better at overcoming

misspellings in the text. One way of leveraging the strengths of diﬀerent models,

for an increased overall performance, is using a voting system for the output

tags. CombiTagger [17], a system originally developed for combining diﬀerent

PoS taggers, oﬀers simple and weighted voting. In our work, we fed our three

best NER model outputs into CombiTagger and applied simple voting.

5 Results and Discussion

In this section, we present and discuss the results from training on MIM-GOLD-

NER using the methods described in Section 4.

For the total corpus of 1 million tokens, the split between training, validation

and test sets was 80%, 10% and 10%, respectively, which gave a training set of

around 800,000 tokens, and validation and test sets of around 100,000 tokens

each. We also trained on ﬁve diﬀerent sizes of the corpus, keeping the test set

8 S. Ingólfsdóttir et al.

intact in all experiments to maintain consistency in the evaluation. The CoNLL-

03 evaluation metrics are used, meaning that both the type and the boundaries

of a predicted NE need to match the gold label for it to count as correct.

Table 2. F1for the diﬀerent models, in addition to CombiTagger, which combines the

output from the three best models.

Overall PER LOC ORG MISC DATE TIME MON PERC

BiLSTM-internal 73.60 80.11 74.98 65.82 44.10 86.73 91.83 81.86 92.73

CRF 82.24 87.18 86.04 77.02 53.87 90.90 94.46 84.30 98.21

IXA-pipe-nerc 83.10 87.90 85.54 77.02 61.57 91.96 94.29 85.59 97.35

BiLSTM-GloVE 83.90 89.53 85.45 79.03 61.77 90.60 94.78 89.45 95.54

CombiTagger 85.79 90.19 88.21 81.23 64.27 93.13 96.41 86.58 98.65

In Table 2, the results from training the diﬀerent models on the whole corpus

are presented. Before considering the best performing method, CombiTagger,

we will discuss how the CRF, IXA-pipe-nerc, and BiLSTM models performed

overall, and on each NE type.

It is not surprising that the most advanced model, BiLSTM, outperforms

the other two in overall score, but it is interesting to see how close behind

the IXA-pipe-nerc model comes – by only 0.80 percentage points. Furthermore,

IXA-pipe-nerc, which uses externally trained (semi-supervised) word clusters,

only improves from the CRF by 0.86 percentage points. In [36], the IXA-pipe-

nerc model also obtained marginally better scores (0.46 percentage points) than

a CRF model, when evaluated on English.

Looking at the individual entity types, we see that the BiLSTM outperforms

the other two in ﬁve out of eight types, which should not come as a surprise. What

was more unexpected, however, is that the CRF, despite scoring somewhat worse

overall, outperforms the IXA-pipe-nerc model in two categories, location and

percent. The reason for the high-scoring location type may be the gazetteer

lookup, implemented as part of the CRF model, since the place names gazetteer

was quite exhaustive. The IXA-pipe-nerc model, however, gives the best results

for date, outperforming the others by over 1 percentage point.

Note the eﬀect of using pre-trained word embeddings as external input into

the BiLSTM. BiLSTM-internal was trained without these external word em-

beddings, i.e. word and character embeddings were trained internally using the

training data itself. In contrast, BiLSTM-GloVe uses externally trained GloVe

embeddings. We attribute this gain from using the pre-trained word vectors to

the fact that Icelandic, being a highly inﬂected language, has so many surface

forms for any lemma, that even though during training the network has seen one

surface form of a word, it doesn’t know the next time it sees a diﬀerent surface

form that it is the same word. Incorporating a pool of word vectors trained from

a corpus of 1.4 billion tokens gives the network access to information on many

diﬀerent word forms.

Named Entity Recognition for Icelandic: Annotated Corpus and Models 9

The CombiTagger ensemble method improves the overall results, with an F1

score of 85.79, which is better than any of the individual models. When it is

compared with our best NER model, we see a 1.89 percentage points improve-

ment in overall F1, and 2.69 points when compared with the second best model.

The observed results show that while the three diﬀerent models are not equal

in quality, each one of them is better than the other two at predicting some

particular NE type. The diﬀerent models thus tend to produce diﬀerent (com-

plementary) errors and the diﬀerences can be exploited to yield better results.

Therefore, CombiTagger outperforms all the others on all but one NE types.

Table 3 shows that training on increasing sizes of the data gradually improves

the performance. However, even with a dataset of 540,000 tokens, the overall F1

is 82.37, and 85.17 in the 720,000 token dataset, which is not far behind from

the result obtained on the whole corpus. Furthermore, preliminary experiments

with training on a subset of the text types, containing only news texts, indicate

that this may be a viable approach, especially if the intended use is within a

particular domain.

Table 3. F1scores for CombiTagger on diﬀerent sizes of the data. For clariﬁcation of

the model names, CombiTagger-180K stands for a corpus size of around 180,000 tokens,

with a training data size of around 160,000 tokens, validation set size of approximately

20,000 tokens, and the consistent test set size of 100,693 tokens (10% of each corpus

size was reserved for the test set).

Overall PER LOC ORG MISC DATE TIME MON PERC

CombiTagger-180K 76.67 85.50 81.87 69.18 46.34 79.60 89.75 71.50 96.46

CombiTagger-360K 79.14 88.24 82.13 71.59 54.11 84.34 86.89 78.10 97.78

CombiTagger-540K 82.37 89.23 83.75 76.00 58.92 89.77 91.97 87.67 98.20

CombiTagger-720K 85.16 89.19 87.19 81.96 61.16 92.59 95.40 90.35 97.78

CombiTagger-900K 85.79 90.19 88.21 81.23 64.27 93.13 96.41 86.58 98.65

6 Conclusion

We have described the annotation of the ﬁrst NER corpus for Icelandic and the

initial experiments on using the data for training and evaluating Icelandic NER

models. This corpus, with 48,371 NEs tagged in 1 million tokens, is one of the

largest manually annotated NER corpora we have come across in the literature,

and includes a variety of text types that have been annotated for eight common

entity types.

Several diﬀerent model architectures and training set sizes were tested on

the data, and an ensemble method using simple voting from three models was

shown to perform considerably better than any individual model. These results

are presented without any post-processing, such as a gazetteer lookup, commonly

used to boost NER results. The morphological intricacies of Icelandic make NER

10 S. Ingólfsdóttir et al.

a nuanced problem, but based on these ﬁrst results we are optimistic about

obtaining higher scores with more advanced models in the future, e.g. by using

BERT-type models.

Acknowledgments

This work was funded by the Icelandic Strategic Research and Development

Programme for Language Technology 2019, grant no. 180027-5301.

References

1. Agerri, R., Bermudez, J., Rigau, G.: IXA pipeline: Eﬃcient and Ready to Use

Multilingual NLP tools. In: Proceedings of the Ninth International Conference on

Language Resources and Evaluation. LREC 2014, Reykjavik, Iceland (2014)

2. Agerri, R., Rigau, G.: Robust Multilingual Named Entity Recognition with Shallow

Semi-Supervised Features. arXiv e-prints arXiv:1701.09123 (2017)

3. Ahmed, I., Sathyaraj, R.: Named Entity Recognition by Using Maximum Entropy.

International Journal of Database Application and Theory 8, 43–50 (2015), https:

//doi.org/10.14257/ijdta.2015.8.2.05

4. Arkhipov, M., Troﬁmova, M., Kuratov, Y., Sorokin, A.: Tuning Multilingual Trans-

formers for Language-Speciﬁc Named Entity Recognition. In: Proceedings of the

7th Workshop on Balto-Slavic Natural Language Processing. Florence, Italy (2019),

https://doi.org/10.18653/v1/W19-3712

5. Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven Pretrain-

ing of Self-attention Networks. In: Proceedings of the 2019 Conference on Em-

pirical Methods in Natural Language Processing and the 9nd International Joint

Conference on Natural Language Processing. EMNLP/IJCNLP, Hong Kong, China

(2019), https://doi.org/10.18653/v1/D19-1539

6. Bjarnadóttir, K.: The Database of Modern Icelandic Inﬂection. In: Proceedings of

the “Language Technology for Normalisation of Less-Resourced Languages” (SaLT-

MiL 8 – AfLaT2012), workshop at the 8th International Conference on Language

Resources and Evaluation. LREC 2012, Istanbul, Turkey (2012)

7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with

Subword Information. Transactions of the Association for Computational Linguis-

tics 5, 135–146 (2017), https://doi.org/10.1162/tacl_a_00051

8. Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-Based

n-Gram Models of Natural Language. Computational Linguistics 18(4), 467–479

(1992), https://www.aclweb.org/anthology/J92-4003

9. Chinchor, N., Brown, E., Ferro, L., Robinson, P.: Named Entity Recognition Task

Deﬁnition. Tech. Rep. Version 1.4, The MITRE Corporation and SAIC (1999)

10. Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., Vaithyanathan, S.: Domain

Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks. In:

Proceedings of the Conference on Empirical Methods in Natural Language Process-

ing. EMNLP 2010, Cambridge, MA, USA (2010), http://aclweb.org/anthology/

D10-1098

11. Clark, A.: Combining Distributional and Morphological Information for Part of

Speech Induction. In: Proceedings of the 10th Conference of the European Chapter

of the Association for Computational Linguistics. EACL 2003, Budapest, Hungary

(2003), https://www.aclweb.org/anthology/E03-1009

Named Entity Recognition for Icelandic: Annotated Corpus and Models 11

12. Demir, H., Özgür, A.: Improving Named Entity Recognition for Morphologically

Rich Languages Using Word Embeddings. In: Proceedings of the 13th International

Conference on Machine Learning and Applications. ICMLA 2013, Miami, FL, USA

(2014), https://doi.org/10.1109/ICMLA.2014.24

13. Derczynski, L., Field, C.V., Bøgh, K.S.: DKIE: Open Source Information Extrac-

tion for Danish. In: Proceedings of the Demonstrations at the 14th Conference of

the European Chapter of the Association for Computational Linguistics. EACL

2014, Gothenburg, Sweden (2014), https://doi.org/10.3115/v1/E14-2016

14. Dernoncourt, F., Lee, J.Y., Szolovits, P.: NeuroNER: an easy-to-use program for

named-entity recognition based on neural networks. In: Proceedings of the Con-

ference on Empirical Methods in Natural Language Processing. EMNLP 2017,

Copenhagen, Denmark (2017)

15. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep

Bidirectional Transformers for Language Understanding. In: Proceedings of the

2019 Conference of the North American Chapter of the Association for Compu-

tational Linguistics: Human Language Technologies, Volume 1 (Long and Short

Papers). NAACL, Minneapolis, MN, USA (2019), https://doi.org/10.18653/

v1/N19-1423

16. Grishman, R., Sundheim, B.: Message Understanding Conference-6: A Brief His-

tory. In: Proceedings of the 16th Conference on Computational Linguistics – Vol-

ume 1. COLING 1996, Copenhagen, Denmark (1996), https://www.aclweb.org/

anthology/C96-1079/

17. Henrich, V., Reuter, T., Loftsson, H.: CombiTagger: A System for Develop-

ing Combined Taggers. In: Proceedings of the 22nd International FLAIRS Con-

ference, Special Track: “Applied Natural Language Processing”. Sanibel Island,

FL, USA (2009), https://www.aaai.org/ocs/index.php/FLAIRS/2009/paper/

viewFile/67/296

18. Ingólfsdóttir, S.L., Þorsteinsson, S., Loftsson, H.: Towards High Accuracy Named

Entity Recognition for Icelandic. In: Proceedings of the 22nd Nordic Conference

on Computational Linguistics. NoDaLiDa 2019, Turku, Finland (2019), https:

//www.aclweb.org/anthology/W19-6142

19. Johansen, B.: Named-Entity Recognition for Norwegian. In: Proceedings of the

22nd Nordic Conference on Computational Linguistics. NoDaLiDa 2019, Turku,

Finland (2019), https://www.aclweb.org/anthology/W19-6123

20. Laﬀerty, J.D., McCallum, A.K., Pereira, F.C.N.: Conditional Random Fields: Prob-

abilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings

of the Eighteenth International Conference on Machine Learning. ICML 2001,

Williamstown, MA, USA (2001)

21. Liu, L., Shang, J., Han, J.: Arabic Named Entity Recognition: What Works and

What’s Next. In: Proceedings of the Fourth Arabic Natural Language Processing

Workshop. Florence, Italy (2019), https://doi.org/10.18653/v1/W19-4607

22. Loftsson, H., Rögnvaldsson, E.: IceNLP: A natural language processing toolkit for

Icelandic. In: Proceedings of the Annual Conference of the International Speech

Communication Association. Antwerp, Belgium (2007)

23. Loftsson, H., Yngvason, J.H., Helgadóttir, S., Rögnvaldsson, E.: Developing a PoS-

tagged corpus using existing tools. In: Proceedings of “Creation and use of basic

lexical resources for less-resourced languages”, workshop at the 7th International

Conference on Language Resources and Evaluation. LREC 2010, Valetta, Malta

(2010)

24. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Eﬃcient Estimation of Word Rep-

resentations in Vector Space. arXiv e-prints arXiv:1301.3781 (2013)

12 S. Ingólfsdóttir et al.

25. Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for

named entity resolution. In: Proceedings of the Eighteenth Conference on Compu-

tational Natural Language Learning. CoNLL 2014, Ann Arbor, Michigan (2014),

https://doi.org/10.3115/v1/W14-1609

26. Pennington, J., Socher, R., Manning, C.: Glove: Global Vectors for Word Rep-

resentation. In: Proceedings of the Conference on Empirical Methods in Natural

Language Processing. EMNLP 2014, Doha, Qatar (2014), https://www.aclweb.

org/anthology/D14-1162/

27. Plank, B.: Neural Cross-Lingual Transfer and Limited Annotated Data for Named

Entity Recognition in Danish. In: Proceedings of the 22nd Nordic Conference on

Computational Linguistics. NoDaLiDa 2019, Turku, Finland (2019), https://www.

aclweb.org/anthology/W19-6143

28. Santos, D., Freitas, C., Gonçalo Oliveira, H., Carvalho, P.: Second HAREM: New

Challenges and Old Wisdom. In: Computational Processing of the Portuguese Lan-

guage, 8th International Conference, Proceedings. PROPOR 2008, Aveiro, Portugal

(2008), https://doi.org/10.1007/978-3-540-85980- 2_22

29. Santos, D., Seco, N., Cardoso, N., Vilela, R.: HAREM: An Advanced NER Eval-

uation Contest for Portuguese. In: Proceedings of the Fifth International Confer-

ence on Language Resources and Evaluation. LREC 2006, Genoa, Italy (2006),

http://www.lrec-conf.org/proceedings/lrec2006/pdf/59_pdf.pdf

30. Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S., Gudnason, J.:

Risamálheild: A Very Large Icelandic Text Corpus. In: Proceedings of the Eleventh

International Conference on Language Resources and Evaluation. LREC 2018,

Miyazaki, Japan (2018), https://www.aclweb.org/anthology/L18-1690

31. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 Shared

Task: Language-Independent Named Entity Recognition. In: Proceedings of the

Conference on Computational Natural Language Learning. CoNLL 2003, Edmon-

ton, Canada (2003), https://www.aclweb.org/anthology/W03-0419

32. Tryggvason, A.: Named Entity Recognition for Icelandic. Research report, Reykja-

vik University (2009)

33. Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Gin-

ter, F., Pyysalo, S.: Multilingual is not enough: BERT for Finnish. arXiv e-prints

arXiv:1912.07076 (2019)

34. Weegar, R., Pérez, A., Casillas, A., Oronoz, M.: Recent advances in Swedish and

Spanish medical entity recognition in clinical texts using deep neural approaches.

BMC Medical Informatics and Decision Making 19 (2019)

35. Wu, Y.C., Fan, T.K., Lee, Y.S., Yen, S.J.: Extracting Named Entities Using Sup-

port Vector Machines. In: Bremer, E.G., Hakenberg, J., Han, E.H.S., Berrar, D.,

Dubitzky, W. (eds.) Knowledge Discovery in Life Science Literature. pp. 91–103.

Springer Berlin Heidelberg, Berlin, Heidelberg (2006), https://doi.org/10.1007/

11683568_8

36. Yadav, V., Bethard, S.: A Survey on Recent Advances in Named Entity Recognition

from Deep Learning models. In: Proceedings of the 27th International Conference

on Computational Linguistics. COLING 2018, Santa Fe, NM, USA (2018), https:

//www.aclweb.org/anthology/C18-1182.pdf

Help Yourself from the Buffet: National Language Technology Infrastructure Initiative on CLARIN-IS

Conference Paper

Full-text available

Jul 2022

In this paper we describe how a fairly new CLARIN member is building a broad collection of national language resources for use in language technology (LT). As a CLARIN C-centre,CLARIN-IS is hosting metadata for various text and speech corpora, lexical resources, software packages and models. The providers of the resources are universities, institutions and private companies working on a national LT infrastructure initiative, Language Technology Programme for Icelandic. All deliverables of the programme are published under open licences and are freely accessible for research as well as commercial use. We provide a broad overview of the available repositories and the core publishing guidelines.

Help Yourself from the Buffet: National Language Technology Infrastructure Initiative on CLARIN-IS

Conference Paper

Full-text available

Sep 2021

In this paper, we describe how a fairly new CLARIN member is building a broad collection of national language resources for use in language technology (LT). As a CLARIN C-centre, CLARIN-IS is hosting metadata for various text and speech corpora, lexical resources, software packages and models. The providers of the resources are universities, institutions and private companies working on a national (Icelandic) LT infrastructure initiative.

IceSum: An Icelandic Text Summarization Corpus

Conference Paper

Full-text available

Jan 2021

Effective Entity Disambiguation in Low-Resource Languages: A Study of Icelandic

Conference Paper

Oct 2023

A Framework for Providing Structured Invoice Document using Optimized Bert Enabled Deep Convolutional Neural Network Classifier

Conference Paper

Oct 2023

Artificial Intelligence

Chapter

Sep 2022

Fatih Demir

The definition of the smart city has changed over time and is expected to continue to change. For example, Japan stepped into the Society 5.0 vision, which encompasses robotics, artificial intelligence, the Internet of Things, and big data. Society 5.0 is Japan’s national vision within the fifth Science and Technology Fundamental Plan, which aims to realize a data-driven, human-centered, next-generation society (Kobashi et al., 2020; Zhang et al., 2020). This vision means that everyone, regardless of location, will benefit from innovation and advances in technology, including the elderly population in rural areas. Community 5.0 emphasizes solving all kinds of social problems while harmonizing sustainability and economic growth. Therefore, Community 5.0 guides cities of all sizes on how to build smart infrastructure. It is no longer just about new technologies, but also about understanding a community’s problems and providing appropriate solutions through these technologies. In the context of the smart city, this approach is called “problem-solution-oriented smart city” (Neirotti et al., 2014), and it is the basis for achieving the greater goals of Community 5.0.

Recent advances in Swedish and Spanish medical entity recognition in clinical texts using deep neural approaches

Article

Full-text available

Dec 2019
BMC MED INFORM DECIS

Background: Text mining and natural language processing of clinical text, such as notes from electronic health records, requires specific consideration of the specialized characteristics of these texts. Deep learning methods could potentially mitigate domain specific challenges such as limited access to in-domain tools and data sets. Methods: A bi-directional Long Short-Term Memory network is applied to clinical notes in Spanish and Swedish for the task of medical named entity recognition. Several types of embeddings, both generated from in-domain and out-of-domain text corpora, and a number of generation and combination strategies for embeddings have been evaluated in order to investigate different input representations and the influence of domain on the final results. Results: For Spanish, a micro averaged F1-score of 75.25 was obtained and for Swedish, the corresponding score was 76.04. The best results for both languages were achieved using embeddings generated from in-domain corpora extracted from electronic health records, but embeddings generated from related domains were also found to be beneficial. Conclusions: A recurrent neural network with in-domain embeddings improved the medical named entity recognition compared to shallow learning methods, showing this combination to be suitable for entity recognition in clinical text for both languages.

Tuning Multilingual Transformers for Language-Specific Named Entity Recognition

Conference Paper

Full-text available

Jan 2019

Risamálheild: A Very Large Icelandic Text Corpus

Conference Paper

Full-text available

May 2018

We present Risamálheild, the Icelandic Gigaword Corpus (IGC), a corpus containing more than one billion running words from mostly contemporary texts. The work was carried out with minimal amount of work and resources, focusing on material that is not protected by copyright and sources which could provide us with large chunks of text for each cleared permission. The two main sources considered were therefore official texts and texts from news media. Only digitally available texts are included in the corpus and formats that can be problematic are not processed. The corpus texts are morphosyntactically tagged and provided with metadata. Processes have been set up for continuous text collection, cleaning and annotation. The corpus is available for search and download with permissive licenses. The dataset is intended to be clearly versioned with the first version released in early 2018. Texts will be collected continually and a new version published every year.

NeuroNER: an easy-to-use program for named-entity recognition based on neural networks

Conference Paper

Full-text available

Jan 2017

Robust Multilingual Named Entity Recognition with Shallow Semi-supervised Features (Extended Abstract)

Conference Paper

Full-text available

Aug 2017

We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering semi-supervised features induced on large amounts of unlabeled text. Understanding via empiricalexperimentation how to effectively combine various types of clustering features allows us to seamlessly export our system to other datasets and languages. The result is a simple but highly competitive system which obtains state of the art results across five languages and twelve datasets. The results are reported on standard shared task evaluation data such as CoNLL for English, Spanish and Dutch. Furthermore, and despite the lack of linguistically motivated features, we also report best results for languages such as Basque and German. In addition, we demonstrate that our method also obtains very competitive results even when the amount of supervised data is cut by half, alleviating the dependency on manually annotated data. Finally, the results show that our emphasis on clustering features is crucial to develop robust out-of-domain models. The system and models are freely available to facilitate its use and guarantee the reproducibility of results.

NeuroNER: an easy-to-use program for named-entity recognition based on neural networks

Article

Full-text available

May 2017

Named-entity recognition (NER) aims at identifying entities of interest in a text. Artificial neural networks (ANNs) have recently been shown to outperform existing NER systems. However, ANNs remain challenging to use for non-expert users. In this paper, we present NeuroNER, an easy-to-use named-entity recognition tool based on ANNs. Users can annotate entities using a graphical web-based user interface (BRAT): the annotations are then used to train an ANN, which in turn predict entities' locations and categories in new texts. NeuroNER makes this annotation-training-prediction flow smooth and accessible to anyone.

Cloze-driven Pretraining of Self-attention Networks

Conference Paper

Jan 2019

Arabic Named Entity Recognition: What Works and What’s Next

Conference Paper

Jan 2019

Efficient Estimation of Word Representations in Vector Space

Conference Paper

Jan 2013

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

Enriching Word Vectors with Subword Information

Article

Jul 2016

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.

Named Entity Recognition for Icelandic: Annotated Corpus and Models

Abstract and Figures

Recommended publications

Towards a Methodology for Named Entities Annotation

ADnOTO: A Self-adaptive System for Automatic Ontology-Based Annotation of Unstructured Documents

Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks.