Fig 1 - uploaded by Arkaitz Zubiaga
Content may be subject to copyright.
F1 scores achieved by submitted systems for different tweet lengths (tweet lengths measured as character counts after removing hashtags, user mentions, and URLs) 

F1 scores achieved by submitted systems for different tweet lengths (tweet lengths measured as character counts after removing hashtags, user mentions, and URLs) 

Source publication
Article
Full-text available
Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper,...

Context in source publication

Context 1
... line with our motivation to study three key unresolved issues in language identifi- cation, we now delve into the analysis of results by looking into the performance of the systems when it comes to these three aspects separately: (i) performance results by tweet length, (ii) performance results for monolingual and multilingual tweets, and (iii) performance between similar languages by looking at the confusion matrix. (i) Evaluation by Tweet Length. Figure 1 shows the performance of the systems by tweet length. These boxplots enable the visualization of quartiles in the ranked list of performance values; the bottom and top edges represent 0% and 100% percentiles, the bottom and top of the box represent the 25% and 75% percentiles, and the middle line represents the median, which allows to compare the distributions of performances for different tweet lengths. These results clearly show the tendency of language iden- tifiers to classify with substantially higher accuracy the tweets with more than 60 characters; the performance of the systems progressively drops especially for tweets with fewer than 60 characters. The performance is dramatically lower for tweets as short as 20 characters or fewer. While this corroborates the findings in previous works on language identification, it shows that language identifiers can also perform accu- rately for long tweets. Even though there is still room for improvement with long tweets, the main challenge remains in the correct identification of language for short ...

Similar publications

Preprint
Full-text available
We evaluate five English NLP benchmark datasets (available on the superGLUE leaderboard) for bias, along multiple axes. The datasets are the following: Boolean Question (Boolq), CommitmentBank (CB), Winograd Schema Challenge (WSC), Winogender diagnostic (AXg), and Recognising Textual Entailment (RTE). Bias can be harmful and it is known to be commo...
Conference Paper
Full-text available
Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremel...
Article
Full-text available
Specialized vocabulary is one of the most important and dynamically developing subsystems of a national language. Due to the continuous development within society, most of the subject fields are characterized by considerable variability and terminological inconsistency. One of the possibilities to contribute to more effective and clearer communicat...

Citations

... Research in the area of NLP on code-switching (CSW) has mostly focused on Language Modeling, especially for Automatic Speech Recognition (ASR) (Pratapa et al. [2018]; Garg et al. [2018]; Gonen and Goldberg [2018]; Winata et al. [2019]; Lee and Li [2020]). Evaluation tasks and benchmarks have also been prepared for LID in user-generated CSW content (Zubiaga et al. [2016]; Molina et al. [2019]), Named Entity Recognition (Aguilar et al. [2019]), Part-of-Speech tagging (Ball and Garrette [2018]; ; Khanuja et al. [2020]) and Sentiment Analysis (Patwa et al. [2020]). CSW was also found useful in foreign language teaching: Renduchintala et al. [2019] showed that replacing words with their counterparts in a foreign language helps to learn foreign language vocabulary. ...
Preprint
Full-text available
One of the things that need to change when it comes to machine translation is the models' ability to translate code-switching content, especially with the rise of social media and user-generated content. In this paper, we are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another, along with translating code-switched sentences to either language. This model can be considered a bilingual model in the human sense. For better use of parallel data, we generated synthetic code-switched (CSW) data along with an alignment loss on the encoder to align representations across languages. Using the WMT14 English-French (En-Fr) dataset, the trained model strongly outperforms bidirectional baselines on code-switched translation while maintaining quality for non-code-switched (monolingual) data.
... However, Twitter is limited to recognizing only 34 languages, and there are still some errors with Twitter's language tags [7]. It has become difficult to automatically identify specific languages on Twitter [8], especially regional or native languages such as Javanese and Sundanese in Indonesia. The current automatic language identification (LI) embedded in Twitter does not perfectly work for both languages. ...
... Language identification is the process of automatically determining which language is contained in a document based on its content [9]. Automatic language identification is a necessary preprocessing step in more complex natural language processing systems, including machine translation, sentiment analysis, named entity recognition, and text summarization [8]. With successful language identification, large corpora of short texts, such as tweets, can be analyzed for marketing, political, and socioeconomic reasons [10]. ...
Conference Paper
Twitter is a well-known social media platform with over 500 million users worldwide. More than 100 languages were identified from among a million tweets. However, only 34 formal languages are supported by Twitter. It becomes challenging to do regional language identification. In Indonesia, Twitter does not recognize the Javanese and Sundanese languages. In this paper, a mechanism to automatically identify Javanese and Sundanese languages is proposed. It implements six deep learning methods: RNN, LSTM, Bi-LSTM, GRU, CNN, and Multichannel CNN. Twitter data is used in the experiment. Generally, those methods performed well in discriminating the languages. A comparison of the results shows that the LSTM technique performed better than the other techniques with an F1-score of 0.9985.
... Therefore, the combination of short, informal, and multilingual posts on Twitter makes language detection much more difficult compared to LID of formal documents [114]. Finally, the lack of large collections of verified ground-truth across most languages is challenging for data scientists seeking to fine-tune language detection models using Twitter data [81,115,116]. ...
Article
Full-text available
Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, Arabic, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the ‘contagion ratio’: The balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1—the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages.
... Twitter is widely used in NLP for tasks such as mining opinions about specific products or topics (Villena et al., 2013;Rosenthal et al., 2017), detecting political stance (Mohammad et al., 2016;Derczynski et al., 2017) and hate speech (Basile et al., 2019) or for basic tasks such as POS tagging (Ritter et al., 2011), named entity recognition (Baldwin et al., 2015), normalization (Alegria et al., 2015) and language identification (Zubiaga et al., 2016). ...
Preprint
Full-text available
In this paper we take into account both social and linguistic aspects to perform demographic analysis by processing a large amount of tweets in Basque language. The study of demographic characteristics and social relationships are approached by applying machine learning and modern deep-learning Natural Language Processing (NLP) techniques, combining social sciences with automatic text processing. More specifically, our main objective is to combine demographic inference and social analysis in order to detect young Basque Twitter users and to identify the communities that arise from their relationships or shared content. This social and demographic analysis will be entirely based on the~automatically collected tweets using NLP to convert unstructured textual information into interpretable knowledge.
... Use of regional dialects has been pointed out in communication and identification of its context meaning is handled in the paper. The paper [25,26] presents the state-of-art in language identification. The paper [27] points the use of MNN (Multi-Layer Neural Network) along with LSTM (Long Short Term Memory) for ambiguity minimization in mixed script textual data. ...
Article
Full-text available
The paper describes the usage of self-learning Hierarchical LSTM technique for classifying hatred and trolling contents in social media code-mixed data. The Hierarchical LSTM-based learning is a novel learning architecture inspired from the neural learning models. The proposed HLSTM model is trained to identify the hatred and trolling words available in social media contents. The proposed HLSTM systems model is equipped with self-learning and predicting mechanism for annotating hatred words in transliteration domain. The Hindi–English data are ordered into Hindi, English, and hatred labels for classification. The mechanism of word embedding and character-embedding features are used here for word representation in the sentence to detect hatred words. The method developed based on HLSTM model helps in recognizing the hatred word context by mining the intention of the user for using that word in the sentence. Wide experiments suggests that the HLSTM-based classification model gives the accuracy of 97.49% when evaluated against the standard parameters like BLSTM, CRF, LR, SVM, Random Forest and Decision Tree models especially when there are some hatred and trolling words in the social media data.
... The focus was on transliterating short form to full form. Zubiaga et al. had mentioned language identification, as the mission of defining the language of a given text [13]. On the other hand, certain issues like quantifying the individuality of similar languages in multilingualism document and analyzing the language of short texts are still unresolved. ...
Chapter
Full-text available
Nowadays, Transliteration is one of the hot research areas in the field of Natural Language Processing. Transliteration means that transferring a word from one language to another language and it is mostly used in cross-language platforms. Generally, people use code-mixed language for sharing their views on social media like Twitter, WhatsApp, etc. Code-mixed language means one language is written using another language script and it is very important to identify the languages used in each word to process such type of text. Therefore, a deep learning model is implemented using Bidirectional Long Short-Term Memory (BLSTM) for Indian social media texts in this paper. This model identifies the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The proposed model gives better accuracy for word-embedding model as compared to character embedding.
... The focus was on transliterating short form to full form. Zubiaga et al. [20] had mentioned language identification, as the mission of defining the language of a given text. On the other hand, certain issues like quantifying the individuality of similar languages in multilingualism document and analyzing the language of short texts are still unresolved. ...
Article
Full-text available
The paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using BLSTM neural model. In Natural Language Processing one of the imperative and relatively less mature areas is a transliteration. During transliteration, issues like language identification, script specification, missing sounds arise in code mixed data. Social media platforms are now widely used by people to express their opinion or interest. The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. In code-mixed data, one language will be written using another language script. So to process such code-mixed text, identification of language used in each word is important for language processing. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We propose a deep learning framework based on cBoW and Skip gram model for language identification in code mixed data. Popular word embedding features were used for the representation of each word. Many researches have been recently done in the field of language identification, but word level language identification in the transliterated environment is a current research issue in code mixed data. We have implemented a deep learning model based on BLSTM that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The multichannel neural networks combining CNN and BLSTM for word level language identification of code-mixed data where English and Hindi roman transliteration has been used. Combining this with a cBoW and Skip gram for evaluation. The proposed system BLSTM context capture module gives better accuracy for word embedding model as compared to character embedding evaluated on our two testing sets. The problem is modeled collectively with the deep-learning design. We tend to gift an in-depth empirical analysis of the proposed methodology against standard approaches for language identification.
... The main concern is of conversion from short form to full form. The paper [20,23] [29] had worked on the concept of language identification. The paper [32] discusses the concept of multichannel neural network along with the use of LSTM. ...
Article
The main focus of the paper is to propose an artificial immune systems-based classification model for code-mixed social media data. The artificial immune systems are computational models inspired by the biological immune system. In this paper, artificial immune systems are used to optimize the initial parameters of Long short-term memory (LSTM) model. The proposed artificial immune systems-based LSTM model is then used for the classification of code-mixed data. The classification of Hindi-English code-mixed data into Hindi, English, and ambiguous words are done. Popular word embedding features were used for the representation of each word. The word embedding features and character embedding features have been used. The proposed method helps in identifying the word context by extracting the intent of user for using the ambiguous word in code-mixed sentence. Extensive experiments reveal that the artificial immune systems-based classification model outperforms competitive models especially when there are some ambiguous words in the social media data.
... Therefore, the combination of short, informal, and multilingual posts on Twitter makes language detection much more difficult compared to LID of formal documents [114]. Finally, the lack of large collections of verified ground-truth across most languages is challenging for data scientists seeking to fine-tune language detection models using Twitter data [81,115,116]. ...
Preprint
Full-text available
Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify each language's level of being a Twitter `echo chamber' over time, we compute the `contagion ratio': the balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1---the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages.
... The focus was on transliterating short form to full form. Zubiaga et al. (2015) had mentioned language identification, as the mission of defining the language of a given text. On the other hand, three main issues still persist unresolved: ...