F1 scores achieved by submitted systems for different tweet lengths (tweet lengths measured as character counts after removing hashtags, user mentions, and URLs)

Source publication

TweetLID: a benchmark for tweet language identification

Article

Full-text available

Dec 2016

Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper,...

Context 1

... line with our motivation to study three key unresolved issues in language identifi- cation, we now delve into the analysis of results by looking into the performance of the systems when it comes to these three aspects separately: (i) performance results by tweet length, (ii) performance results for monolingual and multilingual tweets, and (iii) performance between similar languages by looking at the confusion matrix. (i) Evaluation by Tweet Length. Figure 1 shows the performance of the systems by tweet length. These boxplots enable the visualization of quartiles in the ranked list of performance values; the bottom and top edges represent 0% and 100% percentiles, the bottom and top of the box represent the 25% and 75% percentiles, and the middle line represents the median, which allows to compare the distributions of performances for different tweet lengths. These results clearly show the tendency of language iden- tifiers to classify with substantially higher accuracy the tweets with more than 60 characters; the performance of the systems progressively drops especially for tweets with fewer than 60 characters. The performance is dramatically lower for tweets as short as 20 characters or fewer. While this corroborates the findings in previous works on language identification, it shows that language identifiers can also perform accu- rately for long tweets. Even though there is still room for improvement with long tweets, the main challenge remains in the correct identification of language for short ...

View in full-text

Adapting and Improving Methods to Manage Cognitive Pretesting of Multilingual Survey Instruments

Article

Full-text available

Dec 2013

Fig. 1. Top-5 gender frequent terms in Boolq by RoBERTa.

Fig. 2. Top-5 gender frequent terms in Boolq by DeBERTa.

Fig. 4. Top-5 gender frequent terms in CB by Roberta.

Fig. 6. Top-5 gender frequent terms in CB by Electra.

Fig. 7. Top-5 gender frequent terms in RTE by RoBERTa.

Bipol: Multi-axes Evaluation of Bias with Explainability in Benchmark Datasets

Preprint

Full-text available

Jan 2023

We evaluate five English NLP benchmark datasets (available on the superGLUE leaderboard) for bias, along multiple axes. The datasets are the following: Boolean Question (Boolq), CommitmentBank (CB), Winograd Schema Challenge (WSC), Winogender diagnostic (AXg), and Recognising Textual Entailment (RTE). Bias can be harmful and it is known to be commo...

POLYGLOT: Multilingual Semantic Role Labeling with Unified Labels

Conference Paper

Full-text available

Jan 2016

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Conference Paper

Full-text available

Mar 2022

Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremel...

Some remarks on the elaboration of a multilingual specialized dictionary of hippology

Article

Full-text available

Apr 2021

Specialized vocabulary is one of the most important and dynamically developing subsystems of a national language. Due to the continuous development within society, most of the subject fields are characterized by considerable variability and terminological inconsistency. One of the possibilities to contribute to more effective and clearer communicat...

The Effect of Alignment Objectives on Code-Switching Translation

Preprint

Full-text available

Sep 2023

Mohamed Anwar

One of the things that need to change when it comes to machine translation is the models' ability to translate code-switching content, especially with the rise of social media and user-generated content. In this paper, we are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another, along with translating code-switched sentences to either language. This model can be considered a bilingual model in the human sense. For better use of parallel data, we generated synthetic code-switched (CSW) data along with an alignment loss on the encoder to align representations across languages. Using the WMT14 English-French (En-Fr) dataset, the trained model strongly outperforms bidirectional baselines on code-switched translation while maintaining quality for non-code-switched (monolingual) data.

Conference Paper

Mar 2023

Twitter is a well-known social media platform with over 500 million users worldwide. More than 100 languages were identified from among a million tweets. However, only 34 formal languages are supported by Twitter. It becomes challenging to do regional language identification. In Indonesia, Twitter does not recognize the Javanese and Sundanese languages. In this paper, a mechanism to automatically identify Javanese and Sundanese languages is proposed. It implements six deep learning methods: RNN, LSTM, Bi-LSTM, GRU, CNN, and Multichannel CNN. Twitter data is used in the experiment. Generally, those methods performed well in discriminating the languages. A comparison of the results shows that the LSTM technique performed better than the other techniques with an F1-score of 0.9985.

The growing amplification of social media: measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009–2020

Article

Full-text available

Dec 2021

Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, Arabic, and Portuguese being the most dominant. To quantify social spreading in each language over time, we compute the ‘contagion ratio’: The balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1—the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages.

Social Analysis of Young Basque Speaking Communities in Twitter

Preprint

Full-text available

Sep 2021

In this paper we take into account both social and linguistic aspects to perform demographic analysis by processing a large amount of tweets in Basque language. The study of demographic characteristics and social relationships are approached by applying machine learning and modern deep-learning Natural Language Processing (NLP) techniques, combining social sciences with automatic text processing. More specifically, our main objective is to combine demographic inference and social analysis in order to detect young Basque Twitter users and to identify the communities that arise from their relationships or shared content. This social and demographic analysis will be entirely based on the~automatically collected tweets using NLP to convert unstructured textual information into interpretable knowledge.

Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text

Article

Full-text available

Aug 2021

The paper describes the usage of self-learning Hierarchical LSTM technique for classifying hatred and trolling contents in social media code-mixed data. The Hierarchical LSTM-based learning is a novel learning architecture inspired from the neural learning models. The proposed HLSTM model is trained to identify the hatred and trolling words available in social media contents. The proposed HLSTM systems model is equipped with self-learning and predicting mechanism for annotating hatred words in transliteration domain. The Hindi–English data are ordered into Hindi, English, and hatred labels for classification. The mechanism of word embedding and character-embedding features are used here for word representation in the sentence to detect hatred words. The method developed based on HLSTM model helps in recognizing the hatred word context by mining the intention of the user for using that word in the sentence. Wide experiments suggests that the HLSTM-based classification model gives the accuracy of 97.49% when evaluated against the standard parameters like BLSTM, CRF, LR, SVM, Random Forest and Decision Tree models especially when there are some hatred and trolling words in the social media data.

A New Methodology for Language Identification in Social Media Code-Mixed Text

Chapter

Full-text available

Jan 2021

Nowadays, Transliteration is one of the hot research areas in the field of Natural Language Processing. Transliteration means that transferring a word from one language to another language and it is mostly used in cross-language platforms. Generally, people use code-mixed language for sharing their views on social media like Twitter, WhatsApp, etc. Code-mixed language means one language is written using another language script and it is very important to identify the languages used in each word to process such type of text. Therefore, a deep learning model is implemented using Bidirectional Long Short-Term Memory (BLSTM) for Indian social media texts in this paper. This model identifies the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The proposed model gives better accuracy for word-embedding model as compared to character embedding.

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi

Article

Full-text available

Dec 2020

The paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using BLSTM neural model. In Natural Language Processing one of the imperative and relatively less mature areas is a transliteration. During transliteration, issues like language identification, script specification, missing sounds arise in code mixed data. Social media platforms are now widely used by people to express their opinion or interest. The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. In code-mixed data, one language will be written using another language script. So to process such code-mixed text, identification of language used in each word is important for language processing. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We propose a deep learning framework based on cBoW and Skip gram model for language identification in code mixed data. Popular word embedding features were used for the representation of each word. Many researches have been recently done in the field of language identification, but word level language identification in the transliterated environment is a current research issue in code mixed data. We have implemented a deep learning model based on BLSTM that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The multichannel neural networks combining CNN and BLSTM for word level language identification of code-mixed data where English and Hindi roman transliteration has been used. Combining this with a cBoW and Skip gram for evaluation. The proposed system BLSTM context capture module gives better accuracy for word embedding model as compared to character embedding evaluated on our two testing sets. The problem is modeled collectively with the deep-learning design. We tend to gift an in-depth empirical analysis of the proposed methodology against standard approaches for language identification.

Artificial Immune Systems-Based Classification Model for Code-Mixed Social Media Data

Article

Jul 2020

The main focus of the paper is to propose an artificial immune systems-based classification model for code-mixed social media data. The artificial immune systems are computational models inspired by the biological immune system. In this paper, artificial immune systems are used to optimize the initial parameters of Long short-term memory (LSTM) model. The proposed artificial immune systems-based LSTM model is then used for the classification of code-mixed data. The classification of Hindi-English code-mixed data into Hindi, English, and ambiguous words are done. Popular word embedding features were used for the representation of each word. The word embedding features and character embedding features have been used. The proposed method helps in identifying the word context by extracting the intent of user for using the ambiguous word in code-mixed sentence. Extensive experiments reveal that the artificial immune systems-based classification model outperforms competitive models especially when there are some ambiguous words in the social media data.

The growing echo chamber of social media: Measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009--2020

Preprint

Full-text available

Mar 2020

Working from a dataset of 118 billion messages running from the start of 2009 to the end of 2019, we identify and explore the relative daily use of over 150 languages on Twitter. We find that eight languages comprise 80% of all tweets, with English, Japanese, Spanish, and Portuguese being the most dominant. To quantify each language's level of being a Twitter `echo chamber' over time, we compute the `contagion ratio': the balance of retweets to organic messages. We find that for the most common languages on Twitter there is a growing tendency, though not universal, to retweet rather than share new content. By the end of 2019, the contagion ratios for half of the top 30 languages, including English and Spanish, had reached above 1---the naive contagion threshold. In 2019, the top 5 languages with the highest average daily ratios were, in order, Thai (7.3), Hindi, Tamil, Urdu, and Catalan, while the bottom 5 were Russian, Swedish, Esperanto, Cebuano, and Finnish (0.26). Further, we show that over time, the contagion ratios for most common languages are growing more strongly than those of rare languages.

Computational linguistic retrieval framework using negative bootstrapping for retrieving transliteration variants

Article

Jan 2020

F1 scores achieved by submitted systems for different tweet lengths (tweet lengths measured as character counts after removing hashtags, user mentions, and URLs)

Context in source publication

Similar publications

Citations