Figure 1 - uploaded by Timothy Baldwin
Content may be subject to copyright.
Out-of-vocabulary word distribution in English Gigaword (NYT), Twitter and SMS data

Out-of-vocabulary word distribution in English Gigaword (NYT), Twitter and SMS data

Source publication
Conference Paper
Full-text available
Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candi...

Context in source publication

Context 1
... three corpora we compare are the New York Times (NYT), 4 SMS, 5 and Twit- ter. 6 The results are presented in Figure 1. ...

Similar publications

Article
Full-text available
This paper analyzes the new challenges facing the Galician language in the field of Natural Language Processing (NLP). In particular, it focuses its interest in the Microtext: a new graphic modality used in electronic interaction (SMS, Whatsapp, Twitter, etc.) and characterized by several linguistic licenses that distance it from the standard code,...
Conference Paper
Full-text available
La introducción de las nuevas tecnologías ha cambiado de forma radical el paradigma de la comunicación, particularmente entre los jóvenes que son actualmente nuestros alumnos universitarios.Esta nueva forma de comunicación, como SMS, WHATSAPP, TWITTER, e-mail, etc, se caracteriza por la brevedad del mensaje y carencia de una estructura organizativa...
Conference Paper
Full-text available
Group buying can be defined as buying with the aim of receiving quantity discount. The grouping phenomenon was launched in 2008 by the American website Groupon.com, which marked the beginning of group buying development. Group buying is offered by a group buying website which acts as an intermediary between seller and buyers. Group buying websites...
Article
Full-text available
Microblogging is considered one of the most important Web2.0 technologies of this year. With a solid experience in using Web2.0 technologies in education, the authors are trying to provide arguments for using microblogging systems in education, underlining the advantages, but also the possible bad points. Twitter is the most popular microblogging a...

Citations

... Out-of-vocabulary (OOV) words are prevalent in social media text, and they pose significant challenges [14]. Furthermore, the evolving nature of online language necessitates periodic model updates [6]. ...
Conference Paper
Full-text available
In the rapidly evolving landscape of text-based communication, the importance of the initial interaction phase remains paramount. This study investigates the potential benefits that a proposed AI chat assistant equipped with text recommendation and polishing functionalities can bring during initial textual interactions. The system allows the users to personalise the language style, choosing between humorous and respectful. They can also choose between three different levels of AI extraversion to suit their preferences. Results of user evaluations indicate the system received a "good" us-ability rating, affirming its effectiveness. Users reported heightened comfort levels and increased willingness to continue interactions when using the AI chat assistant. The analysis of the results offers insights into harnessing AI to amplify user engagement, especially in the critical initial stage of textual interaction.
... The work of Han and Baldwin (2011) is a well-known reference to the task of lexical normalization of tweets, albeit their study focused on English tweets for one-to-one normalization. But there is some work which concentrates on enhancing the current state-of-the-art NLP tools such as sentiment analysis by using microtext normalization techniques as a preprocessor to the input text. ...
Article
Full-text available
The use of computer‐mediated communication has resulted in a new form of written text called Microtext, which is very different from well‐written text. Most previous approaches deal with microtext at the character level rather than just words resulting in increased processing time. In this paper, we propose to transform static word vectors to dynamic form by modelling the effect of neighbouring words and their sentiment strength in the AffectiveSpace. To evaluate the approach, we crawled Tweets from diverse topics and human annotation was used to label their sentiments. We also normalized the tweets to fix phonetic variations, spelling errors, and abbreviations manually. A total of 1432 out‐of‐vocabulary (OOV) texts and their IV texts made it to the final corpus with their corresponding polarity. To assess the quality of the corpus, we used several OOV classifiers such as linear regression and observed over 90% accuracy. Next, we inferred word vectors using a novel four‐gram model based on sentiment intensity and reported accuracy on both open domain and closed domain sentiment classifiers. We observed an improvement in the range of 4–20 on Twitter, Movie and Airline reviews over baselines.
... Compared to word-level translation, characterlevel translation is more robust for unseen words. Another model was to first generate a set of normalization candidates for words, employ a Support Vector Machine (SVM) for unnormalized word detection and classification, and then an n-gram lookup for candidate selection [21]. Another character-focused model used two-step statistical machine translation [22], first detecting non-standard words using conditional random fields (CRF) sequence labeling, then translating the character sequence to a phonetic sequence, and finally translating this new sequence to words, with the words segmented based on their phonetic meaning, symbol or pronunciation improves the characters' alignment with parts of a word. ...
Article
Full-text available
Lexical Normalization (LN) aims to normalize a nonstandard text to a standard text. This problem is of extreme importance in natural language processing (NLP) when applying existing trained models to user-generated text on social media. Users of social media tend to use non-standard language. They heavily use abbreviations, phonetic substitutions, and colloquial language. Nevertheless, most existing NLP-based systems are often designed with the standard language in mind. However, they suffer from significant performance drops due to the many out-of-vocabulary words found in social media text. In this paper, we present a new (LN) technique by utilizing a transformer-based sequence-to-sequence (Seq2Seq) to build a multilingual characters-to-words machine translation model. Unlike the majority of current methods, the proposed model is capable of recognizing and generating previously unseen words. Also, it greatly reduces the difficulties involved in tokenizing and preprocessing the nonstandard text input and the standard text output. The proposed model outperforms the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 on both intrinsic and extrinsic evaluations.
... Preprocessing is crucial when working with Twitter data, which can be quite noisy and in general may contain various non-canonical text elements such as user handles (@username), hashtags, emojis, and misspellings, among others. In their work, Nguyen et al (2020) attempted two normalization strategies, a soft one that made minor changes such as replacing usernames and hashtags, and a more aggressive one using the ideas of Han and Baldwin (2011). However, the authors found no significant improvement by using the latter normalization strategy. ...
Preprint
Full-text available
In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-experts. To address these issues, we present pysentimiento, a comprehensive multilingual Python toolkit designed for opinion mining and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish, English, Italian, and Portuguese in an easy-to-use Python library, allowing researchers to leverage these techniques. We present a comprehensive assessment of performance for several pre-trained language models across a variety of tasks, languages, and datasets, including an evaluation of fairness in the results.
... As detailed below, the method is divided into two parts. Using lexical normalization, the invoice line item string was normalized according to Han et al., (2010) [44]. Recognizing ill-formed Out of Vocabulary (OOV) terms is called lexical normalization. ...
Article
Full-text available
Accounts Payable (AP) is a time-consuming and labor-intensive process used by large corporations to compensate vendors on time for goods and services received. A comprehensive verification procedure is executed before disbursing funds to a supplier or vendor. After the successful conclusion of these validations, the invoice undergoes further processing by traversing multiple stages, including vendor identification; line-item matching; accounting code identification; tax code identification, ensuring proper calculation and remittance of taxes, verifying payment terms, approval routing, and compliance with internal control policies and procedures, for a comprehensive approach to invoice processing. At the moment, each of these processes is almost entirely manual and laborious, which makes the process time-consuming and prone to mistakes in the ongoing education of agents. It is difficult to accomplish the task of automatically processing these invoices for payment without any human involvement. To provide a solution, we implemented an automated invoicing system with modules based on artificial intelligence. This system processes invoices from beginning to finish. It takes very little work to configure it to meet the specific needs of each unique customer. Currently, the system has been put into production use for two customers. It has handled roughly 80 thousand invoices, of which 76 percent were automatically processed with little or no human interaction.
... We also see issues in fitting the binomial regression models in the first place. The "Pairs" column indicates how many of the 66 Han and Baldwin (2011) In Figure 12, we compare the number of smoothing iterations to the average AIC (top graphs), average McFadden's pseudo-R 2 (middle graphs), and number of pairs that were successfully fit. We see that Retrofitting approaches get substantially worse with more iterations. ...
... and Liu et al. (2011). TheHan and Baldwin (2011) dataset was formed from three annotators normalizing 1,184 out of vocabulary tokens from 549 English Tweets. TheLiu et al. (2011) dataset was formed from Amazon Turkers normalizing 3,802 nonstandard tokens (tokens that are rare and diverge from a standard form) from 6,150 Tweets. ...
Article
Full-text available
Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.
... A hashtag is a word or phrase without space preceded by a hash symbol (#), which is used as a keyword to indicate the content of a tweet or the topic it is related to [53]. On Twitter, users primarily use hashtags to convey sentiments and opinions [24]. Many studies have shown the effectiveness of hashtags to be used in tweets analytic tasks [6,51]. ...
Preprint
Full-text available
People who share similar opinions towards controversial topics could form an echo chamber and may share similar political views toward other topics as well. The existence of such connections, which we call connected behavior, gives researchers a unique opportunity to predict how one would behave for a future event given their past behaviors. In this work, we propose a framework to conduct connected behavior analysis. Neural stance detection models are trained on Twitter data collected on three seemingly independent topics, i.e., wearing a mask, racial equality, and Trump, to detect people's stance, which we consider as their online behavior in each topic-related event. Our results reveal a strong connection between the stances toward the three topical events and demonstrate the power of past behaviors in predicting one's future behavior.
... Social media, however, can capture low frequency data that traditional corpora cannot; tokens of interest that may occur a handful of times in a traditional sociolinguistic corpus (e.g., seven instances of third person quotative talkin' 'bout and 23 tokens of associative 'nem in the Corpus of Regional African American Language, Kendall and Farrington 2020) occur hundreds of thousands of times on social media (Jones, 2015). The format is inherently informal (Han and Baldwin, 2011;van Halteren and Oostdijk, 2012;Eisenstein, 2013b), people write for their social networks (Eisenstein, 2013a;Doyle, 2014;Eisenstein et al., 2014;Yuan et al., 2016), and unconventional spellings that pose challenges for traditional NLP applications nevertheless provide rich linguistic information as people engage in identity constructionoften through intentionally representing their accents and pronunciation through innovative orthography (Jones, 2016c). People also navigate linguistic taboos orthographically: as Smith (2019) notes, "most white Facebookers (and a few blacks) variably spelled nigga as n***a, nga, ninja, nucca, and nicca, betraying some degree of awareness of the word's taboo status in wider social circles." ...
Article
Full-text available
There are some linguistic forms that may be known to both speakers and linguists, but that occur naturally with such low frequency that traditional sociolinguistic methods do not allow for study. This study investigates one such phenomenon: the grammatical reanalysis of an intensifier in some forms of African American English—from a full phrase [than a mother(fucker)] to lexical word (represented here as dennamug )—using data gathered from twitter. This paper investigates the relationship between apparent lexicalization and deletion of the comparative morpheme on the preceding adjective. While state-of-the-art traditional corpora contain so few tokens they can be counted on one hand, twitter yields almost 300,000 tokens over a 10 year sample period. This paper uses web scraping of Twitter to gather all plausible orthographic representations of the intensifier, and uses logistic regression to analyze the extent to which markers of lexicalization and reanalysis are associated with a corresponding shift from comparative to bare morphology on the adjective the intensifier modifies, finding that, indeed, degree of apparent lexicalization is strongly associated with bare morphology, suggesting ongoing lexicalization and subsequent reanalysis at the phrase level. This digital approach reveals ongoing grammatical change, with the new intensifier associated with bare, note comparative, adjectives, and that there is seemingly stable variation correlated with the degree to which the intensifier has lexicalized. Orthographic representations of African American English on social media are shown to be a locus of identity construction and grammatical change.
... "makan where?" → "where should we eat?"). from character/token level manipulation, early studies utilized lexical-based methods like dictionary lookup, word similarity, and N-gram probabilities (Han and Baldwin, 2011;Supranovich and Patsepnia, 2015). MoNoise (van der Goot and van Noord, 2017) built a pipeline that is similar to a ranking-retrieval approach. ...
Conference Paper
Full-text available
Within the natural language processing community , English is by far the most resource-rich language. There is emerging interest in conducting translation via computational approaches to conform its dialects or creole languages back to standard English. This computational approach paves the way to leverage generic English language backbones, which are beneficial for various downstream tasks. However , in practical online communication scenarios , the use of language varieties is often accompanied by noisy user-generated content, making this translation task more challenging. In this work, we introduce a joint paraphrasing task of creole translation and text normaliza-tion of Singlish messages, which can shed light on how to process other language varieties and dialects. We formulate the task in three different linguistic dimensions: lexical level normal-ization, syntactic level editing, and semantic level rewriting. We build an annotated dataset of Singlish-to-Standard English messages, and report performance on a perturbation-resilient sequence-to-sequence model. Experimental results show that the model produces reasonable generation results, and can improve the performance of downstream tasks like stance detection .
... Another is lexical normalization before training: transforming non-standard tokens into a more standardised form to reduce the number of out-ofvocabulary tokens (Haruechaiyasak and Kongthon, 2013, Cook and Stevenson, 2009, Han and Baldwin, 2011, Liu et al., 2012. Both approaches therefore ignore the hidden semantics of misspelling, either by explicitly removing it or by losing the connection to the standard form. ...
... More recent works started investigating different types of misspelling formation. Cook and Stevenson (2009) and Han and Baldwin (2011) presented a consistent observation that the majority of the misspelling found on the internet is from morphophonemic variations (transformation of surface form of a word but conserve similar pronunciation) and abbreviations. This finding is then used as a guideline to build their lexical normalization models. ...
Preprint
Full-text available
User-generated content is full of misspellings. Rather than being just random noise, we hypothesise that many misspellings contain hidden semantics that can be leveraged for language understanding tasks. This paper presents a fine-grained annotated corpus of misspelling in Thai, together with an analysis of misspelling intention and its possible semantics to get a better understanding of the misspelling patterns observed in the corpus. In addition, we introduce two approaches to incorporate the semantics of misspelling: Misspelling Average Embedding (MAE) and Misspelling Semantic Tokens (MST). Experiments on a sentiment analysis task confirm our overall hypothesis: additional semantics from misspelling can boost the micro F1 score up to 0.4-2%, while blindly normalising misspelling is harmful and suboptimal.