Out-of-vocabulary word distribution in English Gigaword (NYT), Twitter and SMS data

Source publication

Lexical Normalisation of Short Text Messages: Makn Sens a #twitter.

Conference Paper

Full-text available

Jan 2011

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candi...

Context 1

... three corpora we compare are the New York Times (NYT), 4 SMS, 5 and Twit- ter. 6 The results are presented in Figure 1. ...

View in full-text

Normalización del microtexto. Nuevos desafíos en PLN para el gallego. Microtext normalization. New NLP challenges for Galician language

Article

Full-text available

Jan 2014

This paper analyzes the new challenges facing the Galician language in the field of Natural Language Processing (NLP). In particular, it focuses its interest in the Microtext: a new graphic modality used in electronic interaction (SMS, Whatsapp, Twitter, etc.) and characterized by several linguistic licenses that distance it from the standard code,...

La anotación como estrategia para mejorar la expresión escrita en inglés

Conference Paper

Full-text available

Jul 2016

La introducción de las nuevas tecnologías ha cambiado de forma radical el paradigma de la comunicación, particularmente entre los jóvenes que son actualmente nuestros alumnos universitarios.Esta nueva forma de comunicación, como SMS, WHATSAPP, TWITTER, e-mail, etc, se caracteriza por la brevedad del mensaje y carencia de una estructura organizativa...

Development of group buying in Poland

Conference Paper

Full-text available

Oct 2012

Group buying can be defined as buying with the aim of receiving quantity discount. The grouping phenomenon was launched in 2008 by the American website Groupon.com, which marked the beginning of group buying development. Group buying is offered by a group buying website which acts as an intermediary between seller and buyers. Group buying websites...

Posibilităţi de utilizare a sistemelor de microblogging în educaţie

Article

Full-text available

Microblogging is considered one of the most important Web2.0 technologies of this year. With a solid experience in using Web2.0 technologies in education, the authors are trying to provide arguments for using microblogging systems in education, underlining the advantages, but also the possible bad points. Twitter is the most popular microblogging a...

Enhancing User Experience in Chinese Initial Text Conversations with Personalised AI-Powered Assistant

Conference Paper

Full-text available

May 2024

In the rapidly evolving landscape of text-based communication, the importance of the initial interaction phase remains paramount. This study investigates the potential benefits that a proposed AI chat assistant equipped with text recommendation and polishing functionalities can bring during initial textual interactions. The system allows the users to personalise the language style, choosing between humorous and respectful. They can also choose between three different levels of AI extraversion to suit their preferences. Results of user evaluations indicate the system received a "good" us-ability rating, affirming its effectiveness. Users reported heightened comfort levels and increased willingness to continue interactions when using the AI chat assistant. The analysis of the results offers insights into harnessing AI to amplify user engagement, especially in the critical initial stage of textual interaction.

Predicting word vectors for microtext

Article

Full-text available

Mar 2024
EXPERT SYST

The use of computer‐mediated communication has resulted in a new form of written text called Microtext, which is very different from well‐written text. Most previous approaches deal with microtext at the character level rather than just words resulting in increased processing time. In this paper, we propose to transform static word vectors to dynamic form by modelling the effect of neighbouring words and their sentiment strength in the AffectiveSpace. To evaluate the approach, we crawled Tweets from diverse topics and human annotation was used to label their sentiments. We also normalized the tweets to fix phonetic variations, spelling errors, and abbreviations manually. A total of 1432 out‐of‐vocabulary (OOV) texts and their IV texts made it to the final corpus with their corresponding polarity. To assess the quality of the corpus, we used several OOV classifiers such as linear regression and observed over 90% accuracy. Next, we inferred word vectors using a novel four‐gram model based on sentiment intensity and reported accuracy on both open domain and closed domain sentiment classifiers. We observed an improvement in the range of 4–20 on Twitter, Movie and Airline reviews over baselines.

Lexical Normalization Using Generative Transformer Model (LN-GTM)

Article

Full-text available

Nov 2023
INT J COMPUT INT SYS

Lexical Normalization (LN) aims to normalize a nonstandard text to a standard text. This problem is of extreme importance in natural language processing (NLP) when applying existing trained models to user-generated text on social media. Users of social media tend to use non-standard language. They heavily use abbreviations, phonetic substitutions, and colloquial language. Nevertheless, most existing NLP-based systems are often designed with the standard language in mind. However, they suffer from significant performance drops due to the many out-of-vocabulary words found in social media text. In this paper, we present a new (LN) technique by utilizing a transformer-based sequence-to-sequence (Seq2Seq) to build a multilingual characters-to-words machine translation model. Unlike the majority of current methods, the proposed model is capable of recognizing and generating previously unseen words. Also, it greatly reduces the difficulties involved in tokenizing and preprocessing the nonstandard text input and the standard text output. The proposed model outperforms the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 on both intrinsic and extrinsic evaluations.

pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks

Preprint

Full-text available

Nov 2023

In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-experts. To address these issues, we present pysentimiento, a comprehensive multilingual Python toolkit designed for opinion mining and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish, English, Italian, and Portuguese in an easy-to-use Python library, allowing researchers to leverage these techniques. We present a comprehensive assessment of performance for several pre-trained language models across a variety of tasks, languages, and datasets, including an evaluation of fairness in the results.

Examining the Plausible Applications of Artificial Intelligence & Machine Learning in Accounts Payable Improvement

Article

Full-text available

Jul 2023

Vijaya Krishna Kanaparthi

Accounts Payable (AP) is a time-consuming and labor-intensive process used by large corporations to compensate vendors on time for goods and services received. A comprehensive verification procedure is executed before disbursing funds to a supplier or vendor. After the successful conclusion of these validations, the invoice undergoes further processing by traversing multiple stages, including vendor identification; line-item matching; accounting code identification; tax code identification, ensuring proper calculation and remittance of taxes, verifying payment terms, approval routing, and compliance with internal control policies and procedures, for a comprehensive approach to invoice processing. At the moment, each of these processes is almost entirely manual and laborious, which makes the process time-consuming and prone to mistakes in the ongoing education of agents. It is difficult to accomplish the task of automatically processing these invoices for payment without any human involvement. To provide a solution, we implemented an automated invoicing system with modules based on artificial intelligence. This system processes invoices from beginning to finish. It takes very little work to configure it to meet the specific needs of each unique customer. Currently, the system has been put into production use for two customers. It has handled roughly 80 thousand invoices, of which 76 percent were automatically processed with little or no human interaction.

Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings

Article

Full-text available

Jun 2023
COMPUT LINGUIST

Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.

Wearing Masks Implies Refuting Trump?: Towards Target-specific User Stance Prediction across Events in COVID-19 and US Election 2020

Preprint

Full-text available

Mar 2023

People who share similar opinions towards controversial topics could form an echo chamber and may share similar political views toward other topics as well. The existence of such connections, which we call connected behavior, gives researchers a unique opportunity to predict how one would behave for a future event given their past behaviors. In this work, we propose a framework to conduct connected behavior analysis. Neural stance detection models are trained on Twitter data collected on three seemingly independent topics, i.e., wearing a mask, racial equality, and Trump, to detect people's stance, which we consider as their online behavior in each topic-related event. Our results reveal a strong connection between the stances toward the three topical events and demonstrate the power of past behaviors in predicting one's future behavior.

African American English intensifier dennamug: Using twitter to investigate syntactic change in low-frequency forms

Article

Full-text available

Jan 2023

Taylor Jones

There are some linguistic forms that may be known to both speakers and linguists, but that occur naturally with such low frequency that traditional sociolinguistic methods do not allow for study. This study investigates one such phenomenon: the grammatical reanalysis of an intensifier in some forms of African American English—from a full phrase [than a mother(fucker)] to lexical word (represented here as dennamug )—using data gathered from twitter. This paper investigates the relationship between apparent lexicalization and deletion of the comparative morpheme on the preceding adjective. While state-of-the-art traditional corpora contain so few tokens they can be counted on one hand, twitter yields almost 300,000 tokens over a 10 year sample period. This paper uses web scraping of Twitter to gather all plausible orthographic representations of the intensifier, and uses logistic regression to analyze the extent to which markers of lexicalization and reanalysis are associated with a corresponding shift from comparative to bare morphology on the adjective the intensifier modifies, finding that, indeed, degree of apparent lexicalization is strongly associated with bare morphology, suggesting ongoing lexicalization and subsequent reanalysis at the phrase level. This digital approach reveals ongoing grammatical change, with the new intensifier associated with bare, note comparative, adjectives, and that there is seemingly stable variation correlated with the degree to which the intensifier has lexicalized. Orthographic representations of African American English on social media are shown to be a locus of identity construction and grammatical change.

Singlish Message Paraphrasing: A Joint Task of Creole Translation and Text Normalization

Conference Paper

Full-text available

Oct 2022

Within the natural language processing community , English is by far the most resource-rich language. There is emerging interest in conducting translation via computational approaches to conform its dialects or creole languages back to standard English. This computational approach paves the way to leverage generic English language backbones, which are beneficial for various downstream tasks. However , in practical online communication scenarios , the use of language varieties is often accompanied by noisy user-generated content, making this translation task more challenging. In this work, we introduce a joint paraphrasing task of creole translation and text normaliza-tion of Singlish messages, which can shed light on how to process other language varieties and dialects. We formulate the task in three different linguistic dimensions: lexical level normal-ization, syntactic level editing, and semantic level rewriting. We build an annotated dataset of Singlish-to-Standard English messages, and report performance on a perturbation-resilient sequence-to-sequence model. Experimental results show that the model produces reasonable generation results, and can improve the performance of downstream tasks like stance detection .

Misspelling Semantics In Thai

Preprint

Full-text available

Jun 2022

User-generated content is full of misspellings. Rather than being just random noise, we hypothesise that many misspellings contain hidden semantics that can be leveraged for language understanding tasks. This paper presents a fine-grained annotated corpus of misspelling in Thai, together with an analysis of misspelling intention and its possible semantics to get a better understanding of the misspelling patterns observed in the corpus. In addition, we introduce two approaches to incorporate the semantics of misspelling: Misspelling Average Embedding (MAE) and Misspelling Semantic Tokens (MST). Experiments on a sentiment analysis task confirm our overall hypothesis: additional semantics from misspelling can boost the micro F1 score up to 0.4-2%, while blindly normalising misspelling is harmful and suboptimal.

Out-of-vocabulary word distribution in English Gigaword (NYT), Twitter and SMS data

Context in source publication

Similar publications

Citations