ArticlePublisher preview available

Code‐mixed Hindi‐English text correction using fuzzy graph and word embedding

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Interaction via social media involves frequent code‐mixed text, spelling errors and noisy elements, which creates a bottleneck in the performance of natural language processing applications. This proposed work is the first approach for code‐mixed Hindi‐English social media text that comprises language identification, detection and correction of non‐word (Out of Vocabulary) errors as well as real‐word errors occurring simultaneously. Each identified language (Devanagari Hindi, Roman Hindi, and English) has its own complexities and challenges. Errors are detected individually for each language and a suggestive list of the erroneous words is created. After this, a fuzzy graph between different words of the suggestive lists is generated using various semantic relations in Hindi WordNet. Word embeddings and Fuzzy graph‐based centrality measures are used to find the correct word. Several experiments are performed on different social media datasets taken from Instagram, Twitter, YouTube comments, Blogs, and WhatsApp. The experimental results demonstrate that the proposed system corrects out‐of‐vocabulary words as well as real‐word errors with a maximum recall of 0.90 and 0.67, respectively for Dev_Hindi and 0.87 and 0.66, respectively for Rom_Hindi. The proposed method is also applied for state‐of‐art sentiment analysis approaches where the F1‐score has been visibly improved.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
Code-mixed Hindi-English text correction using fuzzy graph
and word embedding
Minni Jain
1
| Rajni Jindal
1
| Amita Jain
2
1
Computer Science and Engineering, Delhi
Technological University, New Delhi, India
2
Computer Science and Engineering, Netaji
Subhas University of Technology Delhi, New
Delhi, India
Correspondence
Amita Jain, Computer Science and
Engineering, Netaji Subhas University of
Technology Delhi, New Delhi, India.
Email: amita.jain@nsut.ac.in
Abstract
Interaction via social media involves frequent code-mixed text, spelling errors and
noisy elements, which creates a bottleneck in the performance of natural language
processing applications. This proposed work is the first approach for code-mixed
Hindi-English social media text that comprises language identification, detection and
correction of non-word (Out of Vocabulary) errors as well as real-word errors occur-
ring simultaneously. Each identified language (Devanagari Hindi, Roman Hindi, and
English) has its own complexities and challenges. Errors are detected individually for
each language and a suggestive list of the erroneous words is created. After this, a
fuzzy graph between different words of the suggestive lists is generated using vari-
ous semantic relations in Hindi WordNet. Word embeddings and Fuzzy graph-based
centrality measures are used to find the correct word. Several experiments are per-
formed on different social media datasets taken from Instagram, Twitter, YouTube
comments, Blogs, and WhatsApp. The experimental results demonstrate that the pro-
posed system corrects out-of-vocabulary words as well as real-word errors with a
maximum recall of 0.90 and 0.67, respectively for Dev_Hindi and 0.87 and 0.66,
respectively for Rom_Hindi. The proposed method is also applied for state-of-art sen-
timent analysis approaches where the F1-score has been visibly improved.
KEYWORDS
fuzzy centrality measures, fuzzy graphs, Hindi WordNet, real-word error and non-word error,
text normalization, Word2Vec
1|INTRODUCTION
Social media facilitates sharing/exchanging ideas, information, thoughts, and other forms of expression via virtual networks and communities
(Mayfield, 2008). Social media activities include blogging, social networks (Twitter, Facebook), messaging, product reviews (Google play store,
Amazon, Flipkart, Myntra), service reviews (Ola, Uber, 1 mg, Urbanclap, TripAdvisor) and many more. As of 2021, the number of people using
social media is over 4.20 billion worldwide.
1
Specifically, in India, because of the widespread availability of internet access, the number of social
media users has been growing to 448 million in 2021.
2
According to the latest survey by Statista, Social media penetration in India which makes
India a leading country based on Facebook, Instagram, and WhatsApp audience (shown in Figure 1).
3
It clearly indicates that social media has
become one of the essential parts of daily internet usage in India.
4,5
India is a multilingual country with 528 million Hindi & 125 million English
speakers. A large number of users usually blend Hindi and English language during informal social media communication. This blended language is
known as Hindi-English code-mixed language. The huge code-mixed data on social media is a rich source of information about the interest, behav-
iour and concern of users. The extracted information can be further utilized for applications such as search engines, data mining, sentiment analy-
sis, cyber security, text-based forecasting, and so forth But this enormous code-mixed data cannot be utilized in its original form. It comprises
Received: 10 October 2022 Revised: 13 February 2023 Accepted: 19 April 2023
DOI: 10.1111/exsy.13328
Expert Systems. 2024;41:e13328. wileyonlinelibrary.com/journal/exsy © 2023 John Wiley & Sons Ltd. 1of22
https://doi.org/10.1111/exsy.13328
Article
Full-text available
With an upsurge in the use of social media, a tremendous amount of textual data is being generated, which is being used for applications like sentiment analysis, industry trend analysis, information retrieval etc. In this context, automatic keyword extraction is a crucial and useful task. Many graph - based methods have been proposed which consider co-occurrence as edge weight, but these methods neglect the semantic relations between words. This paper proposes an automatic keyword extraction method for tweets from Twitter that represents text as a fuzzy graph and applies fuzzy centrality measures to find relevant keywords (vertices). Proposed work, F-GAKE (fuzzy graph automatic keyword extraction) takes belongingness of two words concerning the theme of the dataset into consideration and provides a fuzzy edge weight. It also considers node weight which incorporates the position of the words, frequency, importance, strength of neighbours and distance from the central node. It then uses fuzzy degree centrality, fuzzy betweenness, fuzzy PageRank and fuzzy Node and Edge (NE) Rank measures which provide relevant keywords. It is further extended to extract keywords for localized trending topics from Twitter. For experimentation, various Twitter datasets are used and results show that F-GAKE performs better than the state-of-the-art approaches for automatic keyword extraction for short messages, such as tweets.
Article
Full-text available
Query expansion refers to the process of adding terms to a given query for improving the performance of information retrieval (IR). The query might consist of polysemous terms, which usually bring down the overall IR performance. To resolve this issue and perform optimized IR, we propose an approach based on fuzzy graphs for Hindi query expansion. To identify additional terms for query, we consider the relative semantic importance of the relations present in Hindi WordNet. The query is represented by the sub-graph extracted from the Hindi WordNet graph. Hindi WordNet is semantically richer due to the presence of a greater number of semantic relations as compared to other WordNets. For all 16 semantic relations present in Hindi WordNet a relative significance score proportional to semantic relatedness is provided. This score acts as the edge weights to the Hindi WordNet graph which is now represented as a fuzzy graph. This assignment helps in moving more semantically related words, closer and recedes away less semantically related words in Hindi WordNet. The selection of significant terms that are to be used for query expansion is done by using local and global fuzzy graph connectivity measures. The proposed method is evaluated on the Forum for Information Retrieval (FIRE) dataset for 3 consecutive years which depicts that the proposed method provides better results than the state-of-art approaches.
Article
Hindi is the National language of India, which is still in its early stage of research and development regarding natural language processing applications in comparison to other languages like English, Chinese. Natural language processing is a field of Artificial Intelligence, which includes major tasks such as information retrieval, word segmentation, speech recognition, parsing, part of speech tagging, text classification, automatic text summarization etc. Spelling detection and correction in Hindi language is an important task of NLP which has not gotten sufficient attention till date. Spelling detection and correction for Indian languages such as Hindi is considered as a difficult task. Hindi Language is very different from English language in its phonetic properties and grammatical rules. Thus the existing techniques and methods that are being used to check the errors in English language can't be used for Hindi Language
Chapter
Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.
Article
Hindi is the National language of India, which is still in its early stage of research and development regarding natural language processing applications in comparison to other languages like English, Chinese. Natural language processing is a field of Artificial Intelligence, which includes major tasks such as information retrieval, word segmentation, speech recognition, parsing, part of speech tagging, text classification, automatic text summarization etc. Spelling detection and correction in Hindi language is an important task of NLP which has not gotten sufficient attention till date. Spelling detection and correction for Indian languages such as Hindi is considered as a difficult task. Hindi Language is very different from English language in its phonetic properties and grammatical rules. Thus the existing techniques and methods that are being used to check the errors in English language can't be used for Hindi Language. There are mainly two types of error: Non word error and real word error. Error detection for non-word error in Hindi language has been done but for real word error no work has been done till date. This paper focused on Real word spelling error detection and correction in Hindi text by using N Grams Model and Levensthein edit distance algorithm.
Article
Warning: This manuscript may contain upsetting language. Social media has become a bedrock for people to voice their opinions worldwide. Due to the greater sense of freedom with the anonymity feature, it is possible to disregard social etiquette online and attack others without facing severe consequences, inevitably propagating hate speech. The current measures to sift the online content and offset the hatred spread do not go far enough. One factor contributing to this is the prevalence of regional languages in social media and the paucity of language flexible hate speech detectors. The proposed work focuses on analyzing hate speech in Hindi–English code-switched language. Our method explores transformation techniques to capture precise text representation. To contain the structure of data and yet use it with existing algorithms, we developed ‘MoH’ or (Map Only Hindi), which means ‘Love’ in Hindi. ‘MoH’ pipeline which consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words, and finally employs the fine-tuned Multilingual Bert, and MuRIL language models. We conducted several quantitative experiment studies on three datasets, and evaluated performance using Precision, Recall and F1 metrics. The first experiment studies ‘MoH’ mapped text’s performance with classical machine learning models and shows an average increase of 13% in F1 scores. The second compares the proposed work’s scores with those of the baseline models and shows a rise in performance by 6%. Finally, the third compares the proposed ‘MoH’ technique with various data simulations using the existing transliteration library. Here, ‘MoH’ outperforms the rest by 15%. Our results demonstrate a significant improvement in the state-of-the-art scores on all three datasets.
Book
An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. Opinion Mining and Sentiment Analysis covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. The focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. The survey includes an enumeration of the various applications, a look at general challenges and discusses categorization, extraction and summarization. Finally, it moves beyond just the technical issues, devoting significant attention to the broader implications that the development of opinion-oriented information-access services have: questions of privacy, vulnerability to manipulation, and whether or not reviews can have measurable economic impact. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided. Opinion Mining and Sentiment Analysis is the first such comprehensive survey of this vibrant and important research area and will be of interest to anyone with an interest in opinion-oriented information-seeking systems.