Code‐mixed Hindi‐English text correction using fuzzy graph and word embedding

Interaction via social media involves frequent code‐mixed text, spelling errors and noisy elements, which creates a bottleneck in the performance of natural language processing applications. This proposed work is the first approach for code‐mixed Hindi‐English social media text that comprises language identification, detection and correction of non‐word (Out of Vocabulary) errors as well as real‐word errors occurring simultaneously. Each identified language (Devanagari Hindi, Roman Hindi, and English) has its own complexities and challenges. Errors are detected individually for each language and a suggestive list of the erroneous words is created. After this, a fuzzy graph between different words of the suggestive lists is generated using various semantic relations in Hindi WordNet. Word embeddings and Fuzzy graph‐based centrality measures are used to find the correct word. Several experiments are performed on different social media datasets taken from Instagram, Twitter, YouTube comments, Blogs, and WhatsApp. The experimental results demonstrate that the proposed system corrects out‐of‐vocabulary words as well as real‐word errors with a maximum recall of 0.90 and 0.67, respectively for Dev_Hindi and 0.87 and 0.66, respectively for Rom_Hindi. The proposed method is also applied for state‐of‐art sentiment analysis approaches where the F1‐score has been visibly improved.
Code-mixed Hindi-English text correction using fuzzy graph
and word embedding
Minni Jain
| Rajni Jindal
| Amita Jain
Computer Science and Engineering, Delhi
Technological University, New Delhi, India
Computer Science and Engineering, Netaji
Subhas University of Technology Delhi, New
Delhi, India
Amita Jain, Computer Science and
Engineering, Netaji Subhas University of
Technology Delhi, New Delhi, India.
Interaction via social media involves frequent code-mixed text, spelling errors and
noisy elements, which creates a bottleneck in the performance of natural language
processing applications. This proposed work is the first approach for code-mixed
Hindi-English social media text that comprises language identification, detection and
correction of non-word (Out of Vocabulary) errors as well as real-word errors occur-
ring simultaneously. Each identified language (Devanagari Hindi, Roman Hindi, and
English) has its own complexities and challenges. Errors are detected individually for
each language and a suggestive list of the erroneous words is created. After this, a
fuzzy graph between different words of the suggestive lists is generated using vari-
ous semantic relations in Hindi WordNet. Word embeddings and Fuzzy graph-based
centrality measures are used to find the correct word. Several experiments are per-
formed on different social media datasets taken from Instagram, Twitter, YouTube
comments, Blogs, and WhatsApp. The experimental results demonstrate that the pro-
posed system corrects out-of-vocabulary words as well as real-word errors with a
maximum recall of 0.90 and 0.67, respectively for Dev_Hindi and 0.87 and 0.66,
respectively for Rom_Hindi. The proposed method is also applied for state-of-art sen-
timent analysis approaches where the F1-score has been visibly improved.
fuzzy centrality measures, fuzzy graphs, Hindi WordNet, real-word error and non-word error,
text normalization, Word2Vec
Social media facilitates sharing/exchanging ideas, information, thoughts, and other forms of expression via virtual networks and communities
(Mayfield, 2008). Social media activities include blogging, social networks (Twitter, Facebook), messaging, product reviews (Google play store,
Amazon, Flipkart, Myntra), service reviews (Ola, Uber, 1 mg, Urbanclap, TripAdvisor) and many more. As of 2021, the number of people using
social media is over 4.20 billion worldwide.
Specifically, in India, because of the widespread availability of internet access, the number of social
media users has been growing to 448 million in 2021.
According to the latest survey by Statista, Social media penetration in India which makes
India a leading country based on Facebook, Instagram, and WhatsApp audience (shown in Figure 1).
It clearly indicates that social media has
become one of the essential parts of daily internet usage in India.
India is a multilingual country with 528 million Hindi & 125 million English
speakers. A large number of users usually blend Hindi and English language during informal social media communication. This blended language is
known as Hindi-English code-mixed language. The huge code-mixed data on social media is a rich source of information about the interest, behav-
iour and concern of users. The extracted information can be further utilized for applications such as search engines, data mining, sentiment analy-
sis, cyber security, text-based forecasting, and so forth But this enormous code-mixed data cannot be utilized in its original form. It comprises
Received: 10 October 2022 Revised: 13 February 2023 Accepted: 19 April 2023
DOI: 10.1111/exsy.13328
Expert Systems. 2024;41:e13328. © 2023 John Wiley & Sons Ltd. 1of22
