ArticlePublisher preview available

Code‐mixed Hindi‐English text correction using fuzzy graph and word embedding

May 2023
Expert Systems 41(7)

May 2023
41(7)

Authors:

Delhi Technological University

Interaction via social media involves frequent code‐mixed text, spelling errors and noisy elements, which creates a bottleneck in the performance of natural language processing applications. This proposed work is the first approach for code‐mixed Hindi‐English social media text that comprises language identification, detection and correction of non‐word (Out of Vocabulary) errors as well as real‐word errors occurring simultaneously. Each identified language (Devanagari Hindi, Roman Hindi, and English) has its own complexities and challenges. Errors are detected individually for each language and a suggestive list of the erroneous words is created. After this, a fuzzy graph between different words of the suggestive lists is generated using various semantic relations in Hindi WordNet. Word embeddings and Fuzzy graph‐based centrality measures are used to find the correct word. Several experiments are performed on different social media datasets taken from Instagram, Twitter, YouTube comments, Blogs, and WhatsApp. The experimental results demonstrate that the proposed system corrects out‐of‐vocabulary words as well as real‐word errors with a maximum recall of 0.90 and 0.67, respectively for Dev_Hindi and 0.87 and 0.66, respectively for Rom_Hindi. The proposed method is also applied for state‐of‐art sentiment analysis approaches where the F1‐score has been visibly improved.

Social media penetration in India from 2015 to 2020 and estimation till 2025 (Figures in million).

…

Overall architecture for the proposed approach.

…

Steps of pre‐text processing.

…

Steps for language identification process.

…

Detection of spelling errors and suggestive list generation steps for all languages.

…

Figures - available from: Expert Systems

This content is subject to copyright. Terms and conditions apply.

A preview of this full-text is provided by Wiley.

Learn more

Content available from Expert Systems

This content is subject to copyright. Terms and conditions apply.

ORIGINAL ARTICLE

Code-mixed Hindi-English text correction using fuzzy graph

and word embedding

Minni Jain

| Rajni Jindal

| Amita Jain

Computer Science and Engineering, Delhi

Technological University, New Delhi, India

Computer Science and Engineering, Netaji

Subhas University of Technology Delhi, New

Delhi, India

Correspondence

Amita Jain, Computer Science and

Engineering, Netaji Subhas University of

Technology Delhi, New Delhi, India.

Email: amita.jain@nsut.ac.in

Abstract

Interaction via social media involves frequent code-mixed text, spelling errors and

noisy elements, which creates a bottleneck in the performance of natural language

processing applications. This proposed work is the first approach for code-mixed

Hindi-English social media text that comprises language identification, detection and

correction of non-word (Out of Vocabulary) errors as well as real-word errors occur-

ring simultaneously. Each identified language (Devanagari Hindi, Roman Hindi, and

English) has its own complexities and challenges. Errors are detected individually for

each language and a suggestive list of the erroneous words is created. After this, a

fuzzy graph between different words of the suggestive lists is generated using vari-

ous semantic relations in Hindi WordNet. Word embeddings and Fuzzy graph-based

centrality measures are used to find the correct word. Several experiments are per-

formed on different social media datasets taken from Instagram, Twitter, YouTube

comments, Blogs, and WhatsApp. The experimental results demonstrate that the pro-

posed system corrects out-of-vocabulary words as well as real-word errors with a

maximum recall of 0.90 and 0.67, respectively for Dev_Hindi and 0.87 and 0.66,

respectively for Rom_Hindi. The proposed method is also applied for state-of-art sen-

timent analysis approaches where the F1-score has been visibly improved.

KEYWORDS

fuzzy centrality measures, fuzzy graphs, Hindi WordNet, real-word error and non-word error,

text normalization, Word2Vec

1|INTRODUCTION

Social media facilitates sharing/exchanging ideas, information, thoughts, and other forms of expression via virtual networks and communities

(Mayfield, 2008). Social media activities include blogging, social networks (Twitter, Facebook), messaging, product reviews (Google play store,

Amazon, Flipkart, Myntra), service reviews (Ola, Uber, 1 mg, Urbanclap, TripAdvisor) and many more. As of 2021, the number of people using

social media is over 4.20 billion worldwide.

Specifically, in India, because of the widespread availability of internet access, the number of social

media users has been growing to 448 million in 2021.

According to the latest survey by Statista, Social media penetration in India which makes

India a leading country based on Facebook, Instagram, and WhatsApp audience (shown in Figure 1).

It clearly indicates that social media has

become one of the essential parts of daily internet usage in India.

4,5

India is a multilingual country with 528 million Hindi & 125 million English

speakers. A large number of users usually blend Hindi and English language during informal social media communication. This blended language is

known as Hindi-English code-mixed language. The huge code-mixed data on social media is a rich source of information about the interest, behav-

iour and concern of users. The extracted information can be further utilized for applications such as search engines, data mining, sentiment analy-

sis, cyber security, text-based forecasting, and so forth But this enormous code-mixed data cannot be utilized in its original form. It comprises

Received: 10 October 2022 Revised: 13 February 2023 Accepted: 19 April 2023

DOI: 10.1111/exsy.13328

https://doi.org/10.1111/exsy.13328

Special issue on International conference on computing and communication networks (ICCCN2022)

Article

Full-text available

Nov 2023
EXPERT SYST

Deepak Gupta

Automatic keyword extraction for localized tweets using fuzzy graph connectivity measures

Article

Full-text available

Jan 2022
MULTIMED TOOLS APPL

With an upsurge in the use of social media, a tremendous amount of textual data is being generated, which is being used for applications like sentiment analysis, industry trend analysis, information retrieval etc. In this context, automatic keyword extraction is a crucial and useful task. Many graph - based methods have been proposed which consider co-occurrence as edge weight, but these methods neglect the semantic relations between words. This paper proposes an automatic keyword extraction method for tweets from Twitter that represents text as a fuzzy graph and applies fuzzy centrality measures to find relevant keywords (vertices). Proposed work, F-GAKE (fuzzy graph automatic keyword extraction) takes belongingness of two words concerning the theme of the dataset into consideration and provides a fuzzy edge weight. It also considers node weight which incorporates the position of the words, frequency, importance, strength of neighbours and distance from the central node. It then uses fuzzy degree centrality, fuzzy betweenness, fuzzy PageRank and fuzzy Node and Edge (NE) Rank measures which provide relevant keywords. It is further extended to extract keywords for localized trending topics from Twitter. For experimentation, various Twitter datasets are used and results show that F-GAKE performs better than the state-of-the-art approaches for automatic keyword extraction for short messages, such as tweets.

Hindi Query Expansion based on Semantic Importance of Hindi WordNet Relations and Fuzzy Graph Connectivity Measures

Article

Full-text available

Dec 2019

Query expansion refers to the process of adding terms to a given query for improving the performance of information retrieval (IR). The query might consist of polysemous terms, which usually bring down the overall IR performance. To resolve this issue and perform optimized IR, we propose an approach based on fuzzy graphs for Hindi query expansion. To identify additional terms for query, we consider the relative semantic importance of the relations present in Hindi WordNet. The query is represented by the sub-graph extracted from the Hindi WordNet graph. Hindi WordNet is semantically richer due to the presence of a greater number of semantic relations as compared to other WordNets. For all 16 semantic relations present in Hindi WordNet a relative significance score proportional to semantic relatedness is provided. This score acts as the edge weights to the Hindi WordNet graph which is now represented as a fuzzy graph. This assignment helps in moving more semantically related words, closer and recedes away less semantically related words in Hindi WordNet. The selection of significant terms that are to be used for query expansion is done by using local and global fuzzy graph connectivity measures. The proposed method is evaluated on the Forum for Information Retrieval (FIRE) dataset for 3 consecutive years which depicts that the proposed method provides better results than the state-of-art approaches.

N-GRAMS SOLUTION FOR ERROR DETECTION AND CORRECTION IN HINDI LANGUAGE.

Article

Jul 2017

Hindi is the National language of India, which is still in its early stage of research and development regarding natural language processing applications in comparison to other languages like English, Chinese. Natural language processing is a field of Artificial Intelligence, which includes major tasks such as information retrieval, word segmentation, speech recognition, parsing, part of speech tagging, text classification, automatic text summarization etc. Spelling detection and correction in Hindi language is an important task of NLP which has not gotten sufficient attention till date. Spelling detection and correction for Indian languages such as Hindi is considered as a difficult task. Hindi Language is very different from English language in its phonetic properties and grammatical rules. Thus the existing techniques and methods that are being used to check the errors in English language can't be used for Hindi Language

Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Chapter

Feb 2023

Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.

N-GRAMS SOLUTION FOR ERROR DETECTION AND CORRECTION IN HINDI LANGUAGE

Article

Aug 2017

Ceasing hate with MoH: Hate Speech Detection in Hindi–English code-switched language

Article

Jan 2022
INFORM PROCESS MANAG

Warning: This manuscript may contain upsetting language. Social media has become a bedrock for people to voice their opinions worldwide. Due to the greater sense of freedom with the anonymity feature, it is possible to disregard social etiquette online and attack others without facing severe consequences, inevitably propagating hate speech. The current measures to sift the online content and offset the hatred spread do not go far enough. One factor contributing to this is the prevalence of regional languages in social media and the paucity of language flexible hate speech detectors. The proposed work focuses on analyzing hate speech in Hindi–English code-switched language. Our method explores transformation techniques to capture precise text representation. To contain the structure of data and yet use it with existing algorithms, we developed ‘MoH’ or (Map Only Hindi), which means ‘Love’ in Hindi. ‘MoH’ pipeline which consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words, and finally employs the fine-tuned Multilingual Bert, and MuRIL language models. We conducted several quantitative experiment studies on three datasets, and evaluated performance using Precision, Recall and F1 metrics. The first experiment studies ‘MoH’ mapped text’s performance with classical machine learning models and shows an average increase of 13% in F1 scores. The second compares the proposed work’s scores with those of the baseline models and shows a rise in performance by 6%. Finally, the third compares the proposed ‘MoH’ technique with various data simulations using the existing transliteration library. Here, ‘MoH’ outperforms the rest by 15%. Our results demonstrate a significant improvement in the state-of-the-art scores on all three datasets.

A Scientometric Inspection of Research Based on WordNet Lexical During 1995–2019

Chapter

Jan 2021

Experimental Comparison and Scientometric Inspection of Research for Word Embeddings

Chapter

Jan 2021

Opinion Mining and Sentiment Analysis

Book

Jan 2008

An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. Opinion Mining and Sentiment Analysis covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. The focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. The survey includes an enumeration of the various applications, a look at general challenges and discusses categorization, extraction and summarization. Finally, it moves beyond just the technical issues, devoting significant attention to the broader implications that the development of opinion-oriented information-access services have: questions of privacy, vulnerability to manipulation, and whether or not reviews can have measurable economic impact. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided. Opinion Mining and Sentiment Analysis is the first such comprehensive survey of this vibrant and important research area and will be of interest to anyone with an interest in opinion-oriented information-seeking systems.

A Fuzzy WordNet graph based approach to find key terms for students short answer evaluation

Conference Paper

Apr 2019

Code‐mixed Hindi‐English text correction using fuzzy graph and word embedding

Abstract and Figures

Recommended publications

Lexical Semantics Identification Using Fuzzy Centrality Measures and BERT Embedding

Automatic Construction of Interval-Valued Fuzzy Hindi WordNet using Lexico-Syntactic Patterns and Wo...

Automatic keyword extraction for localized tweets using fuzzy graph connectivity measures

Word Sense Disambiguation of Hindi Text using Fuzzified Semantic Relations and Fuzzy Hindi WordNet