ArticlePDF Available

Multilingual Personalised Hashtag Recommendation for Low Resource Indic Languages using Graph-based Deep Neural Network

August 2023
Expert Systems with Applications 236(1)

August 2023
236(1)

DOI:10.1016/j.eswa.2023.121188

Authors:

Shubhi Bansal

Indian Institute of Technology Indore

Kushaan Gowda

Columbia University

Nagendra Kumar

Indian Institute of Technology Indore

Users from different cultures and backgrounds often feel comfortable expressing their thoughts on trending topics by generating content in their regional languages. Recently, there has been an explosion in multilingual information, and a massive amount of multilingual textual data is added daily on the Internet. Using hashtags for multilingual low-resource content can be an effective way to overcome language barriers because it allows content to be discovered by a wider audience and makes it easier for people interested in the topic to find relevant content, regardless of the language in which it was written. To account for linguistic diversity and universal access to information, hashtag recommendation for multilingual low-resource content is essential. Several approaches have been put forth to recommend content-based and personalized hashtags for multimodal content in high-resource languages. Data availability and linguistic differences often limit the development of hashtag recommendation methods for low-resource Indic languages. Hashtag recommendation for tweets disseminated in low-resource Indic languages has seldom been addressed. Moreover, personalization and language usage aspects to recommend hashtags for tweets posted in low-resource Indic languages have yet to be explored. In view of the foregoing, we propose an automated hashtag recommendation system for tweets posted in low-resource Indic languages dubbed as TAGALOG, capable of recommending personalized and language-specific hashtags. We employ user-guided and language-guided attention mechanisms to distill indicative features from low-resource tweets according to the user’s topical and linguistic preferences. We propose a graph-based neural network to mine users’ posting behavior by connecting historical tweets of a particular user and language relatedness by linking tweets according to language families, i.e., Indo-Aryan and Dravidian. Experimental results on the curated dataset from Twitter demonstrate that the proposed model outperformed recognized pre-trained language models and extant research, showing an average improvement of 12.3% and 12.8% in the F1-score, respectively. TAGALOG recommends hashtags that align with the user’s interests and linguistic predilections, leading to a heightened level of tailored and engaging user experience. Personalized and multilingual hashtag recommendation systems for low-resource Indic languages can help to improve the discoverability and relevance of content in these languages.

Content uploaded by Nagendra Kumar

Content may be subject to copyright.

Multilingual Personalised Hashtag Recommendation for Low

Resource Indic Languages using Graph-based Deep Neural

Network

Shubhi Bansala(phd2001201007@iiti.ac.in), Kushaan Gowdaa(cse190001031@iiti.ac.in),

Nagendra Kumara(nagendra@iiti.ac.in)

aDepartment of Computer Science and Engineering, Indian Institute of Technology, Indore, India

Corresponding Author:

Shubhi Bansal

Department of Computer Science and Engineering, Indian Institute of Technology, Indore, India

Email: phd2001201007@iiti.ac.in

This paper is accepted in Expert Systems with Applications, 2023.

DOI: https://doi.org/10.1016/j.eswa.2023.121188

Multilingual Personalised Hashtag Recommendation for Low

Resource Indic Languages using Graph-based Deep Neural Network

Shubhi Bansala,∗, Kushaan Gowdaa, Nagendra Kumara

aDepartment of Computer Science and Engineering, Indian Institute of Technology, Indore, India

Abstract

Users from dierent cultures and backgrounds often feel comfortable expressing their thoughts

on trending topics by generating content in their regional languages. Recently, there has

been an explosion in multilingual information, and a massive amount of multilingual textual

data is added daily on the Internet. Using hashtags for multilingual low-resource content can

be an eective way to overcome language barriers because it allows content to be discovered

by a wider audience and makes it easier for people interested in the topic to nd relevant

content, regardless of the language in which it was written. To account for linguistic diversity

and universal access to information, hashtag recommendation for multilingual low-resource

content is essential. Several approaches have been put forth to recommend content-based

and personalized hashtags for multimodal content in high-resource languages. Data avail-

ability and linguistic dierences often limit the development of hashtag recommendation

methods for low-resource Indic languages. Hashtag recommendation for tweets dissemi-

nated in low-resource Indic languages has seldom been addressed. Moreover, personaliza-

tion and language usage aspects to recommend hashtags for tweets posted in low-resource

Indic languages have yet to be explored. In view of the foregoing, we propose an automated

hashtag recommendation system for tweets posted in low-resource Indic languages dubbed

as TAGALOG, capable of recommending personalized and language-specic hashtags. We

employ user-guided and language-guided attention mechanisms to distill indicative features

from low-resource tweets according to the user’s topical and linguistic preferences. We pro-

pose a graph-based neural network to mine users’ posting behavior by connecting historical

tweets of a particular user and language relatedness by linking tweets according to language

families, i.e., Indo-Aryan and Dravidian. Experimental results on the curated dataset from

Twitter demonstrate that the proposed model outperformed recognized pre-trained language

models and extant research, showing an average improvement of 12.3% and 12.8% in the

F1-score, respectively. TAGALOG recommends hashtags that align with the user’s interests

and linguistic predilections, leading to a heightened level of tailored and engaging user ex-

perience. Personalized and multilingual hashtag recommendation systems for low-resource

Indic languages can help to improve the discoverability and relevance of content in these

languages.

Keywords: Multilingual Text, Low-Resource Languages, Indic Languages, Hashtag

Recommendation, Graph Convolutional Networks

Preprint submitted to Expert Systems with Applications October 18, 2023

1. Introduction

Due to the active participation of users on Social Networking Services (SNS) like Twitter1

or Facebook2, real-time news and trends can now reach anywhere regardless of geographical

location or time dierence. Twitter users create nearly 500 million tweets daily Dusart

et al. (2023), immediately disseminating information about current events and trending

topics. Tweets are user-generated messages with a specied character limit that provide

scant and ambiguous information. More context and understanding of the subject matter

are frequently required to grasp a tweet’s message better. Hashtags are words preceded

by an octothorpe (#) symbol that clarify, decipher, and enrich tweets’ content by adding

information about the subject, sentiment, and attitude. Hashtags are an integral part of

Twitter and help categorize content so users can easily nd it. Statistics indicate that tweets

with hashtags receive twice the level of engagement than that without Myers et al. (2023),

making them a great way to spread the content.

Twitter public conversations have aected popular discourse and modern culture since,

on Twitter, information spreads across languages and countries. Regionally specic con-

tent generates much traction. In the realm of Twitter, English emerges as the predominant

language, encompassing almost 53% of the total volume of tweets3. It is worth noting that

Twitter is experiencing a surge in popularity in various nations, particularly in regions where

languages with fewer resources are prevalent. India, for instance, constitutes the third largest

consumer for Twitter in terms of user base, trailing behind the United States and Japan,

boasting an impressive daily active user count of 22.1 million4. By providing support for

vernacular languages and allowing users to converse in Indic languages, Twitter has trans-

formed the spread of content and its reachability. 2019 Twitter research shows that 51%

of Indian users tweet in English and 49% in other languages5. More and more Indian users

have now begun to tweet on trending topics in their native tongues. According to the census

of 2001 (Pandey & Jha, 2021), 1,635 rationalized mother tongues, 234 identiable mother

tongues, and 22 major languages are spoken in India. It is possible to present semantically

related posts across various sources and languages. These posts cannot be directly matched

due to language script dierences and morphology. It poses a problem when linking and

accessing tweets in multiple languages that exhibit semantic similarity and belong to the

same topics. Hashtags come to the rescue as they can be used as matching criteria for

semantic posts across dierent data sources. Unfortunately, despite hashtags’ value, very

few tweets use them. The volume of tweets posted during events of widespread interest is

overwhelming, making it challenging to weed out irrelevant tweets while searching for the

∗Corresponding author.

Email addresses: phd2001201007@iiti.ac.in (Shubhi Bansal), cse190001031@iiti.ac.in (Kushaan

Gowda), nagendra@iiti.ac.in (Nagendra Kumar )

1https://twitter.com/

2https://www.facebook.com/

3https://semiocast.com/top-languages-on-twitter-stats/

4https://backlinko.com/twitter-users#twitter-users

5https://telanganatoday.com/twitter-giving-people-more-control-over-conversations-in-india

most pertinent information. Users follow international news and events on Twitter, but it

is hard to nd hashtags for topics in languages other than English. Local content creators

and brands intend to reach a broader audience on social media. Language learners intend

to nd engaging content and connect with other learners in their target language. However,

they need help nding relevant hashtags in multiple languages. Researchers studying mul-

tilingualism and language contact tend to use relevant hashtags for research but need help

seeing them in multiple languages. A tool that recommends multilingual hashtags would

save time and improve the visibility of their work, help content creators discover pertinent

content, and build connections, making nding and engaging with content from diverse

communities easier. To eectively retrieve relevant content while overcoming information

scarcity and the ambiguous nature of tweets, we frequently need to annotate hashtags to

tweets. Manual hashtag annotation takes time and money. Thus, developing automated

hashtag recommendation systems is the need of the hour as it drastically reduces the need

for human annotation while facilitating content categorization and management. Accord-

ing to the statistics on our collected dataset for tweets posted in multiple low-resource Indic

languages, up to 24.16% of the 31,07,866 tweets have less than two hashtags. Therefore, cre-

ating a system to suggest hashtags for low-resource Indic tweets is a worthwhile and pressing

research topic. These factors motivate us to develop a novel polyglot model for low-resource

Indic tweets that can automatically recommend meaningful hashtags for tweets.

Prior works have attempted to recommend hashtags for textual (Kumar et al., 2021;

Mao et al., 2022; Chakrabarti et al., 2023), visual (Hachaj & Miazga, 2020; Park et al.,

2016; Kurunkar et al., 2022), and multimodal content (Djenouri et al., 2022; Panchal &

Prajapati, 2023; Yang & Lin, 2022; Nama & Deepak, 2023). Eorts have been made to

suggest personalized (Wei et al., 2019; Padungkiatwattana & Maneeroj, 2022) hashtags by

considering content, user, and metadata information. Despite the extensive research for

hashtag recommendation via leveraging textual content, researchers have primarily focused

on high-resource languages, namely English (Zhang et al., 2019; Wang et al., 2019) and

Chinese (Kou et al., 2018; Javari et al., 2020; Mao et al., 2022). However, recommending

hashtags for content generated in low-resource Indic languages on social media platforms is

mainly unexplored. Indic languages are considered low-resource owing to the unavailability

of many written texts, audio recordings, or other digital resources. In low-resource language

settings, the data can often be noisy or incomplete. The existing methods to recommend

hashtags for content written in high-resource languages cannot be applied directly to low-

resource languages. The reason is that the development of linguistic knowledge requires

specialized expertise or a native speaker’s prociency in that language.

(Zhang et al., 2019) employed a parallel co-attention technique to simulate the correlation

of visual and textual information constituting the post. The authors consider the similarity

of the current post with the user’s historical posts to capture his tagging behavior and

suggest plausible hashtags for his current post. One drawback of using similarity with

historical posts is that it may not account for changes in the user’s interests or posting

habits over time, potentially leading to less relevant hashtag recommendations. (Jeong et al.,

2022) recommended hashtags based on post content and user demographic information. The

authors computed the similarity of demographic features with content features to recommend

Figure 1: Tweets of a User U

plausible hashtags. If the system relies solely on demographic data to recommend hashtags,

it may not accurately predict the user’s preferences. A user may have a unique interest in

a topic not commonly discussed by others in their demographic group. Users’ individual

preferences or behaviors that do not necessarily align with the general trends or patterns

observed in the larger population are known as idiosyncrasies. Modeling idiosyncrasies in

social media posts helps mitigate potential biases from relying solely on demographic or user

prole data. It also aids in the identication of patterns and trends that may not be visible

through demographic or user prole data alone. (Zhang et al., 2022) created a bipartite

graph comprising tweets and users to mine socially similar tweets and predict hashtags

for multilingual content. Despite this, TwHIN-BERT fails to recommend hashtags following

users’ interests and language usage style. The user who creates a post can provide important

contextual information about the post, such as the user’s interests, preferences, expertise,

language choice, and usage style.

An illustrative example from Twitter is seen in Fig. 1, where a particular user has posted

two dierent tweets yet used similar hashtags, indicating his topic of interest. In the rst

tweet, the user wishes Happy Flowers Day and annotates it with #phooldei. Phooldei is a

festival of owers and springtime celebrated in Uttarakhand. According to tweet content, he

assigned #owers, #Uttarakhand and #nature. In the second tweet, he emphasizes living

in the present through lines of a Hindi Bollywood song. According to the tweet content, he

annotates #present, #moment, and #songs to his tweet. The tweet has no relation with

owers, yet he assigns #owers and #nature to the second tweet, reecting his interest in

topics, i.e., nature and owers. Therefore, mining information from users’ posts can help to

understand their personal preferences and identify patterns in their posting behavior. This

results in a richer and more comprehensive understanding of how users engage with content

on social media platforms. Twitter users often develop their style and tone when tweeting,

which can be inuenced by their personality, background, interests, and communication

style. Some users may use a lot of slang and abbreviations, while others may use more

formal language and punctuation. Some users may use a lot of humor and sarcasm in

their tweets, while others may be more serious and straightforward. Users’ unique and

personal characteristics of language usage include vocabulary, punctuation, and emojis. It

is, therefore, essential to capture the highly idiosyncratic language patterns to comprehend

the dierences and commonalities in language use across users.

Additionally, the user from Fig. 1 recommends hashtags in the same language (Hindi).

He has also transliterated #, to English #phooldei. This emphasizes that user tends

to take language into consideration when posting tweets and annotating hashtags. On the

contrary, TwHIN-BERT doesn’t consider the user’s linguistic preferences and also fails to

capture relatedness among languages. Language relatedness refers to the degree of similarity

between dierent languages regarding their grammar, vocabulary, and other linguistic fea-

tures. Closely related languages share many similarities, while distantly related languages

may have fewer similarities. Modeling relatedness among languages in a language family

assists in overcoming some of the corpora limitations of low-resource languages by leverag-

ing shared knowledge and resources. This approach is particularly valuable in multilingual

settings, where users speak multiple languages within the same language family.

In this paper, we devise an automatic hashtag recommendation system for orpheline

tweets posted in low-resource Indic languages dubbed as TAGALOG that leverages tweet

content, language relatedness, and user preferences to recommend topic-relevant, personal-

ized and language-focused hashtags. We rene tweet representations in line with language

usage style and user interests by employing language-guided and user-guided attention mech-

anisms. We employ a graph neural network to capture relatedness among languages of sep-

arate families (Indo-Aryan and Dravidian) and user posting behavior. The recommended

hashtags can be used to identify the main content for specic topics regardless of the lan-

guage. Our proposed system can help regional language Twitter users to eectively retrieve

content and keep up to date with the latest information.

Below are the key highlights of our contributions.

1. We devise a deep learning-based graph neural network to suggest semantically related,

personalized, and language-specic hashtags for tweets posted in low-resource Indic

languages.

2. We not only capture the distinct topical and linguistic inclinations of individual users

on a local scale but also their long-term behavior and global interests.

3. On a local scale, we rene the content of tweets by devising a novel way of attending

to users’ topical interests and language usage style.

4. Globally, we construct a graph to model users’ interactions with tweets by considering

their historical tweets and capturing the long-term posting behavior.

5. We also leverage relatedness among languages belonging to the same language family.

The framework can mine correlation among languages of the same family group, i.e.,

Indo-Aryan and Dravidian.

6. We have constructed a new text-based hashtag recommendation dataset containing

tweets in Indic languages called Indic Hash. The collected tweet samples span various

low-resource languages: Bangla, Marathi, Gujarati, Telugu, Tamil, Kannada, and

Hindi besides English. Our curated dataset can be a primary resource to recommend

hashtags for tweets posted in Indic regional languages.

7. Our experimental ndings show that the proposed hashtag recommendation model

performs well in a low-resource environment with a minimal amount of labeled data.

The subsequent sections of the paper are arranged in the following manner. Section 2

outlines related work in hashtag suggestion while touching upon Indic languages. Section 3

formalizes the multilingual hashtag suggestion task. Section 4 focuses on our proposed

approach. Section 5 describes the experimental setup, outcomes, and analysis of the studies.

Section 6 outlines the limitations, practical implications, and potential applications of the

proered system. The concluding remarks are mentioned in Section 7.

2. Related Work

This section provides a high-level summary of the work pertaining to the domain of

hashtag recommendation followed by low-resource Indic languages and multilingual hashtag

prediction.

2.1. Hashtag Recommendation

In this part, we rst discuss several works that recommend personalized hashtags. Fol-

lowing that, we outline Graph Convolutional Network (GCN)-based techniques for hashtag

recommendation.

2.1.1. Personalised Hashtag Recommendation

Non-personalized hashtag recommendations (Tang et al., 2019; Ma et al., 2019; Kaviani

& Rahmani, 2020; Yang et al., 2020a) are limited in their capacity to oer personalized sug-

gestions since they only account for content-based factors while neglecting user preferences.

In essence, these recommendations are generated based solely on the textual semantics of the

content, potentially leading to mismatches with user preferences. In response, personalized

hashtag recommendations have been proposed, aiming to leverage both content information

and user preferences to provide personalized recommendations.

(Zhang et al., 2019) employed a parallel co-attention technique to simulate the correla-

tion of visual and textual information constituting the post. The authors also consider the

similarity of the current post with the user’s historical posts to capture his tagging behavior

and suggest plausible hashtags for his current post. One drawback of using similarity with

historical posts is that it simply considers the content of posts without taking into consider-

ation the larger network of connections between users and posts. This can make it dicult

to capture more subtle patterns of user behavior, such as the impact of social networks, com-

munity norms, or user demographics on post engagement. To model users’ extensive posting

histories for tailored hashtag recommendation tasks, (Peng et al., 2019) put forth a unique

neural memory network that incorporates both textual material and hashtags. This model

is equipped with a gating mechanism to tackle scenarios where hashtag usage is entirely

unrelated to earlier posts. To suggest personalized hashtags, (Jeong et al., 2022) presented

an attention-based neural network that used user demographic data derived from their sele

photographs along with textual and visual information. (Padungkiatwattana & Maneeroj,

2022) put forth a personalized hashtag recommender PAC-MAN, which integrates a multi-

tude of high-order relations to represent users and hashtags. A Multi-relational Attentive

Network (MAN) uses GNN to record relationships between hashtags and users, hashtags

and users, and hashtags and hashtags. PAC-MAN is a Person-And-Content-Based BERT

(PAC) that blends MAN user representation with content customization at the word level.

Finally, the authors execute a hashtag prediction task with MAN hashtag representations

incorporated into BERT to model sequenceless hashtag correlations.

2.1.2. GCN-based Hashtag Recommendation

GCN (Kipf & Welling, 2016) was initially introduced as a method to address the chal-

lenges associated with semi-supervised learning. (Wei et al., 2019) employed GCN strategies,

such as information diusion and attentiveness, to acquire micro-video and hashtag repre-

sentations that reect user choices. The resulting user-specic representations enable the

calculation of the similarity score of hashtags with respect to micro-videos facilitating more

eective hashtag recommendations. (Mehta et al., 2021) co-learned latent embeddings of

features gleaned from extended videos and semantic embeddings of prominent hashtags

on social media platforms. The authors adopt GCN to anticipate relationships between

videos and hashtags in a heterogeneous graph and recommend popular hashtags for videos.

To promote micro-video hashtags, (Li et al., 2019) introduces a multi-view representation

interactive embedding model that uses graph-based information propagation. The model in-

tegrates hashtag associations, multiview learning, and video-user-hashtag interaction, with

a graph directing the spread of information among hashtags. This method establishes a

consistent pattern of relatedness between hashtags, which considerably improves the eec-

tiveness of hashtag recommendations for both popular and long-tail hashtags. (Chen et al.,

2021) created an image similarity graph to illustrate the relationship between posts assuming

visually comparable images use similar hashtags. The Triplet Attention module captures

the inuence of visuals, captions, and users to derive node features. Aggregated Graph Con-

volution component learns the attended features and spreads information among vertices to

suggest suitable hashtags.

2.2. Low-Resource Languages and Multilingual Hashtag Prediction

The task of suggesting hashtags for textual content can be posed using one of the tradi-

tional problems in Natural Language Processing (NLP), i.e., text categorisation (Li et al.,

2022; Dogra et al., 2022; Li et al., 2023; Lei et al., 2020). As far as we are aware, although

many works have been carried out for classifying text in low-resource Indic languages (Pathak

& Jain, 2022; Sanghvi et al., 2023; Rehman et al., 2023), there is only one work that predicts

hashtags for multilingual content (Zhang et al., 2022).

Low-resource languages (LRLs), also known as “less studied, under-resourced, low den-

sity” languages are languages with limited linguistic resources, such as textual material,

language processing tools, grammar and speech databases, dictionaries, and human com-

petence (Besacier et al., 2014). These languages are frequently spoken by small groups,

lack standardized writing systems, and have a scarce digital presence. Researchers in NLP

distinguish LRLs based on the availability of data and NLP tools. LRLs have a relatively

small amount of data, i.e., text corpora, parallel corpora, and lack language-specic tools

such as spell checkers and grammar checkers, and manually crafted linguistic resources for

training NLP models. There are a number of advantages to working with low-resource lan-

guages that have the potential to impact the lives of people who speak these languages, the

opportunity to develop new NLP techniques that can be applied to other languages, and

the challenge of working with limited data. Eorts are being made by linguists, researchers,

and organizations to document languages, construct corpora, develop technology and tools,

and community-driven language revival campaigns for LRLs since LRLs oer humongous

benets some of which are enlisted below.

• Social Inclusion: Strengthening LRLs promotes inclusion and gives underrepresented

communities a voice online. They can use it to interact with technology, participate

in online debates, and get information in their language.

• Enhanced Cross-Cultural Understanding: Supporting and researching LRLs stimu-

lates collaboration across diverse linguistic communities and improves cross-cultural

understanding. It helps to bridge barriers and promote mutual tolerance and appre-

ciation for other cultures and languages.

• Enhanced Communication: Supporting low-resource Indic languages enables eective

communication and understanding within linguistic communities. It strengthens inter-

generational bonds, fosters social cohesion, and promotes local participation in various

social, cultural, and economic activities.

• Economic Opportunities: Developing language technologies, content, and services for

low-resource Indic languages might lead to the emergence of industries, such as lo-

calization services, translation, interpretation, content creation, and digital platforms

aimed at specic linguistic communities.

Due to small corpora and unseen scripts, labeled data for diverse Indic languages is sparse

or nonexistent in real applications compared to high-resource languages like English and Chi-

nese. To get beyond corpus restrictions inherent in low-resource languages, (Khemchandani

et al., 2021) proposed RelateLM to eectively customize language models for low-resource

languages. Since numerous Indic scripts descended from Brahmi script, the authors take

advantage of script relatedness through transliteration. RelateLM articially translates rel-

atively well-known language content into low-resource language corpora using comparable

sentence structures to get around corpus limitations. (Aggarwal et al., 2021) performed

zero-shot text classication for Indic languages by leveraging lexical similarity. To this end,

the authors performed script conversion to Devanagari and divided words into sub-words to

optimize the vocabulary overlap among the related Indic languages datasets. (Khatri et al.,

2021) investigated the inuence of sharing encoder-decoder parameters between related lan-

guages in Multilingual Neural Machine Translation. They developed a system trained from

the languages by grouping them based on language family i.e., Indo-Aryan (group) to En-

glish and Dravidian (group) to English. Then, the authors convert the entire language data

to the same script, which helps the model learn better translation by utilizing shared vo-

cabulary. This approach obscures the underlying structural similarities between languages.

Language families are typically dened based on shared ancestry and historical relationships

between languages. Transliteration-based methods may not accurately capture these rela-

tionships between languages, as they focus primarily on the surface features of languages

which amounts to inaccurate results for downstream tasks. (Marreddy et al., 2022) put for-

ward a supervised approach to rebuild graph called as Multi-Task Text GCN. This method

utilizes a Graph AutoEncoder (GAE) (Schlichtkrull et al., 2018) to learn the latent word and

sentence embeddings from a graph which is employed to carry out Telugu text categorization

for various downstream tasks.

(Zhang et al., 2022) proposed a Twitter Heterogeneous Information Network (TwHIN-

BERT) to anticipate hashtags for multilingual content. The authors employ Approximate

Nearest Neighbor (ANN) search to identify pairs of socially appealing tweets. This method

falls short of capturing the user’s language and topical choices. Furthermore, it does not

take linguistic relatedness within language groups into account to address the low-resource

nature of numerous languages featured in the dataset.

Therefore, a quick assessment reveals that research has primarily focused on text-only,

image-only, or multimodal information posted in high-resource languages i.e., English, and

Chinese. These studies do not consider recommending hashtags for content posted in low-

resource languages. To tackle this issue, we propose a novel polyglot paradigm i.e., TAGA-

LOG, which extracts the content-based, user-based, and language-based features to recom-

mend personalized and language-specic hashtags for content created in low-resource Indic

languages.

3. Problem Denition

Let us consider a dataset with a tweet set T={ti}|T|

i=1, a set of users U={uj}|U|

j=1, a set of

hashtags H={hk}|H|

k=1 and a set of languages L={IA(Hindi, Gujarati, M arathi, Bangla),

D(Kannada, T amil, T elugu), E nglish}. Here, |T|,|U|,|H|denotes the cardinality of the

tweet set, user set, and hashtag set. IA and Drefer to Indo-Aryan and Dravidian family

groups.

Given a user u∈Uwho uploads a tweet twritten in language l∈L, we aim to recommend

a personalized and language-specic set of hashtags RH ⊂Hthat are relevant to users’

posting and language usage behavior.

Our objective is to develop a customized hashtag recommendation model for tweets in low-

resource Indic languages that can automatically recommend hashtags from Hto a new tweet

tuploaded by a user u.

Given a tweet written in lby a user u, we intend to learn a function f(.)that can capture

his topical and linguistic preferences.

tu, tl=f(UGA(t, u), LGA(t, l)) (1)

Here, UGA refers to the user-guided attention and LGA refers to the language-guided at-

tention mechanisms that yield latent user and language representations denoted by tuand

tl. Hashtags are a potent tool for self-expression because they allow users to succinctly and

rapidly communicate their interests, thoughts, feelings, and views on a certain topic. To

address the variances in hashtag labels that result from how individuals express themselves

and their unique language usage style, we devise two attention mechanisms to ne-tune user

and language representations. To further enhance tweet representation, we aim to learn a

function g(.)to model various types of interactions.

t′

u, t′=g(tu, t)(2)

Here, t′

u, t′denote the enhanced user and tweet representation derived from the graph, and

g(.)resembles a graph neural network. We employ a graph neural network to model tweet-

tweet interactions based on language relatedness and user-tweet interactions. We construct

a heterogeneous graph G= (V, E )such that V= (U, T )where Vis the set of nodes

comprising users and tweets, and Eis the set of edges. Each edge e∈Eis based on either

the relatedness of the language in which the tweet is written with tweets published in other

languages within the same language group or whether the user created that tweet in the

past. Hashtag recommendations can then be formulated as given in Equation 3.

RH =HASH −REC(t′

u, tl)(3)

Here, HASH −REC refers to the hashtag recommender that resembles a deep neural

network. It takes enhanced tweet representation derived from the graph denoted by t′

and language-guided tweet representation i.e., tlto recommend a reasonable collection of

hashtags denoted by RH. We posit that TAGALOG encodes not only the user’s topical and

linguistic preferences but also relatedness among languages of a family group pertaining to

the language in which a tweet is written. The following sections provide more information

on the UGA,LGA,f(.), g(.), and HASH −REC.

4. Methodology

Figure 2: Overall Architecture of TAGALOG

In this section, we present a detailed overview of our proposed approach. Fig. 2 show-

cases the overview of our innovative polyglot hashtag recommender. We propose a deep

neural network based on graphs to recommend hashtags for tweets posted in multiple Indic

languages. Our system receives a tweet as input, together with information on the language

used in the tweet and the user who posted it. The proposed system rst retrieves features

from a tweet’s textual modality to obtain its low-dimensional feature vector representation.

Then we use attention techniques to mimic how language and user aect the representation

of a tweet. We create a graph to capture the correlation between tweets and the interaction

between tweets and users. The node embeddings which are modied in response to informa-

tion dissemination and neighborhood aggregation are fed into the hashtag recommendation

module. After assessing the plausibility of each hashtag, this module yields a sorted list of

hashtags for polyglot tweets. As demonstrated in Fig. 2, our proposed framework comprises

four components: (a) feature extraction; (b) feature renement; (c) feature interaction, and

(d) hashtag recommendation. Each component is discussed in profundity below.

4.1. Feature Extraction

In this section, we elucidate the textual, linguistic, and user feature retrieval from tweets.

Textual Feature Retrieval. We encode tweets written in various resource-scarce Indic lan-

guages using Multilingual Bidirectional Encoder Representations from Transformers (Pires

et al., 2019), abbreviated as the mBERT model. Wikipedia articles written in 104 dierent

languages serve as the training data for the multilingual variant of BERT. Since mBERT

shares a common input space at the sub-word level, this pre-trained neural language model

is utilized to generate context-aware embeddings of tweets posted in dierent languages.

The input tweet tis enclosed within two special tokens, class (CLS) and separator (SEP)

to signal its start and endpoints. We pass the raw tweet through mBERT’s tokenizer to

produce the corresponding set of tokens as shown in Equation 4.

M=mBE RT _T okenizer ([CLS] + t+ [SEP]) (4)

Here, Mrepresents the created collection of tokens. The number of tokens in the sequence

denoted by Sis capped at 50. We shorten or lengthen the token sequence derived from the

tweet to Sif it is greater or lesser than Sto construct a uniform-sized token sequence for all

tweets. Then, we encode tokens using an mBERT encoder to generate token representations

according to Equation 5.

Tf=mBE RT (M)(5)

The derived textual feature matrix is denoted by Tf∈RS×D, where S= 50 denotes the

number of tokens derived from the tweet, and D= 768 denotes the embedding size for every

token. The textual feature matrix of the encoded tweet is passed to the feature renement

module.

Language Feature Retrieval. Social media language is often informal, abbreviated, and con-

tains hashtags, emojis, and other elements that are specic to these platforms. By learning

language embeddings from a large corpus of social media data, we can better capture these

unique linguistic characteristics and represent them in a way that captures their mean-

ing. Language embeddings are vector representations of words or phrases that are learned

through training on large amounts of text data. It consists of two steps namely language

identication, and language embedding generation.

Language Identication We used the langdetect6library to identify the language in

which tweet tis published. About 50 languages can be recognized by this package, which

is a direct transfer of Google’s language-detection library from Java to Python. Nakatani

Shuyo created the software at Cybozu Laboratories, Inc. We determine the language used

to write the tweet tas depicted in Equation 6.

l=langdetect(t)(6)

Here, lis the language identied for tweet t.

Language Embedding Generation Language embeddings are used for tweet rep-

resentation because they enable us to capture the meaning and context of words used in

tweets. They capture the semantic and syntactic relationships between words, which allows

us to understand the meaning of individual words and the overall context. Using language

embeddings to represent tweets allows us to capture the nuances of language used on social

media platforms. After identifying the language in which the tweet was written, we generate

the feature vector for language using the Keras embedding layer7as discussed in Equation

below.

lf=Embedding(l)(7)

Here, lf∈RDrefers to a feature vector to represent language, with a dimensionality (D) of

768.

User Feature Retrieval. User embeddings can be useful in deriving post features because

they capture information about the users who created the posts. In many cases, the user

who creates a post can provide important contextual information about the post, such

as the user’s interests, preferences, or expertise. By incorporating this information into

post features, models can improve their ability to understand and analyze posts. This

can help the model make personalized recommendations that are more relevant to the user’s

interests. The publisher of the tweet tis expressed as u. We encode uinto a low-dimensional

embedding vector (uf)by employing the Keras embedding layer as demonstrated by the

following Equation.

uf=Embedding(u)(8)

Here, uf∈RDrefers to a feature vector to represent the user, with a dimensionality of

768. Users’ hidden features, such as preferences, may theoretically be captured by user

embeddings and used to direct how the tweet representation is learned.

6https://pypi.org/project/langdetect/

7https://keras.io/api/layers/core_layers/embedding/

4.2. Feature Renement

The cornerstones of the feature renement module comprising our proposed model are

language-guided and user-guided attention mechanisms that successfully capture the topical

and linguistic inclinations of individual users at a local level to enrich the tweet representa-

tion. We discuss these two mechanisms below.

4.2.1. Language-guided Attention Mechanism

We devise a novel language-specic attention block that selectively attends to language-

oriented information in the tweet and lters out unnecessary information thus, enriching its

representation. For the tweet embedding obtained using the mBERT encoder, we denote

it as Tf={es}S

s=1. We use an attention technique to identify key terms, then aggregate

the acquired word representations to create a comprehensive representation of the tweet’s

textual content with respect to the linguistic preferences of the user. To this end, we feed the

token-based embedding matrix Tfthrough a dense layer to create its hidden representation,

as illustrated in the equation below.

hl=tanh(TfWl+bl)(9)

Here, hl={hl

s}S

s=1, where hl

sis the hidden representation of es. We then determine how

closely the token’s latent representation (hl

s)resembles the language embedding vector (lf)

and run the outcome through a softmax algorithm to generate attention scores (αs) using

the formula presented in Equation 10.

α=softmax(hllf)(10)

Here, α={αs}S

s=1, where αsdesignates a word’s signicance with respect to language. The

language-guided tweet representation is then derived by computing the weighted sum of

token embeddings with attention scores αsserving as weights as presented below.

tl=

∑

s=1

αshl

s(11)

Here, tlrepresents the language-guided tweet representation.

4.2.2. User-guided Attention Mechanism

Users tend to express their interest in the semantic attributes of a tweet’s text. Thus,

exploring users’ attention to words appearing in tweets towards recommending hashtags is

crucial. By using user-guided attention, the model can capture the user’s unique perspec-

tives, which can provide additional context and improve the accuracy of post features. We

utilize a user-guided attention mechanism for identifying salient words and combining their

corresponding representations to obtain a comprehensive representation of the tweet’s tex-

tual content with respect to the user. To achieve this, we rst process the mBERT-based

token embedding matrix (Tf)using MultiLayer Perceptron (MLP) to derive huas illustrated

in the subsequent equation.

hu=tanh(TfWu+bu)(12)

Here, hu={hu

s}S

s=1, where hu

sserves as the covert way of representing es. We rst calculate

how comparable hu

sand ufare, then run the resulting through a softmax function to produce

normalized weight βsas demonstrated below.

β=softmax(huuf)(13)

Here, β={βs}S

s=1, where βssignies the relevance of a term with respect to a user. The

user-guided tweet representation is determined by summing the weighted word annotations

i.e., βs. as shown.

tu=

∑

s=1

βshl

s(14)

Here, tudenotes the user-guided tweet representation. The obtained representations are

forwarded to the feature interaction component.

4.3. Feature Interaction

The feature interaction module employs a graph neural network to capture global inter-

ests by analyzing long-term user behavior and preferences, in addition to tweet correlation.

It comprises two major stages namely, graph construction and feature encoding. We discuss

these two stages in detail below.

4.3.1. Graph Construction

To mine the correlation between tweets and the interaction between tweets and users,

we create an undirected heterogeneous graph as illustrated in Algorithm 1. Here, G=

(V, E )is the resultant user-tweet graph, and Vand Edenote the collection of vertices

and edges between them, respectively. We construct a graph with two dierent kinds of

nodes, as shown in Line 1 of Algorithm 1. The total number of nodes in the graph is

Iwhere I=|T|+|U|and E⊂V×Vis a set of relationships among nodes to model

tweet-tweet correlations and user-tweet interactions. The edges constructed based on tweet-

tweet correlations are weighted, whereas those corresponding to user-tweet interactions are

unweighted. First, we compute the pairwise similarity between tweets appearing in the

tweet set T, as depicted in Line 4. We then assign an edge between tweets of related

language families corresponding to the language in which the tweet under consideration

is written, as shown in Lines 5-8, corresponding to the Indo-Aryan and Dravidian family

groups. The tweets not falling under these two groups imply they are written in English,

as shown in Lines 9-10. The edge weight is the similarity score between mBERT-based

embeddings of a tweet with tweets written in related languages comprising the language

group. Grouping posts concerning their language family, like Indo-Aryan and Dravidian,

can help in recommendations by personalizing content and recommendations based on the

user’s linguistic and cultural background. Language families are a collection of languages

that share the same ancestor. Languages in the same family often share similar grammatical

structures, vocabulary, and cultural contexts. By grouping posts based on a language family,

we identify posts that are likely to be relevant and exciting to users with a particular

linguistic background. For example, suppose a user writes tweets in a language from the

Algorithm 1 Graph Construction

Input: T: Tweets

U:Users

Output: G(V, E ): User Tweet Graph

function get_graph(T, U )

1: V=T∪U

2: E= []

3: for all (t1, t2) ∈T X T do

4: sim_score =cos_sim(t1, t2)

5: if langdetect(t1) &langdetect(t2) ∈[‘bn′,‘hi′,‘mr′,‘gu′]then

6: E=E∪(t1, t2, sim_score)

7: else if langdetect(t1) &langdetect(t2) ∈[‘kn′,‘te′,‘ta′]then

8: E=E∪(t1, t2, sim_score)

9: else if langdetect(t1) &langdetect(t2) ∈[‘en′]then

10: E=E∪(t1, t2, sim_score)

11: end if

12: end for

13: for all t∈Tdo

14: u=get_user(t)

15: E=E∪(t, u, 1)

16: end for

17: G= (V, E )

18: return G

Indo-Aryan family. In that case, we can group posts that are written in languages from this

family, such as Bangla (Bn), Hindi (Hi), Marathi (Mr), and Gujarati (Gu), and recommend

hashtags to the user. Similarly, suppose a user uses a language from the Dravidian family. In

that case, we can group posts that are written in languages from this family, such as Kannada

(Kn), Telugu (Te), and Tamil (Ta), and recommend them to the user. By personalizing

recommendations in this way, we can increase the relevance and engagement of content for

users. Furthermore, as depicted in Lines 13-16, for every tweet, we retrieve its corresponding

user. We then create an edge to connect the user to his uploaded tweets. By capturing the

user-tweet relationship through edge creation, tweet representations can be enriched with

the contextual information of the associated user, such as the user’s topical interests and

historical posting patterns. Incorporating the user context allows for more contextualized

and personalized tweet representations. It considers the relationship between the user and

his tweets, allowing for a more nuanced understanding of their behavior and motivations.

Unlike similarity-based analysis (Zhang et al., 2019) that overlook the unique context and

signicance of individual posts, treating them as isolated entities, the edge-based approach

explicitly models the relationship between a user and his tweets within the graph structure,

thus enabling a comprehensive analysis of interdependencies and interactions between users

and their tweeted content. The edge connecting a user to their tweets indicates the range

and diversity of their topical interests. We utilize this edge information to identify patterns

and recommend accurate hashtags.

Figure 3: Graph AutoEncoder

4.3.2. Graph Feature Encoding

Our primary goal is to create and train a model to learn tweet and user embeddings

given an input graph Gin order to perform hashtag recommendations. GAE is a type

of unsupervised learning model used for graph representation learning. GAE can capture

complex, non-linear relationships between nodes in a graph, which cannot be easily cap-

tured by traditional graph embedding techniques such as DeepWalk (Perozzi et al., 2014)

or node2vec (Grover & Leskovec, 2016). GAE preserves the structural properties of nodes

even when the data is noisy. GAE can be used for hashtag recommendation, where the

input data consists of both user-tweet interaction data and tweet features represented as

a graph. This allows for a more comprehensive recommendation system that takes into

account both user behavior and tweet attributes. The proposed GAE pipeline is shown in

Fig. 3. Let G= (V, E)represent a graph with Nnodes and Abe its adjacency matrix.

Let Fbe the feature matrix with N rows, where each row represents the feature vector of a

vertex. The goal of GAE is to acquire a reduced-dimensional latent representation Zthat

encompasses the structural and semantic information of the graph. The adjacency and fea-

ture matrices, when combined (AF ), are the encoder’s input. Graph Sample and Aggregate

(GraphSAGE) (Hamilton et al., 2017) can be used as the encoder in the GAE by adapt-

ing it to aggregate information from the entire graph. GraphSAGE is a neural network

that is designed to learn node embeddings by compiling information from its immediate

surroundings. The input for the GraphSAGE encoder is Fvwhich is the feature vector that

node vis initialized with, and N(v)is the set of neighboring nodes of node vin the graph.

The tweet node is initialized by employing word level attention (Yang et al., 2016) over

the textual feature matrix of tweet tas discussed in Section 4.1 since tweets contain noisy

user-generated text. User nodes are initialized with a feature vector obtained as depicted

in Section 4.2.2. Generally, hk

vis the embedding vector of node vat the kth layer of the

GraphSAGE encoder and NLis the number of layers in the encoder. We adopt the mean

aggregator in GraphSAGE as evident in Equation 15.

v=GraphSAGEmean(hk−1

v, A)∀k∈[1, NL](15)

The updated feature matrix Zis obtained from the last layer as shown in Equation 16.

Z=hNL

v(16)

Here, Zconsists of the updated user representation (t′

u) and text feature (t′). The decoder

maps this latent representation back to the original graph structure. It consists of a sigmoid

activation function as shown in Equation 17.

A=sigmoid(Z.Z T)(17)

Here, ˆ

Ais the reconstructed adjacency matrix.

4.4. Hashtag Recommendation

By considering both the user and the language used in a tweet, we can better capture

the user’s intent, perspective, language usage style, and the meaning of the words they use.

To this end, we derive the overall tweet representation by concatenating the updated tweet

embedding obtained from GAE and language-guided tweet representation as shown below.

tf=concat(t′

u, tl)(18)

Here, tfis the overall tweet representation. The hashtag recommendation module receives

tfas input and outputs a reasonable set of hashtags Rh as given in Equation 19.

Rh =HASH −REC(tf)(19)

The hashtag suggestion task is structured as a multilabel classication problem. Given

that a tweet can belong to numerous classes simultaneously, this formulation procedure can

assist in forecasting labels for non-exclusive classes. A pool of precongured hashtags His

employed to assign suitable hashtags to the multilingual tweet as exhibited in Equation 20.

ypred =sof tmax(Dense(units =|H|)(tf)) (20)

Here, the symbol ypred ∈R|H|refers to the softmax probabilities of the supplied hashtags,

|H|is the cardinality of the set of hashtags. These probabilities are used to rank hashtags

and generate the nal set of predicted hashtags (RH).

RH =argsort(ypred )(21)

The objective loss function for training TAGALOG can be seen in Equation 22.

L=LGAE +LHR (22)

Here, Lis the overall loss function, LGAE is the reconstruction loss of GAE, and LHR is the

loss function for the hashtag recommendation module. The loss function (LGAE ) is described

in Equation 23.

LGAE =||A−ˆ

A||2(23)

Here, Aand ˆ

Arepresent the actual and reconstructed adjacency matrices, and ∥.∥denotes

the squared norm. The objective of LGAE is to reduce the dierence between the predicted

and actual adjacency matrices across the entire training dataset, with the purpose of achiev-

ing better reconstruction accuracy. The optimization problem is solved by minimizing LGAE

with respect to the parameters of the encoder and decoder (θeand θd) using a gradient-based

optimization algorithm. Through this process, GAE learns a compressed representation of

the input graph. The training loss function for the hashtag recommendation module is

described in Equation 24.

LHR =1

|M|∑

(t,G)∈M

∑

g∈G

−log(P(g|t)) (24)

Here, the current tweet is represented by t, the related ground-truth hashtag set is indicated

by G, and the softmax probability that the ground-truth hashtag gwill be used for the tweet

tis given by P(g|t), and variable Mrepresents the training set of multilingual tweets.

5. Experimental Evaluations

In the ensuing subsections, we go over the experimental settings followed by experimental

ndings to validate the viability of our proposed framework.

5.1. Experimental Setup

Here, we present our curated dataset on which experiments were performed. Next, we go

into state-of-the-art approaches and existing models for comparison, followed by the criteria

employed for evaluation.

5.1.1. Dataset

In our opinion, we have curated the rst large-scale multilingual low-resource Indic tweets

dataset dubbed as IndicHash. This dataset is designed for the task of recommending hash-

tags for tweets posted in multiple low-resource Indic languages. We create an exhaustive

dataset from tweets published by Indian users covering seven low-resource languages be-

sides English. Regional language tweets have increased signicantly on Twitter. This served

as our inspiration to broaden the endeavor to Indic languages. We chose a total of seven

dierent Indic languages namely Bangla, Hindi, Kannada, Gujarati, Tamil, Telugu, and

Kannada. This decision was primarily motivated by the widespread usage of these Indic

languages across various regions of India. We now elucidate the techniques used to gather

and process the independent tweets followed by a description of the dataset’s specications.

Data Collection. We gathered nearly equal numbers of posts for each keyword and a similar

amount of keywords for each category. We rst curated a generic list of categories namely

technology, business, education, environment, gadgets, sports, festivals, people’s movement,

politics, cricket, entertainment, movies, music, news, culture, food, military, career, fash-

ion, tness, gaming, nature, weather, emotions, pets, hobbies, astrology, and crisis. The

total number of keywords considered for data collection is 213. For example, keywords

under the education category: education, ed-tech, ParikshaPeCharcha, teacher, learning,

school, university, neweducationpoilcy, students, and exams. Likewise, under the category

of people’s movements which is a hot topic on Twitter, we included keywords such as Stu-

dentLivesMatter, ShaheenBagh, FarmersProtest, KisaanAndolan, metoo, BlackLivesMatter,

pride, feminism, NeverAgain, and EnoughIsEnough. We used Scraper for Social Networking

Services (SNS) abbreviated as snscrape8to download tweets. We scraped attributes like

user IDs, and hashtags, and retrieve the relevant tweets using keywords as a search query.

We gathered user tweet data in a variety of languages since people use hashtags regardless

of their language of origin. The dataset collection comprises a total of 31,07,866 tweets, and

9,17,833 hashtags posted by 4,78,120 users for a total of 8 languages. The average number

of tweets per keyword and tweets per user in the collected dataset amounts to 14,591 and 7

whereas the average number of hashtags per tweet is 5.

Data Pre-processing. The subsequent measures were adopted to ensure a high-quality input

for our model. We removed tweets that contain less than three words. The acquired data

was noisy due to Twitter’s quick and erratic nature. The data was sanitized by deleting

duplicate posts with null values. The pre-processed data underwent several modications,

including the removal of links, conversion of text to lowercase, and exclusion of all non-

alphanumeric characters except space and full stop. Hashtags were also collected from these

pre-processed posts. Post information such as the content of the original post, hashtags

used, and the user id of the user who created that tweet was extracted. To balance the

dataset, we randomly sampled an equal number of tweets from each language. The nal

dataset collection comprises a total of 81,944 tweets, 17,660 users, and 37,151 hashtags.

Table 1 provides a summary of the dataset’s statistics.

Characteristic Original Pre-processed Final

No.of tweets 31,07,866 10,65,848 81,944

No. of users 4,78,120 1,36,348 17,660

No. of keywords 213 213 205

No. of hashtags 9,17,833 45,535 37,151

No. of tweets/keyword 14,591 5,004 400

Average no. of hashtags/tweet 5 8 8

Average no. of tweets/user 7 8 5

Table 1: Dataset Statistics

5.1.2. Compared Methods

In order to assess the ecacy of the suggested model, we conducted a comparative

analysis against prior research endeavors in the domain of hashtag recommendation as well

as established language models based on transformer architecture.

8https://github.com/JustAnotherArchivist/snscrape

Existing Research Works. To evaluate the eciency of the proposed model, we contrast our

approach with the recent research works on hashtag recommendation.

1. AMNN: (Yang et al., 2020b) generated hashtags by developing a sequence-to-sequence

encoder–decoder framework. The encoder retrieves visual and textual embeddings

individually which are then subjected to an attention technique. The attended visual

and textual features upon concatenation are fed into GRU, which generates hashtags

sequentially according to softmax probabilities.

2. TwHIN-BERT: (Zhang et al., 2022) developed the Twitter Heterogeneous Information

Network which is a polyglot language model that frames the objective of predicting

hashtags as a problem of multi-class classication. It is trained with a vast volume of

tweets and rich social interactions in order to emulate the brief and noisy nature of

user-generated content.

3. SEGTRM: (Mao et al., 2022) introduced a transformer-based model which produces

hashtags in a sequential manner. SEGTRM consists of three steps: a hashtag gener-

ator, a segments-selector, and an encoder. The encoder removes extraneous data at

various granularities within text, segments, and tokens in order to derive global textual

representations. The segments-selector selects many segments and reorganizes them

into a novel sequence to serve as an input to the decoder, enabling end-to-end hashtag

construction. To predict hashtags in terms of both quality and quantity concurrently,

the authors employ a sequential decoding algorithm.

4. DESIGN: (Bansal et al., 2022) incorporated pertinent data encoded in linguistic and

visual modalities of social media posts besides analyzing users’ tagging behavior to

suggest a personalized and credible set of hashtags. The authors use a word-level

parallel co-attention mechanism to enhance the multimodal information and create a

richer post representation. The decoder capitalizes hashtags produced using multilabel

classication and sequence generation procedures for the recommendation.

Existing Models. We discuss various transformer-based models against which we compare

the performance of our devised framework. To derive features of tweets in our dataset,

we investigated dierent transformer-based models. These models can be perfectly tailored

for classication tasks after being trained on general tasks. (Devlin et al., 2019) introduced

BERT, a transformer-based approach for pre-training NLP models and learn contextual

representations during pre-training. It is a deep bidirectional and exible model that can be

ne-tuned by appending a few output layers. Consequently, BERT serves as the underlying

architecture for all fundamental models.

1. mBERT: (Pires et al., 2019) devised mBERT, which stands for multilingual BERT. It

is a transformer-based model trained on and usable with 104 languages with Wikipedia

(2.5B words) with 110 thousand shared word-piece vocabulary using a masked language

modeling (MLM) objective. The input is transformed into vectors with BERT’s capa-

bility of bidirectionally training the language model which captures a deeper context

and ow of the language.

2. mBERT with Transliteration: We used IndicTrans9package released by AI4Bharat to

transliterate the text of tweets. We employ transliteration (script conversion) for Indic

languages since it helps in reducing the lexical gap among dierent Indic languages.

After transliteration, we obtain embeddings for transliterated tweets using mBERT

which in turn are employed to recommend suitable hashtags.

3. IndicBERT: (Kakwani et al., 2020) introduced an ALBERT-based multilingual model

featured in AI4Bharat’s IndicNLPSuite. This model was trained on a massive corpus

containing over 9 billion tokens in 12 major Indian languages. IndicBERT is capable

of extracting sentence and word embeddings.

4. XLMR: (Conneau et al., 2020) proposed the multilingual RoBERTa variant called

XLM-RoBERTa which is used to carry out various NLP tasks. It has been pre-trained

on an enormous amount of multilingual data with 100 languages using MLM objec-

tive. More intriguingly, cross-lingual instruction on a big scale has a major positive

impact on languages with few resources. Sentencepiece tokenization is used by XLM-

RoBERTa on raw text without any performance loss. Since it uses the same training

program as the RobERTa model, the moniker “Roberta” was incorporated.

5. DistilmBERT: (Sanh et al., 2019) developed a condensed adaptation of mBERT with

the objective of reducing its size, cost, processing time, and computational load. It

contains a reduced number of parameters, up to 40% less than Bert-base-uncased, and

it guarantees a faster runtime of 60% while maintaining 97% of the original perfor-

mance. Furthermore, it is trained on Wikipedia texts in 102 distinct languages. There

are 134M parameters in all. DistilmBERT is typically twice as quick as mBERTbase.

5.1.3. Evaluation Metrics

To evaluate the performance of our suggested hashtag recommendation system, we use

assessment criteria from the literature on multi-label classication. The standard evalu-

ation metrics for analyzing the performance of hashtag recommendation methods are Hit

rate, Precision, Recall, and F1-score. These metrics are computed by comparing predicted

hashtags and ground-truth hashtags for each tweet. We describe each evaluation metric

below.

The occurrence of at least one common hashtag (GH ∩RH) between the set of recom-

mended hashtags (RH)and ground-truth hashtags (GH)accounts for the hit-rate metric

when dealing with hashtag recommendation systems. The hit rate is described in the fol-

lowing equation.

Hitrate(HR) = min(|GH ∩RH|,1) (25)

The division of the number of hashtags that are present in the set of both ground-truth

and recommended hashtags by the cardinality of the set of recommended hashtags yields

precision. The following is the formula for precision.

P recision(P) = |GH ∩RH |/|RH |(26)

9https://ai4bharat.org/indic-trans

Recall is the ratio between the number of hashtags shared between ground-truth and rec-

ommended hashtags set with the quantity of ground-truth hashtags. The recall is computed

as given in Equation 27.

Recall(R) = |GH ∩RH|/|GH|(27)

To compute F1-score, we derive the harmonic average of precision and recall measures as

shown in Equation 28.

F1−score(F1) = 2 ∗P∗R/(P+R)(28)

The outcome of each evaluation metric is denoted as HR@K, P@K, R@K, and F1@K, where

Kdenotes the number of recommended hashtags. Note that larger values imply better

performance.

5.2. Experimental Results

In this segment, we present an exposition of the empirical ndings resulting from the

comparison of the proposed framework to state-of-the-art approaches and extant models,

analyzing performance enhancement, and examination of visual representations of recom-

mendations.

5.2.1. Eectiveness Comparisons

We begin by outlining TAGALOG’s overall benets, particularly its superiority in out-

performing the previous research works and various transformer-based models. We regard

the top-K hashtags as the recommended ones, with Kbeing 8, since the mean number of

hashtags per tweet is 8. As can be seen in Table 2, the performance gain achieved by TAGA-

Technique Hit rate Precision Recall F1-score

AMNN (Yang et al., 2020b) 0.489 0.195 0.210 0.202

SEGTRM (Mao et al., 2022) 0.520 0.211 0.228 0.219

TwHIN-BERT (Zhang et al., 2022) 0.600 0.179 0.194 0.187

DESIGN (Bansal et al., 2022) 0.771 0.284 0.311 0.297

TAGALOG 0.824 0.334 0.366 0.349

Table 2: Eectiveness Comparison Results with Existing Research Works

LOG is 33.5%, 13.9%, 15.6%, and 14.7% over AMNN, 30.4%, 12.3%, 13.8%, and 13.0% over

SEGTRM, 22.4%, 15.5%, 17.2%, and 16.2% over TwHIN-BERT, and 5.3%, 5.0%, 5.5%, and

5.2% over DESIGN in terms of hit-rate, precision, recall, and F1-score respectively. The im-

provement in performance achieved by TAGALOG over AMNN is due to the superiority of

mBERT over LSTM (Graves & Graves, 2012). The bidirectional and multilingual nature of

the BERT-based feature extractor helps to capture the multilingual context in a better way.

Further, TAGALOG considers language and user characteristics when creating the tweet

representation to recommend high-quality hashtags in contrast to content-based informa-

tion used by AMNN. The reason behind performance enhancement over SEGTRM is that

SEGTRM lters text at dierent granularities, whereas TAGALOG adopts language-guided

and user-guided attention mechanisms to lter content with respect to the user’s topical and

linguistic interests. The remarkable improvement of TAGALOG over TwHIN-BERT is due

to modeling user preferences besides user interaction with tweets and language relatedness

through graph construction. DESIGN employs a word-level attention mechanism in addi-

tion to multi-label classication and sequence generation techniques. The user-guided and

language-guided attention mechanisms in TAGALOG lter the tweet content to construct

tweet representation in accordance with the user’s topical interests and linguistic style which

aids in suggesting relevant hashtags. Unlike DESIGN which samples a certain number of

users’ historical posts, TAGALOG captures the user’s tweet history through a graph neural

network.

(a) Hit rate (b) Precision

Figure 4: Eectiveness Comparison Curves on IndicHash

Fig. 4 contrasts the performance of various hashtag recommendation models in terms

of evaluation metrics. The x-axis shows the number of recommended hashtags, while the

y-axis represents the respective performance indicators. The recommended hashtag count

ranges from 1 to 9. It is noteworthy that an increase in the number of recommended

hashtags leads to higher hit rate and recall, but lower precision. The curves for TAGALOG

consistently outperform those of the other models in all metrics, regardless of the number

of hashtags recommended. Furthermore, the gaps between the curves of evaluation metrics

gradually expand, underscoring the substantial advancements made by our proposed model

in comparison to existing research methods. These ndings provide empirical support for

TAGAlOG’s superiority and ecacy across all four assessment criteria.

Technique Hit rate Precision Recall F1-score

mBERT (Pires et al., 2019) 0.757 0.261 0.286 0.273

mBERT with transliteration 0.715 0.240 0.263 0.251

IndicBERT (Kakwani et al., 2020) 0.637 0.213 0.229 0.221

XLMR (Conneau et al., 2020) 0.655 0.200 0.221 0.210

DistilmBERT (Sanh et al., 2019) 0.549 0.147 0.159 0.153

TAGALOG 0.824 0.334 0.366 0.349

Table 3: Eectiveness Comparison Results with Existing Models

Table 3 shows the performance comparison of TAGALOG with extant transformer-based

models. The performance gain achieved by TAGALOG is 6.7%, 7.3%, 8.0%, and 7.6%

over mBERT without transliteration, 10.9%, 9.4%, 10.3%, and 9.8% over mBERT with

transliteration, 18.7%, 12.1%, 13.7%, and 12.8% over IndicBERT, 16.9%, 13.4%, 14.5%,

and 13.9% over XLMR, 27.5%, 18.7%, 20.7%, and 19.6% over DistilmBERT in terms of

four performance measures. The reasons behind this gap are the incorporation of a novel

language-guided attention mechanism in addition to user-guided attention, the construction

of a user-tweet graph to capture interactions among tweets belonging to languages of the

same family, and user-tweet interaction to enrich user and tweet embeddings. These pro-

cedures help in constructing an eective tweet representation which in turn recommends

high-quality and relevant hashtags for tweets posted in low-resource Indic languages.

5.2.2. Performance Gain Analysis

We analyze the performance pickup of the suggested approach in this section. Following

a performance comparison with various model components, we examine how TAGALOG

performs using various attention techniques.

Attention Techniques. We discuss how TAGALOG performs with diverse attention strate-

gies in this part. The variants of TAGALOG that use no attention, language-guided at-

tention, user-guided attention, and user-guided along with language-guided attention are

T AGALOGN A,T AGALOGLGA ,T AGALOGU GA , and T AGALO GUGA+LGA respectively.

Here, T AGALOGU GA+LGA refers to our devised system.

Mechanism Hit rate Precision Recall F1-score

T AGALOGN A 0.784 0.285 0.313 0.299

T AGALOGLGA 0.783 0.292 0.321 0.306

T AGALOGU GA 0.824 0.330 0.361 0.345

T AGALOGU GA+LGA 0.824 0.334 0.366 0.349

Table 4: Performance of TAGALOG with Dierent Attention Techniques

Table 4 illustrates the performance obtained on eliminating attention mechanisms that

comprise the feature renement module. Here, UGA and LGA refer to user-guided attention

and language-guided attention mechanisms. The performance dierence when TAGALOG

is implemented without any attention mechanism is 5.0% in terms of the F1-score. To

derive the overall tweet representation in the case of the no-attention model, we compute

the average of mBERT-based token embeddings. The performance of TAGALOG is the

lowest in the absence of any attention mechanism. The drop in the F1-score on eliminating

UGA from TAGALOG, termed as T AGALOGLGA, is 4.3%, while the dierence in excluding

LGA from TAGALOG, abbreviated as T AGALOGU GA , is 0.4%. UGA helps to learn the

context in which a user created a post and LGA assists in learning the user’s language choice

and usage style. UGA is typically used to improve the relevance and usefulness of tweets for

individual users and to enhance the overall user experience, while LGA focuses on modeling

idiosyncratic language behavior. The above-mentioned performance gap demonstrates the

signicance of language-guided and user-guided attention techniques.

Model Component Analysis. We conduct model component analysis to emphasize the sig-

nicance of various components constituting the proposed model. Below, we put forth the

performance of Feature Renement (FR) and Feature Interaction (FI) components compris-

ing TAGALOG. We eliminate the feature renement component to stress its pertinence.

The resultant model is referred to as T AGALOGF I . Similarly, the model obtained on

the exclusion of feature interaction from TAGALOG is referred to as T AGALOGF R . We

use acronyms T AGALOGF R+F I and T AGALO G in tandem since T AGALOGF R+F I is the

model we have developed.

Technique Hit rate Precision Recall F1-score

T AGALOGF I 0.784 0.285 0.313 0.299

T AGALOGF R 0.806 0.314 0.342 0.328

T AGALOGF R+F I 0.824 0.334 0.366 0.349

Table 5: Performance Comparison with Dierent Components

Table 5 shows the performance of TAGALOG on eliminating its dierent components.

The performance gap in terms of evaluation metrics on the exclusion of FR is 4.0%, 4.9%,

5.3%, and 5.0% respectively, while that on the exclusion of FI is 1.8%, 2.0%, 2.4%, and 2.1%,

which demonstrates the signicance of these components. Additionally, the performance of

the proposed model which includes both FR and FI beats the performance of individual

components. This implies these components complement each other when recommending

hashtags. FR captures local topical and linguistic interests of individual users through

UGA and LGA, while FI captures global interests by analyzing the long-term behavior and

preferences of the user besides tweet correlation based on language relatedness. Overall,

the experimental results show that each component contributes positively to TAGALOG’s

performance.

5.2.3. Qualitative Analysis

We conduct qualitative investigations to demonstrate how eective our framework is. We

show user-created tweets together with hashtags proposed by dierent models. For sample

tweets chosen from the test data, the accurate hashtags are shown in green, pertinent in blue,

and erroneous in red. The hashtags that models recommend and are consistent with hashtags

that reect the actual situation are considered accurate. On the other hand, pertinent

hashtags do not belong to the category of ground-truth hashtags but are compatible with

the tweet’s content.

The tweet given in Fig. 5a is in context with the Punjab elections held in 2022, written

in Bangla. As can be seen, the user assigns a few hashtags to the tweet in his native

language. It indicates that these hashtags used are wildly trending about Punjab elections

among Bangla Twitter users. The user assigns #congress and #bjp not only in English

but also in Bangla. Besides assigning hashtags in English, users tend to assign topics of

their interests with hashtags in their native language. Users are more inclined to adopt

hashtags in their native language to connect with others who share their cultural background

or interests. Hashtags in dierent languages can also promote diversity and inclusivity

on social media platforms, allowing users to nd content and connect with others from a

broader range of backgrounds and perspectives. Hashtags recommended in Bangla indicate

the ability of our model in recommending language-specic topical hashtags. This implies

our model recommends multilingual hashtags and learns the user’s language usage style by

adopting his linguistic behavior. The hashtag #punjab is directly related to the event of the

Punjab Elections; #pmmodi and #rahulgandhi are prominent political gures and therefore

deemed pertinent. TAGALOG recommends seven accurate and three pertinent hashtags.

DESIGN recommends four accurate, ve pertinent, and one erroneous hashtag. SEGTRM

recommends three accurate and six pertinent hashtags. AMNN recommends one accurate,

ve pertinent, and one erroneous hashtag. TwHIN-BERT recommends one accurate, four

pertinent, and ve erroneous hashtags. Our model recommends the highest number of

accurate hashtags indicating that mining users’ posting and linguistic behavior help suggest

plausible hashtags.

The tweet in Fig. 5b is written in Gujarati in the context of the global event, the

Russia-Ukraine war. TAGALOG recommends seven accurate and three pertinent hashtags;

DESIGN recommends ve accurate, four pertinent, and one erroneous hashtag; SEGTRM

recommends three accurate, one pertinent, and one erroneous hashtag; AMNN recommends

two accurate, one pertinent, and two erroneous hashtags, TwHIN-BERT recommends two

accurate, one pertinent, and seven erroneous hashtags. The example posts demonstrate

how, by suggesting customized hashtags based on users’ thematic and linguistic preferences,

(a) Post 1 (b) Post 2

Figure 5: Example Posts

TAGALOG surpasses earlier research methods.

6. Discussion

This article introduces a technique to recommend hashtags for tweets posted in multiple

low-resource Indic languages. Our method leverages the user’s topical and linguistic prefer-

ences besides the user’s posting behavior to enrich the overall tweet representation to yield

pertinent hashtags. Overall comparison results show that the proered system outperforms

the pre-trained language models and state-of-the-art methods by a signicant margin. While

our proposed system oers exciting possibilities, it is crucial to acknowledge its limitations.

This section delves into these limitations, discusses the practical implications, and explores

potential applications that can leverage its strengths.

6.1. Limitations and Future Work

While our proposed model exhibits notable strengths, it is not immune to limitations.

One of the limitations is that it considers only two prominent language families, i.e., Indo-

Aryan and Dravidian. Additionally, there are other distinct language families represented

in India, such as Austroasiatic (e.g., Santali), Tibeto-Burman (e.g., Manipuri), and An-

damanese (e.g., Great Andamanese) that contribute to the diverse linguistic landscape of

the Indian subcontinent. Our system is scalable as it can be applied to tweets written in lan-

guages belonging to these language groups. Moreover, we have only considered relatedness

among languages of the same family. Indic languages exhibit varying degrees of cross-family

language relatedness due to historical and linguistic inuences. Dierent language fami-

lies have varying degrees of interaction and inuence with the Indo-Aryan and Dravidian

languages, resulting in some cross-family language relatedness in the Indian subcontinent,

which can be explored in the future. The future directions also encompass employing data

augmentation methodologies to amplify the quantity of available data for articially training

the model and devising models that can eectively grasp the diverse patterns of hashtag

usage across various cultural and linguistic contexts.

6.2. Practical Implications

As discussed below, the practical implications of multilingual and personalized hash-

tag recommendation in low-resource Indic languages are far-reaching, revolutionizing how

individuals engage with social media platforms.

1. Improved Content Discovery: Hashtags are a powerful tool for content discovery and

organization. Users can easily nd relevant content in their preferred language by

providing multilingual and personalized hashtag recommendations. This enhances

their browsing experience and encourages active engagement with the platform. By

suggesting hashtags that align with users’ linguistic and cultural context, users are

more likely to engage with the content, participate in discussions, and contribute to

online communities. This can lead to increased user retention and overall platform

activity.

2. Language Inclusivity: Low-resource Indic languages often face marginalization in dig-

ital spaces due to the dominance of primary languages. Multilingual hashtag recom-

mendation systems address this issue by promoting inclusivity. They enable users

to express themselves in their native languages, facilitating active participation and

fostering a sense of belongingness within language communities.

3. Language Learning and Education: Personalized hashtag recommendations can benet

individuals learning low-resource Indic languages. By suggesting hashtags that match

their language prociency level, users can explore relevant content and engage with

native speakers, thereby enhancing their language skills and cultural understanding.

4. Bridging Language Divides and Promoting Heritage: Hashtag recommendation sys-

tems act as linguistic tools and cultural signiers. They bridge language divides by

suggesting common hashtags across low-resource Indic and widely spoken languages,

facilitating cross-lingual communication and collaboration. Additionally, these sys-

tems preserve and promote the cultural heritage of low-resource Indic languages, al-

lowing users to express cultural identity, share traditions, and engage in community

discussions using hashtags. The systems also enable the analysis of hashtag usage

patterns, revealing linguistic and cultural trends across languages.

6.3. Potential Applications

The potential applications of multilingual and personalized hashtag recommendation in

low-resource Indic languages can unlock the immense potential of online communication for

users and communities as enlisted below.

1. Social and Political Discourse: Hashtags play a signicant role in shaping public opin-

ion and facilitating discussions around social and political issues. A multilingual hash-

tag recommendation system for low-resource Indic languages can ensure that diverse

linguistic communities can actively participate in such discussions. It can empower

individuals to express their opinions, promote social causes, drive activism, raise aware-

ness, and contribute to democratic processes. This can amplify their voices and facil-

itate collective action within their linguistic communities.

2. Market Reach and Business Opportunities: Multilingual hashtag recommendations

open doors for businesses and marketers to tap into untapped markets, reaching a

wider audience and driving engagement. By using relevant hashtags, businesses can

eectively target specic language communities, promote their products or services,

and connect with potential customers who prefer using their native languages online.

3. Data analysis and research: Hashtags provide valuable metadata that can be analyzed

to gain insights into social trends, public opinions, and user behavior. By recommend-

ing hashtags in low-resource Indic languages, researchers, social scientists, and data

analysts can access a wider range of data, enabling them to study and understand the

dynamics and patterns within these language communities.

7. Conclusion

In this paper, we have tackled hashtag recommendations to facilitate multilingual con-

tent retrieval and break through language barriers inherent in social media platforms. The

proposed polyglot model, TAGALOG, can recommend personalized and language-specic

hashtags for online content generated in various low-resource Indic languages. The system

proposed in this study comprises feature extraction, renement, and interaction modules.

We rst extract content-based, linguistic, and user-based features using a transformer and

deep learning-based models. We then employ language-guided and user-guided attention

mechanisms to ne-tune tweet representation in line with users’ linguistic and topical pref-

erences. In the feature interaction module, we connect the historical tweets of a particular

user to mine his posting behavior. Furthermore, we group tweets written in various lan-

guages concerning their families, i.e., Indo-Aryan and Dravidian, to capture their interre-

latedness. Extensive experiments conducted on the curated Twitter dataset reveal that our

proposed model is superior in performance to language models that have been trained and

state-of-the-art methods.

References

Aggarwal, S., Kumar, S., & Mamidi, R. (2021). Ecient multilingual text classication for

indian languages. In Proceedings of the International Conference on Recent Advances

in Natural Language Processing (RANLP 2021) (pp. 19–25).

Bansal, S., Gowda, K., & Kumar, N. (2022). A hybrid deep neural network for multimodal

personalized hashtag recommendation. IEEE transactions on computational social sys-

tems, (pp. 1–21).

Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition

for under-resourced languages: A survey. Speech communication,56, 85–100.

Chakrabarti, P., Malvi, E., Bansal, S., & Kumar, N. (2023). Hashtag recommendation for

enhancing the popularity of social media posts. Social Network Analysis and Mining,

13, 21.

Chen, Y.-C., Lai, K.-T., Liu, D., & Chen, M.-S. (2021). Tagnet: triplet-attention graph

networks for hashtag recommendation. IEEE Transactions on Circuits and Systems for

Video Technology,32, 1148–1159.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave,

É., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual repre-

sentation learning at scale. In Proceedings of the 58th Annual Meeting of the Association

for Computational Linguistics (pp. 8440–8451).

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep

bidirectional transformers for language understanding. In Proceedings of the 2019 Con-

ference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186).

Djenouri, Y., Belhadi, A., Srivastava, G., & Lin, J. C.-W. (2022). Deep learning based

hashtag recommendation system for multimedia data. Information Sciences,609, 1506–

1517.

Dogra, V., Verma, S., Chatterjee, P., Sha, J., Choi, J., Ijaz, M. F. et al. (2022). A complete

process of text classication system using state-of-the-art nlp models. Computational

Intelligence and Neuroscience,2022.

Dusart, A., Pinel-Sauvagnat, K., & Hubert, G. (2023). Tssubert: How to sum up multiple

years of reading in a few tweets. ACM Transactions on Information Systems,41, 1–33.

Graves, A., & Graves, A. (2012). Long short-term memory. Supervised sequence labelling

with recurrent neural networks, (pp. 37–45).

Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In

Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery

and data mining (pp. 855–864).

Hachaj, T., & Miazga, J. (2020). Image hashtag recommendations using a voting deep neural

network and associative rules mining approach. Entropy,22, 1351.

Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large

graphs. Advances in neural information processing systems,30.

Javari, A., He, Z., Huang, Z., Jeetu, R., & Chen-Chuan Chang, K. (2020). Weakly supervised

attention for hashtag recommendation using graph data. In Proceedings of The Web

Conference 2020 (pp. 1038–1048).

Jeong, D., Oh, S., & Park, E. (2022). Demohash: Hashtag recommendation based on user

demographic information. Expert Systems with Applications,210, 118375.

Kakwani, D., Kunchukuttan, A., Golla, S., Gokul, N., Bhattacharyya, A., Khapra, M. M.,

& Kumar, P. (2020). Indicnlpsuite: Monolingual corpora, evaluation benchmarks and

pre-trained multilingual language models for indian languages. In Findings of the As-

sociation for Computational Linguistics: EMNLP 2020 (pp. 4948–4961).

Kaviani, M., & Rahmani, H. (2020). Emhash: Hashtag recommendation using neural net-

work based on bert embedding. In 2020 6th International Conference on Web Research

(ICWR) (pp. 113–118). IEEE.

Khatri, J., Saini, N., & Bhattacharyya, P. (2021). Language relatedness and lexical closeness

can help improve multilingual nmt: Iitbombay@ multiindicnmt wat2021. In Proceedings

of the 8th Workshop on Asian Translation (WAT2021) (pp. 217–223).

Khemchandani, Y., Mehtani, S., Patil, V., & Awasthi, A. (2021). Exploiting language

relatedness for low resource language model adaptation: An indic languages study. In

ACL-IJCNLP Main Conference.

Kipf, T. N., & Welling, M. (2016). Semi-supervised classication with graph convolutional

networks. In International Conference on Learning Representations.

Kou, F.-F., Du, J.-P., Yang, C.-X., Shi, Y.-S., Cui, W.-Q., Liang, M.-Y., & Geng, Y. (2018).

Hashtag recommendation based on multi-features of microblogs. Journal of Computer

Science and Technology,33, 711–726.

Kumar, N., Baskaran, E., Konjengbam, A., & Singh, M. (2021). Hashtag recommendation

for short social media texts using word-embeddings and external knowledge. Knowledge

and Information Systems,63, 175–198.

Kurunkar, P., Sawant, O., Mene, P., & Varghese, N. (2022). An image-based hashtag recom-

mendation system as a social media workow tool. In 2022 International Conference on

Smart Generation Computing, Communication and Networking (SMART GENCON)

(pp. 1–5). IEEE.

Lei, K., Fu, Q., Yang, M., & Liang, Y. (2020). Tag recommendation by text classication

with attention-based capsule network. Neurocomputing,391, 65–73.

Li, M., Gan, T., Liu, M., Cheng, Z., Yin, J., & Nie, L. (2019). Long-tail hashtag recom-

mendation for micro-videos with graph convolutional network. In Proceedings of the

28th ACM International Conference on Information and Knowledge Management (pp.

509–518).

Li, X., Wu, X., Luo, Z., Du, Z., Wang, Z., & Gao, C. (2023). Integration of global and local

information for text classication. Neural Computing and Applications,35, 2471–2486.

Li, Z., Wang, X., Yang, W., Wu, J., Zhang, Z., Liu, Z., Sun, M., Zhang, H., & Liu, S. (2022).

A unied understanding of deep nlp models for text classication. IEEE Transactions

on Visualization and Computer Graphics,28, 4980–4994.

Ma, R., Qiu, X., Zhang, Q., Hu, X., Jiang, Y.-G., & Huang, X. (2019). Co-attention memory

network for multimodal microblog’s hashtag recommendation. IEEE Transactions on

Knowledge and Data Engineering,33, 388–400.

Mao, Q., Li, X., Liu, B., Guo, S., Hao, P., Li, J., & Wang, L. (2022). Attend and select:

A segment selective transformer for microblog hashtag generation. Knowledge-Based

Systems,254, 109581.

Marreddy, M., Oota, S. R., Vakada, L. S., Chinni, V. C., & Mamidi, R. (2022). Multi-

task text classication using graph convolutional networks for large-scale low resource

language. In 2022 International Joint Conference on Neural Networks (IJCNN) (pp.

1–8). IEEE.

Mehta, S., Sarkhel, S., Chen, X., Mitra, S., Swaminathan, V., Rossi, R., Aminian, A., Guo,

H., & Garg, K. (2021). Open-domain trending hashtag recommendation for videos. In

2021 IEEE International Symposium on Multimedia (ISM) (pp. 174–181). IEEE.

Myers, S., Syrdal, H. A., Mahto, R. V., & Sen, S. S. (2023). Social religion: A cross-

platform examination of the impact of religious inuencer message cues on engagement–

the christian context. Technological Forecasting and Social Change,191, 122442.

Nama, V., & Deepak, G. (2023). Dtagrecpls: Diversication of tag recommendation for

videos using preferential learning and dierential semantics. In Proceedings of the 14th

International Conference on Soft Computing and Pattern Recognition (SoCPaR 2022)

(pp. 887–898). Springer.

Padungkiatwattana, U., & Maneeroj, S. (2022). Pac-man: Multi-relation network in so-

cial community for personalized hashtag recommendation. IEEE Access,10, 131202–

131228.

Panchal, P., & Prajapati, D. J. (2023). The social hashtag recommendation for image

and video using deep learning approach. In Sentiment Analysis and Deep Learning:

Proceedings of ICSADL 2022 (pp. 241–261). Springer.

Pandey, K. K., & Jha, S. (2021). Exploring the interrelationship between culture and

learning: the case of english as a second language in india. Asian Englishes, (pp. 1–17).

Park, M., Li, H., & Kim, J. (2016). Harrison: A benchmark on hashtag recommendation

for real-world images in social networks. arXiv preprint arXiv:1605.05054, .

Pathak, M., & Jain, A. (2022). µboost: An eective method for solving indic multilin-

gual text classication problem. In 2022 IEEE Eighth International Conference on

Multimedia Big Data (BigMM) (pp. 96–100). IEEE.

Peng, M., Lin, Y., Zeng, L., Gui, T., & Zhang, Q. (2019). Modeling the long-term post his-

tory for personalized hashtag recommendation. In Chinese Computational Linguistics:

18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019,

Proceedings 18 (pp. 495–507). Springer.

Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social rep-

resentations. In Proceedings of the 20th ACM SIGKDD international conference on

Knowledge discovery and data mining (pp. 701–710).

Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual bert? In

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

(pp. 4996–5001).

Rehman, M. Z. U., Mehta, S., Singh, K., Kaushik, K., & Kumar, N. (2023). User-aware

multilingual abusive content detection in social media. Information Processing & Man-

agement,60, 103450.

Sanghvi, D., Fernandes, L. M., D’Souza, S., Vasaani, N., & Kavitha, K. (2023). Fine-tuning

of multilingual models for sentiment classication in code-mixed indian language texts.

In Distributed Computing and Intelligent Technology: 19th International Conference,

ICDCIT 2023, Bhubaneswar, India, January 18–22, 2023, Proceedings (pp. 224–239).

Springer.

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert:

smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, .

Schlichtkrull, M., Kipf, T. N., Bloem, P., Van Den Berg, R., Titov, I., & Welling, M. (2018).

Modeling relational data with graph convolutional networks. In The Semantic Web:

15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018,

Proceedings 15 (pp. 593–607). Springer.

Tang, S., Yao, Y., Zhang, S., Xu, F., Gu, T., Tong, H., Yan, X., & Lu, J. (2019). An integral

tag recommendation model for textual content. In Proceedings of the AAAI Conference

on Articial Intelligence (pp. 5109–5116). volume 33.

Wang, Y., Li, J., King, I., & Shi, M. R. L. S. (2019). Microblog hashtag generation via

encoding conversation contexts. In Proceedings of NAACL-HLT (pp. 1624–1633).

Wei, Y., Cheng, Z., Yu, X., Zhao, Z., Zhu, L., & Nie, L. (2019). Personalized hashtag recom-

mendation for micro-videos. In Proceedings of the 27th ACM International Conference

on Multimedia (pp. 1446–1454).

Yang, C., Wang, X., & Jiang, B. (2020a). Sentiment enhanced multi-modal hashtag recom-

mendation for micro-videos. IEEE Access,8, 78252–78264.

Yang, Q., Wu, G., Li, Y., Li, R., Gu, X., Deng, H., & Wu, J. (2020b). Amnn: Attention-

based multimodal neural network model for hashtag recommendation. IEEE Transac-

tions on Computational Social Systems,7, 768–779.

Yang, Z., & Lin, Z. (2022). Interpretable video tag recommendation with multimedia deep

learning framework. Internet Research,32, 518–535.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention

networks for document classication. In Proceedings of the 2016 conference of the North

American chapter of the association for computational linguistics: human language

technologies (pp. 1480–1489).

Zhang, S., Yao, Y., Xu, F., Tong, H., Yan, X., & Lu, J. (2019). Hashtag recommendation for

photo sharing services. In Proceedings of the AAAI Conference on Articial Intelligence

(pp. 5805–5812). volume 33.

Zhang, X., Malkov, Y., Florez, O., Park, S., McWilliams, B., Han, J., & El-Kishky, A.

(2022). Twhin-bert: A socially-enriched pre-trained language model for multilingual

tweet representations. arXiv preprint arXiv:2209.07562, .

Decoding sarcasm: unveiling nuances in newspaper headlines

Article

Full-text available

Jun 2024
IJECE

This study navigates the intricate landscape of sarcasm detection within the condensed confines of newspaper titles, addressing the nuanced challenge of decoding layered meanings. Leveraging natural language processing (NLP) techniques, we explore the efficacy of various machine learning models—linear regression, support vector machines (SVM), random forest, na¨ıve Bayes multinomial, and gaussian na¨ıve Bayes—tailored for sarcasm detection. Our investigation aims to provide insights into sarcasm within the succinct framework of newspaper titles, offering a comparative analysis of the selected models. We highlight the varied strengths and weaknesses of these models. Random forest exhibits superior performance, achieving a remarkable 94% accuracy in accurately identifying sarcasm in text. It is closely trailed by SVM with 90% accuracy and logistic regression with 83% accuracy.

Multilingual Information Retrieval Using Graph Neural Networks: Practical Applications in English Translation

Article

Full-text available

Apr 2024

Hao Yang

Multilingual information retrieval using graph neural networks offers practical applications in English translation by leveraging advanced computational models to enhance the efficiency and accuracy of cross-lingual search and translation tasks. By representing textual data as graphs and utilizing graph neural networks (GNNs), this approach captures intricate relationships between words and phrases across different languages, enabling more effective language understanding and translation. GNNs can learn complex linguistic structures and semantic similarities from multilingual corpora, facilitating the development of more robust translation systems that are capable of handling diverse language pairs and domains. The paper introduces a novel approach termed the Multilingual Ant Bee Optimization Graph Neural Network (MABO-GNN) for addressing optimization, classification, and multilingual translation tasks. MABO-GNN integrates ant bee optimization algorithms with graph neural networks to provide a versatile framework capable of optimizing objective functions, improving classification accuracy iteratively, and facilitating high-quality translations across multiple languages. Through comprehensive experimentation, the efficacy of MABO-GNN is demonstrated across various tasks, languages, and datasets. in optimization experiments, MABO-GNN achieves objective function values of 0.012, 0.015, 0.011, and 0.013 in Experiment 1, Experiment 2, Experiment 3, and Experiment 4, respectively, with convergence times ranging from 90 to 150 seconds. In classification tasks, the model exhibits notable performance improvements over iterations, with BLEU scores reaching 0.84 and METEOR scores reaching 0.78 in the fifth iteration. The translation results showcase BLEU scores of 0.85 for English, 0.82 for French, 0.79 for German, 0.81 for Spanish, and 0.75 for Chinese, indicating the model's proficiency in generating high-quality translations across diverse languages.

The Analysis of Communication Strategy of Disabled Sports Information Based on Deep Learning and the Internet of Things

Article

Full-text available

Jan 2024

The ever-growing landscape of Internet of Things (IoT) technology and the evolution of deep learning algorithms have ushered in transformative changes in the communication strategy for disseminating information on disabled sports. This specialized information resource aims to provide relevant support and services related to sports activities for disabled individuals. This study investigates the communication strategy of disabled sports information driven by deep learning within the framework of the IoT and assesses the practical application performance of the proposed model. To achieve this objective, an appropriate deep learning model for the dissemination of sports information for the disabled is selected through a thorough literature review. Subsequently, an experimental framework is proposed for comprehensive performance verification, evaluating the model’s performance in reasoning time and user satisfaction through comparative experiments. By constructing deep learning models, extensive data on disabled sports activities are analyzed, enabling the identification and prediction of key factors in information dissemination. The results indicate that the proposed sports information dissemination model outperforms similar models across various performance metrics, particularly in real-time performance and user experience. Comparative analysis with attention-based deep neural networks and traditional machine learning algorithms reveals that the proposed model achieves an accuracy rate as high as 0.85, significantly surpassing the 0.78 and 0.82 accuracies of these models, respectively. Moreover, the proposed model demonstrates the shortest inference time (15ms), surpassing both aforementioned models. This study validates the relative advantages of the proposed model through comparison with similar studies, offering a novel solution for the dissemination of sports information for the disabled.

esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish

Article

Full-text available

Nov 2023

In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through web crawling. However, there are notable limitations in the results for some languages, including Spanish. These datasets are either smaller compared to other languages or suffer from lower quality due to insufficient cleaning and deduplication. In this paper, we present esCorpius-m, a multilingual corpus extracted from around 1 petabyte of Common Crawl data. It is the most extensive corpus for some languages with such a level of high-quality content extraction, cleanliness, and deduplication. Our data curation process involves an efficient cleaning pipeline and various deduplication methods that maintain the integrity of document and paragraph boundaries. We also ensure compliance with EU regulations by retaining both the source web page URL and the WARC shared origin URL.

A Semi-Supervised Learning Investigation Under Mobile Text Translation: Based on Graph Convolutional Networks (GCN)

Conference Paper

Dec 2023

A novel cross-domain adaptation framework for unsupervised criminal jargon detection via pre-trained contextual embedding of darknet corpus

Article

Nov 2023
EXPERT SYST APPL

Hashtag recommendation for enhancing the popularity of social media posts

Article

Full-text available

Jan 2023

Social media has gained huge importance in our lives wherein there is an enormous demand of getting high social popularity. With the emergence of many social media platforms and an overload of information, attaining high popularity requires efficient usage of hashtags, which can increase the reachability of a post. However, with little awareness about using appropriate hashtags, it becomes the need of the hour to build an efficient system to recommend relevant hashtags which in turn can enhance the social popularity of a post. In this paper, we thus propose a novel method hashTag RecommendAtion for eNhancing Social popularITy to recommend context-relevant hashtags that enhance popularity. Our proposed method utilizes the trending nature of hashtags by using post keywords along with the popularity of users and posts. With the prevalent evaluation techniques of this field being quite unreliable and non-uniform, we have devised a novel evaluation algorithm that is more robust and reliable. The experimental results show that our proposed method significantly outperforms the current state-of-the-art methods.

User-Aware Multilingual Abusive Content Detection in Social Media

Article

Full-text available

Jul 2023
INFORM PROCESS MANAG

Despite growing efforts to halt distasteful content on social media, multilingualism has added a new dimension to this problem. The scarcity of resources makes the challenge even greater when it comes to low-resource languages. This work focuses on providing a novel method for abusive content detection in multiple low resource Indic languages. Our observation indicates that a post’s tendency to attract abusive comments, as well as features such as user history and social context, significantly aid in the detection of abusive content. The proposed method first learns social and text context features in two separate modules. The integrated representation from these modules is learned and used for the final prediction. To evaluate the performance of our method against different classical and state-of-the-art methods, we have performed extensive experiments on SCIDN and MACI datasets consisting of 1.5M and 665K multilingual comments, respectively. Our proposed method outperforms state-of-the-art baseline methods with an average increase of 4.08% and 9.52% in the F1 score on SCIDN and MACI datasets, respectively.

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter

Conference Paper

Aug 2023

An Image-Based Hashtag Recommendation System as a Social Media Workflow Tool

Conference Paper

Dec 2022

DTagRecPLS: Diversification of Tag Recommendation for Videos Using Preferential Learning and Differential Semantics

Chapter

Mar 2023

Video tag recommendation is not just necessary but also mandatory in the present-day scenario where multimedia content specifically videos are trending and becoming viral on the internet. In this paper, a video recommendation framework DTagRecPLS that is semantically driven and knowledge-centric has been proposed. It extracts the categories from the video dataset and enriches them by subjecting them to Latent Semantic Indexing. The proposed framework is ontology centered as ontology alignment to the enriched categories of the videos with that of the standard domain ontologies has been achieved and moreover, ontology-driven knowledge harvesting from differential heterogeneous knowledge stores has been used to enrich the number of instances and format them into an enriched knowledge pool. The model encompasses a deep learning framework namely the Convolutional Neural Network to classify video datasets from the perspective of using the actual video and image features, while the Logistic Regression Classifier classifies the dataset by extracting entities from the enriched feature pool with a perspective of annotations and labels. The common incidences are then used to compute the semantic similarity with that of the enriched knowledge pool and are sent for ranking and review. The DTagRecPLS gives the highest precision of 97.09%, highest average recall of 98.72%, highest overall accuracy and F-Measure of 97.91% and 97.90% respectively, and an overall average lowest False Discovery Rate of 0.03.KeywordsVideo tag recommendationPreferential learningDifferential semanticsDiversificationConvolutional neural networks

Social religion: A cross-platform examination of the impact of religious influencer message cues on engagement -The Christian context

Article

Mar 2023
TECHNOL FORECAST SOC

Religion is a key factor in how American consumers spend their time and money. It serves as a significant component of the U.S. economy, with religiously affiliated people contributing trillions to the economy annually. The majority of religious consumers in the U.S. are Christians, making them a critical segment for marketers. Influencer marketing, which involves the use of endorsements from individuals with large social media followings, has emerged as an effective advertising tactic for reaching Christians on social media. However, there is a lack of research exploring the complexities of religion in advertising messaging, especially in the context of influencer marketing. To fill this gap, we apply the social identity, persuasion knowledge, and symbolic interactionism theories to propose relationships between message cues in Christian influencers' social media posts and follower engagement. We analyzed 20,068 Facebook posts, 20,517 tweets, and 13,857 Instagram posts to determine the impact of three categories of message cues on engagement. Across multiple studies, key findings indicate religious and promotional cues increase and decrease engagement across platforms, respectively. The impact of social media cues, such as hashtags and mentions, differs depending on the platform.

TSSuBERT: How to Sum Up Multiple Years of Reading in a Few Tweets

Article

Jan 2023

The development of deep neural networks and the emergence of pre-trained language models such as BERT allow to increase performance on many NLP tasks. However, these models do not meet the same popularity for tweet stream summarization, which is probably due to the fact that their computation limitation requires to drastically truncate the textual input. Our contribution in this article is threefold : (1) we propose a neural model to automatically and incrementally summarize huge tweet streams. This extractive model combines in an original way pre-trained language models and vocabulary frequency-based representations to predict tweet salience. An additional advantage of the model is that it automatically adapts the size of the output summary according to the input tweet stream, (2) we detail an original methodology to construct tweet stream summarization datasets requiring little human effort, and (3) we release the TES 2012-2016 dataset constructed using the aforementioned methodology. Baselines, oracle summaries, gold standard, and qualitative assessments are made publicly available. To evaluate our approach, we conducted extensive quantitative experiments using three different tweet collections as well as an additional qualitative evaluation. Results show that our method outperforms state-of-the-art ones. We believe that this work opens avenues of research for incremental summarization, which has not received much attention yet.

Fine-Tuning of Multilingual Models for Sentiment Classification in Code-Mixed Indian Language Texts

Chapter

Jan 2023

We use XLM (Cross-lingual Language Model), a transformer-based model, to perform sentiment analysis on Kannada-English code-mixed texts. The model was fine-tuned for sentiment analysis using the KanCMD dataset. We assessed the model’s performance on English-only and Kannada-only scripts. Also, Malayalam and Tamil datasets were used to evaluate the model. Our work shows that transformer-based architectures for sequential classification tasks, at least for sentiment analysis, perform better than traditional machine learning solutions for code-mixed data.KeywordsDICT-MLMTask adaptive pre-trainingDomain adaptive pre-trainingTransfer learningTransductive transferLSTMPseudo labelling

The Social Hashtag Recommendation for Image and Video Using Deep Learning Approach

Chapter

Jan 2023

There has been a lot of interest in the recent year in recommending hashtags for images/videos or posts on social media. Several researchers have researched the impact from numerous perspectives. In this paper, we enhance tag recommendation by recommending suitable hashtags considering both contents of the image/video and users’ history of the hashtag. On the social media image/video-sharing websites (such as Facebook, Instagram, Flickr, and Twitter), users can upload images or videos and tag them with tags. The proposed method generates candidate keywords, i.e., hashtag by combining techniques for textual tags, image and video activity/object recognition content, and acoustic data. To this end, this paper examines different methodologies that associate information that is multi-modal and suggests hashtags for image or video uploader users to generate tags for their images or videos. Although a substantial amount of study has been carried out on item/product recommendations for E-commerce websites, video recommendations for YouTube and Netflix, and friend suggestions on social media websites, research has not been carried out as much on hashtag recommendations for images/video on social media platform/app/websites, which have now turned out to be a vital role of these social media platforms. Here, in this paper, glance at hashtag recommendations for image/video has been carried.

µBoost: An Effective Method for Solving Indic Multilingual Text Classification Problem

Conference Paper

Dec 2022

Multilingual Personalised Hashtag Recommendation for Low Resource Indic Languages using Graph-based Deep Neural Network

Abstract

Recommended publications

PAC-MAN: Multi-Relation Network in Social Community for Personalized Hashtag Recommendation

TNOD: Transformer Network with Object Detection for Tag Recommendation

A Hybrid Deep Neural Network for Multimodal Personalized Hashtag Recommendation

MahaEmoSen: Towards Emotion-aware Multimodal Marathi Sentiment Analysis