ArticlePDF Available

Multilingual Personalised Hashtag Recommendation for Low Resource Indic Languages using Graph-based Deep Neural Network

Authors:

Abstract

Users from different cultures and backgrounds often feel comfortable expressing their thoughts on trending topics by generating content in their regional languages. Recently, there has been an explosion in multilingual information, and a massive amount of multilingual textual data is added daily on the Internet. Using hashtags for multilingual low-resource content can be an effective way to overcome language barriers because it allows content to be discovered by a wider audience and makes it easier for people interested in the topic to find relevant content, regardless of the language in which it was written. To account for linguistic diversity and universal access to information, hashtag recommendation for multilingual low-resource content is essential. Several approaches have been put forth to recommend content-based and personalized hashtags for multimodal content in high-resource languages. Data availability and linguistic differences often limit the development of hashtag recommendation methods for low-resource Indic languages. Hashtag recommendation for tweets disseminated in low-resource Indic languages has seldom been addressed. Moreover, personalization and language usage aspects to recommend hashtags for tweets posted in low-resource Indic languages have yet to be explored. In view of the foregoing, we propose an automated hashtag recommendation system for tweets posted in low-resource Indic languages dubbed as TAGALOG, capable of recommending personalized and language-specific hashtags. We employ user-guided and language-guided attention mechanisms to distill indicative features from low-resource tweets according to the user’s topical and linguistic preferences. We propose a graph-based neural network to mine users’ posting behavior by connecting historical tweets of a particular user and language relatedness by linking tweets according to language families, i.e., Indo-Aryan and Dravidian. Experimental results on the curated dataset from Twitter demonstrate that the proposed model outperformed recognized pre-trained language models and extant research, showing an average improvement of 12.3% and 12.8% in the F1-score, respectively. TAGALOG recommends hashtags that align with the user’s interests and linguistic predilections, leading to a heightened level of tailored and engaging user experience. Personalized and multilingual hashtag recommendation systems for low-resource Indic languages can help to improve the discoverability and relevance of content in these languages.
Multilingual Personalised Hashtag Recommendation for Low
Resource Indic Languages using Graph-based Deep Neural
Network
Shubhi Bansala(phd2001201007@iiti.ac.in), Kushaan Gowdaa(cse190001031@iiti.ac.in),
Nagendra Kumara(nagendra@iiti.ac.in)
aDepartment of Computer Science and Engineering, Indian Institute of Technology, Indore, India
Corresponding Author:
Shubhi Bansal
Department of Computer Science and Engineering, Indian Institute of Technology, Indore, India
Email: phd2001201007@iiti.ac.in
This paper is accepted in Expert Systems with Applications, 2023.
DOI: https://doi.org/10.1016/j.eswa.2023.121188
Multilingual Personalised Hashtag Recommendation for Low
Resource Indic Languages using Graph-based Deep Neural Network
Shubhi Bansala,, Kushaan Gowdaa, Nagendra Kumara
aDepartment of Computer Science and Engineering, Indian Institute of Technology, Indore, India
Abstract
Users from dierent cultures and backgrounds often feel comfortable expressing their thoughts
on trending topics by generating content in their regional languages. Recently, there has
been an explosion in multilingual information, and a massive amount of multilingual textual
data is added daily on the Internet. Using hashtags for multilingual low-resource content can
be an eective way to overcome language barriers because it allows content to be discovered
by a wider audience and makes it easier for people interested in the topic to nd relevant
content, regardless of the language in which it was written. To account for linguistic diversity
and universal access to information, hashtag recommendation for multilingual low-resource
content is essential. Several approaches have been put forth to recommend content-based
and personalized hashtags for multimodal content in high-resource languages. Data avail-
ability and linguistic dierences often limit the development of hashtag recommendation
methods for low-resource Indic languages. Hashtag recommendation for tweets dissemi-
nated in low-resource Indic languages has seldom been addressed. Moreover, personaliza-
tion and language usage aspects to recommend hashtags for tweets posted in low-resource
Indic languages have yet to be explored. In view of the foregoing, we propose an automated
hashtag recommendation system for tweets posted in low-resource Indic languages dubbed
as TAGALOG, capable of recommending personalized and language-specic hashtags. We
employ user-guided and language-guided attention mechanisms to distill indicative features
from low-resource tweets according to the user’s topical and linguistic preferences. We pro-
pose a graph-based neural network to mine users’ posting behavior by connecting historical
tweets of a particular user and language relatedness by linking tweets according to language
families, i.e., Indo-Aryan and Dravidian. Experimental results on the curated dataset from
Twitter demonstrate that the proposed model outperformed recognized pre-trained language
models and extant research, showing an average improvement of 12.3% and 12.8% in the
F1-score, respectively. TAGALOG recommends hashtags that align with the user’s interests
and linguistic predilections, leading to a heightened level of tailored and engaging user ex-
perience. Personalized and multilingual hashtag recommendation systems for low-resource
Indic languages can help to improve the discoverability and relevance of content in these
languages.
Keywords: Multilingual Text, Low-Resource Languages, Indic Languages, Hashtag
Recommendation, Graph Convolutional Networks
Preprint submitted to Expert Systems with Applications October 18, 2023
1. Introduction
Due to the active participation of users on Social Networking Services (SNS) like Twitter1
or Facebook2, real-time news and trends can now reach anywhere regardless of geographical
location or time dierence. Twitter users create nearly 500 million tweets daily Dusart
et al. (2023), immediately disseminating information about current events and trending
topics. Tweets are user-generated messages with a specied character limit that provide
scant and ambiguous information. More context and understanding of the subject matter
are frequently required to grasp a tweet’s message better. Hashtags are words preceded
by an octothorpe (#) symbol that clarify, decipher, and enrich tweets’ content by adding
information about the subject, sentiment, and attitude. Hashtags are an integral part of
Twitter and help categorize content so users can easily nd it. Statistics indicate that tweets
with hashtags receive twice the level of engagement than that without Myers et al. (2023),
making them a great way to spread the content.
Twitter public conversations have aected popular discourse and modern culture since,
on Twitter, information spreads across languages and countries. Regionally specic con-
tent generates much traction. In the realm of Twitter, English emerges as the predominant
language, encompassing almost 53% of the total volume of tweets3. It is worth noting that
Twitter is experiencing a surge in popularity in various nations, particularly in regions where
languages with fewer resources are prevalent. India, for instance, constitutes the third largest
consumer for Twitter in terms of user base, trailing behind the United States and Japan,
boasting an impressive daily active user count of 22.1 million4. By providing support for
vernacular languages and allowing users to converse in Indic languages, Twitter has trans-
formed the spread of content and its reachability. 2019 Twitter research shows that 51%
of Indian users tweet in English and 49% in other languages5. More and more Indian users
have now begun to tweet on trending topics in their native tongues. According to the census
of 2001 (Pandey & Jha, 2021), 1,635 rationalized mother tongues, 234 identiable mother
tongues, and 22 major languages are spoken in India. It is possible to present semantically
related posts across various sources and languages. These posts cannot be directly matched
due to language script dierences and morphology. It poses a problem when linking and
accessing tweets in multiple languages that exhibit semantic similarity and belong to the
same topics. Hashtags come to the rescue as they can be used as matching criteria for
semantic posts across dierent data sources. Unfortunately, despite hashtags’ value, very
few tweets use them. The volume of tweets posted during events of widespread interest is
overwhelming, making it challenging to weed out irrelevant tweets while searching for the
Corresponding author.
Email addresses: phd2001201007@iiti.ac.in (Shubhi Bansal), cse190001031@iiti.ac.in (Kushaan
Gowda), nagendra@iiti.ac.in (Nagendra Kumar )
1https://twitter.com/
2https://www.facebook.com/
3https://semiocast.com/top-languages-on-twitter-stats/
4https://backlinko.com/twitter-users#twitter-users
5https://telanganatoday.com/twitter-giving-people-more-control-over-conversations-in-india
2
most pertinent information. Users follow international news and events on Twitter, but it
is hard to nd hashtags for topics in languages other than English. Local content creators
and brands intend to reach a broader audience on social media. Language learners intend
to nd engaging content and connect with other learners in their target language. However,
they need help nding relevant hashtags in multiple languages. Researchers studying mul-
tilingualism and language contact tend to use relevant hashtags for research but need help
seeing them in multiple languages. A tool that recommends multilingual hashtags would
save time and improve the visibility of their work, help content creators discover pertinent
content, and build connections, making nding and engaging with content from diverse
communities easier. To eectively retrieve relevant content while overcoming information
scarcity and the ambiguous nature of tweets, we frequently need to annotate hashtags to
tweets. Manual hashtag annotation takes time and money. Thus, developing automated
hashtag recommendation systems is the need of the hour as it drastically reduces the need
for human annotation while facilitating content categorization and management. Accord-
ing to the statistics on our collected dataset for tweets posted in multiple low-resource Indic
languages, up to 24.16% of the 31,07,866 tweets have less than two hashtags. Therefore, cre-
ating a system to suggest hashtags for low-resource Indic tweets is a worthwhile and pressing
research topic. These factors motivate us to develop a novel polyglot model for low-resource
Indic tweets that can automatically recommend meaningful hashtags for tweets.
Prior works have attempted to recommend hashtags for textual (Kumar et al., 2021;
Mao et al., 2022; Chakrabarti et al., 2023), visual (Hachaj & Miazga, 2020; Park et al.,
2016; Kurunkar et al., 2022), and multimodal content (Djenouri et al., 2022; Panchal &
Prajapati, 2023; Yang & Lin, 2022; Nama & Deepak, 2023). Eorts have been made to
suggest personalized (Wei et al., 2019; Padungkiatwattana & Maneeroj, 2022) hashtags by
considering content, user, and metadata information. Despite the extensive research for
hashtag recommendation via leveraging textual content, researchers have primarily focused
on high-resource languages, namely English (Zhang et al., 2019; Wang et al., 2019) and
Chinese (Kou et al., 2018; Javari et al., 2020; Mao et al., 2022). However, recommending
hashtags for content generated in low-resource Indic languages on social media platforms is
mainly unexplored. Indic languages are considered low-resource owing to the unavailability
of many written texts, audio recordings, or other digital resources. In low-resource language
settings, the data can often be noisy or incomplete. The existing methods to recommend
hashtags for content written in high-resource languages cannot be applied directly to low-
resource languages. The reason is that the development of linguistic knowledge requires
specialized expertise or a native speaker’s prociency in that language.
(Zhang et al., 2019) employed a parallel co-attention technique to simulate the correlation
of visual and textual information constituting the post. The authors consider the similarity
of the current post with the user’s historical posts to capture his tagging behavior and
suggest plausible hashtags for his current post. One drawback of using similarity with
historical posts is that it may not account for changes in the user’s interests or posting
habits over time, potentially leading to less relevant hashtag recommendations. (Jeong et al.,
2022) recommended hashtags based on post content and user demographic information. The
authors computed the similarity of demographic features with content features to recommend
3
Figure 1: Tweets of a User U
plausible hashtags. If the system relies solely on demographic data to recommend hashtags,
it may not accurately predict the user’s preferences. A user may have a unique interest in
a topic not commonly discussed by others in their demographic group. Users’ individual
preferences or behaviors that do not necessarily align with the general trends or patterns
observed in the larger population are known as idiosyncrasies. Modeling idiosyncrasies in
social media posts helps mitigate potential biases from relying solely on demographic or user
prole data. It also aids in the identication of patterns and trends that may not be visible
through demographic or user prole data alone. (Zhang et al., 2022) created a bipartite
graph comprising tweets and users to mine socially similar tweets and predict hashtags
for multilingual content. Despite this, TwHIN-BERT fails to recommend hashtags following
users’ interests and language usage style. The user who creates a post can provide important
contextual information about the post, such as the user’s interests, preferences, expertise,
language choice, and usage style.
An illustrative example from Twitter is seen in Fig. 1, where a particular user has posted
two dierent tweets yet used similar hashtags, indicating his topic of interest. In the rst
tweet, the user wishes Happy Flowers Day and annotates it with #phooldei. Phooldei is a
festival of owers and springtime celebrated in Uttarakhand. According to tweet content, he
assigned #owers, #Uttarakhand and #nature. In the second tweet, he emphasizes living
in the present through lines of a Hindi Bollywood song. According to the tweet content, he
annotates #present, #moment, and #songs to his tweet. The tweet has no relation with
owers, yet he assigns #owers and #nature to the second tweet, reecting his interest in
topics, i.e., nature and owers. Therefore, mining information from users’ posts can help to
understand their personal preferences and identify patterns in their posting behavior. This
results in a richer and more comprehensive understanding of how users engage with content
on social media platforms. Twitter users often develop their style and tone when tweeting,
which can be inuenced by their personality, background, interests, and communication
style. Some users may use a lot of slang and abbreviations, while others may use more
formal language and punctuation. Some users may use a lot of humor and sarcasm in
their tweets, while others may be more serious and straightforward. Users’ unique and
4
personal characteristics of language usage include vocabulary, punctuation, and emojis. It
is, therefore, essential to capture the highly idiosyncratic language patterns to comprehend
the dierences and commonalities in language use across users.
Additionally, the user from Fig. 1 recommends hashtags in the same language (Hindi).
He has also transliterated #, to English #phooldei. This emphasizes that user tends
to take language into consideration when posting tweets and annotating hashtags. On the
contrary, TwHIN-BERT doesn’t consider the user’s linguistic preferences and also fails to
capture relatedness among languages. Language relatedness refers to the degree of similarity
between dierent languages regarding their grammar, vocabulary, and other linguistic fea-
tures. Closely related languages share many similarities, while distantly related languages
may have fewer similarities. Modeling relatedness among languages in a language family
assists in overcoming some of the corpora limitations of low-resource languages by leverag-
ing shared knowledge and resources. This approach is particularly valuable in multilingual
settings, where users speak multiple languages within the same language family.
In this paper, we devise an automatic hashtag recommendation system for orpheline
tweets posted in low-resource Indic languages dubbed as TAGALOG that leverages tweet
content, language relatedness, and user preferences to recommend topic-relevant, personal-
ized and language-focused hashtags. We rene tweet representations in line with language
usage style and user interests by employing language-guided and user-guided attention mech-
anisms. We employ a graph neural network to capture relatedness among languages of sep-
arate families (Indo-Aryan and Dravidian) and user posting behavior. The recommended
hashtags can be used to identify the main content for specic topics regardless of the lan-
guage. Our proposed system can help regional language Twitter users to eectively retrieve
content and keep up to date with the latest information.
Below are the key highlights of our contributions.
1. We devise a deep learning-based graph neural network to suggest semantically related,
personalized, and language-specic hashtags for tweets posted in low-resource Indic
languages.
2. We not only capture the distinct topical and linguistic inclinations of individual users
on a local scale but also their long-term behavior and global interests.
3. On a local scale, we rene the content of tweets by devising a novel way of attending
to users’ topical interests and language usage style.
4. Globally, we construct a graph to model users’ interactions with tweets by considering
their historical tweets and capturing the long-term posting behavior.
5. We also leverage relatedness among languages belonging to the same language family.
The framework can mine correlation among languages of the same family group, i.e.,
Indo-Aryan and Dravidian.
6. We have constructed a new text-based hashtag recommendation dataset containing
tweets in Indic languages called Indic Hash. The collected tweet samples span various
low-resource languages: Bangla, Marathi, Gujarati, Telugu, Tamil, Kannada, and
Hindi besides English. Our curated dataset can be a primary resource to recommend
hashtags for tweets posted in Indic regional languages.
5
7. Our experimental ndings show that the proposed hashtag recommendation model
performs well in a low-resource environment with a minimal amount of labeled data.
The subsequent sections of the paper are arranged in the following manner. Section 2
outlines related work in hashtag suggestion while touching upon Indic languages. Section 3
formalizes the multilingual hashtag suggestion task. Section 4 focuses on our proposed
approach. Section 5 describes the experimental setup, outcomes, and analysis of the studies.
Section 6 outlines the limitations, practical implications, and potential applications of the
proered system. The concluding remarks are mentioned in Section 7.
2. Related Work
This section provides a high-level summary of the work pertaining to the domain of
hashtag recommendation followed by low-resource Indic languages and multilingual hashtag
prediction.
2.1. Hashtag Recommendation
In this part, we rst discuss several works that recommend personalized hashtags. Fol-
lowing that, we outline Graph Convolutional Network (GCN)-based techniques for hashtag
recommendation.
2.1.1. Personalised Hashtag Recommendation
Non-personalized hashtag recommendations (Tang et al., 2019; Ma et al., 2019; Kaviani
& Rahmani, 2020; Yang et al., 2020a) are limited in their capacity to oer personalized sug-
gestions since they only account for content-based factors while neglecting user preferences.
In essence, these recommendations are generated based solely on the textual semantics of the
content, potentially leading to mismatches with user preferences. In response, personalized
hashtag recommendations have been proposed, aiming to leverage both content information
and user preferences to provide personalized recommendations.
(Zhang et al., 2019) employed a parallel co-attention technique to simulate the correla-
tion of visual and textual information constituting the post. The authors also consider the
similarity of the current post with the user’s historical posts to capture his tagging behavior
and suggest plausible hashtags for his current post. One drawback of using similarity with
historical posts is that it simply considers the content of posts without taking into consider-
ation the larger network of connections between users and posts. This can make it dicult
to capture more subtle patterns of user behavior, such as the impact of social networks, com-
munity norms, or user demographics on post engagement. To model users’ extensive posting
histories for tailored hashtag recommendation tasks, (Peng et al., 2019) put forth a unique
neural memory network that incorporates both textual material and hashtags. This model
is equipped with a gating mechanism to tackle scenarios where hashtag usage is entirely
unrelated to earlier posts. To suggest personalized hashtags, (Jeong et al., 2022) presented
an attention-based neural network that used user demographic data derived from their sele
photographs along with textual and visual information. (Padungkiatwattana & Maneeroj,
6
2022) put forth a personalized hashtag recommender PAC-MAN, which integrates a multi-
tude of high-order relations to represent users and hashtags. A Multi-relational Attentive
Network (MAN) uses GNN to record relationships between hashtags and users, hashtags
and users, and hashtags and hashtags. PAC-MAN is a Person-And-Content-Based BERT
(PAC) that blends MAN user representation with content customization at the word level.
Finally, the authors execute a hashtag prediction task with MAN hashtag representations
incorporated into BERT to model sequenceless hashtag correlations.
2.1.2. GCN-based Hashtag Recommendation
GCN (Kipf & Welling, 2016) was initially introduced as a method to address the chal-
lenges associated with semi-supervised learning. (Wei et al., 2019) employed GCN strategies,
such as information diusion and attentiveness, to acquire micro-video and hashtag repre-
sentations that reect user choices. The resulting user-specic representations enable the
calculation of the similarity score of hashtags with respect to micro-videos facilitating more
eective hashtag recommendations. (Mehta et al., 2021) co-learned latent embeddings of
features gleaned from extended videos and semantic embeddings of prominent hashtags
on social media platforms. The authors adopt GCN to anticipate relationships between
videos and hashtags in a heterogeneous graph and recommend popular hashtags for videos.
To promote micro-video hashtags, (Li et al., 2019) introduces a multi-view representation
interactive embedding model that uses graph-based information propagation. The model in-
tegrates hashtag associations, multiview learning, and video-user-hashtag interaction, with
a graph directing the spread of information among hashtags. This method establishes a
consistent pattern of relatedness between hashtags, which considerably improves the eec-
tiveness of hashtag recommendations for both popular and long-tail hashtags. (Chen et al.,
2021) created an image similarity graph to illustrate the relationship between posts assuming
visually comparable images use similar hashtags. The Triplet Attention module captures
the inuence of visuals, captions, and users to derive node features. Aggregated Graph Con-
volution component learns the attended features and spreads information among vertices to
suggest suitable hashtags.
2.2. Low-Resource Languages and Multilingual Hashtag Prediction
The task of suggesting hashtags for textual content can be posed using one of the tradi-
tional problems in Natural Language Processing (NLP), i.e., text categorisation (Li et al.,
2022; Dogra et al., 2022; Li et al., 2023; Lei et al., 2020). As far as we are aware, although
many works have been carried out for classifying text in low-resource Indic languages (Pathak
& Jain, 2022; Sanghvi et al., 2023; Rehman et al., 2023), there is only one work that predicts
hashtags for multilingual content (Zhang et al., 2022).
Low-resource languages (LRLs), also known as “less studied, under-resourced, low den-
sity” languages are languages with limited linguistic resources, such as textual material,
language processing tools, grammar and speech databases, dictionaries, and human com-
petence (Besacier et al., 2014). These languages are frequently spoken by small groups,
lack standardized writing systems, and have a scarce digital presence. Researchers in NLP
distinguish LRLs based on the availability of data and NLP tools. LRLs have a relatively
7
small amount of data, i.e., text corpora, parallel corpora, and lack language-specic tools
such as spell checkers and grammar checkers, and manually crafted linguistic resources for
training NLP models. There are a number of advantages to working with low-resource lan-
guages that have the potential to impact the lives of people who speak these languages, the
opportunity to develop new NLP techniques that can be applied to other languages, and
the challenge of working with limited data. Eorts are being made by linguists, researchers,
and organizations to document languages, construct corpora, develop technology and tools,
and community-driven language revival campaigns for LRLs since LRLs oer humongous
benets some of which are enlisted below.
Social Inclusion: Strengthening LRLs promotes inclusion and gives underrepresented
communities a voice online. They can use it to interact with technology, participate
in online debates, and get information in their language.
Enhanced Cross-Cultural Understanding: Supporting and researching LRLs stimu-
lates collaboration across diverse linguistic communities and improves cross-cultural
understanding. It helps to bridge barriers and promote mutual tolerance and appre-
ciation for other cultures and languages.
Enhanced Communication: Supporting low-resource Indic languages enables eective
communication and understanding within linguistic communities. It strengthens inter-
generational bonds, fosters social cohesion, and promotes local participation in various
social, cultural, and economic activities.
Economic Opportunities: Developing language technologies, content, and services for
low-resource Indic languages might lead to the emergence of industries, such as lo-
calization services, translation, interpretation, content creation, and digital platforms
aimed at specic linguistic communities.
Due to small corpora and unseen scripts, labeled data for diverse Indic languages is sparse
or nonexistent in real applications compared to high-resource languages like English and Chi-
nese. To get beyond corpus restrictions inherent in low-resource languages, (Khemchandani
et al., 2021) proposed RelateLM to eectively customize language models for low-resource
languages. Since numerous Indic scripts descended from Brahmi script, the authors take
advantage of script relatedness through transliteration. RelateLM articially translates rel-
atively well-known language content into low-resource language corpora using comparable
sentence structures to get around corpus limitations. (Aggarwal et al., 2021) performed
zero-shot text classication for Indic languages by leveraging lexical similarity. To this end,
the authors performed script conversion to Devanagari and divided words into sub-words to
optimize the vocabulary overlap among the related Indic languages datasets. (Khatri et al.,
2021) investigated the inuence of sharing encoder-decoder parameters between related lan-
guages in Multilingual Neural Machine Translation. They developed a system trained from
the languages by grouping them based on language family i.e., Indo-Aryan (group) to En-
glish and Dravidian (group) to English. Then, the authors convert the entire language data
8
to the same script, which helps the model learn better translation by utilizing shared vo-
cabulary. This approach obscures the underlying structural similarities between languages.
Language families are typically dened based on shared ancestry and historical relationships
between languages. Transliteration-based methods may not accurately capture these rela-
tionships between languages, as they focus primarily on the surface features of languages
which amounts to inaccurate results for downstream tasks. (Marreddy et al., 2022) put for-
ward a supervised approach to rebuild graph called as Multi-Task Text GCN. This method
utilizes a Graph AutoEncoder (GAE) (Schlichtkrull et al., 2018) to learn the latent word and
sentence embeddings from a graph which is employed to carry out Telugu text categorization
for various downstream tasks.
(Zhang et al., 2022) proposed a Twitter Heterogeneous Information Network (TwHIN-
BERT) to anticipate hashtags for multilingual content. The authors employ Approximate
Nearest Neighbor (ANN) search to identify pairs of socially appealing tweets. This method
falls short of capturing the user’s language and topical choices. Furthermore, it does not
take linguistic relatedness within language groups into account to address the low-resource
nature of numerous languages featured in the dataset.
Therefore, a quick assessment reveals that research has primarily focused on text-only,
image-only, or multimodal information posted in high-resource languages i.e., English, and
Chinese. These studies do not consider recommending hashtags for content posted in low-
resource languages. To tackle this issue, we propose a novel polyglot paradigm i.e., TAGA-
LOG, which extracts the content-based, user-based, and language-based features to recom-
mend personalized and language-specic hashtags for content created in low-resource Indic
languages.
3. Problem Denition
Let us consider a dataset with a tweet set T={ti}|T|
i=1, a set of users U={uj}|U|
j=1, a set of
hashtags H={hk}|H|
k=1 and a set of languages L={IA(Hindi, Gujarati, M arathi, Bangla),
D(Kannada, T amil, T elugu), E nglish}. Here, |T|,|U|,|H|denotes the cardinality of the
tweet set, user set, and hashtag set. IA and Drefer to Indo-Aryan and Dravidian family
groups.
Given a user uUwho uploads a tweet twritten in language lL, we aim to recommend
a personalized and language-specic set of hashtags RH Hthat are relevant to users’
posting and language usage behavior.
Our objective is to develop a customized hashtag recommendation model for tweets in low-
resource Indic languages that can automatically recommend hashtags from Hto a new tweet
tuploaded by a user u.
Given a tweet written in lby a user u, we intend to learn a function f(.)that can capture
his topical and linguistic preferences.
tu, tl=f(UGA(t, u), LGA(t, l)) (1)
Here, UGA refers to the user-guided attention and LGA refers to the language-guided at-
tention mechanisms that yield latent user and language representations denoted by tuand
9
tl. Hashtags are a potent tool for self-expression because they allow users to succinctly and
rapidly communicate their interests, thoughts, feelings, and views on a certain topic. To
address the variances in hashtag labels that result from how individuals express themselves
and their unique language usage style, we devise two attention mechanisms to ne-tune user
and language representations. To further enhance tweet representation, we aim to learn a
function g(.)to model various types of interactions.
t
u, t=g(tu, t)(2)
Here, t
u, tdenote the enhanced user and tweet representation derived from the graph, and
g(.)resembles a graph neural network. We employ a graph neural network to model tweet-
tweet interactions based on language relatedness and user-tweet interactions. We construct
a heterogeneous graph G= (V, E )such that V= (U, T )where Vis the set of nodes
comprising users and tweets, and Eis the set of edges. Each edge eEis based on either
the relatedness of the language in which the tweet is written with tweets published in other
languages within the same language group or whether the user created that tweet in the
past. Hashtag recommendations can then be formulated as given in Equation 3.
RH =HASH REC(t
u, tl)(3)
Here, HASH REC refers to the hashtag recommender that resembles a deep neural
network. It takes enhanced tweet representation derived from the graph denoted by t
u
and language-guided tweet representation i.e., tlto recommend a reasonable collection of
hashtags denoted by RH. We posit that TAGALOG encodes not only the user’s topical and
linguistic preferences but also relatedness among languages of a family group pertaining to
the language in which a tweet is written. The following sections provide more information
on the UGA,LGA,f(.), g(.), and HASH REC.
4. Methodology
Figure 2: Overall Architecture of TAGALOG
In this section, we present a detailed overview of our proposed approach. Fig. 2 show-
cases the overview of our innovative polyglot hashtag recommender. We propose a deep
10
neural network based on graphs to recommend hashtags for tweets posted in multiple Indic
languages. Our system receives a tweet as input, together with information on the language
used in the tweet and the user who posted it. The proposed system rst retrieves features
from a tweet’s textual modality to obtain its low-dimensional feature vector representation.
Then we use attention techniques to mimic how language and user aect the representation
of a tweet. We create a graph to capture the correlation between tweets and the interaction
between tweets and users. The node embeddings which are modied in response to informa-
tion dissemination and neighborhood aggregation are fed into the hashtag recommendation
module. After assessing the plausibility of each hashtag, this module yields a sorted list of
hashtags for polyglot tweets. As demonstrated in Fig. 2, our proposed framework comprises
four components: (a) feature extraction; (b) feature renement; (c) feature interaction, and
(d) hashtag recommendation. Each component is discussed in profundity below.
4.1. Feature Extraction
In this section, we elucidate the textual, linguistic, and user feature retrieval from tweets.
Textual Feature Retrieval. We encode tweets written in various resource-scarce Indic lan-
guages using Multilingual Bidirectional Encoder Representations from Transformers (Pires
et al., 2019), abbreviated as the mBERT model. Wikipedia articles written in 104 dierent
languages serve as the training data for the multilingual variant of BERT. Since mBERT
shares a common input space at the sub-word level, this pre-trained neural language model
is utilized to generate context-aware embeddings of tweets posted in dierent languages.
The input tweet tis enclosed within two special tokens, class (CLS) and separator (SEP)
to signal its start and endpoints. We pass the raw tweet through mBERT’s tokenizer to
produce the corresponding set of tokens as shown in Equation 4.
M=mBE RT _T okenizer ([CLS] + t+ [SEP]) (4)
Here, Mrepresents the created collection of tokens. The number of tokens in the sequence
denoted by Sis capped at 50. We shorten or lengthen the token sequence derived from the
tweet to Sif it is greater or lesser than Sto construct a uniform-sized token sequence for all
tweets. Then, we encode tokens using an mBERT encoder to generate token representations
according to Equation 5.
Tf=mBE RT (M)(5)
The derived textual feature matrix is denoted by TfRS×D, where S= 50 denotes the
number of tokens derived from the tweet, and D= 768 denotes the embedding size for every
token. The textual feature matrix of the encoded tweet is passed to the feature renement
module.
Language Feature Retrieval. Social media language is often informal, abbreviated, and con-
tains hashtags, emojis, and other elements that are specic to these platforms. By learning
language embeddings from a large corpus of social media data, we can better capture these
unique linguistic characteristics and represent them in a way that captures their mean-
ing. Language embeddings are vector representations of words or phrases that are learned
11
through training on large amounts of text data. It consists of two steps namely language
identication, and language embedding generation.
Language Identication We used the langdetect6library to identify the language in
which tweet tis published. About 50 languages can be recognized by this package, which
is a direct transfer of Google’s language-detection library from Java to Python. Nakatani
Shuyo created the software at Cybozu Laboratories, Inc. We determine the language used
to write the tweet tas depicted in Equation 6.
l=langdetect(t)(6)
Here, lis the language identied for tweet t.
Language Embedding Generation Language embeddings are used for tweet rep-
resentation because they enable us to capture the meaning and context of words used in
tweets. They capture the semantic and syntactic relationships between words, which allows
us to understand the meaning of individual words and the overall context. Using language
embeddings to represent tweets allows us to capture the nuances of language used on social
media platforms. After identifying the language in which the tweet was written, we generate
the feature vector for language using the Keras embedding layer7as discussed in Equation
below.
lf=Embedding(l)(7)
Here, lfRDrefers to a feature vector to represent language, with a dimensionality (D) of
768.
User Feature Retrieval. User embeddings can be useful in deriving post features because
they capture information about the users who created the posts. In many cases, the user
who creates a post can provide important contextual information about the post, such
as the user’s interests, preferences, or expertise. By incorporating this information into
post features, models can improve their ability to understand and analyze posts. This
can help the model make personalized recommendations that are more relevant to the user’s
interests. The publisher of the tweet tis expressed as u. We encode uinto a low-dimensional
embedding vector (uf)by employing the Keras embedding layer as demonstrated by the
following Equation.
uf=Embedding(u)(8)
Here, ufRDrefers to a feature vector to represent the user, with a dimensionality of
768. Users’ hidden features, such as preferences, may theoretically be captured by user
embeddings and used to direct how the tweet representation is learned.
6https://pypi.org/project/langdetect/
7https://keras.io/api/layers/core_layers/embedding/
12
4.2. Feature Renement
The cornerstones of the feature renement module comprising our proposed model are
language-guided and user-guided attention mechanisms that successfully capture the topical
and linguistic inclinations of individual users at a local level to enrich the tweet representa-
tion. We discuss these two mechanisms below.
4.2.1. Language-guided Attention Mechanism
We devise a novel language-specic attention block that selectively attends to language-
oriented information in the tweet and lters out unnecessary information thus, enriching its
representation. For the tweet embedding obtained using the mBERT encoder, we denote
it as Tf={es}S
s=1. We use an attention technique to identify key terms, then aggregate
the acquired word representations to create a comprehensive representation of the tweet’s
textual content with respect to the linguistic preferences of the user. To this end, we feed the
token-based embedding matrix Tfthrough a dense layer to create its hidden representation,
as illustrated in the equation below.
hl=tanh(TfWl+bl)(9)
Here, hl={hl
s}S
s=1, where hl
sis the hidden representation of es. We then determine how
closely the token’s latent representation (hl
s)resembles the language embedding vector (lf)
and run the outcome through a softmax algorithm to generate attention scores (αs) using
the formula presented in Equation 10.
α=softmax(hllf)(10)
Here, α={αs}S
s=1, where αsdesignates a word’s signicance with respect to language. The
language-guided tweet representation is then derived by computing the weighted sum of
token embeddings with attention scores αsserving as weights as presented below.
tl=
S
s=1
αshl
s(11)
Here, tlrepresents the language-guided tweet representation.
4.2.2. User-guided Attention Mechanism
Users tend to express their interest in the semantic attributes of a tweet’s text. Thus,
exploring users’ attention to words appearing in tweets towards recommending hashtags is
crucial. By using user-guided attention, the model can capture the user’s unique perspec-
tives, which can provide additional context and improve the accuracy of post features. We
utilize a user-guided attention mechanism for identifying salient words and combining their
corresponding representations to obtain a comprehensive representation of the tweet’s tex-
tual content with respect to the user. To achieve this, we rst process the mBERT-based
token embedding matrix (Tf)using MultiLayer Perceptron (MLP) to derive huas illustrated
in the subsequent equation.
hu=tanh(TfWu+bu)(12)
13
Here, hu={hu
s}S
s=1, where hu
sserves as the covert way of representing es. We rst calculate
how comparable hu
sand ufare, then run the resulting through a softmax function to produce
normalized weight βsas demonstrated below.
β=softmax(huuf)(13)
Here, β={βs}S
s=1, where βssignies the relevance of a term with respect to a user. The
user-guided tweet representation is determined by summing the weighted word annotations
i.e., βs. as shown.
tu=
S
s=1
βshl
s(14)
Here, tudenotes the user-guided tweet representation. The obtained representations are
forwarded to the feature interaction component.
4.3. Feature Interaction
The feature interaction module employs a graph neural network to capture global inter-
ests by analyzing long-term user behavior and preferences, in addition to tweet correlation.
It comprises two major stages namely, graph construction and feature encoding. We discuss
these two stages in detail below.
4.3.1. Graph Construction
To mine the correlation between tweets and the interaction between tweets and users,
we create an undirected heterogeneous graph as illustrated in Algorithm 1. Here, G=
(V, E )is the resultant user-tweet graph, and Vand Edenote the collection of vertices
and edges between them, respectively. We construct a graph with two dierent kinds of
nodes, as shown in Line 1 of Algorithm 1. The total number of nodes in the graph is
Iwhere I=|T|+|U|and EV×Vis a set of relationships among nodes to model
tweet-tweet correlations and user-tweet interactions. The edges constructed based on tweet-
tweet correlations are weighted, whereas those corresponding to user-tweet interactions are
unweighted. First, we compute the pairwise similarity between tweets appearing in the
tweet set T, as depicted in Line 4. We then assign an edge between tweets of related
language families corresponding to the language in which the tweet under consideration
is written, as shown in Lines 5-8, corresponding to the Indo-Aryan and Dravidian family
groups. The tweets not falling under these two groups imply they are written in English,
as shown in Lines 9-10. The edge weight is the similarity score between mBERT-based
embeddings of a tweet with tweets written in related languages comprising the language
group. Grouping posts concerning their language family, like Indo-Aryan and Dravidian,
can help in recommendations by personalizing content and recommendations based on the
user’s linguistic and cultural background. Language families are a collection of languages
that share the same ancestor. Languages in the same family often share similar grammatical
structures, vocabulary, and cultural contexts. By grouping posts based on a language family,
we identify posts that are likely to be relevant and exciting to users with a particular
linguistic background. For example, suppose a user writes tweets in a language from the
14
Algorithm 1 Graph Construction
Input: T: Tweets
U:Users
Output: G(V, E ): User Tweet Graph
function get_graph(T, U )
1: V=TU
2: E= []
3: for all (t1, t2) T X T do
4: sim_score =cos_sim(t1, t2)
5: if langdetect(t1) &langdetect(t2) [bn,hi,mr,gu]then
6: E=E(t1, t2, sim_score)
7: else if langdetect(t1) &langdetect(t2) [kn,te,ta]then
8: E=E(t1, t2, sim_score)
9: else if langdetect(t1) &langdetect(t2) [en]then
10: E=E(t1, t2, sim_score)
11: end if
12: end for
13: for all tTdo
14: u=get_user(t)
15: E=E(t, u, 1)
16: end for
17: G= (V, E )
18: return G
Indo-Aryan family. In that case, we can group posts that are written in languages from this
family, such as Bangla (Bn), Hindi (Hi), Marathi (Mr), and Gujarati (Gu), and recommend
hashtags to the user. Similarly, suppose a user uses a language from the Dravidian family. In
that case, we can group posts that are written in languages from this family, such as Kannada
(Kn), Telugu (Te), and Tamil (Ta), and recommend them to the user. By personalizing
recommendations in this way, we can increase the relevance and engagement of content for
users. Furthermore, as depicted in Lines 13-16, for every tweet, we retrieve its corresponding
user. We then create an edge to connect the user to his uploaded tweets. By capturing the
user-tweet relationship through edge creation, tweet representations can be enriched with
the contextual information of the associated user, such as the user’s topical interests and
historical posting patterns. Incorporating the user context allows for more contextualized
and personalized tweet representations. It considers the relationship between the user and
his tweets, allowing for a more nuanced understanding of their behavior and motivations.
Unlike similarity-based analysis (Zhang et al., 2019) that overlook the unique context and
signicance of individual posts, treating them as isolated entities, the edge-based approach
explicitly models the relationship between a user and his tweets within the graph structure,
thus enabling a comprehensive analysis of interdependencies and interactions between users
and their tweeted content. The edge connecting a user to their tweets indicates the range
15
and diversity of their topical interests. We utilize this edge information to identify patterns
and recommend accurate hashtags.
Figure 3: Graph AutoEncoder
4.3.2. Graph Feature Encoding
Our primary goal is to create and train a model to learn tweet and user embeddings
given an input graph Gin order to perform hashtag recommendations. GAE is a type
of unsupervised learning model used for graph representation learning. GAE can capture
complex, non-linear relationships between nodes in a graph, which cannot be easily cap-
tured by traditional graph embedding techniques such as DeepWalk (Perozzi et al., 2014)
or node2vec (Grover & Leskovec, 2016). GAE preserves the structural properties of nodes
even when the data is noisy. GAE can be used for hashtag recommendation, where the
input data consists of both user-tweet interaction data and tweet features represented as
a graph. This allows for a more comprehensive recommendation system that takes into
account both user behavior and tweet attributes. The proposed GAE pipeline is shown in
Fig. 3. Let G= (V, E)represent a graph with Nnodes and Abe its adjacency matrix.
Let Fbe the feature matrix with N rows, where each row represents the feature vector of a
vertex. The goal of GAE is to acquire a reduced-dimensional latent representation Zthat
encompasses the structural and semantic information of the graph. The adjacency and fea-
ture matrices, when combined (AF ), are the encoder’s input. Graph Sample and Aggregate
(GraphSAGE) (Hamilton et al., 2017) can be used as the encoder in the GAE by adapt-
ing it to aggregate information from the entire graph. GraphSAGE is a neural network
that is designed to learn node embeddings by compiling information from its immediate
surroundings. The input for the GraphSAGE encoder is Fvwhich is the feature vector that
node vis initialized with, and N(v)is the set of neighboring nodes of node vin the graph.
The tweet node is initialized by employing word level attention (Yang et al., 2016) over
the textual feature matrix of tweet tas discussed in Section 4.1 since tweets contain noisy
user-generated text. User nodes are initialized with a feature vector obtained as depicted
in Section 4.2.2. Generally, hk
vis the embedding vector of node vat the kth layer of the
GraphSAGE encoder and NLis the number of layers in the encoder. We adopt the mean
aggregator in GraphSAGE as evident in Equation 15.
hk
v=GraphSAGEmean(hk1
v, A)k[1, NL](15)
16
The updated feature matrix Zis obtained from the last layer as shown in Equation 16.
Z=hNL
v(16)
Here, Zconsists of the updated user representation (t
u) and text feature (t). The decoder
maps this latent representation back to the original graph structure. It consists of a sigmoid
activation function as shown in Equation 17.
ˆ
A=sigmoid(Z.Z T)(17)
Here, ˆ
Ais the reconstructed adjacency matrix.
4.4. Hashtag Recommendation
By considering both the user and the language used in a tweet, we can better capture
the user’s intent, perspective, language usage style, and the meaning of the words they use.
To this end, we derive the overall tweet representation by concatenating the updated tweet
embedding obtained from GAE and language-guided tweet representation as shown below.
tf=concat(t
u, tl)(18)
Here, tfis the overall tweet representation. The hashtag recommendation module receives
tfas input and outputs a reasonable set of hashtags Rh as given in Equation 19.
Rh =HASH REC(tf)(19)
The hashtag suggestion task is structured as a multilabel classication problem. Given
that a tweet can belong to numerous classes simultaneously, this formulation procedure can
assist in forecasting labels for non-exclusive classes. A pool of precongured hashtags His
employed to assign suitable hashtags to the multilingual tweet as exhibited in Equation 20.
ypred =sof tmax(Dense(units =|H|)(tf)) (20)
Here, the symbol ypred R|H|refers to the softmax probabilities of the supplied hashtags,
|H|is the cardinality of the set of hashtags. These probabilities are used to rank hashtags
and generate the nal set of predicted hashtags (RH).
RH =argsort(ypred )(21)
The objective loss function for training TAGALOG can be seen in Equation 22.
L=LGAE +LHR (22)
Here, Lis the overall loss function, LGAE is the reconstruction loss of GAE, and LHR is the
loss function for the hashtag recommendation module. The loss function (LGAE ) is described
in Equation 23.
LGAE =||Aˆ
A||2(23)
17
Here, Aand ˆ
Arepresent the actual and reconstructed adjacency matrices, and .denotes
the squared norm. The objective of LGAE is to reduce the dierence between the predicted
and actual adjacency matrices across the entire training dataset, with the purpose of achiev-
ing better reconstruction accuracy. The optimization problem is solved by minimizing LGAE
with respect to the parameters of the encoder and decoder (θeand θd) using a gradient-based
optimization algorithm. Through this process, GAE learns a compressed representation of
the input graph. The training loss function for the hashtag recommendation module is
described in Equation 24.
LHR =1
|M|
(t,G)M
gG
log(P(g|t)) (24)
Here, the current tweet is represented by t, the related ground-truth hashtag set is indicated
by G, and the softmax probability that the ground-truth hashtag gwill be used for the tweet
tis given by P(g|t), and variable Mrepresents the training set of multilingual tweets.
5. Experimental Evaluations
In the ensuing subsections, we go over the experimental settings followed by experimental
ndings to validate the viability of our proposed framework.
5.1. Experimental Setup
Here, we present our curated dataset on which experiments were performed. Next, we go
into state-of-the-art approaches and existing models for comparison, followed by the criteria
employed for evaluation.
5.1.1. Dataset
In our opinion, we have curated the rst large-scale multilingual low-resource Indic tweets
dataset dubbed as IndicHash. This dataset is designed for the task of recommending hash-
tags for tweets posted in multiple low-resource Indic languages. We create an exhaustive
dataset from tweets published by Indian users covering seven low-resource languages be-
sides English. Regional language tweets have increased signicantly on Twitter. This served
as our inspiration to broaden the endeavor to Indic languages. We chose a total of seven
dierent Indic languages namely Bangla, Hindi, Kannada, Gujarati, Tamil, Telugu, and
Kannada. This decision was primarily motivated by the widespread usage of these Indic
languages across various regions of India. We now elucidate the techniques used to gather
and process the independent tweets followed by a description of the dataset’s specications.
Data Collection. We gathered nearly equal numbers of posts for each keyword and a similar
amount of keywords for each category. We rst curated a generic list of categories namely
technology, business, education, environment, gadgets, sports, festivals, people’s movement,
politics, cricket, entertainment, movies, music, news, culture, food, military, career, fash-
ion, tness, gaming, nature, weather, emotions, pets, hobbies, astrology, and crisis. The
total number of keywords considered for data collection is 213. For example, keywords
18
under the education category: education, ed-tech, ParikshaPeCharcha, teacher, learning,
school, university, neweducationpoilcy, students, and exams. Likewise, under the category
of people’s movements which is a hot topic on Twitter, we included keywords such as Stu-
dentLivesMatter, ShaheenBagh, FarmersProtest, KisaanAndolan, metoo, BlackLivesMatter,
pride, feminism, NeverAgain, and EnoughIsEnough. We used Scraper for Social Networking
Services (SNS) abbreviated as snscrape8to download tweets. We scraped attributes like
user IDs, and hashtags, and retrieve the relevant tweets using keywords as a search query.
We gathered user tweet data in a variety of languages since people use hashtags regardless
of their language of origin. The dataset collection comprises a total of 31,07,866 tweets, and
9,17,833 hashtags posted by 4,78,120 users for a total of 8 languages. The average number
of tweets per keyword and tweets per user in the collected dataset amounts to 14,591 and 7
whereas the average number of hashtags per tweet is 5.
Data Pre-processing. The subsequent measures were adopted to ensure a high-quality input
for our model. We removed tweets that contain less than three words. The acquired data
was noisy due to Twitter’s quick and erratic nature. The data was sanitized by deleting
duplicate posts with null values. The pre-processed data underwent several modications,
including the removal of links, conversion of text to lowercase, and exclusion of all non-
alphanumeric characters except space and full stop. Hashtags were also collected from these
pre-processed posts. Post information such as the content of the original post, hashtags
used, and the user id of the user who created that tweet was extracted. To balance the
dataset, we randomly sampled an equal number of tweets from each language. The nal
dataset collection comprises a total of 81,944 tweets, 17,660 users, and 37,151 hashtags.
Table 1 provides a summary of the dataset’s statistics.
Characteristic Original Pre-processed Final
No.of tweets 31,07,866 10,65,848 81,944
No. of users 4,78,120 1,36,348 17,660
No. of keywords 213 213 205
No. of hashtags 9,17,833 45,535 37,151
No. of tweets/keyword 14,591 5,004 400
Average no. of hashtags/tweet 5 8 8
Average no. of tweets/user 7 8 5
Table 1: Dataset Statistics
5.1.2. Compared Methods
In order to assess the ecacy of the suggested model, we conducted a comparative
analysis against prior research endeavors in the domain of hashtag recommendation as well
as established language models based on transformer architecture.
8https://github.com/JustAnotherArchivist/snscrape
19
Existing Research Works. To evaluate the eciency of the proposed model, we contrast our
approach with the recent research works on hashtag recommendation.
1. AMNN: (Yang et al., 2020b) generated hashtags by developing a sequence-to-sequence
encoder–decoder framework. The encoder retrieves visual and textual embeddings
individually which are then subjected to an attention technique. The attended visual
and textual features upon concatenation are fed into GRU, which generates hashtags
sequentially according to softmax probabilities.
2. TwHIN-BERT: (Zhang et al., 2022) developed the Twitter Heterogeneous Information
Network which is a polyglot language model that frames the objective of predicting
hashtags as a problem of multi-class classication. It is trained with a vast volume of
tweets and rich social interactions in order to emulate the brief and noisy nature of
user-generated content.
3. SEGTRM: (Mao et al., 2022) introduced a transformer-based model which produces
hashtags in a sequential manner. SEGTRM consists of three steps: a hashtag gener-
ator, a segments-selector, and an encoder. The encoder removes extraneous data at
various granularities within text, segments, and tokens in order to derive global textual
representations. The segments-selector selects many segments and reorganizes them
into a novel sequence to serve as an input to the decoder, enabling end-to-end hashtag
construction. To predict hashtags in terms of both quality and quantity concurrently,
the authors employ a sequential decoding algorithm.
4. DESIGN: (Bansal et al., 2022) incorporated pertinent data encoded in linguistic and
visual modalities of social media posts besides analyzing users’ tagging behavior to
suggest a personalized and credible set of hashtags. The authors use a word-level
parallel co-attention mechanism to enhance the multimodal information and create a
richer post representation. The decoder capitalizes hashtags produced using multilabel
classication and sequence generation procedures for the recommendation.
Existing Models. We discuss various transformer-based models against which we compare
the performance of our devised framework. To derive features of tweets in our dataset,
we investigated dierent transformer-based models. These models can be perfectly tailored
for classication tasks after being trained on general tasks. (Devlin et al., 2019) introduced
BERT, a transformer-based approach for pre-training NLP models and learn contextual
representations during pre-training. It is a deep bidirectional and exible model that can be
ne-tuned by appending a few output layers. Consequently, BERT serves as the underlying
architecture for all fundamental models.
1. mBERT: (Pires et al., 2019) devised mBERT, which stands for multilingual BERT. It
is a transformer-based model trained on and usable with 104 languages with Wikipedia
(2.5B words) with 110 thousand shared word-piece vocabulary using a masked language
modeling (MLM) objective. The input is transformed into vectors with BERT’s capa-
bility of bidirectionally training the language model which captures a deeper context
and ow of the language.
20
2. mBERT with Transliteration: We used IndicTrans9package released by AI4Bharat to
transliterate the text of tweets. We employ transliteration (script conversion) for Indic
languages since it helps in reducing the lexical gap among dierent Indic languages.
After transliteration, we obtain embeddings for transliterated tweets using mBERT
which in turn are employed to recommend suitable hashtags.
3. IndicBERT: (Kakwani et al., 2020) introduced an ALBERT-based multilingual model
featured in AI4Bharat’s IndicNLPSuite. This model was trained on a massive corpus
containing over 9 billion tokens in 12 major Indian languages. IndicBERT is capable
of extracting sentence and word embeddings.
4. XLMR: (Conneau et al., 2020) proposed the multilingual RoBERTa variant called
XLM-RoBERTa which is used to carry out various NLP tasks. It has been pre-trained
on an enormous amount of multilingual data with 100 languages using MLM objec-
tive. More intriguingly, cross-lingual instruction on a big scale has a major positive
impact on languages with few resources. Sentencepiece tokenization is used by XLM-
RoBERTa on raw text without any performance loss. Since it uses the same training
program as the RobERTa model, the moniker “Roberta” was incorporated.
5. DistilmBERT: (Sanh et al., 2019) developed a condensed adaptation of mBERT with
the objective of reducing its size, cost, processing time, and computational load. It
contains a reduced number of parameters, up to 40% less than Bert-base-uncased, and
it guarantees a faster runtime of 60% while maintaining 97% of the original perfor-
mance. Furthermore, it is trained on Wikipedia texts in 102 distinct languages. There
are 134M parameters in all. DistilmBERT is typically twice as quick as mBERTbase.
5.1.3. Evaluation Metrics
To evaluate the performance of our suggested hashtag recommendation system, we use
assessment criteria from the literature on multi-label classication. The standard evalu-
ation metrics for analyzing the performance of hashtag recommendation methods are Hit
rate, Precision, Recall, and F1-score. These metrics are computed by comparing predicted
hashtags and ground-truth hashtags for each tweet. We describe each evaluation metric
below.
The occurrence of at least one common hashtag (GH RH) between the set of recom-
mended hashtags (RH)and ground-truth hashtags (GH)accounts for the hit-rate metric
when dealing with hashtag recommendation systems. The hit rate is described in the fol-
lowing equation.
Hitrate(HR) = min(|GH RH|,1) (25)
The division of the number of hashtags that are present in the set of both ground-truth
and recommended hashtags by the cardinality of the set of recommended hashtags yields
precision. The following is the formula for precision.
P recision(P) = |GH RH |/|RH |(26)
9https://ai4bharat.org/indic-trans
21
Recall is the ratio between the number of hashtags shared between ground-truth and rec-
ommended hashtags set with the quantity of ground-truth hashtags. The recall is computed
as given in Equation 27.
Recall(R) = |GH RH|/|GH|(27)
To compute F1-score, we derive the harmonic average of precision and recall measures as
shown in Equation 28.
F1score(F1) = 2 PR/(P+R)(28)
The outcome of each evaluation metric is denoted as HR@K, P@K, R@K, and F1@K, where
Kdenotes the number of recommended hashtags. Note that larger values imply better
performance.
5.2. Experimental Results
In this segment, we present an exposition of the empirical ndings resulting from the
comparison of the proposed framework to state-of-the-art approaches and extant models,
analyzing performance enhancement, and examination of visual representations of recom-
mendations.
5.2.1. Eectiveness Comparisons
We begin by outlining TAGALOG’s overall benets, particularly its superiority in out-
performing the previous research works and various transformer-based models. We regard
the top-K hashtags as the recommended ones, with Kbeing 8, since the mean number of
hashtags per tweet is 8. As can be seen in Table 2, the performance gain achieved by TAGA-
Technique Hit rate Precision Recall F1-score
AMNN (Yang et al., 2020b) 0.489 0.195 0.210 0.202
SEGTRM (Mao et al., 2022) 0.520 0.211 0.228 0.219
TwHIN-BERT (Zhang et al., 2022) 0.600 0.179 0.194 0.187
DESIGN (Bansal et al., 2022) 0.771 0.284 0.311 0.297
TAGALOG 0.824 0.334 0.366 0.349
Table 2: Eectiveness Comparison Results with Existing Research Works
LOG is 33.5%, 13.9%, 15.6%, and 14.7% over AMNN, 30.4%, 12.3%, 13.8%, and 13.0% over
SEGTRM, 22.4%, 15.5%, 17.2%, and 16.2% over TwHIN-BERT, and 5.3%, 5.0%, 5.5%, and
5.2% over DESIGN in terms of hit-rate, precision, recall, and F1-score respectively. The im-
provement in performance achieved by TAGALOG over AMNN is due to the superiority of
mBERT over LSTM (Graves & Graves, 2012). The bidirectional and multilingual nature of
the BERT-based feature extractor helps to capture the multilingual context in a better way.
Further, TAGALOG considers language and user characteristics when creating the tweet
representation to recommend high-quality hashtags in contrast to content-based informa-
tion used by AMNN. The reason behind performance enhancement over SEGTRM is that
22
SEGTRM lters text at dierent granularities, whereas TAGALOG adopts language-guided
and user-guided attention mechanisms to lter content with respect to the user’s topical and
linguistic interests. The remarkable improvement of TAGALOG over TwHIN-BERT is due
to modeling user preferences besides user interaction with tweets and language relatedness
through graph construction. DESIGN employs a word-level attention mechanism in addi-
tion to multi-label classication and sequence generation techniques. The user-guided and
language-guided attention mechanisms in TAGALOG lter the tweet content to construct
tweet representation in accordance with the user’s topical interests and linguistic style which
aids in suggesting relevant hashtags. Unlike DESIGN which samples a certain number of
users’ historical posts, TAGALOG captures the user’s tweet history through a graph neural
network.
(a) Hit rate (b) Precision
(c) Recall (d) F1-score
Figure 4: Eectiveness Comparison Curves on IndicHash
Fig. 4 contrasts the performance of various hashtag recommendation models in terms
of evaluation metrics. The x-axis shows the number of recommended hashtags, while the
23
y-axis represents the respective performance indicators. The recommended hashtag count
ranges from 1 to 9. It is noteworthy that an increase in the number of recommended
hashtags leads to higher hit rate and recall, but lower precision. The curves for TAGALOG
consistently outperform those of the other models in all metrics, regardless of the number
of hashtags recommended. Furthermore, the gaps between the curves of evaluation metrics
gradually expand, underscoring the substantial advancements made by our proposed model
in comparison to existing research methods. These ndings provide empirical support for
TAGAlOG’s superiority and ecacy across all four assessment criteria.
Technique Hit rate Precision Recall F1-score
mBERT (Pires et al., 2019) 0.757 0.261 0.286 0.273
mBERT with transliteration 0.715 0.240 0.263 0.251
IndicBERT (Kakwani et al., 2020) 0.637 0.213 0.229 0.221
XLMR (Conneau et al., 2020) 0.655 0.200 0.221 0.210
DistilmBERT (Sanh et al., 2019) 0.549 0.147 0.159 0.153
TAGALOG 0.824 0.334 0.366 0.349
Table 3: Eectiveness Comparison Results with Existing Models
Table 3 shows the performance comparison of TAGALOG with extant transformer-based
models. The performance gain achieved by TAGALOG is 6.7%, 7.3%, 8.0%, and 7.6%
over mBERT without transliteration, 10.9%, 9.4%, 10.3%, and 9.8% over mBERT with
transliteration, 18.7%, 12.1%, 13.7%, and 12.8% over IndicBERT, 16.9%, 13.4%, 14.5%,
and 13.9% over XLMR, 27.5%, 18.7%, 20.7%, and 19.6% over DistilmBERT in terms of
four performance measures. The reasons behind this gap are the incorporation of a novel
language-guided attention mechanism in addition to user-guided attention, the construction
of a user-tweet graph to capture interactions among tweets belonging to languages of the
same family, and user-tweet interaction to enrich user and tweet embeddings. These pro-
cedures help in constructing an eective tweet representation which in turn recommends
high-quality and relevant hashtags for tweets posted in low-resource Indic languages.
5.2.2. Performance Gain Analysis
We analyze the performance pickup of the suggested approach in this section. Following
a performance comparison with various model components, we examine how TAGALOG
performs using various attention techniques.
Attention Techniques. We discuss how TAGALOG performs with diverse attention strate-
gies in this part. The variants of TAGALOG that use no attention, language-guided at-
tention, user-guided attention, and user-guided along with language-guided attention are
T AGALOGN A,T AGALOGLGA ,T AGALOGU GA , and T AGALO GUGA+LGA respectively.
Here, T AGALOGU GA+LGA refers to our devised system.
24
Mechanism Hit rate Precision Recall F1-score
T AGALOGN A 0.784 0.285 0.313 0.299
T AGALOGLGA 0.783 0.292 0.321 0.306
T AGALOGU GA 0.824 0.330 0.361 0.345
T AGALOGU GA+LGA 0.824 0.334 0.366 0.349
Table 4: Performance of TAGALOG with Dierent Attention Techniques
Table 4 illustrates the performance obtained on eliminating attention mechanisms that
comprise the feature renement module. Here, UGA and LGA refer to user-guided attention
and language-guided attention mechanisms. The performance dierence when TAGALOG
is implemented without any attention mechanism is 5.0% in terms of the F1-score. To
derive the overall tweet representation in the case of the no-attention model, we compute
the average of mBERT-based token embeddings. The performance of TAGALOG is the
lowest in the absence of any attention mechanism. The drop in the F1-score on eliminating
UGA from TAGALOG, termed as T AGALOGLGA, is 4.3%, while the dierence in excluding
LGA from TAGALOG, abbreviated as T AGALOGU GA , is 0.4%. UGA helps to learn the
context in which a user created a post and LGA assists in learning the user’s language choice
and usage style. UGA is typically used to improve the relevance and usefulness of tweets for
individual users and to enhance the overall user experience, while LGA focuses on modeling
idiosyncratic language behavior. The above-mentioned performance gap demonstrates the
signicance of language-guided and user-guided attention techniques.
Model Component Analysis. We conduct model component analysis to emphasize the sig-
nicance of various components constituting the proposed model. Below, we put forth the
performance of Feature Renement (FR) and Feature Interaction (FI) components compris-
ing TAGALOG. We eliminate the feature renement component to stress its pertinence.
The resultant model is referred to as T AGALOGF I . Similarly, the model obtained on
the exclusion of feature interaction from TAGALOG is referred to as T AGALOGF R . We
use acronyms T AGALOGF R+F I and T AGALO G in tandem since T AGALOGF R+F I is the
model we have developed.
Technique Hit rate Precision Recall F1-score
T AGALOGF I 0.784 0.285 0.313 0.299
T AGALOGF R 0.806 0.314 0.342 0.328
T AGALOGF R+F I 0.824 0.334 0.366 0.349
Table 5: Performance Comparison with Dierent Components
Table 5 shows the performance of TAGALOG on eliminating its dierent components.
The performance gap in terms of evaluation metrics on the exclusion of FR is 4.0%, 4.9%,
5.3%, and 5.0% respectively, while that on the exclusion of FI is 1.8%, 2.0%, 2.4%, and 2.1%,
which demonstrates the signicance of these components. Additionally, the performance of
the proposed model which includes both FR and FI beats the performance of individual
25
components. This implies these components complement each other when recommending
hashtags. FR captures local topical and linguistic interests of individual users through
UGA and LGA, while FI captures global interests by analyzing the long-term behavior and
preferences of the user besides tweet correlation based on language relatedness. Overall,
the experimental results show that each component contributes positively to TAGALOG’s
performance.
5.2.3. Qualitative Analysis
We conduct qualitative investigations to demonstrate how eective our framework is. We
show user-created tweets together with hashtags proposed by dierent models. For sample
tweets chosen from the test data, the accurate hashtags are shown in green, pertinent in blue,
and erroneous in red. The hashtags that models recommend and are consistent with hashtags
that reect the actual situation are considered accurate. On the other hand, pertinent
hashtags do not belong to the category of ground-truth hashtags but are compatible with
the tweet’s content.
The tweet given in Fig. 5a is in context with the Punjab elections held in 2022, written
in Bangla. As can be seen, the user assigns a few hashtags to the tweet in his native
language. It indicates that these hashtags used are wildly trending about Punjab elections
among Bangla Twitter users. The user assigns #congress and #bjp not only in English
but also in Bangla. Besides assigning hashtags in English, users tend to assign topics of
their interests with hashtags in their native language. Users are more inclined to adopt
hashtags in their native language to connect with others who share their cultural background
or interests. Hashtags in dierent languages can also promote diversity and inclusivity
on social media platforms, allowing users to nd content and connect with others from a
broader range of backgrounds and perspectives. Hashtags recommended in Bangla indicate
the ability of our model in recommending language-specic topical hashtags. This implies
our model recommends multilingual hashtags and learns the user’s language usage style by
adopting his linguistic behavior. The hashtag #punjab is directly related to the event of the
Punjab Elections; #pmmodi and #rahulgandhi are prominent political gures and therefore
deemed pertinent. TAGALOG recommends seven accurate and three pertinent hashtags.
DESIGN recommends four accurate, ve pertinent, and one erroneous hashtag. SEGTRM
recommends three accurate and six pertinent hashtags. AMNN recommends one accurate,
ve pertinent, and one erroneous hashtag. TwHIN-BERT recommends one accurate, four
pertinent, and ve erroneous hashtags. Our model recommends the highest number of
accurate hashtags indicating that mining users’ posting and linguistic behavior help suggest
plausible hashtags.
The tweet in Fig. 5b is written in Gujarati in the context of the global event, the
Russia-Ukraine war. TAGALOG recommends seven accurate and three pertinent hashtags;
DESIGN recommends ve accurate, four pertinent, and one erroneous hashtag; SEGTRM
recommends three accurate, one pertinent, and one erroneous hashtag; AMNN recommends
two accurate, one pertinent, and two erroneous hashtags, TwHIN-BERT recommends two
accurate, one pertinent, and seven erroneous hashtags. The example posts demonstrate
how, by suggesting customized hashtags based on users’ thematic and linguistic preferences,
26
(a) Post 1 (b) Post 2
Figure 5: Example Posts
27
TAGALOG surpasses earlier research methods.
6. Discussion
This article introduces a technique to recommend hashtags for tweets posted in multiple
low-resource Indic languages. Our method leverages the user’s topical and linguistic prefer-
ences besides the user’s posting behavior to enrich the overall tweet representation to yield
pertinent hashtags. Overall comparison results show that the proered system outperforms
the pre-trained language models and state-of-the-art methods by a signicant margin. While
our proposed system oers exciting possibilities, it is crucial to acknowledge its limitations.
This section delves into these limitations, discusses the practical implications, and explores
potential applications that can leverage its strengths.
6.1. Limitations and Future Work
While our proposed model exhibits notable strengths, it is not immune to limitations.
One of the limitations is that it considers only two prominent language families, i.e., Indo-
Aryan and Dravidian. Additionally, there are other distinct language families represented
in India, such as Austroasiatic (e.g., Santali), Tibeto-Burman (e.g., Manipuri), and An-
damanese (e.g., Great Andamanese) that contribute to the diverse linguistic landscape of
the Indian subcontinent. Our system is scalable as it can be applied to tweets written in lan-
guages belonging to these language groups. Moreover, we have only considered relatedness
among languages of the same family. Indic languages exhibit varying degrees of cross-family
language relatedness due to historical and linguistic inuences. Dierent language fami-
lies have varying degrees of interaction and inuence with the Indo-Aryan and Dravidian
languages, resulting in some cross-family language relatedness in the Indian subcontinent,
which can be explored in the future. The future directions also encompass employing data
augmentation methodologies to amplify the quantity of available data for articially training
the model and devising models that can eectively grasp the diverse patterns of hashtag
usage across various cultural and linguistic contexts.
6.2. Practical Implications
As discussed below, the practical implications of multilingual and personalized hash-
tag recommendation in low-resource Indic languages are far-reaching, revolutionizing how
individuals engage with social media platforms.
1. Improved Content Discovery: Hashtags are a powerful tool for content discovery and
organization. Users can easily nd relevant content in their preferred language by
providing multilingual and personalized hashtag recommendations. This enhances
their browsing experience and encourages active engagement with the platform. By
suggesting hashtags that align with users’ linguistic and cultural context, users are
more likely to engage with the content, participate in discussions, and contribute to
online communities. This can lead to increased user retention and overall platform
activity.
28
2. Language Inclusivity: Low-resource Indic languages often face marginalization in dig-
ital spaces due to the dominance of primary languages. Multilingual hashtag recom-
mendation systems address this issue by promoting inclusivity. They enable users
to express themselves in their native languages, facilitating active participation and
fostering a sense of belongingness within language communities.
3. Language Learning and Education: Personalized hashtag recommendations can benet
individuals learning low-resource Indic languages. By suggesting hashtags that match
their language prociency level, users can explore relevant content and engage with
native speakers, thereby enhancing their language skills and cultural understanding.
4. Bridging Language Divides and Promoting Heritage: Hashtag recommendation sys-
tems act as linguistic tools and cultural signiers. They bridge language divides by
suggesting common hashtags across low-resource Indic and widely spoken languages,
facilitating cross-lingual communication and collaboration. Additionally, these sys-
tems preserve and promote the cultural heritage of low-resource Indic languages, al-
lowing users to express cultural identity, share traditions, and engage in community
discussions using hashtags. The systems also enable the analysis of hashtag usage
patterns, revealing linguistic and cultural trends across languages.
6.3. Potential Applications
The potential applications of multilingual and personalized hashtag recommendation in
low-resource Indic languages can unlock the immense potential of online communication for
users and communities as enlisted below.
1. Social and Political Discourse: Hashtags play a signicant role in shaping public opin-
ion and facilitating discussions around social and political issues. A multilingual hash-
tag recommendation system for low-resource Indic languages can ensure that diverse
linguistic communities can actively participate in such discussions. It can empower
individuals to express their opinions, promote social causes, drive activism, raise aware-
ness, and contribute to democratic processes. This can amplify their voices and facil-
itate collective action within their linguistic communities.
2. Market Reach and Business Opportunities: Multilingual hashtag recommendations
open doors for businesses and marketers to tap into untapped markets, reaching a
wider audience and driving engagement. By using relevant hashtags, businesses can
eectively target specic language communities, promote their products or services,
and connect with potential customers who prefer using their native languages online.
3. Data analysis and research: Hashtags provide valuable metadata that can be analyzed
to gain insights into social trends, public opinions, and user behavior. By recommend-
ing hashtags in low-resource Indic languages, researchers, social scientists, and data
analysts can access a wider range of data, enabling them to study and understand the
dynamics and patterns within these language communities.
29
7. Conclusion
In this paper, we have tackled hashtag recommendations to facilitate multilingual con-
tent retrieval and break through language barriers inherent in social media platforms. The
proposed polyglot model, TAGALOG, can recommend personalized and language-specic
hashtags for online content generated in various low-resource Indic languages. The system
proposed in this study comprises feature extraction, renement, and interaction modules.
We rst extract content-based, linguistic, and user-based features using a transformer and
deep learning-based models. We then employ language-guided and user-guided attention
mechanisms to ne-tune tweet representation in line with users’ linguistic and topical pref-
erences. In the feature interaction module, we connect the historical tweets of a particular
user to mine his posting behavior. Furthermore, we group tweets written in various lan-
guages concerning their families, i.e., Indo-Aryan and Dravidian, to capture their interre-
latedness. Extensive experiments conducted on the curated Twitter dataset reveal that our
proposed model is superior in performance to language models that have been trained and
state-of-the-art methods.
References
Aggarwal, S., Kumar, S., & Mamidi, R. (2021). Ecient multilingual text classication for
indian languages. In Proceedings of the International Conference on Recent Advances
in Natural Language Processing (RANLP 2021) (pp. 19–25).
Bansal, S., Gowda, K., & Kumar, N. (2022). A hybrid deep neural network for multimodal
personalized hashtag recommendation. IEEE transactions on computational social sys-
tems, (pp. 1–21).
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition
for under-resourced languages: A survey. Speech communication,56, 85–100.
Chakrabarti, P., Malvi, E., Bansal, S., & Kumar, N. (2023). Hashtag recommendation for
enhancing the popularity of social media posts. Social Network Analysis and Mining,
13, 21.
Chen, Y.-C., Lai, K.-T., Liu, D., & Chen, M.-S. (2021). Tagnet: triplet-attention graph
networks for hashtag recommendation. IEEE Transactions on Circuits and Systems for
Video Technology,32, 1148–1159.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave,
É., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual repre-
sentation learning at scale. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics (pp. 8440–8451).
30
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186).
Djenouri, Y., Belhadi, A., Srivastava, G., & Lin, J. C.-W. (2022). Deep learning based
hashtag recommendation system for multimedia data. Information Sciences,609, 1506–
1517.
Dogra, V., Verma, S., Chatterjee, P., Sha, J., Choi, J., Ijaz, M. F. et al. (2022). A complete
process of text classication system using state-of-the-art nlp models. Computational
Intelligence and Neuroscience,2022.
Dusart, A., Pinel-Sauvagnat, K., & Hubert, G. (2023). Tssubert: How to sum up multiple
years of reading in a few tweets. ACM Transactions on Information Systems,41, 1–33.
Graves, A., & Graves, A. (2012). Long short-term memory. Supervised sequence labelling
with recurrent neural networks, (pp. 37–45).
Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery
and data mining (pp. 855–864).
Hachaj, T., & Miazga, J. (2020). Image hashtag recommendations using a voting deep neural
network and associative rules mining approach. Entropy,22, 1351.
Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large
graphs. Advances in neural information processing systems,30.
Javari, A., He, Z., Huang, Z., Jeetu, R., & Chen-Chuan Chang, K. (2020). Weakly supervised
attention for hashtag recommendation using graph data. In Proceedings of The Web
Conference 2020 (pp. 1038–1048).
Jeong, D., Oh, S., & Park, E. (2022). Demohash: Hashtag recommendation based on user
demographic information. Expert Systems with Applications,210, 118375.
Kakwani, D., Kunchukuttan, A., Golla, S., Gokul, N., Bhattacharyya, A., Khapra, M. M.,
& Kumar, P. (2020). Indicnlpsuite: Monolingual corpora, evaluation benchmarks and
pre-trained multilingual language models for indian languages. In Findings of the As-
sociation for Computational Linguistics: EMNLP 2020 (pp. 4948–4961).
Kaviani, M., & Rahmani, H. (2020). Emhash: Hashtag recommendation using neural net-
work based on bert embedding. In 2020 6th International Conference on Web Research
(ICWR) (pp. 113–118). IEEE.
31
Khatri, J., Saini, N., & Bhattacharyya, P. (2021). Language relatedness and lexical closeness
can help improve multilingual nmt: Iitbombay@ multiindicnmt wat2021. In Proceedings
of the 8th Workshop on Asian Translation (WAT2021) (pp. 217–223).
Khemchandani, Y., Mehtani, S., Patil, V., & Awasthi, A. (2021). Exploiting language
relatedness for low resource language model adaptation: An indic languages study. In
ACL-IJCNLP Main Conference.
Kipf, T. N., & Welling, M. (2016). Semi-supervised classication with graph convolutional
networks. In International Conference on Learning Representations.
Kou, F.-F., Du, J.-P., Yang, C.-X., Shi, Y.-S., Cui, W.-Q., Liang, M.-Y., & Geng, Y. (2018).
Hashtag recommendation based on multi-features of microblogs. Journal of Computer
Science and Technology,33, 711–726.
Kumar, N., Baskaran, E., Konjengbam, A., & Singh, M. (2021). Hashtag recommendation
for short social media texts using word-embeddings and external knowledge. Knowledge
and Information Systems,63, 175–198.
Kurunkar, P., Sawant, O., Mene, P., & Varghese, N. (2022). An image-based hashtag recom-
mendation system as a social media workow tool. In 2022 International Conference on
Smart Generation Computing, Communication and Networking (SMART GENCON)
(pp. 1–5). IEEE.
Lei, K., Fu, Q., Yang, M., & Liang, Y. (2020). Tag recommendation by text classication
with attention-based capsule network. Neurocomputing,391, 65–73.
Li, M., Gan, T., Liu, M., Cheng, Z., Yin, J., & Nie, L. (2019). Long-tail hashtag recom-
mendation for micro-videos with graph convolutional network. In Proceedings of the
28th ACM International Conference on Information and Knowledge Management (pp.
509–518).
Li, X., Wu, X., Luo, Z., Du, Z., Wang, Z., & Gao, C. (2023). Integration of global and local
information for text classication. Neural Computing and Applications,35, 2471–2486.
Li, Z., Wang, X., Yang, W., Wu, J., Zhang, Z., Liu, Z., Sun, M., Zhang, H., & Liu, S. (2022).
A unied understanding of deep nlp models for text classication. IEEE Transactions
on Visualization and Computer Graphics,28, 4980–4994.
Ma, R., Qiu, X., Zhang, Q., Hu, X., Jiang, Y.-G., & Huang, X. (2019). Co-attention memory
network for multimodal microblog’s hashtag recommendation. IEEE Transactions on
Knowledge and Data Engineering,33, 388–400.
Mao, Q., Li, X., Liu, B., Guo, S., Hao, P., Li, J., & Wang, L. (2022). Attend and select:
A segment selective transformer for microblog hashtag generation. Knowledge-Based
Systems,254, 109581.
32
Marreddy, M., Oota, S. R., Vakada, L. S., Chinni, V. C., & Mamidi, R. (2022). Multi-
task text classication using graph convolutional networks for large-scale low resource
language. In 2022 International Joint Conference on Neural Networks (IJCNN) (pp.
1–8). IEEE.
Mehta, S., Sarkhel, S., Chen, X., Mitra, S., Swaminathan, V., Rossi, R., Aminian, A., Guo,
H., & Garg, K. (2021). Open-domain trending hashtag recommendation for videos. In
2021 IEEE International Symposium on Multimedia (ISM) (pp. 174–181). IEEE.
Myers, S., Syrdal, H. A., Mahto, R. V., & Sen, S. S. (2023). Social religion: A cross-
platform examination of the impact of religious inuencer message cues on engagement–
the christian context. Technological Forecasting and Social Change,191, 122442.
Nama, V., & Deepak, G. (2023). Dtagrecpls: Diversication of tag recommendation for
videos using preferential learning and dierential semantics. In Proceedings of the 14th
International Conference on Soft Computing and Pattern Recognition (SoCPaR 2022)
(pp. 887–898). Springer.
Padungkiatwattana, U., & Maneeroj, S. (2022). Pac-man: Multi-relation network in so-
cial community for personalized hashtag recommendation. IEEE Access,10, 131202–
131228.
Panchal, P., & Prajapati, D. J. (2023). The social hashtag recommendation for image
and video using deep learning approach. In Sentiment Analysis and Deep Learning:
Proceedings of ICSADL 2022 (pp. 241–261). Springer.
Pandey, K. K., & Jha, S. (2021). Exploring the interrelationship between culture and
learning: the case of english as a second language in india. Asian Englishes, (pp. 1–17).
Park, M., Li, H., & Kim, J. (2016). Harrison: A benchmark on hashtag recommendation
for real-world images in social networks. arXiv preprint arXiv:1605.05054, .
Pathak, M., & Jain, A. (2022). µboost: An eective method for solving indic multilin-
gual text classication problem. In 2022 IEEE Eighth International Conference on
Multimedia Big Data (BigMM) (pp. 96–100). IEEE.
Peng, M., Lin, Y., Zeng, L., Gui, T., & Zhang, Q. (2019). Modeling the long-term post his-
tory for personalized hashtag recommendation. In Chinese Computational Linguistics:
18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019,
Proceedings 18 (pp. 495–507). Springer.
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social rep-
resentations. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 701–710).
33
Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual bert? In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
(pp. 4996–5001).
Rehman, M. Z. U., Mehta, S., Singh, K., Kaushik, K., & Kumar, N. (2023). User-aware
multilingual abusive content detection in social media. Information Processing & Man-
agement,60, 103450.
Sanghvi, D., Fernandes, L. M., D’Souza, S., Vasaani, N., & Kavitha, K. (2023). Fine-tuning
of multilingual models for sentiment classication in code-mixed indian language texts.
In Distributed Computing and Intelligent Technology: 19th International Conference,
ICDCIT 2023, Bhubaneswar, India, January 18–22, 2023, Proceedings (pp. 224–239).
Springer.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, .
Schlichtkrull, M., Kipf, T. N., Bloem, P., Van Den Berg, R., Titov, I., & Welling, M. (2018).
Modeling relational data with graph convolutional networks. In The Semantic Web:
15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018,
Proceedings 15 (pp. 593–607). Springer.
Tang, S., Yao, Y., Zhang, S., Xu, F., Gu, T., Tong, H., Yan, X., & Lu, J. (2019). An integral
tag recommendation model for textual content. In Proceedings of the AAAI Conference
on Articial Intelligence (pp. 5109–5116). volume 33.
Wang, Y., Li, J., King, I., & Shi, M. R. L. S. (2019). Microblog hashtag generation via
encoding conversation contexts. In Proceedings of NAACL-HLT (pp. 1624–1633).
Wei, Y., Cheng, Z., Yu, X., Zhao, Z., Zhu, L., & Nie, L. (2019). Personalized hashtag recom-
mendation for micro-videos. In Proceedings of the 27th ACM International Conference
on Multimedia (pp. 1446–1454).
Yang, C., Wang, X., & Jiang, B. (2020a). Sentiment enhanced multi-modal hashtag recom-
mendation for micro-videos. IEEE Access,8, 78252–78264.
Yang, Q., Wu, G., Li, Y., Li, R., Gu, X., Deng, H., & Wu, J. (2020b). Amnn: Attention-
based multimodal neural network model for hashtag recommendation. IEEE Transac-
tions on Computational Social Systems,7, 768–779.
Yang, Z., & Lin, Z. (2022). Interpretable video tag recommendation with multimedia deep
learning framework. Internet Research,32, 518–535.
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention
networks for document classication. In Proceedings of the 2016 conference of the North
American chapter of the association for computational linguistics: human language
technologies (pp. 1480–1489).
34
Zhang, S., Yao, Y., Xu, F., Tong, H., Yan, X., & Lu, J. (2019). Hashtag recommendation for
photo sharing services. In Proceedings of the AAAI Conference on Articial Intelligence
(pp. 5805–5812). volume 33.
Zhang, X., Malkov, Y., Florez, O., Park, S., McWilliams, B., Han, J., & El-Kishky, A.
(2022). Twhin-bert: A socially-enriched pre-trained language model for multilingual
tweet representations. arXiv preprint arXiv:2209.07562, .
35
... Sarcasm, characterized by the use of irony to mock or convey contempt, adds layers of complexity to written and spoken language [1], [2]. Beyond its colloquial usage in everyday conversations, the ability to detect sarcasm holds profound implications for diverse applications [3], [4]. In social media, where brevity is a norm, understanding sarcastic remarks is essential for interpreting user sentiments accurately [5]. ...
Article
Full-text available
This study navigates the intricate landscape of sarcasm detection within the condensed confines of newspaper titles, addressing the nuanced challenge of decoding layered meanings. Leveraging natural language processing (NLP) techniques, we explore the efficacy of various machine learning models—linear regression, support vector machines (SVM), random forest, na¨ıve Bayes multinomial, and gaussian na¨ıve Bayes—tailored for sarcasm detection. Our investigation aims to provide insights into sarcasm within the succinct framework of newspaper titles, offering a comparative analysis of the selected models. We highlight the varied strengths and weaknesses of these models. Random forest exhibits superior performance, achieving a remarkable 94% accuracy in accurately identifying sarcasm in text. It is closely trailed by SVM with 90% accuracy and logistic regression with 83% accuracy.
... This process requires linguistic proficiency, cultural awareness, and the ability to effectively communicate the essence of the text in the translated language [2]. In the realm of English translation, information retrieval plays a vital role in ensuring the fidelity and clarity of the translated content, ultimately facilitating effective cross-cultural communication and understanding [3].Multilingual information retrieval (MIR) is a specialized field that focuses on retrieving relevant information from multilingual sources and presenting it in a coherent and understandable manner, particularly when translating into English [4]. In the context of translation, MIR involves accessing and processing information from texts written in different languages, analyzing their content, and extracting key information that needs to be translated accurately into English [5]. ...
Article
Full-text available
Multilingual information retrieval using graph neural networks offers practical applications in English translation by leveraging advanced computational models to enhance the efficiency and accuracy of cross-lingual search and translation tasks. By representing textual data as graphs and utilizing graph neural networks (GNNs), this approach captures intricate relationships between words and phrases across different languages, enabling more effective language understanding and translation. GNNs can learn complex linguistic structures and semantic similarities from multilingual corpora, facilitating the development of more robust translation systems that are capable of handling diverse language pairs and domains. The paper introduces a novel approach termed the Multilingual Ant Bee Optimization Graph Neural Network (MABO-GNN) for addressing optimization, classification, and multilingual translation tasks. MABO-GNN integrates ant bee optimization algorithms with graph neural networks to provide a versatile framework capable of optimizing objective functions, improving classification accuracy iteratively, and facilitating high-quality translations across multiple languages. Through comprehensive experimentation, the efficacy of MABO-GNN is demonstrated across various tasks, languages, and datasets. in optimization experiments, MABO-GNN achieves objective function values of 0.012, 0.015, 0.011, and 0.013 in Experiment 1, Experiment 2, Experiment 3, and Experiment 4, respectively, with convergence times ranging from 90 to 150 seconds. In classification tasks, the model exhibits notable performance improvements over iterations, with BLEU scores reaching 0.84 and METEOR scores reaching 0.78 in the fifth iteration. The translation results showcase BLEU scores of 0.85 for English, 0.82 for French, 0.79 for German, 0.81 for Spanish, and 0.75 for Chinese, indicating the model's proficiency in generating high-quality translations across diverse languages.
... The ongoing advancement in the Internet of Things (IoT) technology and deep learning presents a promising avenue to deliver intelligent, real-time, and personalized sports information dissemination services for individuals with disabilities by integrating these cutting-edge technologies [13], [14], [15]. IoT sensor technology facilitates the capture of intricate scene details, while deep learning models excel in comprehending and processing such complex data, thereby enhancing the overall sports viewing experience for individuals with disabilities [16], [17], [18]. ...
Article
Full-text available
The ever-growing landscape of Internet of Things (IoT) technology and the evolution of deep learning algorithms have ushered in transformative changes in the communication strategy for disseminating information on disabled sports. This specialized information resource aims to provide relevant support and services related to sports activities for disabled individuals. This study investigates the communication strategy of disabled sports information driven by deep learning within the framework of the IoT and assesses the practical application performance of the proposed model. To achieve this objective, an appropriate deep learning model for the dissemination of sports information for the disabled is selected through a thorough literature review. Subsequently, an experimental framework is proposed for comprehensive performance verification, evaluating the model’s performance in reasoning time and user satisfaction through comparative experiments. By constructing deep learning models, extensive data on disabled sports activities are analyzed, enabling the identification and prediction of key factors in information dissemination. The results indicate that the proposed sports information dissemination model outperforms similar models across various performance metrics, particularly in real-time performance and user experience. Comparative analysis with attention-based deep neural networks and traditional machine learning algorithms reveals that the proposed model achieves an accuracy rate as high as 0.85, significantly surpassing the 0.78 and 0.82 accuracies of these models, respectively. Moreover, the proposed model demonstrates the shortest inference time (15ms), surpassing both aforementioned models. This study validates the relative advantages of the proposed model through comparison with similar studies, offering a novel solution for the dissemination of sports information for the disabled.
... Foundational models refer to large-scale language models that serve as the basis or foundation for various downstream applications and tasks [2][3][4]. They have become a fundamental building block for a wide range of AI applications covering natural language understanding (text classification including sentiment analysis, spam detection and topic categorization, named entity recognition, language translation, etc.) [5][6][7][8], text generation (content creation, code generation for programming languages, etc.) [9,10], question answering, conversational AI [11,12], language summarization [13,14], content recommendation and moderation [15,16], search engines, web pages and documents ranking [17], and data extraction and knowledge graph creation [18,19]. ...
Article
Full-text available
In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through web crawling. However, there are notable limitations in the results for some languages, including Spanish. These datasets are either smaller compared to other languages or suffer from lower quality due to insufficient cleaning and deduplication. In this paper, we present esCorpius-m, a multilingual corpus extracted from around 1 petabyte of Common Crawl data. It is the most extensive corpus for some languages with such a level of high-quality content extraction, cleanliness, and deduplication. Our data curation process involves an efficient cleaning pipeline and various deduplication methods that maintain the integrity of document and paragraph boundaries. We also ensure compliance with EU regulations by retaining both the source web page URL and the WARC shared origin URL.
Article
Full-text available
Social media has gained huge importance in our lives wherein there is an enormous demand of getting high social popularity. With the emergence of many social media platforms and an overload of information, attaining high popularity requires efficient usage of hashtags, which can increase the reachability of a post. However, with little awareness about using appropriate hashtags, it becomes the need of the hour to build an efficient system to recommend relevant hashtags which in turn can enhance the social popularity of a post. In this paper, we thus propose a novel method hashTag RecommendAtion for eNhancing Social popularITy to recommend context-relevant hashtags that enhance popularity. Our proposed method utilizes the trending nature of hashtags by using post keywords along with the popularity of users and posts. With the prevalent evaluation techniques of this field being quite unreliable and non-uniform, we have devised a novel evaluation algorithm that is more robust and reliable. The experimental results show that our proposed method significantly outperforms the current state-of-the-art methods.
Article
Full-text available
Despite growing efforts to halt distasteful content on social media, multilingualism has added a new dimension to this problem. The scarcity of resources makes the challenge even greater when it comes to low-resource languages. This work focuses on providing a novel method for abusive content detection in multiple low resource Indic languages. Our observation indicates that a post’s tendency to attract abusive comments, as well as features such as user history and social context, significantly aid in the detection of abusive content. The proposed method first learns social and text context features in two separate modules. The integrated representation from these modules is learned and used for the final prediction. To evaluate the performance of our method against different classical and state-of-the-art methods, we have performed extensive experiments on SCIDN and MACI datasets consisting of 1.5M and 665K multilingual comments, respectively. Our proposed method outperforms state-of-the-art baseline methods with an average increase of 4.08% and 9.52% in the F1 score on SCIDN and MACI datasets, respectively.
Chapter
Video tag recommendation is not just necessary but also mandatory in the present-day scenario where multimedia content specifically videos are trending and becoming viral on the internet. In this paper, a video recommendation framework DTagRecPLS that is semantically driven and knowledge-centric has been proposed. It extracts the categories from the video dataset and enriches them by subjecting them to Latent Semantic Indexing. The proposed framework is ontology centered as ontology alignment to the enriched categories of the videos with that of the standard domain ontologies has been achieved and moreover, ontology-driven knowledge harvesting from differential heterogeneous knowledge stores has been used to enrich the number of instances and format them into an enriched knowledge pool. The model encompasses a deep learning framework namely the Convolutional Neural Network to classify video datasets from the perspective of using the actual video and image features, while the Logistic Regression Classifier classifies the dataset by extracting entities from the enriched feature pool with a perspective of annotations and labels. The common incidences are then used to compute the semantic similarity with that of the enriched knowledge pool and are sent for ranking and review. The DTagRecPLS gives the highest precision of 97.09%, highest average recall of 98.72%, highest overall accuracy and F-Measure of 97.91% and 97.90% respectively, and an overall average lowest False Discovery Rate of 0.03.KeywordsVideo tag recommendationPreferential learningDifferential semanticsDiversificationConvolutional neural networks
Article
Religion is a key factor in how American consumers spend their time and money. It serves as a significant component of the U.S. economy, with religiously affiliated people contributing trillions to the economy annually. The majority of religious consumers in the U.S. are Christians, making them a critical segment for marketers. Influencer marketing, which involves the use of endorsements from individuals with large social media followings, has emerged as an effective advertising tactic for reaching Christians on social media. However, there is a lack of research exploring the complexities of religion in advertising messaging, especially in the context of influencer marketing. To fill this gap, we apply the social identity, persuasion knowledge, and symbolic interactionism theories to propose relationships between message cues in Christian influencers' social media posts and follower engagement. We analyzed 20,068 Facebook posts, 20,517 tweets, and 13,857 Instagram posts to determine the impact of three categories of message cues on engagement. Across multiple studies, key findings indicate religious and promotional cues increase and decrease engagement across platforms, respectively. The impact of social media cues, such as hashtags and mentions, differs depending on the platform.
Article
The development of deep neural networks and the emergence of pre-trained language models such as BERT allow to increase performance on many NLP tasks. However, these models do not meet the same popularity for tweet stream summarization, which is probably due to the fact that their computation limitation requires to drastically truncate the textual input. Our contribution in this article is threefold : (1) we propose a neural model to automatically and incrementally summarize huge tweet streams. This extractive model combines in an original way pre-trained language models and vocabulary frequency-based representations to predict tweet salience. An additional advantage of the model is that it automatically adapts the size of the output summary according to the input tweet stream, (2) we detail an original methodology to construct tweet stream summarization datasets requiring little human effort, and (3) we release the TES 2012-2016 dataset constructed using the aforementioned methodology. Baselines, oracle summaries, gold standard, and qualitative assessments are made publicly available. To evaluate our approach, we conducted extensive quantitative experiments using three different tweet collections as well as an additional qualitative evaluation. Results show that our method outperforms state-of-the-art ones. We believe that this work opens avenues of research for incremental summarization, which has not received much attention yet.
Chapter
We use XLM (Cross-lingual Language Model), a transformer-based model, to perform sentiment analysis on Kannada-English code-mixed texts. The model was fine-tuned for sentiment analysis using the KanCMD dataset. We assessed the model’s performance on English-only and Kannada-only scripts. Also, Malayalam and Tamil datasets were used to evaluate the model. Our work shows that transformer-based architectures for sequential classification tasks, at least for sentiment analysis, perform better than traditional machine learning solutions for code-mixed data.KeywordsDICT-MLMTask adaptive pre-trainingDomain adaptive pre-trainingTransfer learningTransductive transferLSTMPseudo labelling
Chapter
There has been a lot of interest in the recent year in recommending hashtags for images/videos or posts on social media. Several researchers have researched the impact from numerous perspectives. In this paper, we enhance tag recommendation by recommending suitable hashtags considering both contents of the image/video and users’ history of the hashtag. On the social media image/video-sharing websites (such as Facebook, Instagram, Flickr, and Twitter), users can upload images or videos and tag them with tags. The proposed method generates candidate keywords, i.e., hashtag by combining techniques for textual tags, image and video activity/object recognition content, and acoustic data. To this end, this paper examines different methodologies that associate information that is multi-modal and suggests hashtags for image or video uploader users to generate tags for their images or videos. Although a substantial amount of study has been carried out on item/product recommendations for E-commerce websites, video recommendations for YouTube and Netflix, and friend suggestions on social media websites, research has not been carried out as much on hashtag recommendations for images/video on social media platform/app/websites, which have now turned out to be a vital role of these social media platforms. Here, in this paper, glance at hashtag recommendations for image/video has been carried.