ArticlePDF Available

Abstract and Figures

The majority of users were available on the Internet and created a number of social networking accounts during India’s COVID-19-caused lockdown, which lasted from March to June 2020. A massive amount of information is currently being disseminated on the Internet via various social networking accounts. Some false or fake information in the form of “government letters or resolutions, religious comments, hate speech, and so on" has spread like wildfire. As a result, there are major social issues affecting areas such as unemployment, politics, healthcare, poverty, religious cleavages, etc. Due to the vast availability of similar datasets comprising these types of information, manual detection of fake news or false information is challenging. This issue requires immediate attention in terms of automatically finding false news. With this motivation, we present a novel ‘ConFake’ algorithm. This algorithm includes an eighty content-based feature set for identifying fake news. Content-based and word vector features extracted from the textual content of news stories were used in the experiment. These characteristics were combined and input into machine learning classifiers. To validate the experimental findings, we ran all of the experiments on five publicly available datasets and one synthetically generated ConFake dataset that combined five datasets, namely: Kaggle, McIntire, Reuter, BuzzFeed, and PolitiFact. The proposed model achieved the highest accuracy of 97.31% when compared to other cutting-edge models.
This content is subject to copyright. Terms and conditions apply.
Multimedia Tools and Applications
https://doi.org/10.1007/s11042-023-15792-1
ConFake: fake news identification using content based
features
Mayank Kumar Jain1·Dinesh Gopalani1·Yogesh Kumar Meena1
Received: 19 October 2021 / Revised: 4 May 2023 / Accepted: 6 May 2023
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023
Abstract
The majority of users were available on the Internet and created a number of social networking
accounts during India’s COVID-19-caused lockdown, which lasted from March to June 2020.
A massive amount of information is currently being disseminated on the Internet via various
social networking accounts. Some false or fake information in the form of “government
letters or resolutions, religious comments, hate speech, and so on" has spread like wildfire.
As a result, there are major social issues affecting areas such as unemployment, politics,
healthcare, poverty, religious cleavages, etc. Due to the vast availability of similar datasets
comprising these types of information, manual detection of fake news or false information
is challenging. This issue requires immediate attention in terms of automatically finding
false news. With this motivation, we present a novel ‘ConFake’ algorithm. This algorithm
includes an eighty content-based feature set for identifying fake news. Content-based and
word vector features extracted from the textual content of news stories were used in the
experiment. These characteristics were combined and input into machine learning classifiers.
To validate the experimental findings, we ran all of the experiments on five publicly available
datasets and one synthetically generated ConFake dataset that combined five datasets, namely:
Kaggle, McIntire, Reuter, BuzzFeed, and PolitiFact. The proposed model achieved the highest
accuracy of 97.31% when compared to other cutting-edge models.
Keywords Social media ·Machine learning ·Fake news ·Linguistic features ·
Word embedding
BMayank Kumar Jain
mayank261288@gmail.com
Dinesh Gopalani
dgopalani.cse@mnit.ac.in
Yogesh Kumar Meena
ymeena.cse@mnit.ac.in
1Department of Computer Science and Engineering, Malaviya National Institute of Technology,
Jaipur 302017, Rajasthan, India
123
Multimedia Tools and Applications
1 Introduction
People are increasingly exposed to false news since users may easily make profiles on social
media sites to connect with or talk to friends and publish or tweet stuff. The proportion of
false information is growing by the day. There are various kinds of false information present,
such as satire [39], hoaxes, clickbaits [28], rumours [31], stances, fake reviews, and fake
news [67]. The spread of false information through media channels such as social media
posts, tweets, blogs, and online newspapers has necessitated the identification of credible
news originators. Fake news can exist in any format and spread through word-of-mouth, text
in regional languages, instant messages, posts, photos, images, videos, short videos, audio,
clippings, etc. As a result of the spread of fake news or false information, society may face
dangerous situations such as mob lynching, human life at risk due to rumours and incorrect
information about health care and medicine, and social and political confusion.
Fake news is disseminated under the pretence of being real, generally via news sources
on the Internet, to gain political or financial advantage, increase readership, and sway public
opinion [47]. It can potentially harm an agency, company, or individual, or profit financially
or politically [33].
The efficiency of accessing and sharing knowledge with others makes online social net-
works enticing. However, instantaneous data scattering at a high pace with little effort makes
it possible for the widespread dissemination of false information, such as fake news [57].
According to Schwarz et al. [45], people generally assess truths using the following criteria,
each of which may be evaluated analytically and intuitively: Consensus: Do other people
agree with you? Consistency: Is it in line with what I know? Coherence: Is the tale internally
consistent and plausible? Credibility: Is it from a trustworthy source? Support: Is there a lot
of proof to back up your claim?
For the reasons stated, identifying false news on social media has become difficult. To
begin with, gathering and labelling fake information from the Internet is difficult because
information shared on social networking sites is considered private data [44]. Second, the
person who writes the news alters the words in their language in order to spread false infor-
mation. Third, the limited number of text data representation approaches makes it difficult
to detect false news.
Figures 1and 2depicts the example of true news1and disinformation2on social network
webs, respectively. The sources of these images are taken from the Truth and Fiction fact-
checking website [56].
Many fact-checking organisations, such as Politifact [36], TruthOrFiction [56], Snopes
[54], FactCheck [10], FullFact [13], HoaxSlayer [19], VishwasNews [60], are fighting false
news. These organisations operate on a conventional journalistic paradigm, in which reporters
must analyse the facts to determine the validity of a news clip. This method is not automated
and takes extra time [17]. Moreover, to identify false news, many writers employ Machine
Learning (ML) models [5,7,22,26,29,41,48] using a variety of features. Many researchers
utilise content-based [64] characteristics in their research to improve accuracy. These prob-
lems raise several research concerns, and the problem’s complexity necessitates innovative
and solid answers. The following Research Questions (RQ) inspire this research:
RQ1: Which Linguistic Feature (LF) set is vital for detecting false news with high accuracy?
1Source:https://www.truthorfiction.com/u-s-military-dogs-being-evacuated-from-afghanistan/
2Source:https://www.truthorfiction.com/no-a-study-didnt-find-that-the-most-highly-educated-americans-
are-also-the-most-vaccine-hesitant/
123
Multimedia Tools and Applications
Fig. 1 Example of True news
RQ2: Which method of Word Embedding (WoE) with a set of LFs is best for finding fake
news?
RQ3: Which ML approach is best for detecting false news on available datasets?
To address the research mentioned in earlier questions (RQ), we suggest a three-phase
approach centred on article textual content:
1. Use the LF set to detect fake stories.
2. Use WoE with a set of LFs to improve the ConFake dataset’s ability to spot fake news.
3. Contrast this model with state-of-the-art algorithms.
Fig. 2 Example of Disinformation
123
Multimedia Tools and Applications
This study requires no metadata related to the user or the media to detect false news. In
general, the following are the work’s major contributions:
a. ConFake Dataset: To minimize the overfitting of ML models and facilitate better train-
ing, we created a larger ConFake dataset. It combines five datasets for this purpose:
Kaggle, Reuter’s, McIntire’s, BuzzFeed’s, and PolitiFact’s. This novel dataset of 72,413
news articles contains 35,396 true news articles and 37,479 false news articles.
b. Extensive Feature Collection: Utilizing state-of-the-art approaches, gather the different
language characteristics and identify a subset using Pearson correlation, which works well
on the ConFake dataset.
c. ConFake Model: In this investigation, the ensemble technique is applied to WoE with
LFs using multiple ML methods.
The balance of this work has been organized as follows: A survey of relevant work on
linguistic characteristics and deception detection is found in Section 2. The LF categories
and word vector features are then presented in Section 3. Section 4outlines the classification
process pipelines and the tools for each assessment step. Following that, Section 5presents
the algorithms for detecting fake news using linguistic and word vector characteristics. In
contrast, Section 6frames the performance of ML classifiers on the ConFake dataset and
compares the testing findings with related works. Finally, Section 7concludes the article and
emphasises future work.
2 Related work
Fake news, satire [39], misinformation [58], rumour [24], hoaxes, disinformation, propa-
ganda, and opinion spam are various categories of false information. These categories are
not mutually exclusive and numerous researchers have used them for different scenarios.
This study focuses on several false news detection methods.
Pawan et al. [57] proposed a WELFake model, where they used WoE vector features
combined with LFs to detect fake news. In that work, they perform the tasks in phases. They
created a novel dataset of approximately 72,000 news articles in the first phase after incor-
porating four Kaggle, McIntire, Reuter, and Buzzfeed political datasets. In the second phase,
they extracted 87 features from the literature and selected 20 features after applying Pearson
correlation to features whose value was greater than 0.7. In the third phase, they combine the
LF 1, LF 2, and LF 3 sets with the Term Frequency (TF) matrix and feed them into the voting
classifier. In the last phase, apply the hard voting classifier to the Term Frequency-Inverse
Document Frequency (TF-IDF) vector, the TF vector, and the previous voting classifier result.
They used six ML classifiers, including Support Vector Machine (SVM), Naive Bayes (NB),
K-Nearest Neighbour (KNN), Decision Tree (DT), Bagging, and Boosting, and discovered
that SVM provided 96.73% accuracy. The model’s accuracy was 1.31% greater than the
Bidirectional Encoder Representations from Transformers (BERT) model and 4.31% greater
than the Convolutional Neural Network (CNN) model. The limitation of the WELFake model
is the random selection of features for LF from 20 features. Additionally, they have not yet
identified fake news stories using user-centred traits like registration, account age, posting
frequency, social connections, etc.
Anshika et al. [9] introduced a deep learning model based on textual features to detect
false news, which was adequate. Their research employed news articles’ syntactic, gram-
matical, semantic, and readability features. Following the extraction of these characteristics,
the author made two feature sets: a combination of syntactic, semantic, and grammatical
123
Multimedia Tools and Applications
features and a readability feature set. Character count, word count, title word count, stop word
count, uppercase word count, word density, polarity, subjectivity, #nouns, #verbs, #adjec-
tives, #pronouns, and #adverbs are all included in the first set. The second set comprises
the Flesch Reading Ease, Automated Readability Index, Gunning Fog Index, Coleman Liau
score, Flesch-Kinkaid score, Smog Index, and Linsear write formula. They used two small
datasets in the experiment: Buzzfeed political news and Random political news. After pre-
processing the datasets, they extracted the features and fed each feature set independently and
collectively into the Long Short-Term Memory (LSTM) model for false news classification.
In this trial, they achieved an accuracy of 86%. Apart from that, they use basic learner ML
models such as linear SVM, Gaussian SVM, Gaussian NB, and Kernel NB on the datasets,
with Gaussian SVM achieving a maximum accuracy of 70%. This feature-based sequential
model has been tested on small datasets. Moreover, the author needed to consider tempo-
ral factors like posting timing, popular topics, the story’s lifespan, the consistency of the
story, etc.
Saqib Hakak et al. [18] successfully proposed an ensemble ML model to detect fake news.
This work used the DT, Random Forest (RF), and extra tree classifiers in ensemble form and
applied them to the Information Security and Object Technology (ISOT) [6]andLiar[61]
datasets. They achieved training and testing accuracy of 99.8% and 44.15%, respectively, on
the Liar dataset. Moreover, on the ISOT dataset, they achieved a training and testing accuracy
of 100%. The author extracts the 26 LFs used to detect fake news, such as word count,
character count, sentence count, average word length, average sentence length, sentiment
score, named entity recognition, etc. The limitation of their work is the low accuracy of
testing on the Liar dataset.
H. Reddy et al. [40] identify fake news by using the ensemble ML method on stylometric
features and word vector features, where they achieve an accuracy of 95.49% on the com-
bination of two datasets: FakeNewsNet [50], and McIntire [32]. They extract 50 LFs and
word-vector features such as TF, TF-IDF, Continuous Bag of Words (CBOW), and Skip-
gram. After combining the LF vector and word vector, apply ML classifiers such as Logistic
Regression (LR), SVM, NB, RF, KNN, Gradient Boosting (GB), Adaboost, and Bagging.
The small size of the dataset, as well as the overfitting performed by SVM on the LFs, are
both drawbacks of this study.
X. Zhou et al. [69] proposed a fake news prediction model on Politifact and Buzzfeed
datasets. The authors extract lexical, syntactic, semantic, and discourse-level characteristics
from the news article text and feed this feature vector to ML models such as LR, NB, SVM,
RF, and XGBoost individually or in combination. On Buzzfeed, they obtain an accuracy of
87.9%, while on the PolitiFact dataset, they reach an accuracy of 89.2%. This investigation
used disinformation-related and clickbait-related features to identify fake news. This model’s
downside is that fewer articles are in the corpus. These datasets contain images that have not
been used in this experiment.
Yang et al. [65] presented a paper to classify fake news with the help of a two-level CNN
model. The authors’ work involves images and text of news articles from 20,015 newspapers
on 240 approved news websites. After concatenating both feature sets into one, the author
extracts the latent and explicit features from text or images. This feature set classifies the
news as real or fake. Based on latent and explicit features, they obtained precision, recall,
and an f1-score of 92%, better than LSTM-text. Since the author only used a small dataset,
user-specific factors like account age, posting frequency, social connections, and registration
status have not been taken into account.
123
Multimedia Tools and Applications
V. Pérez-Rosas et al. [35] used major LFs (e.g., n-grams, punctuation, psycho-linguistics,
readability, and syntax) and achieved an accuracy of 76% on two novel data sets:
FakeNewsAMT and Celebrity. These datasets cover seven different domains. The limitation
of this work is the small number of features used to detect fake news and the low accuracy
of the models.
G. Gravanis et al. [17] introduce a new unbiased dataset with 3004 articles by incorporating
a few articles from four data sets (Kaggle-EXT, McIntire, BuzzFeed, and Politifact). They
extracted the 57 LFs, and the WoE fed into ML classifiers (viz., SVM, NB, DT, KNN,
Adaboost, and Bagging) achieved the highest accuracy of 94.9% using SVM. They also
correlated their approach with the different corpora (i.e., Kaggle-EXT, BuzzFeed, Politifact,
and McIntire). They achieved accuracyrates of 99.0%, 72.70%, 84.7%, and 81%, respectively.
In their investigation, they have not incorporated user-based features.
On the Kaggle dataset, R. Kaliyar et al. [27] suggested a deep CNN classification of false
or authentic news. They utilized a pre-trained WoE glove to extract latent features and assess
accuracy, precision, recall, and f1-score performance. They also measure the True Negative
Rate and the False-Positive Rate to identify fake news, achieving 98.36% accuracy on text.
In their investigation, they have not incorporated the images. Ghanem et al. [14]usean
emotion-infused neural network to identify different categories, such as propaganda, hoaxes,
clickbait, and satire, on a news article and Twitter dataset. After extracting latent and content
features of the text, they achieve a maximum of 96% accuracy on the clickbait dataset. They
have not incorporated the images into their investigation and have used short text.
Shah et al. [46] multimodal data to identify fake news on Weibo and Twitter datasets,
extracting sentiment-related features and performing the image segmentation process. Their
experiment used the cultural optimization algorithm to achieve an accuracy of 79.8% on
Twitter and 89.01% on Weibo datasets. In their investigation, they have not incorporated the
user-based features and have used short text.
Most of the authors apply supervised ML [15,41,42,52,69] models on various types of
features such as TF, TF-IDF, N-gram [64], LFs [9,31,35,57,69], and readability features [9,
31,35,57,69] to detect fake news. Shu et al. [49] proposed a FakeNewsTracker (FNT) tool to
collect, detect, and visualize fake news. Apart from that, many researchers used deep neural
network [16,63] models, such as CNN [3], Recurrent Neural Network (RNN) [3], BERT,
hybrid model [43] and so on, to extract the latent features instead of the explicit features from
the news articles. Few authors proposed a multi-modal [25,37] approach such as SpotFake
[53], Multimodal Variational Autoencoder (MVAE) [30], Event Adversarial Neural Network
(EANN) [62], SAFE [66] to detect fake news. Vishwakarma et al. [59] use scraping to analyze
and detect false news and verify the validity of fake news on the web. The spread of fake
news on social media began following the US 2016 election [4]; however, false information
is language-dependent; therefore, several authors [12,51] developed a model to recognize
fake news in many languages. For false news identification, Huang et al. [21] presented an
ensemble learning approach that combines four distinct models: embedding LSTM, depth
LSTM, Linguistic Inquiry Word Count (LIWC) CNN, and N-gram CNN. Furthermore, the
ensemble learning model’s optimal weights are calculated using the Self-Adaptive Harmony
Search (SAHS) method to obtain a greater accuracy of 99.4% in false news identification.
The following problems were discovered in the works mentioned above:
A. In references [35,65,69], the suggested models do not have great performance.
B. There are fewer features used by authors.
C. The datasets were too limited and were from a different domain.
123
Multimedia Tools and Applications
3 Feature engineering
This section describes the feature engineering presented in relation to the proposed algorithm
used to improve learning performance by improving dataset quality. Processing a dataset
without checking for critical feature engineering issues can lead to incorrect conclusions
about the model’s accuracy. As a result, data preprocessing in terms of feature engineering
must be addressed before conducting an analysis. The following are the steps for completing
the feature engineering task in this work:
3.1 Linguistic feature extraction
LFs are extracted from the textual content in terms of characters, words, sentences, para-
graphs, and documents [17,23,33,40,65,69]. The main objective of feature extraction
is to create a feature set that finds the relevant information in the actual dataset, increas-
ing the speed of model training and improving learning accuracy. Feature extraction also
helps with data visualization. We have extracted the 80 LFs after studying various linguistic-
based articles that help detect fake news. Most of the features fall into the lexical, syntax, and
semantic-level categories. Some features from sentiment, psycho-linguistics, readability, etc,
are used. We have used 12 categories of features to identify fake news or false information.
These categories are described below:
3.1.1 Quantity features
Quantity features count the characters, words, sentences, paragraphs, syllables, articles, verbs,
adverbs, adjectives, stop words, capital letters, self-referencing, group-referencing, and so on
to find speech information. On the other hand, the document was divided to count the number
of characters, uppercase letters, words, sentences, paragraphs, and syllables. In addition, a few
characteristics, such as self-referencing and group referencing, were evaluated by matching
each word. Table 1shows the ‘quantity features’ as well as the tools used to extract them.
Table 1 Quantity features Features Tools
#Characters Self-Programmed in Python
#Words
#Sentences
#Paragraphs
#Syllables LIWC dictionary
#Articles
#Verbs
#Adverbs
#Adjectives
#Stopwords Natural Language Tool Kit (NLTK) library
#Uppercase letters Self-Programmed in Python
#Self-referencing
#Group-referencing
123
Multimedia Tools and Applications
Table 2 Complexity features
Features Tool
Average number of words per sentences Self-Programmed in Python
Average number of characters per word
Average number of syllables per word
Average number of sentences per paragraph
3.1.2 Complexity features
The complexity features shown in Table 2were used to measure the complexity of the news
article. In Table 2, tools for extracting features are also provided.
3.1.3 Uncertainty features
Uncertainty features measure the ratio of a modal verb to certainty (e.g., always), generalising
terms, tentative terms (e.g., perhaps), numbers, question marks, ellipsis (...), hashtags, etc.
We have used a set of tokens to count features like numbers, question marks, ellipsis, and
hashtags. Moreover, to measure the remaining terms, the LIWC 2007 dictionary is used.
Tabl e 3shows the ‘uncertainty features’ as well as the tools used to extract them.
3.1.4 Subjectivity features
When a headline story becomes biased, its quality should be considered lower because it
maintains impartiality. It expresses an opinion, thoughts, or a person’s sentiments, which
range from 0 (negative) to 1 (positive). We assess the subjectivity of news articles by calcu-
lating the ratio of factual verbs (e.g., observe), report verbs (e.g., proclaim), subjective verbs,
motion verbs (e.g., move, shift), and greeting words. Motion verbs, subjective verbs, and
factual verbs were measured with the help of the LIWC 2007 dictionary. While measuring
greeting and reporting verbs, we created a dictionary and matched each word. Table 4shows
the ‘subjectivity features’ as well as the tools used to extract them.
Table 3 Uncertainty Features Features Tool
Certainty LIWC dictionary
Generalizing term
Tentative term
Numbers Self-Programmed in Python
Question marks
Hash tags
Ellipsis (...)
123
Multimedia Tools and Applications
Table 4 Subjectivity Features Features Tool
Factual verbs LIWC dictionary
Motion verbs
Subjective verbs
Greeting verbs Self-Programmed in Python
Reporting verbs
3.1.5 Non-immediacy features
Non-immediacy features assess first-person, second-person, and third-person pronouns. To
measure the above-mentioned features, the LIWC 2007 dictionary was used. Table 5shows
the ‘non-immediacy features’ as well as the tools used to extract them.
3.1.6 Sentiment features
Sentiment features in news content suggest a distinction between fake and true news. We
calculate the sentiments for each news story by measuring the number of positive words,
negative words, exclamation marks, anger words, anxiety words, sadness words, etc. These
terms are measured with the help of the LIWC 2007 dictionary. Table 6shows the ‘sentiment
features’ as well as the tools used to extract them.
3.1.7 Diversity features
Diversity features are used to measure lexical diversity, the number of unique words, content
words (e.g., nouns, verbs, adverbs, and adjectives), and function words (e.g., pronouns, prepo-
sitions, determiners, and conjunctions). At the highest level, the above-mentioned diversity
features can be evaluated by looking into news authors’ writing expression. From these fea-
tures, we have counted the unique words by removing the duplicates from the set of words.
While the remaining terms were measured using the LIWC 2007 dictionary. Table 7shows
the ‘diversity features’ as well as the tools used to extract them.
3.1.8 Informality features
Informality comprises five dimensions: swear words (e.g., dammit), netspeak (e.g., btw,
hahaha), assents (e.g., OK), nonfluencies (e.g., hm, hmmm), and fillers (e.g., I mean, you
know). To assess the casualness of each news article, consider how frequently each word or
phrase appears across all dimensions. Table 8shows the ‘informality features’ as well as the
tools used to extract them.
Table 5 Non-immediacy features Features Tool
#First-person pronoun LIWC dictionary
#Second-person pronoun
#Third-person pronoun
123
Multimedia Tools and Applications
Table 6 Sentiment Features Features Tool
Positive words LIWC dictionary
Negative words
Exclamation marks
Anger words
Anxiety words
Sadness words
3.1.9 Specificity features
Specificity features consist of a ratio of temporal, spatial, sensory information, causation terms
(e.g., because), exclusive terms, cognitive processes (i.e., insight), and perceptual processes
(see, hear, and feel). These terms are measured with the help of the LIWC dictionary. Table
9shows the ‘specificity features’ as well as the tools used to extract them.
3.1.10 Readability features
The readability feature defines the sentence complexity of the textual content. Using these
features, we have identified the grade of the text writer. Therefore, we measured the readability
features [9,17,23] using the textstat Python library on text to identify fake news. The
readability features are the Flesch Reading Ease Index, the Flesch Kinkaid formula, the
Automated Readability Index, the Coleman Liau formula, the SMOG Index, the Gunning
Fog Index formula, the New Dale-Chall formula, and the Linsear write formula. Table 10
shows the ‘readability features’ as well as the tools used to extract them.
3.1.11 Writing pattern features
Writing pattern features focus on the text’s writing style by counting the number of special
characters (e.g., ?, !, “, ‘, ’, ", #, @, etc.), short words (less than four characters), lengthy
words (greater than 15 characters), and so on. To measure these features, we performed the
Table 7 Diversity features Features Tool
Content words LIWC dictionary
Function marks
Pronouns
Prepositions
Determiners
Conjunctions
Lexical Diversity Self-Programmed in Python
#Type (Unique words)
#Tokens
#Polysyllables
123
Multimedia Tools and Applications
Table 8 Informality features Features Tool
Swear words LIWC dictionary
Netspeak words
Assents
Non-fluencies
Fillers
Table 9 Specificity features Features Tool
Temporal words LIWC dictionary
Spatial words
Causation terms
Exclusive terms
Cognitive process
Perceptual process
Table 10 Readability features Features Tool
Flesch Reading Ease Index Textstat Python library
Flesch-Kincaid Grade Level
Automated Readability Index
Coleman Liau Formula
SMOG Index
Gunning Fog Index Formula
New Dale-Chall Formula
Linsear Write Formula
Table 11 Writing pattern features Features Tool
No. of ? Self-Programmed in Python
No. of !
No. of single quotes (‘ ’)
No. of #
No. of @
#Big words (greater than 15 chars)
#Short words (less than 4 chars)
No. of double quotes (“ ")
No. of Ellipsis
123
Multimedia Tools and Applications
calculation on sets of characters and tokens. Table 11 shows the ‘writing pattern features’ as
well as the tools used to extract them.
3.1.12 Psycho-linguistic features
Psycho-linguistic features estimate the text polarity, which pertains to positive and negative
assertions and subjectivity. It falls between -1 and 1. We used the textblob Python library for
text polarity and subjectivity. Table 12 shows the ‘psycho-linguistic features’ as well as the
tools used to extract them.
Many authors incorporate the above categories of LFs in their work. These features help
to find cues from the textual content of news articles. In this work, the LIWC [8,34,55,68]
module extracts uncertainty, subjectivity, non-immediacy, sentiment, diversity, informality,
and specificity. Moreover, the textstat Python library extracted the readability features [9]of
the news articles. For the text polarity and subjectivity Python module, textblob was used.
We have also implemented features such as quantity, complexity, greeting words, filler, and
report verbs that improve the model’s performance.
3.2 Linguistic feature selection
In this process, we have selected the least correlated features to classify the news, which
reduces the number of features, decreases the computation power, and improves the accuracy
of the ML models. We use the Pearson correlation (corr) for feature selection. The corr
exhibits the strength of the relationship between the features. It measures the dependency
among features. Calculates the corr between given features using the (1), where x and y
represent the feature vectors and ¯x,¯yrepresent the means of x and y, respectively. The range
of corr is from -1 to +1, where -1 represents negative corr and +1 shows positive corr [9].
corr =n
i=0(xi−¯x)(yi−¯y)
n
i=0(xi−¯x)2(yi−¯y)2
(1)
We have calculated the correlation between every pair of LFs within the same category
and discarded those with a corr higher than 0.7, 0.8, or 0.9, which indicates a strong positive
linear relationship [38] between the two features. Here, we obtained a constant correlation
matrix. After that, we select the features whose correlation values are above the threshold
and drop one of them. If one variable is measured more consistently or has stronger evidence
of construct validity, it may be better to keep that variable even if its correlation with the
outcome is similar to that of another variable rather than dropping one of the two features
whose correlation values are the same (e.g., 0.95). Finally, a thorough examination of the
theoretical and empirical support for each variable ought to be utilised to make the decision
about which variable to keep.
Table 12 Psycho-linguistic
Features Features Tool
Text polarity Textblob Python library
Subjectivity
123
Multimedia Tools and Applications
3.3 Word embedding
WoE is employed to convert the plain text into a numeric vector because ML models cannot
directly handle the textual content. In a survey, we found two forms of WoE: content-based
WoE, such as TF and TF-IDF, which concentrate on prior knowledge, and context-based
WoE, such as Word2Vec, Glove, and FastText, which focus on textual writing patterns. A
faker alters the news by repeating identical terms. As an example, various false news reports
and conspiracy theories were disseminated on Twitter during the 2020 U.S. presidential
election. To seem more credible and acquire momentum on the site, several of these articles
employed similar words and phrases.
3.3.1 Term frequency/Count vector
The text vector is converted into a histogram vector by TF, which represents the frequency
of each word in the document. The vocabulary of unique words is defined by the length of
the vector. The formula for calculating the TF is shown in (2).
TF =Ft
T(2)
where Ft= number of times a term is present in the document, and T= total terms in the
document.
3.3.2 Term frequency-inverse document frequency
It is also known as “normalised frequency." It is an extended version of TF, which shows
the importance of the words that have fewer occurrences in the document. The (4) shows the
formula to calculate normalised frequency, which is the product of (2)and(3).
Inver se_document_frequency(IDF)=log2D
t_D(3)
where D= total documents and t_D= token present in documents.
TF_IDF =TF×IDF (4)
4 ConFake model
The four phases of the fake news detection method shown in Fig. 3are dataset preparation,
feature engineering, concatenating features, and classification. These phases involve collect-
ing and cleaning data, extracting relevant features, combining them to create a feature vector,
and using machine learning algorithms to classify news articles as real or fake. This method
is a useful tool for identifying and filtering out fake news, but its effectiveness depends on
the quality of features selected and the accuracy of the classification algorithm used.
4.1 Dataset preparation
Data collection and preprocessing are essential tasks for machine learning models, as the
quality and relevance of the data used to train the model have a direct impact on its perfor-
mance and accuracy.
123
Multimedia Tools and Applications
Datasets
Word Embedding
Word Embedding
Technique Selection
Linguistic Feature
Extraction
Linguistic Feature
Selection
Apply Machine
Learning Models
True News Fake News
Phase 1
New Dataset
Preparation
Dataset
Preprocessing
Phase 2
Feature Engineering
Phase 3
Phase 4
Concatenate Features
Fig. 3 ConFake model for fake news detection
4.1.1 Dataset collection
In previous studies, many datasets were used [2,6,11,20,32,36,61], whereas we identified
datasets with comparable structures and categories. The datasets utilised in a related study
had numerous issues, including size, category, bias, etc. As shown in Table 13,weprepareda
large dataset consisting of five datasets: Kaggle, Reuter, McIntire, BuzzFeed, and PolitiFact.
This large dataset minimizes the overfitting of ML models and facilitates better training. This
Table 13 ConFake dataset Dataset Total news True news Fake news
Kaggle [11] 20800 10387 10413
Reuter [2] 44898 21416 23482
McIntire [32] 6335 3171 3164
BuzzFeed [20] 182 91 91
PolitiFact [36] 240 120 120
ConFake Dataset 72413 35396 37479
123
Multimedia Tools and Applications
novel dataset of 72,413 news articles contains 35,396 true news articles and 37,479 false news
articles.
4.1.2 Data preprocessing
Data preprocessing consists of many tasks to handle noise in the text and missing data. It
rebuilds the unstructured data into a structured form that helps improve the accuracy and
performance of the model. This work uses preprocessing on the ConFake dataset to remove
NaN values, typographic errors, duplicate data, stop words, emoji signs, punctuation marks,
dates, special characters, lemmatization, and perform stemming by using the Porter Stemmer
algorithm.
4.2 Feature engineering
In feature engineering, we have extracted the linguistic and word vector evidence from the
text of the news articles discussed in Section 3.
4.3 Concatenate features
In this section, content-based WoE combines with LFs to achieve better accuracy because
only LFs and word vector features do not provide good accuracy. We combine TF or TF-IDF
with an optimised LF set. After that, it was fed into a ML classifier for classification.
4.4 Classification
Selecting the best-performing ML classifier is essential to designing a fake news detection
model that accurately identifies fake news. In related work, we identified better ML classifiers,
such as SVM, NB, RF, LR, KNN, Bagging, and Boosting. In text-mining tasks, SVM performs
better than other classifiers. Ensemble ML classifiers use weak learners to improve accuracy.
The Adaboost classifier’s primary purpose is to identify patterns that are hard to classify. The
description of the ML classifiers is as follows:
1. Support Vector Machine: SVM is a supervised machine learning technique compatible
with large datasets for classification. In the experiment, we utilised a linear SVM to
classify false news. It’s also utilised in rumour detection, sentiment categorization, facial
recognition, and other applications.
2. Random Forest: RF is a supervised machine learning method used in regression and
classification problems. It utilised several DTs and provided the result based on the outputs
of each DT. In this experiment, we used many DTs, but the value of n_estimator= 200
offered the best accuracy.
3. Naive Bayes: NB is a supervised machine learning technique. It has two types: gaus-
sian and multinomial. This approach provides a quick response in comparison to other
classifiers. It is mostly used in text classification tasks. Gaussian NB were utilised in the
experiment because they handle negative values.
4. K-Nearest Neighbour: KNN performs the classification by utilising feature similarity. It
is a non-parametric classification approach. In this experiment, we have to take the value
of k = 7, which gives better accuracy.
123
Multimedia Tools and Applications
5. Logistic Regression: This statistical technique utilises a logistic function to describe a
binary target variable in its most basic form, but many more complicated extensions exist.
LR is used in regression analysis to estimate the parameters of a logistic model.
6. Bagging: is a parallel ensemble ML classifier. This method reduces the variance of the
prediction model by generating more data during the training stage. In this study, the
feature vector is partitioned into equal subsets. DT is applied equally to each subset; thus,
the prediction is estimated by taking the mean or mode of the classifiers’ output.
7. Boosting: A sequential ensemble ML classifier reduces bias errors and generates powerful
prediction models. The phrase “Boosting" refers to transforming a weak classifier into
a robust one. Boosting attracts a large number of classifiers. Since the data samples are
weighted, some may participate in the new sets more frequently. Data points mistakenly
predicted are detected, and their weights are increased in each phase so that the following
learner gets closer to getting them right.
5 Parameters and methods
The procedure to detect fake news by ML classifiers using LFs is shown in Algorithm 1, and
the procedure with word vector features is shown in Algorithm 2.
Following are the steps of Algorithm 1:
1. The first line represents a collection of the datasets and combines them. After that, apply
preprocessing steps such as removing missing data, redundant data, emoji, etc.
2. In lines 2 and 3, initialise the list of document features ( feature_doc) and all document
features ( feature_dataset).
3. In the feature extraction step, first extract 80 LFs from each document and append these
vectors feature_doc in feature_dataset list.
4. Transform feature_dataset values by putting them through standardisation to improve
accuracy. Data standardisation is the process of rescaling the features so that they have a
mean of 0 and a variance of 1. Standardisation is used when features of an input dataset
have large differences between ranges or are simply measured in different measurement
units. The ultimate goal of standardisation is to bring all the features down to a common
scale without distorting the differences in the range of the values. In standardisation, there
is no specific upper or lower bound for the maximum and minimum values. Z-score is one
of the most popular methods to standardise data. The formula to calculate the Z-score is
shown in (5).
Zscor e =Value Mean
Standard_deviation (5)
5. Set the correlation function on attributes of transformed feature_dataset and get feature
sets LF 1, LF 2, LF 3, and LF 4 with respect to Corr values less than 0.7, 0.8, 0.9, and 1.
The selection of least-correlated features to classify the news is discussed in Section 3.2.
6. Building a model that works well with additional data is one of the objectives of supervised
learning. It’s a good idea to test our model on new data if we have any. The issue is that
we don’t have any new data; however, a technique like a train test split may be used to
simulate this experience. For a training set and a testing set, 80% of rows were randomly
sampled without replacement and placed into the training set, while the remaining 20%
were placed into the test set.
7. Now, we have applied each ML classifier to every LF set with a label and calculated
metrics using the confusion matrix discussed in Section 6.1.
123
Multimedia Tools and Applications
Algorithm 1 Algorithm to apply ML classifiers on LFs.
Input: Datasets (Kaggle, Reuter, McIntire, BuzzFeed, PolitiFact)
Output: Accuracy, Precision, Recall, F1-score.
1. ConFake_dataset(D)collection(Kaggle,Reuter,
McIntire,BuzzFeed ,PolitiFact);
2. feature_doc ←[ ];
3. feature_dataset ←[ ];
4. for doc 1to len(D)do feature_doc.append(Linguistic_features_extract
_from_textual_content_of_doc);
feature_dataset.append(feature_doc);
5. Apply Standardisation on feature_dataset.
6. Apply the correlation function to feature_dataset and create the feature sets LF 1, LF 2, LF 3, and
LF 4 with respect to correlation values below 0.7, 0.8, 0.9, and 1.
7. Perform train_test_split function on LF 1/ LF 2/ LF 3/ LF 4 and labels with a ratio of 80:20%.
8. Select an ML classifier from (NB, SVM, LR, KNN, Bagging, Adaboost, and RF).
9. Training of the model: classifier.fit(feature_train, labels_train)
10. Prediction: classifier.predict(feature_test)
11. Print confusion_matrix(feature_test, labels_test)
12. Evaluate metrics using the confusion matrix.
Algorithm 2 Algorithm to apply ML classifier on word vector features.
Input: Datasets (Kaggle, Reuter, McIntire, BuzzFeed, PolitiFact).
Output: Accuracy, Precision, Recall, F1-score.
1. ConFake_dataset(D)collection(Kaggle,Reuter,
McIntire,BuzzFeed ,PolitiFact);
2. preprocess_doc ←[ ];
3. preprocess_dataset ←[ ];
4. tf_feature_doc ←[ ];
5. tf_feature_dataset ←[ ];
6. tf_idf _feature_doc ←[ ];
7. tf_idf _feature_dataset ←[ ];
8. for doc 1tolen(D)do prepr ocess_doc.append(preprocessing(
remove_special_character s,remove_digits,remove_punctuation_marks,
lowercasi ng,stemming,lemmetization));
preprocess_dataset.append(preprocess_doc);
9. Calculate BOW-TF and BOW-TFIDF;
10. for doc 1to len(preprocess_dataset)
do tf_feature_doc.append(BOW_TF(doc));
tf_feature_dataset.append(tf_feature_doc);
tfidf_feature_doc.append(BOW TF_IDF(doc));
tfidf_feature_dataset.append(tfidf_feature_doc);
11. Perform train_test_split function on feature_dataset of BOW-TF or BOW-TFIDF and labels with a ratio
of 80:20%.
12. Select an ML classifier from (NB, SVM, LR, KNN, Bagging, Adaboost, and RF).
13. Training of the model: classifier.fit(feature_train, labels_train)
14. Prediction: classifier.predict(feature_test)
15. Print confusion_matrix(feature_test, labels_test)
16. Evaluate metrics.
Similarly, the following steps are taken in the Algorithm 2:
1. The first line represents a collection of the datasets and combines them.
2. Line 2-7 represents the initialization of lists, where preprocess_doc is the list of sen-
tences of the document, pr epr ocess_dataset is the list of sentences of all documents,
tf_feature_doc is the list of TF of the document, tf_feature_dataset is the list of TF
123
Multimedia Tools and Applications
of all the documents, tf_id f _feature_doc is the list of TF-IDF of the document, and
tf_id f _feature_dataset is the list of TF-IDF of all the documents.
3. In line 8, the dataset applies to preprocessing steps such as removing missing data,
redundant data, stop words, URLs, special characters, punctuation, and perform-
ing stemming and lemmatization on text data. Append each preprocess_doc into
preprocess_dataset.
4. Line 11 shows the feature extraction step to extract word vector features (TF/TF-IDF)
of each document using the (2)and(4). After that, append these vectors into a list
tf_feature_dataset or tf_idf _feature_dataset.
5. Apply standardisation to both the TF and TF-IDF lists, the same as used for LFs.
6. Perform a train test split on both transformed TF and TF-IDF lists with labels the same
as those used for LFs.
7. Now, we have to apply each ML classifier and calculate metrics using the confusion matrix
discussed in Section 6.1.
The Algorithm 3 follows the Algorithm 1 and 2 where lines 2, 3 show the combination of
tf_feature_dataset with optimised feature_dataset that is stored in FS1 (feature set)
and the combination of tf_id f _feature_dataset with optimised feature_dataset stored
in FS2, then performed the standardisation on FS1andFS2 as used for LFs. After that,
perform the train test split on standardised FS1and FS2 with labels. Now, we have to apply
each ML classifier and calculate metrics using the confusion matrix discussed in Section 6.1.
Algorithm 3 Algorithm to fed content-based features (i.e., LFs and TF/TF-IDF) into ML
classifiers.
Input: feature_dataset,tf_feature_dataset,tf_id f _feature_dataset.
Output: Accuracy, Precision, Recall, F1-score.
1. FS1=concat( feature_dataset,tf_feature_dataset);
2. FS2=concat( feature_dataset,tf_idf _feature_dataset);
3. Apply Standardisation on FS1, FS2.
4. Perform train_test_split function on FS1/FS2 and labels with a ratio of 80:20%.
5. Select an ML classifier from (NB, SVM, LR, KNN, Bagging, Adaboost, and RF).
6. Training of the model: classifier.fit(feature_train, labels_train)
7. Prediction: classifier.predict(feature_test)
8. Print confusion_matrix(feature_test, labels_test)
9. Evaluate metrics using a confusion matrix.
6 Results and discussion
This study performed experiments on PyCharm Community 2019.3 with Python 2.7, the Mac
OS X operating system, and 8 GB of memory.
6.1 Evaluation metrics
A confusion matrix measured the performance of the classification models. To measure the
performance of the proposed method, we use four metrics: Accuracy (ACC), Precision (P),
Recall (R), and F1-score. The calculation of these metrics necessitates the use of parameters
such as “True positive” (Tp), “True negative” (Tn), “False positive” (Fp), and “False negative”
123
Multimedia Tools and Applications
(Fn). The following performance metrics are measured using the confusion matrix shown in
Tabl e 14:
1. Accuracy: defined as the proportion of correct predictions to the total number of estimates.
ACC =Tp+Tn
Tp+Tn+Fp+Fn
2. Precision: expressed as the proportion of the correctly identified true positives to the total
number of positive predictions. It is used to calculate the positive predicted value.
P=Tp
Tp+Fp
3. Recall: expressed as the ratio of correctly identified positive predictions to the total number
of accurately desired outcomes.
R=Tp
Tp+Fn
4. F1-score: define the harmonic mean of accuracy and recall. It assists in determining the
model’s testing accuracy.
F1scor e =2
P1+R1
6.2 Machine learning classifiers
All the experiments were conducted in four phases, and seven ML classifiers analysed the
accuracy of the proposed work.
In the first phase, the accuracy was calculated with a different range of Pearson correlation
between features. First, 40 features offering low correlation, i.e., lying in the range of -0.7
to 0.7, are selected for accuracy calculation. The second set of 45 features for accuracy
calculation contains features with a correlation between -0.8 and 0.8. The third set of 51
features for accuracy calculation is then obtained with a correlation coefficient ranging from
-0.9 to 0.9. Finally, accuracy is calculated using all 80 features extracted from the dataset. It
can be seen from Table 15 that with the inclusion of redundant or highly correlated features
(correlation magnitude greater than 0.7), there is not much improvement in the accuracy
observed. Table 15 shows the ML classifier performance on LFs of the ConFake dataset for
different values of corr, where RF and Bagging achieve the best accuracy, i.e., 95.48% and
95.19%, respectively. Moreover, other classifiers like SVM, KNN, DT, Adaboost, and LR
provided accuracy of 89.72%, 84.75%, 91.19%, 89.20%, and GNB provided low accuracy
of 60.59%.
In the second phase, we extracted LFs from all benchmark datasets, such as Kaggle, McIn-
tire, Reuter, BuzzFeed, and PolitiFact. After that, some ML classifiers like GNB, SVM, KNN,
Adaboost, Bagging, LR, and RF have been applied to LFs, where RF performs better on Kag-
gle, McIntire, and Reuter datasets with accuracy of 98.21%, 92.02%, 96.92% respectively,
Table 14 Confusion matrix Predicted class
True False
Actual class True TpFp
False FnTn
123
Multimedia Tools and Applications
Table 15 ML classifiers’ performance on LFs of the ConFake dataset for different values of correlation
Classifier Accuracy (%)
Corr <=| 1|Corr <|0.7|Corr <|0.8|Corr <|0.9|
GNB 60.59 61.19 60.70 60.64
SVM 89.72 85.35 85.66 85.97
KNN 84.75 84.51 83.89 83.98
DT 91.19 91.14 91.21 91.58
Adaboost 89.62 88.44 88.75 88.66
Bagging 95.19 95.00 95.03 95.12
LR 89.20 84.39 84.67 84.92
RF 95.48 95.38 95.29 95.49
and the remaining datasets also achieve good accuracy. Bagging performs better on BuzzFeed
and PolitiFact small-size datasets, which achieve an accuracy of 86.48% and 79.16%. The
classifier’s performance on five datasets is shown in Table 16.
The third phase extracts content-based word embeddings like TF and TF-IDF features
from ConFake and inserts them into ML classifiers. However, RF and Bagging consistently
perform better on both WoE. The accuracy of RF was 94.34% on TF and 94.51% on TF-IDF.
Similarly, the accuracy of Bagging was 94.61% on TF and 94.10% on TF-IDF. LR achieves
the best accuracy of 95.21% on TF. Other classifiers like SVM, Adaboost, KNN, and GNB
perform accurately on both word vector features. The performance of ML classifiers on the
WoE of ConFake is shown in Table 17.
In the last phase, we combined the WoE with LFs and applied ML classifiers to them.
The performance of RF is best on both TF-LFs and TF-IDF-LFs of the ConFake dataset.
RF accuracy was 97.31% on TF-LFs and 97.14% on TF-IDF-LFs. Apart from that, a few
classifiers like LR, SVM, Bagging, and Adaboost also provide good accuracy on TF-LFs
and TF-IDF-LFs, up to 97%. The performance of all ML classifiers on LFs + TF features is
shown in Table 18 and on LFs + TF-IDF features, in Table 19.
The overall results show that RF achieves a higher accuracy of 97.31% on the combined
linguistic and WoE (TF) features of the ConFake dataset. Moreover, RF performs better in
all phases, achieving the best accuracy compared to other ML classifiers.
Table 16 ML classifiers’
performance on five different
datasets
Classifier Accuracy (%)
Kaggle McIntire Reuter BuzzFeed PolitiFact
GNB 96.05 62.86 67.72 62.16 61.11
SVM 97.88 89.50 95.12 75.67 68.75
KNN 89.62 82.00 88.69 70.27 56.25
Adaboost 98.04 88.95 94.30 81.08 72.91
Bagging 98.09 91.47 96.15 86.48 79.16
LR 97.54 88.95 94.85 75.67 68.75
RF 98.21 92.02 96.92 81.08 75.00
123
Multimedia Tools and Applications
Table 17 ML classifiers’
performance on word vector
features of the ConFake dataset
Classifier Accuracy (%)
TF TF-IDF
GNB 80.00 84.45
SVM 89.61 92.27
KNN 82.40 74.05
Adaboost 91.97 91.40
Bagging 94.61 94.10
LR 95.21 92.41
RF 94.34 94.51
6.3 Related work comparison
We compare the proposed ConFake method with four related methods, compiled in Table 20.
We use the seven ML classifiers on six datasets, including the ConFake dataset, as shown
in Table 13. We preprocess the dataset using the LFs and WoE features before applying ML
classifiers. In the case of LFs, we preprocess the dataset by removing punctuation marks
excluding single quotes, double quotes, “#, “@,” ellipsis, “?, and “!” as well as missing
values, duplicate values, emoji signs, irrelevant statements, and stemming and lemmatization
on text. We removed all punctuation marks, missing values, duplicate values, emoji signs,
and irrelevant statements from WoE. Then, on the text, perform stemming and lemmatization.
This preprocessing method differs from other cutting-edge techniques. We tested each ML
model on datasets with an 80%-20% data distribution. We used six different sizes of datasets
from various sources in this work.
1. Ahmed et al. [1] performed an experiment on the Kaggle-Ext dataset that contains 25200
articles where they used the TF-IDF word vector feature rather than LFs and achieved
92% accuracy by using linear SVM.
2. X. Zhou et al. [69] used the 53 LFs to detect fake news on BuzzFeed and PolitiFact datasets
that contain only 240 and 182 articles, respectively. They achieve the best accuracy by
linear SVM 89.20% on BuzzFeed and 87.9% on PolitiFact.
3. G. Gravanis et al. [17]: introduce a new UN-biased data set with 3004 articles by incorpo-
rating a few articles from four data sets (Kaggle-EXT, McIntire, BuzzFeed, and Politifact).
They extracted the 57 LFs along with the WoE, fed them into ML classifiers (viz., SVM,
NB, DT, KNN, Adaboost, and Bagging), and achieved the highest accuracy of 94.9% using
SVM. They also correlated their approach with the different corpora (i.e., Kaggle-EXT,
Table 18 ML classifiers’
performance on linguistic and
word vector (TF) features of the
ConFake dataset
Classifier Accuracy (%) Precision Recall F1-score
KNN 77.95 .77 .81 .79
RF 97.31 .96 .99 .97
GNB 82.92 .84 .82 .83
LR 96.21 .96 .97 .96
Bagging 97.09 .96 .99 .97
Adaboost 94.41 .94 .95 .95
SVM 95.08 .94 .96 .95
123
Multimedia Tools and Applications
Table 19 ML classifiers’
performance on linguistic and
word vector (TF-IDF) features of
the ConFake dataset
Classifier Accuracy (%) Precision Recall F1-score
KNN 70.64 .69 .78 .73
RF 97.14 .96 .99 .97
GNB 82.30 .84 .81 .82
LR 96.27 .96 .97 .96
Bagging 96.82 .95 .99 .97
Adaboost 94.37 .94 .95 .95
SVM 94.69 .94 .96 .95
BuzzFeed, Politifact, and McIntire) and achieved an accuracy of 99.0%, 72.70%, 84.7%,
and 81%, respectively.
4. H. Reddy et al. [40]: used the 50 features to detect fake news on a combination of the
FakeNewsNet and McIntire datasets that contain 6755 articles. They combined the LFs
with word-vector features and fed them into ML classifiers. In this experiment, they
achieved the highest accuracy of 95.49% by using Adaboost and GB classifiers.
5. ConFake: utilised a massive dataset of 72413 news items from five distinct datasets (Kag-
gle, Reuter, McIntire, BuzzFeed, and PolitiFact) and achieved the highest accuracy of
97.31% by RF compared to the related techniques. To provide a valid assessment, we used
this method on each dataset separately. When compared to [17], our approach improves
the accuracy of the McIntire dataset from 81.0% to 92.02% and the BuzzFeed dataset
from 72.7% to 81.08%.
6.4 Discussion
In this section, we discuss the performance of various ML algorithms on the ConFake dataset
with varying levels of correlation between the features. The ConFake dataset is a combi-
nation of five different datasets and has a moderate size of 72,413 instances. Given the
size and dimensionality of the ConFake dataset, some of the more computationally efficient
algorithms for this type of dataset could include RF and GB. These algorithms can han-
dle high-dimensional data well and can parallelize computations, which can improve their
efficiency. On the other hand, algorithms that may require more memory or computational
resources, such as GNB or SVM, may be less efficient for this dataset, especially if not
optimised properly.
Efficiency in ML can be assessed based on several factors, including training time, testing
time, memory requirements, and computational complexity. The performance of a classifier
on a dataset with a large number of features can significantly affect its efficiency, as more
features can lead to higher computational requirements and longer training and testing times.
RF and Bagging proved to be the most efficient algorithms, especially for the ConFake dataset,
due to their ability to parallelize computations and handle noise and irrelevant features. These
algorithms can also be faster and more memory-efficient than some other models, such as
SVM and KNN, which may require more computational resources and memory to train and
test. Some of the other algorithms used in the study, such as GNB, may be less computationally
123
Multimedia Tools and Applications
Table 20 ConFake comparison with related methods
Attribute Ahmed et al. [1] X. Zhou et al. [69] G. Gravanis et al. [17] H. Reddy et al. [40] ConFake Method
Dataset 1. Kaggle-EXT 1. Politifact
2. Buzzfeed
1. Kaggle-EXT
2. Buzzfeed
3. Politifact
4. McIntire
5. UNBiased
Combination of
FakeNewsNet and
McIntire Datasets
1. Kaggle
2. Reuter
3. McIntire
4. BuzzFeed
5. PolitiFact
6. ConFake
No. of news articles 25,200 1. 240
2. 182
1. 23,340
2. 240
3. 182
4. 6,310
5. 3,004
6,755 1. 20,800
2. 44,898
3. 6,335
4. 182
5. 240
6. 72,413
LFs No Yes Yes Yes Yes
No. of LFs used Nil 53 57 50 80
Word vector feature TF-IDF No Word2Vec TF, TF-IDF, Word2Vec TF, TF-IDF
Training and testing ratio 80% - 20% 80% - 20% 80% - 20% 80% - 20% 80% - 20%
Best classifier Linear SVM Linear SVM SVM Adaboost, GB RF
Accuracy 1. 92% 1. 89.2%
2. 87.9%
1. 99.0%
2. 72.7%
3. 84.7%
4. 81.0%
5. 94.9%
95.49% 1. 98.21%
2. 96.92%
3. 92.02%
4. 81.08%
5. 75.00%
6. 97.31%
123
Multimedia Tools and Applications
expensive but also offer lower accuracy. The efficiency of a classifier ultimately depends on
various factors, as mentioned in Table 20.
7 Conclusion
In this work, a new dataset “ConFake” is proposed that is made up of five open-source
corpora (Kaggle, McIntire, Reuter, BuzzFeed, and PolitiFact) to reduce their own limitations
and bias to tell the difference between fake news and real news. There are a total of 72,413
news articles, with 35,396 true news articles and 37,479 false news articles. To detect LFs,
the LIWC 2007 dictionary was used, and for readability features, the textstat Python library
was used. Furthermore, the textblob library has text polarity and subjectivity features, some
of which are self-programmed in Python to detect fake news. In this experiment, 80 features
were extracted and fed into seven ML classifiers. To extract WoE features such as TF and
TF-IDF, the entire feature set is fed separately into ML classifiers. Then TF is combined with
LFs, and similarly, TF-IDF is combined with LFs and fed into the ML classifier. Multiple
classifiers performances were compared, and the RF classifier achieved the highest accuracy
of 97.31% by combining the TF and LFs. To provide a valid assessment, ConFake method
on each dataset was used. When compared to cutting-edge methods, our method improves
the accuracy of the McIntire dataset by 11% and the accuracy of the BuzzFeed dataset by
9%. But the study has two flaws: 1) It only looks at the text field of news articles, not other
metadata like images, user information, events, social statistics, or even paths of spread; using
only text makes it harder to tell fake stories from real ones; and 2) Our method is not good
for finding fake news on social media early.
In future work, we will incorporate user- and social context-based features to improve the
performance of this approach. The user-based features (i.e., the registration ID of the user,
age, gender, #followers, #posts, etc.) and social context-based features (i.e., likes, comments,
shares, etc.) can effectively discriminate fake news from actual news. Moreover, we will
utilise enhanced deep learning models instead of ML models to recognise fake news because
deep learning models handle large datasets very effectively. Many efforts have been made
in recent years to increase the dependability and credibility of online material, but some
of the most critical issues still need to be solved. First, most of the research concentrates
on LFs in texts written in English. Other well-known and local languages still need to be
taken into consideration. Second, supervised learning methodologies have primarily been
used in the current work. Due to the vast amounts of unlabeled data from social media,
unsupervised models must be designed. Third, as most of this research has been conducted
using customised datasets, constructing compelling standard datasets is critically essential.
The shortage of publicly accessible large-scale datasets restricts a benchmark comparison of
various techniques.
Funding The authors did not receive support from any organization for the submitted work.
Data Availability The data that support the findings of this study are available from the corresponding author
upon reasonable request.
Declaration
Conflicts of interest The authors declare they have no financial interests.
123
Multimedia Tools and Applications
References
1. Ahmed, H (2017) Detecting opinion spam and fake news using n-gram analysis and semantic similarity.
PhD thesis, University of Victoria
2. Ahmed, H, Traore, I, Saad, S (2017) Detection of online fake news using n-gram analysis and machine
learning techniques. In: Intelligent, secure, and dependable systems in distributed and cloud environments:
first international conference, ISDDC 2017, Vancouver, BC, Canada, October 26-28, 2017, Proceedings
1, Springer, pp 127–138
3. Ajao, O, Bhowmik, D, Zargari, S (2018) Fake news identification on twitter with hybrid cnn and rnn
models. In: Proceedings of the 9th international conference on social media and society, pp 226–230
4. Allcott H, Gentzkow M (2017) Social media and fake news in the 2016 election. J Econ Perspec 31(2):
211–236
5. Bali, APS, Fernandes, M, Choubey, S, Goel, M (2019) Comparative performance of machine learning
algorithms for fake news detection. In: Advances in computing and data sciences: third international
conference, ICACDS 2019, Ghaziabad, India, April 12–13, 2019, Revised Selected Papers, Part II 3,
Springer, pp 420–430
6. Bezerra JFR (2021) Content-based fake news classification through modified voting ensemble. J Inf
Telecommun 5(4):499–513
7. Bra¸soveanu, AMP, Andonie, R (2019) Semantic fake news detection: A machine learning perspective. In:
Advances in Computational Intelligence: 15th international work-conference on artificial neural networks,
IWANN 2019, Gran Canaria, Spain, June 12–14, 2019, Proceedings, Part I 15, Springer, pp 656–667
8. Burgoon, JK, Blair, JP, Qin, T, Nunamaker, JF (2003) Detecting deception through linguistic analysis. In:
Intelligence and Security Informatics: first NSF/NIJ symposium, ISI 2003, Tucson, AZ, USA, June 2–3,
2003 Proceedings 1, Springer, pp 91–101
9. Choudhary A, Arora A (2021) Linguistic feature based learning model for fake news detection and
classification. Exp Syst Appl 169:114171
10. Fact Check. https://www.factcheck.org/. Accessed: 31 Mar 2020
11. Fake News Kaggle dataset. https://www.kaggle.com/c/fake-news/data?select=train.csv. Accessed: 15
Apr 2020
12. Faustini PHA, Covoes TF (2020) Fake news detection in multiple platforms and languages. Exp Syst
Appl 158:113503
13. Fullfact. https://fullfact.org/. Accessed: 31 Mar 2020
14. Ghanem B, Rosso P, Rangel F (2020) An emotional analysis of false information in social media and
news articles. ACM Trans Int Technol (TOIT) 20(2):1–18
15. Gilda, S (2017) Notice of violation of ieee publication principles: evaluating machine learning algorithms
for fake news detection. In: 2017 IEEE 15th student conference on research and development (SCOReD),
IEEE, pp 110–115
16. Gogate, M, Adeel, A, Hussain, A, Deep learning driven multimodal fusion for automated deception
detection. In: 2017 IEEE symposium series on computational intelligence (SSCI), IEEE, pp 1–6
17. Gravanis G, Vakali A, Diamantaras K, Karadais P (2019) Behind the cues: a benchmarking study for fake
news detection. Exp Syst Appl 128:201–213
18. Hakak S, Alazab M, Khan S, Gadekallu TR, Maddikunta PKR, Khan WZ (2021) An ensemble machine
learning approach through effective feature extraction to classify fake news. Future Gener Comput Syst
117:47–58
19. Hoax Slayer. http://hoaxslayer.com/. Accessed: 31 Mar 2020
20. Horne, B, Adali, S (2017) This just in: Fake news packs a lot in title, uses simpler, repetitive content in
text body, more similar to satire than real news. In: Proceedings of the international AAAI conference on
web and social media, vol 11, pp 759–766
21. Huang Y-F,Chen P-H (2020) Fake news detection using an ensemble learning model based on self-adaptive
harmony search algorithms. Exp Syst Appl 159:113584
22. Jain, MK, Garg, R, Gopalani, D, Meena, YK (2022) Review on analysis of classifiers for fake news
detection. In: Emerging technologies in computer engineering: cognitive computing and intelligent IoT,
Springer, pp 395–407
23. Jain, MK, Gopalani, D, Meena, YK, Kumar, R (2020) Machine learning based fake news detection
using linguistic features and word vector features. In: 2020 IEEE 7th Uttar pradesh section international
conference on electrical, electronics and computer engineering (UPCON), IEEE, pp 1–6
24. Jin, Z, Cao, J, Guo, H, Zhang, Y, Luo, J (2017) Multimodal fusion with recurrent neural networks for
rumor detection on microblogs. In: Proceedings of the 25th ACM international conference on multimedia,
pp 795–816
123
Multimedia Tools and Applications
25. Jin Z, Cao J, Zhang Y, Zhou J, Tian Q (2016) Novel visual and statistical image features for microblogs
news verification. IEEE Trans Multimed 19(3):598–608
26. Kaliyar, RK, Goswami, A, Narang, P (2019) Multiclass fake news detection using ensemble machine
learning. In: 2019 IEEE 9th international conference on advanced computing (IACC), IEEE, pp 103–107
27. Kaliyar RK, Goswami A, Narang P, Sinha S (2020) FNDNet-a deep convolutional neural network for
fake news detection. Cogn Syst Res 61:32–44
28. Kaur S, Kumar P, Kumaraguru P (2020) Detecting clickbaits using two-phase hybrid CNN-LSTM biterm
model. Exp Syst Appl 151:113350
29. Khan JY, Khondaker MTI, Afroz S, Uddin G, Iqbal A (2021) A benchmark study of machine learning
models for online fake news detection. Mach Learn Appl 4:100032
30. Khattar, D, Goud, JS, Gupta, M, Varma, V (2019) MVAE: multimodal variational autoencoder for fake
news detection. In: The world wide web conference, pp 2915–2921
31. Maan, M, Jain, MK, Trivedi, S, Sharma, R (2022) Machine learning based rumor detection on twitter data.
In: Emerging technologies in computer engineering: cognitive computing and intelligent IoT. Springer,
pp 259–273
32. McIntire dataset. https://github.com/lutzhamel/fake- news/tree/master/ data. Accessed: 31 Mar 2020
33. Meel P, Vishwakarma DK (2020) Fake news, rumor, information pollution in social media and web: a
contemporary survey of state-of-the-arts, challenges and opportunities. Exp Syst Appl 153:112986
34. Newman ML, Pennebaker JW, Berry DS, Richards JM (2003) Lying words: Predicting deception from
linguistic styles. Person Soc Psychol Bullet 29(5):665–675
35. Pérez-Rosas, V, Kleinberg, B, Lefevre, A, Mihalcea, R (2017) Automatic detection of fake news.
arXiv:1708.07104
36. Politifact news dataset. http://www.politifact.com/. Accessed: 31 Mar 2020
37. Qi, P, Cao, J, Yang, T, Guo, J, Li, J (2019) Exploiting multi-domain visual information for fake news
detection. In: 2019 IEEE international conference on data mining (ICDM), IEEE, pp 518–527
38. Ratner B (2009) The correlation coefficient: Its values range between+ 1/- 1, or do they? J Target Measur
Anal Market 17(2):139–142
39. Ravi K, Ravi V (2017) A novel automatic satire and irony detection using ensembled feature selection
and data mining. Knowledge-Based Syst 120:15–33
40. Reddy H, Raj N, Gala M, Basava A (2020) Text-mining-based fake news detection using ensemble
methods. Int J Autom Comput 17(2):210–221
41. Reis, JCS, Correia, A, Murai, F, Veloso, A, Benevenuto, F (2019) Explainable machine learning for fake
news detection. In: Proceedings of the 10th ACM conference on web science, pp 17–26
42. Reis JCS, Correia A, Murai F,Veloso A, BenevenutoF (2019) Supervised learning for fake news detection.
IEEE Intell Syst 34(2):76–81
43. Ruchansky, N, Seo, S, Liu, Y (2017) CSI: a hybrid deep model for fake news detection. In: Proceedings
of the 2017 ACM on conference on information and knowledge management, pp 797–806
44. Saquete E, Tomás D, Moreda P, Martínez-Barco P, Palomar M (2020) Fighting post-truth using natural
language processing: a review and open challenges. Exp Syst Appl 141:112943
45. Schwarz N, Newman E, Leach W (2016) Making the truth stick and the myths fade: lessons from cognitive
psychology. Behav Sci Policy 2:85–95
46. Shah, P, Kobti, Z (2020) Multimodal fake news detection using a cultural algorithm with situational and
normative knowledge. In: 2020 IEEE congress on evolutionary computation (CEC), IEEE, pp 1–7
47. Sharma K, Qian F, Jiang H, Ruchansky N, Zhang M, Liu Y (2019) Combating fake news: a survey on
identification and mitigation techniques. ACM Trans Intell Syst Technol (TIST) 10(3):1–42
48. Shu, K, Wang, S, Liu, H (2019) Beyond news contents: The role of social context for fake news detection.
In: Proceedings of the twelfth ACM international conference on web search and data mining, pp 312–320
49. Shu K, Mahudeswaran D, Liu H (2019) FakeNewsTracker: a tool for fake news collection, detection, and
visualization. Comput Math Org Theory 25:60–71
50. Shu K, Mahudeswaran D, Wang S, Lee D, Liu H (2020) Fakenewsnet: a data repository with news
content, social context, and spatiotemporal information for studying fake news on social media. Big Data
8(3):171–188
51. Silva RM, Santos RLS, Almeida TA, Pardo TAS (2020) Towards automatically filtering fake news in
portuguese. Exp Syst Appl 146:113199
52. Singh, V, Dasgupta, R, Sonagra, D, Raman, K, Ghosh, I (2017) Automated fake news detection using
linguistic analysis and machine learning. In: International conference on social computing, behavioral-
cultural modeling, & prediction and behavior representation in modeling and simulation (SBP-BRiMS),
pp 1–3
123
Multimedia Tools and Applications
53. Singhal, S, Shah, RR, Chakraborty, T, Kumaraguru, P, Satoh, S (2019) Spotfake: a multi-modal framework
for fake news detection. In: 2019 IEEE fifth international conference on multimedia big data (BigMM),
IEEE, pp 39–47
54. Snopes. https://www.snopes.com/. Accessed: 31 Mar 2020
55. Tausczik YR, Pennebaker JW (2010) The psychological meaning of words: LIWC and computerized text
analysis methods. J Lang Soc Psychol 29(1):24–54
56. Truthorfiction. https://www.truthorfiction.com/. Accessed: 31 Mar 2020
57. Verma PK, Agrawal P, Amorim I, Prodan R (2021) WELFake: word embedding over linguistic features
for fake news detection. IEEE Trans Comput Soc Syst 8(4):881–893
58. Vicario MD, Quattrociocchi W, Scala A, Zollo F (2019) Polarization and fake news: early warning of
potential misinformation targets. ACM Trans Web (TWEB) 13(2):1–22
59. VishwakarmaDK, Varshney D, Yadav A (2019) Detection and veracity analysis of fake news via scrapping
and authenticating the web search. Cogn Syst Res 58:217–229
60. Viswas News. http://www.vishvasnews.com/. Accessed: 31 Mar 2020
61. Wang, WY (2017) “Liar, liar pants on fire”: a new benchmark dataset for fake news detection. In: Pro-
ceedings of the 55th annual meeting of the association for computational linguistics (Vol 2: Short Papers),
Association for Computational Linguistics, pp 422–426
62. Wang, Y, Ma, F, Jin, Z, Yuan, Y, Xun, G, Jha, K, Su, L, Gao, J (2018) EANN: event adversarial neural
networks for multi-modal fake news detection. In: Proceedings of the 24th ACM sigkdd international
conference on knowledge discovery & data mining, pp 849–857
63. Wu Y, Fang Y, Shang S, Jin J, Wei L, Wang H (2021) A novel framework for detecting social bots with
deep neural networks and active learning. Knowl-Based Syst 211:106525
64. Wynne, HE, Wint, ZZ (2019) Content based fake news detection using n-gram models. In: Proceedings
of the 21st international conference on information integration and web-based applications & services,
pp 669–673
65. Yang, Y, Zheng, L, Zhang, J, Cui, Q, Li, Z, Yu, PS (2018) TI-CNN: convolutional neural networks for
fake news detection. arXiv:1806.00749
66. Zhou, X, Wu, J, Zafarani, R (2020) Similarity-aware multi-modal fake news detection. In: Advances
in knowledge discovery and data mining: 24th pacific-asia conference, PAKDD 2020, Singapore, May
11–14, 2020, Proceedings, Part II, Springer, pp 354–367
67. Zhou X, Zafarani R (2020) A survey of fake news: Fundamental theories, detection methods, and oppor-
tunities. ACM Comput Surv (CSUR) 53(5):1–40
68. Zhou L, Burgoon JK, Nunamaker JF, Twitchell D (2004) Automating linguistics-based cues for detecting
deception in text-based asynchronous computer-mediated communications. Group Dec Nego 13:81–106
69. Zhou X, Jain A, Phoha VV, Zafarani R (2020) Fake news early detection: a theory-driven model. Digit
Threats Res Pract 1(2):1–25
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
123
Chapter
Full-text available
The spread of false news on an online social media platform has been a major concern in recent years. Many sources, such as news stations, websites, and even newspaper websites, post news pieces on social media. Meanwhile, most of the new material on social media is suspect and, in some circumstances, deliberately misleading. Fake news is a term used to describe this type of information. Large volumes of bogus news on the internet have the potential to generate major societal issues. Accepting the stories and pretending that they are true is extremely harmful for our community. Many people believe that false news affected the 2016 presidential election in the United States. The term has since become commonplace as a result of the election. It has also attracted the interest of industry and academics, who are trying to figure out where it comes from, how it spreads, and what impacts it has. In this work, we looked at a number of different papers and compared all of the strategies for detecting false news.
Chapter
Full-text available
Rumors are misleading information that are not sustained at the time of circulation and are not true at the time of verification. In other words, Rumors are set of linguistic, symbolic or tactile propositions whose veracity is not quickly or ever confirmed. As the use of social media platform has grown in recent years, incorrect information and rumors have circulated widely causing a significant influence on people’s lives. Rumors spreads faster than righteous news and spreads through social media. Because of the expansion of Internet and web technologies, it is now possible for anybody to post anything on online platforms such as blogs, comments on articles, post on social media, and so on, where false news, rumors, and true news are swiftly conveyed. This rapid and expansive spread of rumors has encouraged researchers to differentiate between rumors and non-rumors data. In this work, we have used stylometric and word vector features and put them into machine learning models. These features are extracted from the twitter-16 dataset and by applying SVM, we have attain the highest accuracy in compare to existing state-of-the-art studies.
Article
Full-text available
Credibility is a crucial element for journalism. As fake news impacts credibility, it affects the general public, policymakers, decision-makers, and the journalistic environment. However, current research on fake news using content-based approaches focuses majorly on one or two dimensions of stylometrics, semantic and linguistic processes, but not on these three simultaneously. Considering that content-based detection of fake news would benefit from a multidimensional approach because of their inherent characteristics, we proposed a method that uses all of these dimensions to improve classification accuracy, using a voting ensemble designed in an ensemble classifier form. The results show that the multidimensional voting classifier has produced more accurate results than its peers while being more sensitive to distinguish between true and false news when using randomized data.
Article
Full-text available
Social media is a popular medium for the dissemination of real-time news all over the world. Easy and quick information proliferation is one of the reasons for its popularity. An extensive number of users with different age groups, gender, and societal beliefs are engaged in social media websites. Despite these favorable aspects, a significant disadvantage comes in the form of fake news, as people usually read and share information without caring about its genuineness. Therefore, it is imperative to research methods for the authentication of news. To address this issue, this article proposes a two-phase benchmark model named WELFake based on word embedding (WE) over linguistic features for fake news detection using machine learning classification. The first phase preprocesses the data set and validates the veracity of news content by using linguistic features. The second phase merges the linguistic feature sets with WE and applies voting classification. To validate its approach, this article also carefully designs a novel WELFake data set with approximately 72,000 articles, which incorporates different data sets to generate an unbiased classification output. Experimental results show that the WELFake model categorizes the news in real and fake with a 96.73% which improves the overall accuracy by 1.31% compared to bidirectional encoder representations from transformer (BERT) and 4.25% compared to convolutional neural network (CNN) models. Our frequency-based and focused analyzing writing patterns model outperforms predictive-based related works implemented using the Word2vec WE method by up to 1.73%.
Article
Full-text available
The proliferation of fake news and its propagation on social media has become a major concern due to its ability to create devastating impacts. Different machine learning approaches have been suggested to detect fake news. However, most of those focused on a specific type of news (such as political) which leads us to the question of dataset-bias of the models used. In this research, we conducted a benchmark study to assess the performance of different applicable machine learning approaches on three different datasets where we accumulated the largest and most diversified one. We explored a number of advanced pre-trained language models for fake news detection along with the traditional and deep learning ones and compared their performances from different aspects for the first time to the best of our knowledge. We find that BERT and similar pre-trained models perform the best for fake news detection, especially with very small dataset. Hence, these models are significantly better option for languages with limited electronic contents, i.e., training data. We also carried out several analysis based on the models’ performance, article’s topic, article’s length, and discussed different lessons learned from them. We believe that this benchmark study will help the research community to explore further and news sites/blogs to select the most appropriate fake news detection method.
Article
Full-text available
The explosive growth in fake news and its erosion to democracy, justice, and public trust has increased the demand for fake news detection and intervention. This survey reviews and evaluates methods that can detect fake news from four perspectives: (1) the false knowledge it carries, (2) its writing style, (3) its propagation patterns, and (4) the credibility of its source. The survey also highlights some potential research tasks based on the review. In particular, we identify and detail related fundamental theories across various disciplines to encourage interdisciplinary research on fake news. We hope this survey can facilitate collaborative efforts among experts in computer and information sciences, social sciences, political science, and journalism to research fake news, where such efforts can lead to fake news detection that is not only efficient but more importantly, explainable.
Article
Clickbait indicates the type of content with an intending goal to attract the attention of readers. It has grown to become a nuisance to social media users. The purpose of clickbait is to bring an appealing link in front of users. Clickbaits seen in the form of headlines influence people to get attracted and curious to read the inside content. The content seen in the form of text on clickbait posts is very short to identify its features as clickbait. In this paper, a novel approach (two-phase hybrid CNN-LSTM Biterm model) has been proposed for modeling short topic content. The hybrid CNN-LSTM model when implemented with pre-trained GloVe embedding yields the best results based on accuracy, recall, precision, and F1-score performance metrics. The proposed model achieves 91.24%, 95.64%, 95.87% precision values for Dataset 1, Dataset 2 and Dataset 3, respectively. Eight types of clickbait such as Reasoning, Number, Reaction, Revealing , Shocking/Unbelievable, Hypothesis/Guess, Questionable, Forward referencing are classified in this work using the Biterm Topic Model (BTM). It has been shown that the clickbaits such as Shocking/Unbelievable, Hypothesis/Guess and Reaction are the highest in numbers among rest of the clickbait headlines published online. Also, a ground dataset of non-textual (image-based) data using multiple social media platforms has been created in this paper. The textual information has been retrieved from the images with the help of OCR tool. A comparative study is performed to show the effectiveness of our proposed model which helps to identify the various categories of clickbait headlines that are spread on social media platforms.
Article
Social media is used as a dominant source of news distribution among users. The world’s preeminent decisions such as politics are acclaimed by social media to influence users for enclosing users’ decisions in their favor. However, the adoption of social media is much needed for awareness but the authenticity of content is an unknown factor in the current scenario. Therefore, this research work finds it imperative to propose a solution to fake news detection and classification. In the case of fake news, content is the prime entity that captures the human mind toward trust for specific news. Therefore, a linguistic model is proposed to find out the properties of content that will generate language-driven features. This linguistic model extracts syntactic, grammatical, sentimental, and readability features of particular news. Language driven model requires an approach to handle time-consuming and handcrafted features problems in order to deal with the curse of dimensionality problem. Therefore, the neural-based sequential learning model is used to achieve superior results for fake news detection. The results are drawn to validate the importance of the linguistic model extracted features and finally combined linguistic feature-driven model is able to achieve the average accuracy of 86% for fake news detection and classification. The sequential neural model results are compared with machine learning based models and LSTM based word embedding based fake news detection model as well. Comparative results show that features based sequential model is able to achieve comparable evaluation performance in discernable less time.
Preprint
Microblogging is a popular online social network (OSN), which facilitates users to obtain and share news and information. Nevertheless, it is filled with a huge number of social bots that significantly disrupt the normal order of OSNs. Sina Weibo, one of the most popular Chinese OSNs in the world, is also seriously affected by social bots. With the growing development of social bots in Sina Weibo, they are increasingly indistinguishable from normal users, which presents more huge challenges in detecting social bots. Firstly, it is difficult to extract the features of social bots completely. Secondly, large-scale data collection and labeling of user data are extremely hard. Thirdly, the performance of classical classification approaches applied to social bot detection is not good enough. Therefore, this paper proposes a novel framework for detecting social bots in Sina Weibo based on deep neural networks and active learning (DABot). Specifically, 30 features from four categories, namely metadata-based, interaction-based, content-based, and timing-based are extracted to distinguish between social bots and normal users. Nine of these features are completely new features proposed in this paper. Moreover, active learning is employed to efficiently expand the labeled data. Then, a new deep neural network model called RGA is built to implement the detection of social bots, which makes use of a residual network (ResNet), a bidirectional gated recurrent unit (BiGRU), and an attention mechanism. After performance evaluation, the results show that DABot is more effective than the state-of-the-art baselines with the accuracy of 0.9887.
Article
There are numerous channels available such as social media, blogs, websites, etc., through which people can easily access the news. It is due to the availability of these platforms that the dissemination of fake news has become easier. Anyone using these platforms can create and share fake news content based on personal or professional motives. To address the issue of detecting fake news, numerous studies based on supervised and unsupervised learning methods have been proposed. However, all those studies do suffer from a certain limitation of poor accuracy. The reason for poor accuracy can be attributed due to several reasons such as the poor selection of features, inefficient tuning of parameters, imbalanced datasets, etc. In this article, we have proposed an ensemble classification model for detection of the fake news that has achieved a better accuracy compared to the state-of-the-art. The proposed model extracts important features from the fake news datasets, and the extracted features are then classified using the ensemble model comprising of three popular machine learning models namely, Decision Tree, Random Forest and Extra Tree Classifier. We achieved a training and testing accuracy of 99.8% and 44.15% respectively on the Liar dataset. For the ISOT dataset, we achieved the training and testing accuracy of 100%.