ArticlePDF Available

ConFake: fake news identification using content based features

June 2023
Multimedia Tools and Applications

June 2023

DOI:10.1007/s11042-023-15792-1

Authors:

Mayank Kumar Jain

Manipal University Jaipur

Dinesh Gopalani

Malaviya National Institute of Technology Jaipur

Yogesh Meena

Malaviya National Institute of Technology Jaipur

The majority of users were available on the Internet and created a number of social networking accounts during India’s COVID-19-caused lockdown, which lasted from March to June 2020. A massive amount of information is currently being disseminated on the Internet via various social networking accounts. Some false or fake information in the form of “government letters or resolutions, religious comments, hate speech, and so on" has spread like wildfire. As a result, there are major social issues affecting areas such as unemployment, politics, healthcare, poverty, religious cleavages, etc. Due to the vast availability of similar datasets comprising these types of information, manual detection of fake news or false information is challenging. This issue requires immediate attention in terms of automatically finding false news. With this motivation, we present a novel ‘ConFake’ algorithm. This algorithm includes an eighty content-based feature set for identifying fake news. Content-based and word vector features extracted from the textual content of news stories were used in the experiment. These characteristics were combined and input into machine learning classifiers. To validate the experimental findings, we ran all of the experiments on five publicly available datasets and one synthetically generated ConFake dataset that combined five datasets, namely: Kaggle, McIntire, Reuter, BuzzFeed, and PolitiFact. The proposed model achieved the highest accuracy of 97.31% when compared to other cutting-edge models.

Example of True news

…

Example of Disinformation

…

ConFake model for fake news detection

…

Figures - available from: Multimedia Tools and Applications

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Mayank Kumar Jain

Content may be subject to copyright.

Multimedia Tools and Applications

https://doi.org/10.1007/s11042-023-15792-1

ConFake: fake news identiﬁcation using content based

features

Mayank Kumar Jain1·Dinesh Gopalani1·Yogesh Kumar Meena1

Received: 19 October 2021 / Revised: 4 May 2023 / Accepted: 6 May 2023

Abstract

The majority of users were available on the Internet and created a number of social networking

accounts during India’s COVID-19-caused lockdown, which lasted from March to June 2020.

A massive amount of information is currently being disseminated on the Internet via various

social networking accounts. Some false or fake information in the form of “government

letters or resolutions, religious comments, hate speech, and so on" has spread like wildﬁre.

As a result, there are major social issues affecting areas such as unemployment, politics,

healthcare, poverty, religious cleavages, etc. Due to the vast availability of similar datasets

comprising these types of information, manual detection of fake news or false information

is challenging. This issue requires immediate attention in terms of automatically ﬁnding

false news. With this motivation, we present a novel ‘ConFake’ algorithm. This algorithm

includes an eighty content-based feature set for identifying fake news. Content-based and

word vector features extracted from the textual content of news stories were used in the

experiment. These characteristics were combined and input into machine learning classiﬁers.

To validate the experimental ﬁndings, we ran all of the experiments on ﬁve publicly available

datasets and one synthetically generated ConFake dataset that combined ﬁve datasets, namely:

Kaggle, McIntire, Reuter, BuzzFeed, and PolitiFact. The proposed model achieved the highest

accuracy of 97.31% when compared to other cutting-edge models.

Keywords Social media ·Machine learning ·Fake news ·Linguistic features ·

Word embedding

BMayank Kumar Jain

mayank261288@gmail.com

Dinesh Gopalani

dgopalani.cse@mnit.ac.in

Yogesh Kumar Meena

ymeena.cse@mnit.ac.in

1Department of Computer Science and Engineering, Malaviya National Institute of Technology,

Jaipur 302017, Rajasthan, India

123

Multimedia Tools and Applications

1 Introduction

People are increasingly exposed to false news since users may easily make proﬁles on social

media sites to connect with or talk to friends and publish or tweet stuff. The proportion of

false information is growing by the day. There are various kinds of false information present,

such as satire [39], hoaxes, clickbaits [28], rumours [31], stances, fake reviews, and fake

news [67]. The spread of false information through media channels such as social media

posts, tweets, blogs, and online newspapers has necessitated the identiﬁcation of credible

news originators. Fake news can exist in any format and spread through word-of-mouth, text

in regional languages, instant messages, posts, photos, images, videos, short videos, audio,

clippings, etc. As a result of the spread of fake news or false information, society may face

dangerous situations such as mob lynching, human life at risk due to rumours and incorrect

information about health care and medicine, and social and political confusion.

Fake news is disseminated under the pretence of being real, generally via news sources

on the Internet, to gain political or ﬁnancial advantage, increase readership, and sway public

opinion [47]. It can potentially harm an agency, company, or individual, or proﬁt ﬁnancially

or politically [33].

The efﬁciency of accessing and sharing knowledge with others makes online social net-

works enticing. However, instantaneous data scattering at a high pace with little effort makes

it possible for the widespread dissemination of false information, such as fake news [57].

According to Schwarz et al. [45], people generally assess truths using the following criteria,

each of which may be evaluated analytically and intuitively: Consensus: Do other people

agree with you? Consistency: Is it in line with what I know? Coherence: Is the tale internally

consistent and plausible? Credibility: Is it from a trustworthy source? Support: Is there a lot

of proof to back up your claim?

For the reasons stated, identifying false news on social media has become difﬁcult. To

begin with, gathering and labelling fake information from the Internet is difﬁcult because

information shared on social networking sites is considered private data [44]. Second, the

person who writes the news alters the words in their language in order to spread false infor-

mation. Third, the limited number of text data representation approaches makes it difﬁcult

to detect false news.

Figures 1and 2depicts the example of true news1and disinformation2on social network

webs, respectively. The sources of these images are taken from the Truth and Fiction fact-

checking website [56].

Many fact-checking organisations, such as Politifact [36], TruthOrFiction [56], Snopes

[54], FactCheck [10], FullFact [13], HoaxSlayer [19], VishwasNews [60], are ﬁghting false

news. These organisations operate on a conventional journalistic paradigm, in which reporters

must analyse the facts to determine the validity of a news clip. This method is not automated

and takes extra time [17]. Moreover, to identify false news, many writers employ Machine

Learning (ML) models [5,7,22,26,29,41,48] using a variety of features. Many researchers

utilise content-based [64] characteristics in their research to improve accuracy. These prob-

lems raise several research concerns, and the problem’s complexity necessitates innovative

and solid answers. The following Research Questions (RQ) inspire this research:

RQ1: Which Linguistic Feature (LF) set is vital for detecting false news with high accuracy?

1Source:https://www.truthorﬁction.com/u-s-military-dogs-being-evacuated-from-afghanistan/

2Source:https://www.truthorﬁction.com/no-a-study-didnt-ﬁnd-that-the-most-highly-educated-americans-

are-also-the-most-vaccine-hesitant/

123

Multimedia Tools and Applications

Fig. 1 Example of True news

RQ2: Which method of Word Embedding (WoE) with a set of LFs is best for ﬁnding fake

news?

RQ3: Which ML approach is best for detecting false news on available datasets?

To address the research mentioned in earlier questions (RQ), we suggest a three-phase

approach centred on article textual content:

1. Use the LF set to detect fake stories.

2. Use WoE with a set of LFs to improve the ConFake dataset’s ability to spot fake news.

3. Contrast this model with state-of-the-art algorithms.

Fig. 2 Example of Disinformation

123

Multimedia Tools and Applications

This study requires no metadata related to the user or the media to detect false news. In

general, the following are the work’s major contributions:

a. ConFake Dataset: To minimize the overﬁtting of ML models and facilitate better train-

ing, we created a larger ConFake dataset. It combines ﬁve datasets for this purpose:

Kaggle, Reuter’s, McIntire’s, BuzzFeed’s, and PolitiFact’s. This novel dataset of 72,413

news articles contains 35,396 true news articles and 37,479 false news articles.

b. Extensive Feature Collection: Utilizing state-of-the-art approaches, gather the different

language characteristics and identify a subset using Pearson correlation, which works well

on the ConFake dataset.

c. ConFake Model: In this investigation, the ensemble technique is applied to WoE with

LFs using multiple ML methods.

The balance of this work has been organized as follows: A survey of relevant work on

linguistic characteristics and deception detection is found in Section 2. The LF categories

and word vector features are then presented in Section 3. Section 4outlines the classiﬁcation

process pipelines and the tools for each assessment step. Following that, Section 5presents

the algorithms for detecting fake news using linguistic and word vector characteristics. In

contrast, Section 6frames the performance of ML classiﬁers on the ConFake dataset and

compares the testing ﬁndings with related works. Finally, Section 7concludes the article and

emphasises future work.

2 Related work

Fake news, satire [39], misinformation [58], rumour [24], hoaxes, disinformation, propa-

ganda, and opinion spam are various categories of false information. These categories are

not mutually exclusive and numerous researchers have used them for different scenarios.

This study focuses on several false news detection methods.

Pawan et al. [57] proposed a WELFake model, where they used WoE vector features

combined with LFs to detect fake news. In that work, they perform the tasks in phases. They

created a novel dataset of approximately 72,000 news articles in the ﬁrst phase after incor-

porating four Kaggle, McIntire, Reuter, and Buzzfeed political datasets. In the second phase,

they extracted 87 features from the literature and selected 20 features after applying Pearson

correlation to features whose value was greater than 0.7. In the third phase, they combine the

LF 1, LF 2, and LF 3 sets with the Term Frequency (TF) matrix and feed them into the voting

classiﬁer. In the last phase, apply the hard voting classiﬁer to the Term Frequency-Inverse

Document Frequency (TF-IDF) vector, the TF vector, and the previous voting classiﬁer result.

They used six ML classiﬁers, including Support Vector Machine (SVM), Naive Bayes (NB),

K-Nearest Neighbour (KNN), Decision Tree (DT), Bagging, and Boosting, and discovered

that SVM provided 96.73% accuracy. The model’s accuracy was 1.31% greater than the

Bidirectional Encoder Representations from Transformers (BERT) model and 4.31% greater

than the Convolutional Neural Network (CNN) model. The limitation of the WELFake model

is the random selection of features for LF from 20 features. Additionally, they have not yet

identiﬁed fake news stories using user-centred traits like registration, account age, posting

frequency, social connections, etc.

Anshika et al. [9] introduced a deep learning model based on textual features to detect

false news, which was adequate. Their research employed news articles’ syntactic, gram-

matical, semantic, and readability features. Following the extraction of these characteristics,

the author made two feature sets: a combination of syntactic, semantic, and grammatical

123

Multimedia Tools and Applications

features and a readability feature set. Character count, word count, title word count, stop word

count, uppercase word count, word density, polarity, subjectivity, #nouns, #verbs, #adjec-

tives, #pronouns, and #adverbs are all included in the ﬁrst set. The second set comprises

the Flesch Reading Ease, Automated Readability Index, Gunning Fog Index, Coleman Liau

score, Flesch-Kinkaid score, Smog Index, and Linsear write formula. They used two small

datasets in the experiment: Buzzfeed political news and Random political news. After pre-

processing the datasets, they extracted the features and fed each feature set independently and

collectively into the Long Short-Term Memory (LSTM) model for false news classiﬁcation.

In this trial, they achieved an accuracy of 86%. Apart from that, they use basic learner ML

models such as linear SVM, Gaussian SVM, Gaussian NB, and Kernel NB on the datasets,

with Gaussian SVM achieving a maximum accuracy of 70%. This feature-based sequential

model has been tested on small datasets. Moreover, the author needed to consider tempo-

ral factors like posting timing, popular topics, the story’s lifespan, the consistency of the

story, etc.

Saqib Hakak et al. [18] successfully proposed an ensemble ML model to detect fake news.

This work used the DT, Random Forest (RF), and extra tree classiﬁers in ensemble form and

applied them to the Information Security and Object Technology (ISOT) [6]andLiar[61]

datasets. They achieved training and testing accuracy of 99.8% and 44.15%, respectively, on

the Liar dataset. Moreover, on the ISOT dataset, they achieved a training and testing accuracy

of 100%. The author extracts the 26 LFs used to detect fake news, such as word count,

character count, sentence count, average word length, average sentence length, sentiment

score, named entity recognition, etc. The limitation of their work is the low accuracy of

testing on the Liar dataset.

H. Reddy et al. [40] identify fake news by using the ensemble ML method on stylometric

features and word vector features, where they achieve an accuracy of 95.49% on the com-

bination of two datasets: FakeNewsNet [50], and McIntire [32]. They extract 50 LFs and

word-vector features such as TF, TF-IDF, Continuous Bag of Words (CBOW), and Skip-

gram. After combining the LF vector and word vector, apply ML classiﬁers such as Logistic

Regression (LR), SVM, NB, RF, KNN, Gradient Boosting (GB), Adaboost, and Bagging.

The small size of the dataset, as well as the overﬁtting performed by SVM on the LFs, are

both drawbacks of this study.

X. Zhou et al. [69] proposed a fake news prediction model on Politifact and Buzzfeed

datasets. The authors extract lexical, syntactic, semantic, and discourse-level characteristics

from the news article text and feed this feature vector to ML models such as LR, NB, SVM,

RF, and XGBoost individually or in combination. On Buzzfeed, they obtain an accuracy of

87.9%, while on the PolitiFact dataset, they reach an accuracy of 89.2%. This investigation

used disinformation-related and clickbait-related features to identify fake news. This model’s

downside is that fewer articles are in the corpus. These datasets contain images that have not

been used in this experiment.

Yang et al. [65] presented a paper to classify fake news with the help of a two-level CNN

model. The authors’ work involves images and text of news articles from 20,015 newspapers

on 240 approved news websites. After concatenating both feature sets into one, the author

extracts the latent and explicit features from text or images. This feature set classiﬁes the

news as real or fake. Based on latent and explicit features, they obtained precision, recall,

and an f1-score of 92%, better than LSTM-text. Since the author only used a small dataset,

user-speciﬁc factors like account age, posting frequency, social connections, and registration

status have not been taken into account.

123

Multimedia Tools and Applications

V. Pérez-Rosas et al. [35] used major LFs (e.g., n-grams, punctuation, psycho-linguistics,

readability, and syntax) and achieved an accuracy of 76% on two novel data sets:

FakeNewsAMT and Celebrity. These datasets cover seven different domains. The limitation

of this work is the small number of features used to detect fake news and the low accuracy

of the models.

G. Gravanis et al. [17] introduce a new unbiased dataset with 3004 articles by incorporating

a few articles from four data sets (Kaggle-EXT, McIntire, BuzzFeed, and Politifact). They

extracted the 57 LFs, and the WoE fed into ML classiﬁers (viz., SVM, NB, DT, KNN,

Adaboost, and Bagging) achieved the highest accuracy of 94.9% using SVM. They also

correlated their approach with the different corpora (i.e., Kaggle-EXT, BuzzFeed, Politifact,

and McIntire). They achieved accuracyrates of 99.0%, 72.70%, 84.7%, and 81%, respectively.

In their investigation, they have not incorporated user-based features.

On the Kaggle dataset, R. Kaliyar et al. [27] suggested a deep CNN classiﬁcation of false

or authentic news. They utilized a pre-trained WoE glove to extract latent features and assess

accuracy, precision, recall, and f1-score performance. They also measure the True Negative

Rate and the False-Positive Rate to identify fake news, achieving 98.36% accuracy on text.

In their investigation, they have not incorporated the images. Ghanem et al. [14]usean

emotion-infused neural network to identify different categories, such as propaganda, hoaxes,

clickbait, and satire, on a news article and Twitter dataset. After extracting latent and content

features of the text, they achieve a maximum of 96% accuracy on the clickbait dataset. They

have not incorporated the images into their investigation and have used short text.

Shah et al. [46] multimodal data to identify fake news on Weibo and Twitter datasets,

extracting sentiment-related features and performing the image segmentation process. Their

experiment used the cultural optimization algorithm to achieve an accuracy of 79.8% on

Twitter and 89.01% on Weibo datasets. In their investigation, they have not incorporated the

user-based features and have used short text.

Most of the authors apply supervised ML [15,41,42,52,69] models on various types of

features such as TF, TF-IDF, N-gram [64], LFs [9,31,35,57,69], and readability features [9,

31,35,57,69] to detect fake news. Shu et al. [49] proposed a FakeNewsTracker (FNT) tool to

collect, detect, and visualize fake news. Apart from that, many researchers used deep neural

network [16,63] models, such as CNN [3], Recurrent Neural Network (RNN) [3], BERT,

hybrid model [43] and so on, to extract the latent features instead of the explicit features from

the news articles. Few authors proposed a multi-modal [25,37] approach such as SpotFake

[53], Multimodal Variational Autoencoder (MVAE) [30], Event Adversarial Neural Network

(EANN) [62], SAFE [66] to detect fake news. Vishwakarma et al. [59] use scraping to analyze

and detect false news and verify the validity of fake news on the web. The spread of fake

news on social media began following the US 2016 election [4]; however, false information

is language-dependent; therefore, several authors [12,51] developed a model to recognize

fake news in many languages. For false news identiﬁcation, Huang et al. [21] presented an

ensemble learning approach that combines four distinct models: embedding LSTM, depth

LSTM, Linguistic Inquiry Word Count (LIWC) CNN, and N-gram CNN. Furthermore, the

ensemble learning model’s optimal weights are calculated using the Self-Adaptive Harmony

Search (SAHS) method to obtain a greater accuracy of 99.4% in false news identiﬁcation.

The following problems were discovered in the works mentioned above:

A. In references [35,65,69], the suggested models do not have great performance.

B. There are fewer features used by authors.

C. The datasets were too limited and were from a different domain.

123

Multimedia Tools and Applications

3 Feature engineering

This section describes the feature engineering presented in relation to the proposed algorithm

used to improve learning performance by improving dataset quality. Processing a dataset

without checking for critical feature engineering issues can lead to incorrect conclusions

about the model’s accuracy. As a result, data preprocessing in terms of feature engineering

must be addressed before conducting an analysis. The following are the steps for completing

the feature engineering task in this work:

3.1 Linguistic feature extraction

LFs are extracted from the textual content in terms of characters, words, sentences, para-

graphs, and documents [17,23,33,40,65,69]. The main objective of feature extraction

is to create a feature set that ﬁnds the relevant information in the actual dataset, increas-

ing the speed of model training and improving learning accuracy. Feature extraction also

helps with data visualization. We have extracted the 80 LFs after studying various linguistic-

based articles that help detect fake news. Most of the features fall into the lexical, syntax, and

semantic-level categories. Some features from sentiment, psycho-linguistics, readability, etc,

are used. We have used 12 categories of features to identify fake news or false information.

These categories are described below:

3.1.1 Quantity features

Quantity features count the characters, words, sentences, paragraphs, syllables, articles, verbs,

adverbs, adjectives, stop words, capital letters, self-referencing, group-referencing, and so on

to ﬁnd speech information. On the other hand, the document was divided to count the number

of characters, uppercase letters, words, sentences, paragraphs, and syllables. In addition, a few

characteristics, such as self-referencing and group referencing, were evaluated by matching

each word. Table 1shows the ‘quantity features’ as well as the tools used to extract them.

Table 1 Quantity features Features Tools

#Characters Self-Programmed in Python

#Words

#Sentences

#Paragraphs

#Syllables LIWC dictionary

#Articles

#Verbs

#Adverbs

#Adjectives

#Stopwords Natural Language Tool Kit (NLTK) library

#Uppercase letters Self-Programmed in Python

#Self-referencing

#Group-referencing

123

Multimedia Tools and Applications

Table 2 Complexity features

Features Tool

Average number of words per sentences Self-Programmed in Python

Average number of characters per word

Average number of syllables per word

Average number of sentences per paragraph

3.1.2 Complexity features

The complexity features shown in Table 2were used to measure the complexity of the news

article. In Table 2, tools for extracting features are also provided.

3.1.3 Uncertainty features

Uncertainty features measure the ratio of a modal verb to certainty (e.g., always), generalising

terms, tentative terms (e.g., perhaps), numbers, question marks, ellipsis (...), hashtags, etc.

We have used a set of tokens to count features like numbers, question marks, ellipsis, and

hashtags. Moreover, to measure the remaining terms, the LIWC 2007 dictionary is used.

Tabl e 3shows the ‘uncertainty features’ as well as the tools used to extract them.

3.1.4 Subjectivity features

When a headline story becomes biased, its quality should be considered lower because it

maintains impartiality. It expresses an opinion, thoughts, or a person’s sentiments, which

range from 0 (negative) to 1 (positive). We assess the subjectivity of news articles by calcu-

lating the ratio of factual verbs (e.g., observe), report verbs (e.g., proclaim), subjective verbs,

motion verbs (e.g., move, shift), and greeting words. Motion verbs, subjective verbs, and

factual verbs were measured with the help of the LIWC 2007 dictionary. While measuring

greeting and reporting verbs, we created a dictionary and matched each word. Table 4shows

the ‘subjectivity features’ as well as the tools used to extract them.

Table 3 Uncertainty Features Features Tool

Certainty LIWC dictionary

Generalizing term

Tentative term

Numbers Self-Programmed in Python

Question marks

Hash tags

Ellipsis (...)

123

Multimedia Tools and Applications

Table 4 Subjectivity Features Features Tool

Factual verbs LIWC dictionary

Motion verbs

Subjective verbs

Greeting verbs Self-Programmed in Python

Reporting verbs

3.1.5 Non-immediacy features

Non-immediacy features assess ﬁrst-person, second-person, and third-person pronouns. To

measure the above-mentioned features, the LIWC 2007 dictionary was used. Table 5shows

the ‘non-immediacy features’ as well as the tools used to extract them.

3.1.6 Sentiment features

Sentiment features in news content suggest a distinction between fake and true news. We

calculate the sentiments for each news story by measuring the number of positive words,

negative words, exclamation marks, anger words, anxiety words, sadness words, etc. These

terms are measured with the help of the LIWC 2007 dictionary. Table 6shows the ‘sentiment

features’ as well as the tools used to extract them.

3.1.7 Diversity features

Diversity features are used to measure lexical diversity, the number of unique words, content

words (e.g., nouns, verbs, adverbs, and adjectives), and function words (e.g., pronouns, prepo-

sitions, determiners, and conjunctions). At the highest level, the above-mentioned diversity

features can be evaluated by looking into news authors’ writing expression. From these fea-

tures, we have counted the unique words by removing the duplicates from the set of words.

While the remaining terms were measured using the LIWC 2007 dictionary. Table 7shows

the ‘diversity features’ as well as the tools used to extract them.

3.1.8 Informality features

Informality comprises ﬁve dimensions: swear words (e.g., dammit), netspeak (e.g., btw,

hahaha), assents (e.g., OK), nonﬂuencies (e.g., hm, hmmm), and ﬁllers (e.g., I mean, you

know). To assess the casualness of each news article, consider how frequently each word or

phrase appears across all dimensions. Table 8shows the ‘informality features’ as well as the

tools used to extract them.

Table 5 Non-immediacy features Features Tool

#First-person pronoun LIWC dictionary

#Second-person pronoun

#Third-person pronoun

123

Multimedia Tools and Applications

Table 6 Sentiment Features Features Tool

Positive words LIWC dictionary

Negative words

Exclamation marks

Anger words

Anxiety words

Sadness words

3.1.9 Speciﬁcity features

Speciﬁcity features consist of a ratio of temporal, spatial, sensory information, causation terms

(e.g., because), exclusive terms, cognitive processes (i.e., insight), and perceptual processes

(see, hear, and feel). These terms are measured with the help of the LIWC dictionary. Table

9shows the ‘speciﬁcity features’ as well as the tools used to extract them.

3.1.10 Readability features

The readability feature deﬁnes the sentence complexity of the textual content. Using these

features, we have identiﬁed the grade of the text writer. Therefore, we measured the readability

features [9,17,23] using the textstat Python library on text to identify fake news. The

readability features are the Flesch Reading Ease Index, the Flesch Kinkaid formula, the

Automated Readability Index, the Coleman Liau formula, the SMOG Index, the Gunning

Fog Index formula, the New Dale-Chall formula, and the Linsear write formula. Table 10

shows the ‘readability features’ as well as the tools used to extract them.

3.1.11 Writing pattern features

Writing pattern features focus on the text’s writing style by counting the number of special

characters (e.g., ?, !, “, ‘, ’, ", #, @, etc.), short words (less than four characters), lengthy

words (greater than 15 characters), and so on. To measure these features, we performed the

Table 7 Diversity features Features Tool

Content words LIWC dictionary

Function marks

Pronouns

Prepositions

Determiners

Conjunctions

Lexical Diversity Self-Programmed in Python

#Type (Unique words)

#Tokens

#Polysyllables

123

Multimedia Tools and Applications

Table 8 Informality features Features Tool

Swear words LIWC dictionary

Netspeak words

Assents

Non-ﬂuencies

Fillers

Table 9 Speciﬁcity features Features Tool

Temporal words LIWC dictionary

Spatial words

Causation terms

Exclusive terms

Cognitive process

Perceptual process

Table 10 Readability features Features Tool

Flesch Reading Ease Index Textstat Python library

Flesch-Kincaid Grade Level

Automated Readability Index

Coleman Liau Formula

SMOG Index

Gunning Fog Index Formula

New Dale-Chall Formula

Linsear Write Formula

Table 11 Writing pattern features Features Tool

No. of ? Self-Programmed in Python

No. of !

No. of single quotes (‘ ’)

No. of #

No. of @

#Big words (greater than 15 chars)

#Short words (less than 4 chars)

No. of double quotes (“ ")

No. of Ellipsis

123

Multimedia Tools and Applications

calculation on sets of characters and tokens. Table 11 shows the ‘writing pattern features’ as

well as the tools used to extract them.

3.1.12 Psycho-linguistic features

Psycho-linguistic features estimate the text polarity, which pertains to positive and negative

assertions and subjectivity. It falls between -1 and 1. We used the textblob Python library for

text polarity and subjectivity. Table 12 shows the ‘psycho-linguistic features’ as well as the

tools used to extract them.

Many authors incorporate the above categories of LFs in their work. These features help

to ﬁnd cues from the textual content of news articles. In this work, the LIWC [8,34,55,68]

module extracts uncertainty, subjectivity, non-immediacy, sentiment, diversity, informality,

and speciﬁcity. Moreover, the textstat Python library extracted the readability features [9]of

the news articles. For the text polarity and subjectivity Python module, textblob was used.

We have also implemented features such as quantity, complexity, greeting words, ﬁller, and

report verbs that improve the model’s performance.

3.2 Linguistic feature selection

In this process, we have selected the least correlated features to classify the news, which

reduces the number of features, decreases the computation power, and improves the accuracy

of the ML models. We use the Pearson correlation (corr) for feature selection. The corr

exhibits the strength of the relationship between the features. It measures the dependency

among features. Calculates the corr between given features using the (1), where x and y

represent the feature vectors and ¯x,¯yrepresent the means of x and y, respectively. The range

of corr is from -1 to +1, where -1 represents negative corr and +1 shows positive corr [9].

corr =n

i=0(xi−¯x)(yi−¯y)

n

i=0(xi−¯x)2(yi−¯y)2

(1)

We have calculated the correlation between every pair of LFs within the same category

and discarded those with a corr higher than 0.7, 0.8, or 0.9, which indicates a strong positive

linear relationship [38] between the two features. Here, we obtained a constant correlation

matrix. After that, we select the features whose correlation values are above the threshold

and drop one of them. If one variable is measured more consistently or has stronger evidence

of construct validity, it may be better to keep that variable even if its correlation with the

outcome is similar to that of another variable rather than dropping one of the two features

whose correlation values are the same (e.g., 0.95). Finally, a thorough examination of the

theoretical and empirical support for each variable ought to be utilised to make the decision

about which variable to keep.

Table 12 Psycho-linguistic

Features Features Tool

Text polarity Textblob Python library

Subjectivity

123

Multimedia Tools and Applications

3.3 Word embedding

WoE is employed to convert the plain text into a numeric vector because ML models cannot

directly handle the textual content. In a survey, we found two forms of WoE: content-based

WoE, such as TF and TF-IDF, which concentrate on prior knowledge, and context-based

WoE, such as Word2Vec, Glove, and FastText, which focus on textual writing patterns. A

faker alters the news by repeating identical terms. As an example, various false news reports

and conspiracy theories were disseminated on Twitter during the 2020 U.S. presidential

election. To seem more credible and acquire momentum on the site, several of these articles

employed similar words and phrases.

3.3.1 Term frequency/Count vector

The text vector is converted into a histogram vector by TF, which represents the frequency

of each word in the document. The vocabulary of unique words is deﬁned by the length of

the vector. The formula for calculating the TF is shown in (2).

TF =Ft

T(2)

where Ft= number of times a term is present in the document, and T= total terms in the

document.

3.3.2 Term frequency-inverse document frequency

It is also known as “normalised frequency." It is an extended version of TF, which shows

the importance of the words that have fewer occurrences in the document. The (4) shows the

formula to calculate normalised frequency, which is the product of (2)and(3).

Inver se_document_frequency(IDF)=log2D

t_D(3)

where D= total documents and t_D= token present in documents.

TF_IDF =TF×IDF (4)

4 ConFake model

The four phases of the fake news detection method shown in Fig. 3are dataset preparation,

feature engineering, concatenating features, and classiﬁcation. These phases involve collect-

ing and cleaning data, extracting relevant features, combining them to create a feature vector,

and using machine learning algorithms to classify news articles as real or fake. This method

is a useful tool for identifying and ﬁltering out fake news, but its effectiveness depends on

the quality of features selected and the accuracy of the classiﬁcation algorithm used.

4.1 Dataset preparation

Data collection and preprocessing are essential tasks for machine learning models, as the

quality and relevance of the data used to train the model have a direct impact on its perfor-

mance and accuracy.

123

Multimedia Tools and Applications

Datasets

Word Embedding

Technique Selection

Linguistic Feature

Extraction

Linguistic Feature

Selection

Apply Machine

Learning Models

True News Fake News

Phase 1

New Dataset

Preparation

Dataset

Preprocessing

Phase 2

Feature Engineering

Phase 3

Phase 4

Concatenate Features

Fig. 3 ConFake model for fake news detection

4.1.1 Dataset collection

In previous studies, many datasets were used [2,6,11,20,32,36,61], whereas we identiﬁed

datasets with comparable structures and categories. The datasets utilised in a related study

had numerous issues, including size, category, bias, etc. As shown in Table 13,weprepareda

large dataset consisting of ﬁve datasets: Kaggle, Reuter, McIntire, BuzzFeed, and PolitiFact.

This large dataset minimizes the overﬁtting of ML models and facilitates better training. This

Table 13 ConFake dataset Dataset Total news True news Fake news

Kaggle [11] 20800 10387 10413

Reuter [2] 44898 21416 23482

McIntire [32] 6335 3171 3164

BuzzFeed [20] 182 91 91

PolitiFact [36] 240 120 120

ConFake Dataset 72413 35396 37479

123

Multimedia Tools and Applications

novel dataset of 72,413 news articles contains 35,396 true news articles and 37,479 false news

articles.

4.1.2 Data preprocessing

Data preprocessing consists of many tasks to handle noise in the text and missing data. It

rebuilds the unstructured data into a structured form that helps improve the accuracy and

performance of the model. This work uses preprocessing on the ConFake dataset to remove

NaN values, typographic errors, duplicate data, stop words, emoji signs, punctuation marks,

dates, special characters, lemmatization, and perform stemming by using the Porter Stemmer

algorithm.

4.2 Feature engineering

In feature engineering, we have extracted the linguistic and word vector evidence from the

text of the news articles discussed in Section 3.

4.3 Concatenate features

In this section, content-based WoE combines with LFs to achieve better accuracy because

only LFs and word vector features do not provide good accuracy. We combine TF or TF-IDF

with an optimised LF set. After that, it was fed into a ML classiﬁer for classiﬁcation.

4.4 Classification

Selecting the best-performing ML classiﬁer is essential to designing a fake news detection

model that accurately identiﬁes fake news. In related work, we identiﬁed better ML classiﬁers,

such as SVM, NB, RF, LR, KNN, Bagging, and Boosting. In text-mining tasks, SVM performs

better than other classiﬁers. Ensemble ML classiﬁers use weak learners to improve accuracy.

The Adaboost classiﬁer’s primary purpose is to identify patterns that are hard to classify. The

description of the ML classiﬁers is as follows:

1. Support Vector Machine: SVM is a supervised machine learning technique compatible

with large datasets for classiﬁcation. In the experiment, we utilised a linear SVM to

classify false news. It’s also utilised in rumour detection, sentiment categorization, facial

recognition, and other applications.

2. Random Forest: RF is a supervised machine learning method used in regression and

classiﬁcation problems. It utilised several DTs and provided the result based on the outputs

of each DT. In this experiment, we used many DTs, but the value of n_estimator= 200

offered the best accuracy.

3. Naive Bayes: NB is a supervised machine learning technique. It has two types: gaus-

sian and multinomial. This approach provides a quick response in comparison to other

classiﬁers. It is mostly used in text classiﬁcation tasks. Gaussian NB were utilised in the

experiment because they handle negative values.

4. K-Nearest Neighbour: KNN performs the classiﬁcation by utilising feature similarity. It

is a non-parametric classiﬁcation approach. In this experiment, we have to take the value

of k = 7, which gives better accuracy.

123

Multimedia Tools and Applications

5. Logistic Regression: This statistical technique utilises a logistic function to describe a

binary target variable in its most basic form, but many more complicated extensions exist.

LR is used in regression analysis to estimate the parameters of a logistic model.

6. Bagging: is a parallel ensemble ML classiﬁer. This method reduces the variance of the

prediction model by generating more data during the training stage. In this study, the

feature vector is partitioned into equal subsets. DT is applied equally to each subset; thus,

the prediction is estimated by taking the mean or mode of the classiﬁers’ output.

7. Boosting: A sequential ensemble ML classiﬁer reduces bias errors and generates powerful

prediction models. The phrase “Boosting" refers to transforming a weak classiﬁer into

a robust one. Boosting attracts a large number of classiﬁers. Since the data samples are

weighted, some may participate in the new sets more frequently. Data points mistakenly

predicted are detected, and their weights are increased in each phase so that the following

learner gets closer to getting them right.

5 Parameters and methods

The procedure to detect fake news by ML classiﬁers using LFs is shown in Algorithm 1, and

the procedure with word vector features is shown in Algorithm 2.

Following are the steps of Algorithm 1:

1. The ﬁrst line represents a collection of the datasets and combines them. After that, apply

preprocessing steps such as removing missing data, redundant data, emoji, etc.

2. In lines 2 and 3, initialise the list of document features ( feature_doc) and all document

features ( feature_dataset).

3. In the feature extraction step, ﬁrst extract 80 LFs from each document and append these

vectors feature_doc in feature_dataset list.

4. Transform feature_dataset values by putting them through standardisation to improve

accuracy. Data standardisation is the process of rescaling the features so that they have a

mean of 0 and a variance of 1. Standardisation is used when features of an input dataset

have large differences between ranges or are simply measured in different measurement

units. The ultimate goal of standardisation is to bring all the features down to a common

scale without distorting the differences in the range of the values. In standardisation, there

is no speciﬁc upper or lower bound for the maximum and minimum values. Z-score is one

of the most popular methods to standardise data. The formula to calculate the Z-score is

shown in (5).

Z−scor e =Value −Mean

Standard_deviation (5)

5. Set the correlation function on attributes of transformed feature_dataset and get feature

sets LF 1, LF 2, LF 3, and LF 4 with respect to Corr values less than 0.7, 0.8, 0.9, and 1.

The selection of least-correlated features to classify the news is discussed in Section 3.2.

6. Building a model that works well with additional data is one of the objectives of supervised

learning. It’s a good idea to test our model on new data if we have any. The issue is that

we don’t have any new data; however, a technique like a train test split may be used to

simulate this experience. For a training set and a testing set, 80% of rows were randomly

sampled without replacement and placed into the training set, while the remaining 20%

were placed into the test set.

7. Now, we have applied each ML classiﬁer to every LF set with a label and calculated

metrics using the confusion matrix discussed in Section 6.1.

123

Multimedia Tools and Applications

Algorithm 1 Algorithm to apply ML classiﬁers on LFs.

Input: Datasets (Kaggle, Reuter, McIntire, BuzzFeed, PolitiFact)

Output: Accuracy, Precision, Recall, F1-score.

1. ConFake_dataset(D)←collection(Kaggle,Reuter,

McIntire,BuzzFeed ,PolitiFact);

2. feature_doc ←[ ];

3. feature_dataset ←[ ];

4. for doc ←1to len(D)do feature_doc.append(Linguistic_features_extract

_from_textual_content_of_doc);

feature_dataset.append(feature_doc);

5. Apply Standardisation on feature_dataset.

6. Apply the correlation function to feature_dataset and create the feature sets LF 1, LF 2, LF 3, and

LF 4 with respect to correlation values below 0.7, 0.8, 0.9, and 1.

7. Perform train_test_split function on LF 1/ LF 2/ LF 3/ LF 4 and labels with a ratio of 80:20%.

8. Select an ML classiﬁer from (NB, SVM, LR, KNN, Bagging, Adaboost, and RF).

9. Training of the model: classiﬁer.ﬁt(feature_train, labels_train)

10. Prediction: classiﬁer.predict(feature_test)

11. Print confusion_matrix(feature_test, labels_test)

12. Evaluate metrics using the confusion matrix.

Algorithm 2 Algorithm to apply ML classiﬁer on word vector features.

Input: Datasets (Kaggle, Reuter, McIntire, BuzzFeed, PolitiFact).

Output: Accuracy, Precision, Recall, F1-score.

1. ConFake_dataset(D)←collection(Kaggle,Reuter,

McIntire,BuzzFeed ,PolitiFact);

2. preprocess_doc ←[ ];

3. preprocess_dataset ←[ ];

4. tf_feature_doc ←[ ];

5. tf_feature_dataset ←[ ];

6. tf_idf _feature_doc ←[ ];

7. tf_idf _feature_dataset ←[ ];

8. for doc ←1tolen(D)do prepr ocess_doc.append(preprocessing(

remove_special_character s,remove_digits,remove_punctuation_marks,

lowercasi ng,stemming,lemmetization));

preprocess_dataset.append(preprocess_doc);

9. Calculate BOW-TF and BOW-TFIDF;

10. for doc ←1to len(preprocess_dataset)

do tf_feature_doc.append(BOW_TF(doc));

tf_feature_dataset.append(tf_feature_doc);

tfidf_feature_doc.append(BOW −TF_IDF(doc));

tfidf_feature_dataset.append(tfidf_feature_doc);

11. Perform train_test_split function on feature_dataset of BOW-TF or BOW-TFIDF and labels with a ratio

of 80:20%.

12. Select an ML classiﬁer from (NB, SVM, LR, KNN, Bagging, Adaboost, and RF).

13. Training of the model: classiﬁer.ﬁt(feature_train, labels_train)

14. Prediction: classiﬁer.predict(feature_test)

15. Print confusion_matrix(feature_test, labels_test)

16. Evaluate metrics.

Similarly, the following steps are taken in the Algorithm 2:

1. The ﬁrst line represents a collection of the datasets and combines them.

2. Line 2-7 represents the initialization of lists, where preprocess_doc is the list of sen-

tences of the document, pr epr ocess_dataset is the list of sentences of all documents,

tf_feature_doc is the list of TF of the document, tf_feature_dataset is the list of TF

123

Multimedia Tools and Applications

of all the documents, tf_id f _feature_doc is the list of TF-IDF of the document, and

tf_id f _feature_dataset is the list of TF-IDF of all the documents.

3. In line 8, the dataset applies to preprocessing steps such as removing missing data,

redundant data, stop words, URLs, special characters, punctuation, and perform-

ing stemming and lemmatization on text data. Append each preprocess_doc into

preprocess_dataset.

4. Line 11 shows the feature extraction step to extract word vector features (TF/TF-IDF)

of each document using the (2)and(4). After that, append these vectors into a list

tf_feature_dataset or tf_idf _feature_dataset.

5. Apply standardisation to both the TF and TF-IDF lists, the same as used for LFs.

6. Perform a train test split on both transformed TF and TF-IDF lists with labels the same

as those used for LFs.

7. Now, we have to apply each ML classiﬁer and calculate metrics using the confusion matrix

discussed in Section 6.1.

The Algorithm 3 follows the Algorithm 1 and 2 where lines 2, 3 show the combination of

tf_feature_dataset with optimised feature_dataset that is stored in FS1 (feature set)

and the combination of tf_id f _feature_dataset with optimised feature_dataset stored

in FS2, then performed the standardisation on FS1andFS2 as used for LFs. After that,

perform the train test split on standardised FS1and FS2 with labels. Now, we have to apply

each ML classiﬁer and calculate metrics using the confusion matrix discussed in Section 6.1.

Algorithm 3 Algorithm to fed content-based features (i.e., LFs and TF/TF-IDF) into ML

classiﬁers.

Input: feature_dataset,tf_feature_dataset,tf_id f _feature_dataset.

Output: Accuracy, Precision, Recall, F1-score.

1. FS1=concat( feature_dataset,tf_feature_dataset);

2. FS2=concat( feature_dataset,tf_idf _feature_dataset);

3. Apply Standardisation on FS1, FS2.

4. Perform train_test_split function on FS1/FS2 and labels with a ratio of 80:20%.

5. Select an ML classiﬁer from (NB, SVM, LR, KNN, Bagging, Adaboost, and RF).

6. Training of the model: classiﬁer.ﬁt(feature_train, labels_train)

7. Prediction: classiﬁer.predict(feature_test)

8. Print confusion_matrix(feature_test, labels_test)

9. Evaluate metrics using a confusion matrix.

6 Results and discussion

This study performed experiments on PyCharm Community 2019.3 with Python 2.7, the Mac

OS X operating system, and 8 GB of memory.

6.1 Evaluation metrics

A confusion matrix measured the performance of the classiﬁcation models. To measure the

performance of the proposed method, we use four metrics: Accuracy (ACC), Precision (P),

Recall (R), and F1-score. The calculation of these metrics necessitates the use of parameters

such as “True positive” (Tp), “True negative” (Tn), “False positive” (Fp), and “False negative”

123

Multimedia Tools and Applications

(Fn). The following performance metrics are measured using the confusion matrix shown in

Tabl e 14:

1. Accuracy: deﬁned as the proportion of correct predictions to the total number of estimates.

ACC =Tp+Tn

Tp+Tn+Fp+Fn

2. Precision: expressed as the proportion of the correctly identiﬁed true positives to the total

number of positive predictions. It is used to calculate the positive predicted value.

P=Tp

Tp+Fp

3. Recall: expressed as the ratio of correctly identiﬁed positive predictions to the total number

of accurately desired outcomes.

R=Tp

Tp+Fn

4. F1-score: deﬁne the harmonic mean of accuracy and recall. It assists in determining the

model’s testing accuracy.

F1−scor e =2

P−1+R−1

6.2 Machine learning classifiers

All the experiments were conducted in four phases, and seven ML classiﬁers analysed the

accuracy of the proposed work.

In the ﬁrst phase, the accuracy was calculated with a different range of Pearson correlation

between features. First, 40 features offering low correlation, i.e., lying in the range of -0.7

to 0.7, are selected for accuracy calculation. The second set of 45 features for accuracy

calculation contains features with a correlation between -0.8 and 0.8. The third set of 51

features for accuracy calculation is then obtained with a correlation coefﬁcient ranging from

-0.9 to 0.9. Finally, accuracy is calculated using all 80 features extracted from the dataset. It

can be seen from Table 15 that with the inclusion of redundant or highly correlated features

(correlation magnitude greater than 0.7), there is not much improvement in the accuracy

observed. Table 15 shows the ML classiﬁer performance on LFs of the ConFake dataset for

different values of corr, where RF and Bagging achieve the best accuracy, i.e., 95.48% and

95.19%, respectively. Moreover, other classiﬁers like SVM, KNN, DT, Adaboost, and LR

provided accuracy of 89.72%, 84.75%, 91.19%, 89.20%, and GNB provided low accuracy

of 60.59%.

In the second phase, we extracted LFs from all benchmark datasets, such as Kaggle, McIn-

tire, Reuter, BuzzFeed, and PolitiFact. After that, some ML classiﬁers like GNB, SVM, KNN,

Adaboost, Bagging, LR, and RF have been applied to LFs, where RF performs better on Kag-

gle, McIntire, and Reuter datasets with accuracy of 98.21%, 92.02%, 96.92% respectively,

Table 14 Confusion matrix Predicted class

True False

Actual class True TpFp

False FnTn

123

Multimedia Tools and Applications

Table 15 ML classiﬁers’ performance on LFs of the ConFake dataset for different values of correlation

Classiﬁer Accuracy (%)

Corr <=| 1|Corr <|0.7|Corr <|0.8|Corr <|0.9|

GNB 60.59 61.19 60.70 60.64

SVM 89.72 85.35 85.66 85.97

KNN 84.75 84.51 83.89 83.98

DT 91.19 91.14 91.21 91.58

Adaboost 89.62 88.44 88.75 88.66

Bagging 95.19 95.00 95.03 95.12

LR 89.20 84.39 84.67 84.92

RF 95.48 95.38 95.29 95.49

and the remaining datasets also achieve good accuracy. Bagging performs better on BuzzFeed

and PolitiFact small-size datasets, which achieve an accuracy of 86.48% and 79.16%. The

classiﬁer’s performance on ﬁve datasets is shown in Table 16.

The third phase extracts content-based word embeddings like TF and TF-IDF features

from ConFake and inserts them into ML classiﬁers. However, RF and Bagging consistently

perform better on both WoE. The accuracy of RF was 94.34% on TF and 94.51% on TF-IDF.

Similarly, the accuracy of Bagging was 94.61% on TF and 94.10% on TF-IDF. LR achieves

the best accuracy of 95.21% on TF. Other classiﬁers like SVM, Adaboost, KNN, and GNB

perform accurately on both word vector features. The performance of ML classiﬁers on the

WoE of ConFake is shown in Table 17.

In the last phase, we combined the WoE with LFs and applied ML classiﬁers to them.

The performance of RF is best on both TF-LFs and TF-IDF-LFs of the ConFake dataset.

RF accuracy was 97.31% on TF-LFs and 97.14% on TF-IDF-LFs. Apart from that, a few

classiﬁers like LR, SVM, Bagging, and Adaboost also provide good accuracy on TF-LFs

and TF-IDF-LFs, up to 97%. The performance of all ML classiﬁers on LFs + TF features is

shown in Table 18 and on LFs + TF-IDF features, in Table 19.

The overall results show that RF achieves a higher accuracy of 97.31% on the combined

linguistic and WoE (TF) features of the ConFake dataset. Moreover, RF performs better in

all phases, achieving the best accuracy compared to other ML classiﬁers.

Table 16 ML classiﬁers’

performance on ﬁve different

datasets

Classiﬁer Accuracy (%)

Kaggle McIntire Reuter BuzzFeed PolitiFact

GNB 96.05 62.86 67.72 62.16 61.11

SVM 97.88 89.50 95.12 75.67 68.75

KNN 89.62 82.00 88.69 70.27 56.25

Adaboost 98.04 88.95 94.30 81.08 72.91

Bagging 98.09 91.47 96.15 86.48 79.16

LR 97.54 88.95 94.85 75.67 68.75

RF 98.21 92.02 96.92 81.08 75.00

123

Multimedia Tools and Applications

Table 17 ML classiﬁers’

performance on word vector

features of the ConFake dataset

Classiﬁer Accuracy (%)

TF TF-IDF

GNB 80.00 84.45

SVM 89.61 92.27

KNN 82.40 74.05

Adaboost 91.97 91.40

Bagging 94.61 94.10

LR 95.21 92.41

RF 94.34 94.51

6.3 Related work comparison

We compare the proposed ConFake method with four related methods, compiled in Table 20.

We use the seven ML classiﬁers on six datasets, including the ConFake dataset, as shown

in Table 13. We preprocess the dataset using the LFs and WoE features before applying ML

classiﬁers. In the case of LFs, we preprocess the dataset by removing punctuation marks

excluding single quotes, double quotes, “#,” “@,” ellipsis, “?,” and “!” as well as missing

values, duplicate values, emoji signs, irrelevant statements, and stemming and lemmatization

on text. We removed all punctuation marks, missing values, duplicate values, emoji signs,

and irrelevant statements from WoE. Then, on the text, perform stemming and lemmatization.

This preprocessing method differs from other cutting-edge techniques. We tested each ML

model on datasets with an 80%-20% data distribution. We used six different sizes of datasets

from various sources in this work.

1. Ahmed et al. [1] performed an experiment on the Kaggle-Ext dataset that contains 25200

articles where they used the TF-IDF word vector feature rather than LFs and achieved

92% accuracy by using linear SVM.

2. X. Zhou et al. [69] used the 53 LFs to detect fake news on BuzzFeed and PolitiFact datasets

that contain only 240 and 182 articles, respectively. They achieve the best accuracy by

linear SVM 89.20% on BuzzFeed and 87.9% on PolitiFact.

3. G. Gravanis et al. [17]: introduce a new UN-biased data set with 3004 articles by incorpo-

rating a few articles from four data sets (Kaggle-EXT, McIntire, BuzzFeed, and Politifact).

They extracted the 57 LFs along with the WoE, fed them into ML classiﬁers (viz., SVM,

NB, DT, KNN, Adaboost, and Bagging), and achieved the highest accuracy of 94.9% using

SVM. They also correlated their approach with the different corpora (i.e., Kaggle-EXT,

Table 18 ML classiﬁers’

performance on linguistic and

word vector (TF) features of the

ConFake dataset

Classiﬁer Accuracy (%) Precision Recall F1-score

KNN 77.95 .77 .81 .79

RF 97.31 .96 .99 .97

GNB 82.92 .84 .82 .83

LR 96.21 .96 .97 .96

Bagging 97.09 .96 .99 .97

Adaboost 94.41 .94 .95 .95

SVM 95.08 .94 .96 .95

123

Multimedia Tools and Applications

Table 19 ML classiﬁers’

performance on linguistic and

word vector (TF-IDF) features of

the ConFake dataset

Classiﬁer Accuracy (%) Precision Recall F1-score

KNN 70.64 .69 .78 .73

RF 97.14 .96 .99 .97

GNB 82.30 .84 .81 .82

LR 96.27 .96 .97 .96

Bagging 96.82 .95 .99 .97

Adaboost 94.37 .94 .95 .95

SVM 94.69 .94 .96 .95

BuzzFeed, Politifact, and McIntire) and achieved an accuracy of 99.0%, 72.70%, 84.7%,

and 81%, respectively.

4. H. Reddy et al. [40]: used the 50 features to detect fake news on a combination of the

FakeNewsNet and McIntire datasets that contain 6755 articles. They combined the LFs

with word-vector features and fed them into ML classiﬁers. In this experiment, they

achieved the highest accuracy of 95.49% by using Adaboost and GB classiﬁers.

5. ConFake: utilised a massive dataset of 72413 news items from ﬁve distinct datasets (Kag-

gle, Reuter, McIntire, BuzzFeed, and PolitiFact) and achieved the highest accuracy of

97.31% by RF compared to the related techniques. To provide a valid assessment, we used

this method on each dataset separately. When compared to [17], our approach improves

the accuracy of the McIntire dataset from 81.0% to 92.02% and the BuzzFeed dataset

from 72.7% to 81.08%.

6.4 Discussion

In this section, we discuss the performance of various ML algorithms on the ConFake dataset

with varying levels of correlation between the features. The ConFake dataset is a combi-

nation of ﬁve different datasets and has a moderate size of 72,413 instances. Given the

size and dimensionality of the ConFake dataset, some of the more computationally efﬁcient

algorithms for this type of dataset could include RF and GB. These algorithms can han-

dle high-dimensional data well and can parallelize computations, which can improve their

efﬁciency. On the other hand, algorithms that may require more memory or computational

resources, such as GNB or SVM, may be less efﬁcient for this dataset, especially if not

optimised properly.

Efﬁciency in ML can be assessed based on several factors, including training time, testing

time, memory requirements, and computational complexity. The performance of a classiﬁer

on a dataset with a large number of features can signiﬁcantly affect its efﬁciency, as more

features can lead to higher computational requirements and longer training and testing times.

RF and Bagging proved to be the most efﬁcient algorithms, especially for the ConFake dataset,

due to their ability to parallelize computations and handle noise and irrelevant features. These

algorithms can also be faster and more memory-efﬁcient than some other models, such as

SVM and KNN, which may require more computational resources and memory to train and

test. Some of the other algorithms used in the study, such as GNB, may be less computationally

123

Multimedia Tools and Applications

Table 20 ConFake comparison with related methods

Attribute Ahmed et al. [1] X. Zhou et al. [69] G. Gravanis et al. [17] H. Reddy et al. [40] ConFake Method

Dataset 1. Kaggle-EXT 1. Politifact

2. Buzzfeed

1. Kaggle-EXT

2. Buzzfeed

3. Politifact

4. McIntire

5. UNBiased

Combination of

FakeNewsNet and

McIntire Datasets

1. Kaggle

2. Reuter

3. McIntire

4. BuzzFeed

5. PolitiFact

6. ConFake

No. of news articles 25,200 1. 240

2. 182

1. 23,340

2. 240

3. 182

4. 6,310

5. 3,004

6,755 1. 20,800

2. 44,898

3. 6,335

4. 182

5. 240

6. 72,413

LFs No Yes Yes Yes Yes

No. of LFs used Nil 53 57 50 80

Word vector feature TF-IDF No Word2Vec TF, TF-IDF, Word2Vec TF, TF-IDF

Training and testing ratio 80% - 20% 80% - 20% 80% - 20% 80% - 20% 80% - 20%

Best classiﬁer Linear SVM Linear SVM SVM Adaboost, GB RF

Accuracy 1. 92% 1. 89.2%

2. 87.9%

1. 99.0%

2. 72.7%

3. 84.7%

4. 81.0%

5. 94.9%

95.49% 1. 98.21%

2. 96.92%

3. 92.02%

4. 81.08%

5. 75.00%

6. 97.31%

123

Multimedia Tools and Applications

expensive but also offer lower accuracy. The efﬁciency of a classiﬁer ultimately depends on

various factors, as mentioned in Table 20.

7 Conclusion

In this work, a new dataset “ConFake” is proposed that is made up of ﬁve open-source

corpora (Kaggle, McIntire, Reuter, BuzzFeed, and PolitiFact) to reduce their own limitations

and bias to tell the difference between fake news and real news. There are a total of 72,413

news articles, with 35,396 true news articles and 37,479 false news articles. To detect LFs,

the LIWC 2007 dictionary was used, and for readability features, the textstat Python library

was used. Furthermore, the textblob library has text polarity and subjectivity features, some

of which are self-programmed in Python to detect fake news. In this experiment, 80 features

were extracted and fed into seven ML classiﬁers. To extract WoE features such as TF and

TF-IDF, the entire feature set is fed separately into ML classiﬁers. Then TF is combined with

LFs, and similarly, TF-IDF is combined with LFs and fed into the ML classiﬁer. Multiple

classiﬁers performances were compared, and the RF classiﬁer achieved the highest accuracy

of 97.31% by combining the TF and LFs. To provide a valid assessment, ConFake method

on each dataset was used. When compared to cutting-edge methods, our method improves

the accuracy of the McIntire dataset by 11% and the accuracy of the BuzzFeed dataset by

9%. But the study has two ﬂaws: 1) It only looks at the text ﬁeld of news articles, not other

metadata like images, user information, events, social statistics, or even paths of spread; using

only text makes it harder to tell fake stories from real ones; and 2) Our method is not good

for ﬁnding fake news on social media early.

In future work, we will incorporate user- and social context-based features to improve the

performance of this approach. The user-based features (i.e., the registration ID of the user,

age, gender, #followers, #posts, etc.) and social context-based features (i.e., likes, comments,

shares, etc.) can effectively discriminate fake news from actual news. Moreover, we will

utilise enhanced deep learning models instead of ML models to recognise fake news because

deep learning models handle large datasets very effectively. Many efforts have been made

in recent years to increase the dependability and credibility of online material, but some

of the most critical issues still need to be solved. First, most of the research concentrates

on LFs in texts written in English. Other well-known and local languages still need to be

taken into consideration. Second, supervised learning methodologies have primarily been

used in the current work. Due to the vast amounts of unlabeled data from social media,

unsupervised models must be designed. Third, as most of this research has been conducted

using customised datasets, constructing compelling standard datasets is critically essential.

The shortage of publicly accessible large-scale datasets restricts a benchmark comparison of

various techniques.

Funding The authors did not receive support from any organization for the submitted work.

Data Availability The data that support the ﬁndings of this study are available from the corresponding author

upon reasonable request.

Declaration

Conﬂicts of interest The authors declare they have no ﬁnancial interests.

123

Multimedia Tools and Applications

References

1. Ahmed, H (2017) Detecting opinion spam and fake news using n-gram analysis and semantic similarity.

PhD thesis, University of Victoria

2. Ahmed, H, Traore, I, Saad, S (2017) Detection of online fake news using n-gram analysis and machine

learning techniques. In: Intelligent, secure, and dependable systems in distributed and cloud environments:

ﬁrst international conference, ISDDC 2017, Vancouver, BC, Canada, October 26-28, 2017, Proceedings

1, Springer, pp 127–138

3. Ajao, O, Bhowmik, D, Zargari, S (2018) Fake news identiﬁcation on twitter with hybrid cnn and rnn

models. In: Proceedings of the 9th international conference on social media and society, pp 226–230

4. Allcott H, Gentzkow M (2017) Social media and fake news in the 2016 election. J Econ Perspec 31(2):

211–236

5. Bali, APS, Fernandes, M, Choubey, S, Goel, M (2019) Comparative performance of machine learning

algorithms for fake news detection. In: Advances in computing and data sciences: third international

conference, ICACDS 2019, Ghaziabad, India, April 12–13, 2019, Revised Selected Papers, Part II 3,

Springer, pp 420–430

6. Bezerra JFR (2021) Content-based fake news classiﬁcation through modiﬁed voting ensemble. J Inf

Telecommun 5(4):499–513

7. Bra¸soveanu, AMP, Andonie, R (2019) Semantic fake news detection: A machine learning perspective. In:

Advances in Computational Intelligence: 15th international work-conference on artiﬁcial neural networks,

IWANN 2019, Gran Canaria, Spain, June 12–14, 2019, Proceedings, Part I 15, Springer, pp 656–667

8. Burgoon, JK, Blair, JP, Qin, T, Nunamaker, JF (2003) Detecting deception through linguistic analysis. In:

Intelligence and Security Informatics: ﬁrst NSF/NIJ symposium, ISI 2003, Tucson, AZ, USA, June 2–3,

2003 Proceedings 1, Springer, pp 91–101

9. Choudhary A, Arora A (2021) Linguistic feature based learning model for fake news detection and

classiﬁcation. Exp Syst Appl 169:114171

10. Fact Check. https://www.factcheck.org/. Accessed: 31 Mar 2020

11. Fake News Kaggle dataset. https://www.kaggle.com/c/fake-news/data?select=train.csv. Accessed: 15

Apr 2020

12. Faustini PHA, Covoes TF (2020) Fake news detection in multiple platforms and languages. Exp Syst

Appl 158:113503

13. Fullfact. https://fullfact.org/. Accessed: 31 Mar 2020

14. Ghanem B, Rosso P, Rangel F (2020) An emotional analysis of false information in social media and

news articles. ACM Trans Int Technol (TOIT) 20(2):1–18

15. Gilda, S (2017) Notice of violation of ieee publication principles: evaluating machine learning algorithms

for fake news detection. In: 2017 IEEE 15th student conference on research and development (SCOReD),

IEEE, pp 110–115

16. Gogate, M, Adeel, A, Hussain, A, Deep learning driven multimodal fusion for automated deception

detection. In: 2017 IEEE symposium series on computational intelligence (SSCI), IEEE, pp 1–6

17. Gravanis G, Vakali A, Diamantaras K, Karadais P (2019) Behind the cues: a benchmarking study for fake

news detection. Exp Syst Appl 128:201–213

18. Hakak S, Alazab M, Khan S, Gadekallu TR, Maddikunta PKR, Khan WZ (2021) An ensemble machine

learning approach through effective feature extraction to classify fake news. Future Gener Comput Syst

117:47–58

19. Hoax Slayer. http://hoaxslayer.com/. Accessed: 31 Mar 2020

20. Horne, B, Adali, S (2017) This just in: Fake news packs a lot in title, uses simpler, repetitive content in

text body, more similar to satire than real news. In: Proceedings of the international AAAI conference on

web and social media, vol 11, pp 759–766

21. Huang Y-F,Chen P-H (2020) Fake news detection using an ensemble learning model based on self-adaptive

harmony search algorithms. Exp Syst Appl 159:113584

22. Jain, MK, Garg, R, Gopalani, D, Meena, YK (2022) Review on analysis of classiﬁers for fake news

detection. In: Emerging technologies in computer engineering: cognitive computing and intelligent IoT,

Springer, pp 395–407

23. Jain, MK, Gopalani, D, Meena, YK, Kumar, R (2020) Machine learning based fake news detection

using linguistic features and word vector features. In: 2020 IEEE 7th Uttar pradesh section international

conference on electrical, electronics and computer engineering (UPCON), IEEE, pp 1–6

24. Jin, Z, Cao, J, Guo, H, Zhang, Y, Luo, J (2017) Multimodal fusion with recurrent neural networks for

rumor detection on microblogs. In: Proceedings of the 25th ACM international conference on multimedia,

pp 795–816

123

Multimedia Tools and Applications

25. Jin Z, Cao J, Zhang Y, Zhou J, Tian Q (2016) Novel visual and statistical image features for microblogs

news veriﬁcation. IEEE Trans Multimed 19(3):598–608

26. Kaliyar, RK, Goswami, A, Narang, P (2019) Multiclass fake news detection using ensemble machine

learning. In: 2019 IEEE 9th international conference on advanced computing (IACC), IEEE, pp 103–107

27. Kaliyar RK, Goswami A, Narang P, Sinha S (2020) FNDNet-a deep convolutional neural network for

fake news detection. Cogn Syst Res 61:32–44

28. Kaur S, Kumar P, Kumaraguru P (2020) Detecting clickbaits using two-phase hybrid CNN-LSTM biterm

model. Exp Syst Appl 151:113350

29. Khan JY, Khondaker MTI, Afroz S, Uddin G, Iqbal A (2021) A benchmark study of machine learning

models for online fake news detection. Mach Learn Appl 4:100032

30. Khattar, D, Goud, JS, Gupta, M, Varma, V (2019) MVAE: multimodal variational autoencoder for fake

news detection. In: The world wide web conference, pp 2915–2921

31. Maan, M, Jain, MK, Trivedi, S, Sharma, R (2022) Machine learning based rumor detection on twitter data.

In: Emerging technologies in computer engineering: cognitive computing and intelligent IoT. Springer,

pp 259–273

32. McIntire dataset. https://github.com/lutzhamel/fake- news/tree/master/ data. Accessed: 31 Mar 2020

33. Meel P, Vishwakarma DK (2020) Fake news, rumor, information pollution in social media and web: a

contemporary survey of state-of-the-arts, challenges and opportunities. Exp Syst Appl 153:112986

34. Newman ML, Pennebaker JW, Berry DS, Richards JM (2003) Lying words: Predicting deception from

linguistic styles. Person Soc Psychol Bullet 29(5):665–675

35. Pérez-Rosas, V, Kleinberg, B, Lefevre, A, Mihalcea, R (2017) Automatic detection of fake news.

arXiv:1708.07104

36. Politifact news dataset. http://www.politifact.com/. Accessed: 31 Mar 2020

37. Qi, P, Cao, J, Yang, T, Guo, J, Li, J (2019) Exploiting multi-domain visual information for fake news

detection. In: 2019 IEEE international conference on data mining (ICDM), IEEE, pp 518–527

38. Ratner B (2009) The correlation coefﬁcient: Its values range between+ 1/- 1, or do they? J Target Measur

Anal Market 17(2):139–142

39. Ravi K, Ravi V (2017) A novel automatic satire and irony detection using ensembled feature selection

and data mining. Knowledge-Based Syst 120:15–33

40. Reddy H, Raj N, Gala M, Basava A (2020) Text-mining-based fake news detection using ensemble

methods. Int J Autom Comput 17(2):210–221

41. Reis, JCS, Correia, A, Murai, F, Veloso, A, Benevenuto, F (2019) Explainable machine learning for fake

news detection. In: Proceedings of the 10th ACM conference on web science, pp 17–26

42. Reis JCS, Correia A, Murai F,Veloso A, BenevenutoF (2019) Supervised learning for fake news detection.

IEEE Intell Syst 34(2):76–81

43. Ruchansky, N, Seo, S, Liu, Y (2017) CSI: a hybrid deep model for fake news detection. In: Proceedings

of the 2017 ACM on conference on information and knowledge management, pp 797–806

44. Saquete E, Tomás D, Moreda P, Martínez-Barco P, Palomar M (2020) Fighting post-truth using natural

language processing: a review and open challenges. Exp Syst Appl 141:112943

45. Schwarz N, Newman E, Leach W (2016) Making the truth stick and the myths fade: lessons from cognitive

psychology. Behav Sci Policy 2:85–95

46. Shah, P, Kobti, Z (2020) Multimodal fake news detection using a cultural algorithm with situational and

normative knowledge. In: 2020 IEEE congress on evolutionary computation (CEC), IEEE, pp 1–7

47. Sharma K, Qian F, Jiang H, Ruchansky N, Zhang M, Liu Y (2019) Combating fake news: a survey on

identiﬁcation and mitigation techniques. ACM Trans Intell Syst Technol (TIST) 10(3):1–42

48. Shu, K, Wang, S, Liu, H (2019) Beyond news contents: The role of social context for fake news detection.

In: Proceedings of the twelfth ACM international conference on web search and data mining, pp 312–320

49. Shu K, Mahudeswaran D, Liu H (2019) FakeNewsTracker: a tool for fake news collection, detection, and

visualization. Comput Math Org Theory 25:60–71

50. Shu K, Mahudeswaran D, Wang S, Lee D, Liu H (2020) Fakenewsnet: a data repository with news

content, social context, and spatiotemporal information for studying fake news on social media. Big Data

8(3):171–188

51. Silva RM, Santos RLS, Almeida TA, Pardo TAS (2020) Towards automatically ﬁltering fake news in

portuguese. Exp Syst Appl 146:113199

52. Singh, V, Dasgupta, R, Sonagra, D, Raman, K, Ghosh, I (2017) Automated fake news detection using

linguistic analysis and machine learning. In: International conference on social computing, behavioral-

cultural modeling, & prediction and behavior representation in modeling and simulation (SBP-BRiMS),

pp 1–3

123

Multimedia Tools and Applications

53. Singhal, S, Shah, RR, Chakraborty, T, Kumaraguru, P, Satoh, S (2019) Spotfake: a multi-modal framework

for fake news detection. In: 2019 IEEE ﬁfth international conference on multimedia big data (BigMM),

IEEE, pp 39–47

54. Snopes. https://www.snopes.com/. Accessed: 31 Mar 2020

55. Tausczik YR, Pennebaker JW (2010) The psychological meaning of words: LIWC and computerized text

analysis methods. J Lang Soc Psychol 29(1):24–54

56. Truthorﬁction. https://www.truthorﬁction.com/. Accessed: 31 Mar 2020

57. Verma PK, Agrawal P, Amorim I, Prodan R (2021) WELFake: word embedding over linguistic features

for fake news detection. IEEE Trans Comput Soc Syst 8(4):881–893

58. Vicario MD, Quattrociocchi W, Scala A, Zollo F (2019) Polarization and fake news: early warning of

potential misinformation targets. ACM Trans Web (TWEB) 13(2):1–22

59. VishwakarmaDK, Varshney D, Yadav A (2019) Detection and veracity analysis of fake news via scrapping

and authenticating the web search. Cogn Syst Res 58:217–229

60. Viswas News. http://www.vishvasnews.com/. Accessed: 31 Mar 2020

61. Wang, WY (2017) “Liar, liar pants on ﬁre”: a new benchmark dataset for fake news detection. In: Pro-

ceedings of the 55th annual meeting of the association for computational linguistics (Vol 2: Short Papers),

Association for Computational Linguistics, pp 422–426

62. Wang, Y, Ma, F, Jin, Z, Yuan, Y, Xun, G, Jha, K, Su, L, Gao, J (2018) EANN: event adversarial neural

networks for multi-modal fake news detection. In: Proceedings of the 24th ACM sigkdd international

conference on knowledge discovery & data mining, pp 849–857

63. Wu Y, Fang Y, Shang S, Jin J, Wei L, Wang H (2021) A novel framework for detecting social bots with

deep neural networks and active learning. Knowl-Based Syst 211:106525

64. Wynne, HE, Wint, ZZ (2019) Content based fake news detection using n-gram models. In: Proceedings

of the 21st international conference on information integration and web-based applications & services,

pp 669–673

65. Yang, Y, Zheng, L, Zhang, J, Cui, Q, Li, Z, Yu, PS (2018) TI-CNN: convolutional neural networks for

fake news detection. arXiv:1806.00749

66. Zhou, X, Wu, J, Zafarani, R (2020) Similarity-aware multi-modal fake news detection. In: Advances

in knowledge discovery and data mining: 24th paciﬁc-asia conference, PAKDD 2020, Singapore, May

11–14, 2020, Proceedings, Part II, Springer, pp 354–367

67. Zhou X, Zafarani R (2020) A survey of fake news: Fundamental theories, detection methods, and oppor-

tunities. ACM Comput Surv (CSUR) 53(5):1–40

68. Zhou L, Burgoon JK, Nunamaker JF, Twitchell D (2004) Automating linguistics-based cues for detecting

deception in text-based asynchronous computer-mediated communications. Group Dec Nego 13:81–106

69. Zhou X, Jain A, Phoha VV, Zafarani R (2020) Fake news early detection: a theory-driven model. Digit

Threats Res Pract 1(2):1–25

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and

institutional afﬁliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under

a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted

manuscript version of this article is solely governed by the terms of such publishing agreement and applicable

law.

123

A preview of this full-text is provided by Springer Nature.

Learn more

Content available from Multimedia Tools and Applications

This content is subject to copyright. Terms and conditions apply.

Stock Market Trends Analysis using Extreme Gradient Boosting (XGBoost)

Conference Paper

Full-text available

Nov 2023

Review on Analysis of Classifiers for Fake News Detection

Chapter

Full-text available

Jan 2022

The spread of false news on an online social media platform has been a major concern in recent years. Many sources, such as news stations, websites, and even newspaper websites, post news pieces on social media. Meanwhile, most of the new material on social media is suspect and, in some circumstances, deliberately misleading. Fake news is a term used to describe this type of information. Large volumes of bogus news on the internet have the potential to generate major societal issues. Accepting the stories and pretending that they are true is extremely harmful for our community. Many people believe that false news affected the 2016 presidential election in the United States. The term has since become commonplace as a result of the election. It has also attracted the interest of industry and academics, who are trying to figure out where it comes from, how it spreads, and what impacts it has. In this work, we looked at a number of different papers and compared all of the strategies for detecting false news.

Machine Learning Based Rumor Detection on Twitter Data

Chapter

Full-text available

Jan 2022

Rumors are misleading information that are not sustained at the time of circulation and are not true at the time of verification. In other words, Rumors are set of linguistic, symbolic or tactile propositions whose veracity is not quickly or ever confirmed. As the use of social media platform has grown in recent years, incorrect information and rumors have circulated widely causing a significant influence on people’s lives. Rumors spreads faster than righteous news and spreads through social media. Because of the expansion of Internet and web technologies, it is now possible for anybody to post anything on online platforms such as blogs, comments on articles, post on social media, and so on, where false news, rumors, and true news are swiftly conveyed. This rapid and expansive spread of rumors has encouraged researchers to differentiate between rumors and non-rumors data. In this work, we have used stylometric and word vector features and put them into machine learning models. These features are extracted from the twitter-16 dataset and by applying SVM, we have attain the highest accuracy in compare to existing state-of-the-art studies.

Content-based fake news classification through modified voting ensemble

Article

Full-text available

Aug 2021

Jose Fabio Ribeiro Bezerra

Credibility is a crucial element for journalism. As fake news impacts credibility, it affects the general public, policymakers, decision-makers, and the journalistic environment. However, current research on fake news using content-based approaches focuses majorly on one or two dimensions of stylometrics, semantic and linguistic processes, but not on these three simultaneously. Considering that content-based detection of fake news would benefit from a multidimensional approach because of their inherent characteristics, we proposed a method that uses all of these dimensions to improve classification accuracy, using a voting ensemble designed in an ensemble classifier form. The results show that the multidimensional voting classifier has produced more accurate results than its peers while being more sensitive to distinguish between true and false news when using randomized data.

WELFake: Word Embedding over Linguistic Features for Fake News Detection

Article

Full-text available

Apr 2021

Social media is a popular medium for the dissemination of real-time news all over the world. Easy and quick information proliferation is one of the reasons for its popularity. An extensive number of users with different age groups, gender, and societal beliefs are engaged in social media websites. Despite these favorable aspects, a significant disadvantage comes in the form of fake news, as people usually read and share information without caring about its genuineness. Therefore, it is imperative to research methods for the authentication of news. To address this issue, this article proposes a two-phase benchmark model named WELFake based on word embedding (WE) over linguistic features for fake news detection using machine learning classification. The first phase preprocesses the data set and validates the veracity of news content by using linguistic features. The second phase merges the linguistic feature sets with WE and applies voting classification. To validate its approach, this article also carefully designs a novel WELFake data set with approximately 72,000 articles, which incorporates different data sets to generate an unbiased classification output. Experimental results show that the WELFake model categorizes the news in real and fake with a 96.73% which improves the overall accuracy by 1.31% compared to bidirectional encoder representations from transformer (BERT) and 4.25% compared to convolutional neural network (CNN) models. Our frequency-based and focused analyzing writing patterns model outperforms predictive-based related works implemented using the Word2vec WE method by up to 1.73%.

A benchmark study of machine learning models for online fake news detection

Article

Full-text available

Mar 2021

The proliferation of fake news and its propagation on social media has become a major concern due to its ability to create devastating impacts. Different machine learning approaches have been suggested to detect fake news. However, most of those focused on a specific type of news (such as political) which leads us to the question of dataset-bias of the models used. In this research, we conducted a benchmark study to assess the performance of different applicable machine learning approaches on three different datasets where we accumulated the largest and most diversified one. We explored a number of advanced pre-trained language models for fake news detection along with the traditional and deep learning ones and compared their performances from different aspects for the first time to the best of our knowledge. We find that BERT and similar pre-trained models perform the best for fake news detection, especially with very small dataset. Hence, these models are significantly better option for languages with limited electronic contents, i.e., training data. We also carried out several analysis based on the models’ performance, article’s topic, article’s length, and discussed different lessons learned from them. We believe that this benchmark study will help the research community to explore further and news sites/blogs to select the most appropriate fake news detection method.

A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities

Article

Full-text available

May 2020

The explosive growth in fake news and its erosion to democracy, justice, and public trust has increased the demand for fake news detection and intervention. This survey reviews and evaluates methods that can detect fake news from four perspectives: (1) the false knowledge it carries, (2) its writing style, (3) its propagation patterns, and (4) the credibility of its source. The survey also highlights some potential research tasks based on the review. In particular, we identify and detail related fundamental theories across various disciplines to encourage interdisciplinary research on fake news. We hope this survey can facilitate collaborative efforts among experts in computer and information sciences, social sciences, political science, and journalism to research fake news, where such efforts can lead to fake news detection that is not only efficient but more importantly, explainable.

Detecting clickbaits using two-phase hybrid CNN-LSTM biterm model

Article

Feb 2020
EXPERT SYST APPL

Clickbait indicates the type of content with an intending goal to attract the attention of readers. It has grown to become a nuisance to social media users. The purpose of clickbait is to bring an appealing link in front of users. Clickbaits seen in the form of headlines influence people to get attracted and curious to read the inside content. The content seen in the form of text on clickbait posts is very short to identify its features as clickbait. In this paper, a novel approach (two-phase hybrid CNN-LSTM Biterm model) has been proposed for modeling short topic content. The hybrid CNN-LSTM model when implemented with pre-trained GloVe embedding yields the best results based on accuracy, recall, precision, and F1-score performance metrics. The proposed model achieves 91.24%, 95.64%, 95.87% precision values for Dataset 1, Dataset 2 and Dataset 3, respectively. Eight types of clickbait such as Reasoning, Number, Reaction, Revealing , Shocking/Unbelievable, Hypothesis/Guess, Questionable, Forward referencing are classified in this work using the Biterm Topic Model (BTM). It has been shown that the clickbaits such as Shocking/Unbelievable, Hypothesis/Guess and Reaction are the highest in numbers among rest of the clickbait headlines published online. Also, a ground dataset of non-textual (image-based) data using multiple social media platforms has been created in this paper. The textual information has been retrieved from the images with the help of OCR tool. A comparative study is performed to show the effectiveness of our proposed model which helps to identify the various categories of clickbait headlines that are spread on social media platforms.

Linguistic Feature Based Learning Model for Fake News Detection and Classification

Article

Nov 2020
EXPERT SYST APPL

Social media is used as a dominant source of news distribution among users. The world’s preeminent decisions such as politics are acclaimed by social media to influence users for enclosing users’ decisions in their favor. However, the adoption of social media is much needed for awareness but the authenticity of content is an unknown factor in the current scenario. Therefore, this research work finds it imperative to propose a solution to fake news detection and classification. In the case of fake news, content is the prime entity that captures the human mind toward trust for specific news. Therefore, a linguistic model is proposed to find out the properties of content that will generate language-driven features. This linguistic model extracts syntactic, grammatical, sentimental, and readability features of particular news. Language driven model requires an approach to handle time-consuming and handcrafted features problems in order to deal with the curse of dimensionality problem. Therefore, the neural-based sequential learning model is used to achieve superior results for fake news detection. The results are drawn to validate the importance of the linguistic model extracted features and finally combined linguistic feature-driven model is able to achieve the average accuracy of 86% for fake news detection and classification. The sequential neural model results are compared with machine learning based models and LSTM based word embedding based fake news detection model as well. Comparative results show that features based sequential model is able to achieve comparable evaluation performance in discernable less time.

A Novel Framework for Detecting Social Bots with Deep Neural Networks and Active Learning

Preprint

Oct 2020
KNOWL-BASED SYST

Microblogging is a popular online social network (OSN), which facilitates users to obtain and share news and information. Nevertheless, it is filled with a huge number of social bots that significantly disrupt the normal order of OSNs. Sina Weibo, one of the most popular Chinese OSNs in the world, is also seriously affected by social bots. With the growing development of social bots in Sina Weibo, they are increasingly indistinguishable from normal users, which presents more huge challenges in detecting social bots. Firstly, it is difficult to extract the features of social bots completely. Secondly, large-scale data collection and labeling of user data are extremely hard. Thirdly, the performance of classical classification approaches applied to social bot detection is not good enough. Therefore, this paper proposes a novel framework for detecting social bots in Sina Weibo based on deep neural networks and active learning (DABot). Specifically, 30 features from four categories, namely metadata-based, interaction-based, content-based, and timing-based are extracted to distinguish between social bots and normal users. Nine of these features are completely new features proposed in this paper. Moreover, active learning is employed to efficiently expand the labeled data. Then, a new deep neural network model called RGA is built to implement the detection of social bots, which makes use of a residual network (ResNet), a bidirectional gated recurrent unit (BiGRU), and an attention mechanism. After performance evaluation, the results show that DABot is more effective than the state-of-the-art baselines with the accuracy of 0.9887.

An ensemble machine learning approach through effective feature extraction to classify fake news

Article

Nov 2020

There are numerous channels available such as social media, blogs, websites, etc., through which people can easily access the news. It is due to the availability of these platforms that the dissemination of fake news has become easier. Anyone using these platforms can create and share fake news content based on personal or professional motives. To address the issue of detecting fake news, numerous studies based on supervised and unsupervised learning methods have been proposed. However, all those studies do suffer from a certain limitation of poor accuracy. The reason for poor accuracy can be attributed due to several reasons such as the poor selection of features, inefficient tuning of parameters, imbalanced datasets, etc. In this article, we have proposed an ensemble classification model for detection of the fake news that has achieved a better accuracy compared to the state-of-the-art. The proposed model extracts important features from the fake news datasets, and the extracted features are then classified using the ensemble model comprising of three popular machine learning models namely, Decision Tree, Random Forest and Extra Tree Classifier. We achieved a training and testing accuracy of 99.8% and 44.15% respectively on the Liar dataset. For the ISOT dataset, we achieved the training and testing accuracy of 100%.

ConFake: fake news identification using content based features

Abstract and Figures

Recommended publications

Machine Learning based Fake News Detection using linguistic features and word vector features

Review on Analysis of Classifiers for Fake News Detection

Artificial Intelligence for Fake News

I Hardly Lie: A Multistage Fake News Detection System