Content uploaded by Manik Gupta
Author content
All content in this area was uploaded by Manik Gupta on Sep 17, 2020
Content may be subject to copyright.
SocialTruth Project Approach to Online
Disinformation (Fake News) Detection and Mitigation
Michał Choraś
Marek Pawlicki
Rafał Kozik
UTP University of Science and
Technology
Bydgoszcz, Poland
Konstantinos Demestichas
Pavlos Kosmides
ICCS, National Technical University
of Athens
Athens, Greece
Manik Gupta
London South Bank University
London, United Kingdom
ABSTRACT
The extreme growth and adoption of Social Media, in com-
bination with their poor governance and the lack of quality
control over the digital content being published and shared,
has led information veracity to a continuous deterioration.
Current approaches entrust content verication to a sin-
gle centralised authority, lack resilience towards attempts
to successfully "game" verication checks, and make con-
tent verication dicult to access and use. In response, our
ambition is to create an open, democratic, pluralistic and
distributed ecosystem that allows easy access to various ver-
ication services (both internal and third-party), ensuring
scalability and establishing trust in a completely decentral-
ized environment. In fact, this is the ambition of the EU
H2020 SocialTruth project. In this paper, we present the in-
novative project approach and the vision of eective online
disinformation detection for various practical use-cases.
KEYWORDS
pattern recognition, security, safety, detection, fake news,
networks.
ACM Reference Format:
Michał Choraś, Marek Pawlicki, Rafał Kozik, Konstantinos Demes-
tichas, Pavlos Kosmides, and Manik Gupta. 2019. SocialTruth Project
Approach to Online Disinformation (Fake News) Detection and
Mitigation. In Proceedings of the 14th International Conference on
Availability, Reliability and Security (ARES 2019) (ARES ’19), August
26–29, 2019, Canterbury, United Kingdom. ACM, New York, NY, USA,
10 pages. https://doi.org/10.1145/3339252.3341497
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
ARES ’19, August 26–29, 2019, Canterbury, United Kingdom
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-7164-3/19/08. . . $15.00
https://doi.org/10.1145/3339252.3341497
1 INTRODUCTION
During the last decade, there has been an unprecedented
revolution in how people interconnect and socialize. From
the early days of Facebook to today’s proliferation of Social
Media, people have been embracing this new form of social-
ization. Social networks, media and platforms are becoming
the usual way of how our societies operate for the purposes of
communication, information exchange, conducting business,
co-creation, learning and knowledge acquisition. However,
the extreme growth and adoption of Social Media, in combi-
nation with the lack of control over the digital content being
published and shared, has led the information veracity to
be in dispute. Establishing synergies with innovative infor-
mation and communication technologies (such as semantic
analysis tools, blockchains, emotional descriptors, lifelong
learning) can enhance the auditability, reliability and accu-
racy of the information being shared in Social Media, leading
to a more veritable society. The key to this situation is to
safeguard the distributed and open nature of Social Media,
strengthening pluralism and participation and mitigating
censorship.
According to a recent MIT study [
20
], false information is
spread six times faster than truth and reaches more people
than true stories, often with devastating impacts. A single
rumour spread in 2013 by a compromised Associated Press
account on Twitter resulted in an estimated $136.5 billion
drop in the S&P 500 index [
10
]. Over the past two years,
Fake and Hoaxed news have gained tremendous proportions,
particularly with Donald Trump’s presidential campaign in
the United States, as many people used the social networks
as a distribution system to spread highly inaccurate or com-
pletely erroneous stories[
14
]. As the fake news cases are
becoming countless [
11
], motives for their spreading are
often nancial or political.
In the Freedom of the Net 2017 report [
19
], Freedom House
is led to the same conclusion. The report studied 65 countries
worldwide between June 2016 and May 2017 and found out
that online manipulation and disinformation tactics played
an important role in elections in at least 18 out of 65 countries
during this period, including the United States.
ARES ’19, August 26–29, 2019, Canterbury, United Kingdom Michał Choraś, et al.
Because of this high rate of false information spread, large
media organisations face increasing pressure to respond
quickly and accurately to breaking news stories. Although
established workows and editorial structures, such as the
use of copytasters, have been able to deal with the task in the
past, the challenge has severely increased, as news sources
have multiplied both in numbers and diversity in the era
of social media. In such an environment, the publishing or-
ganizations face the increasingly dicult task to identify
early a breaking news story, conrm its accuracy, provide
appropriate background, and publish it or broadcast it as
quickly as possible, thus providing high-quality journalism
[17].
Therefore in this paper we overview the approach adopted
by the EU H2020 SocialTruth project. The projects develops
innovative tools to ght with online disinformation and tests
those at specic use-cases, namely:
(1)
journalists and news editors (the solution will be tested
at ADNKRONOS, Italy)
(2)
search engines (the solution will be tested at QWANT,
France)
(3)
citizens and web users (the solution will be supported
by Infocons, Romania)
(4)
teaching material provider (the solution will be tested
by De-Agostini, Italy)
The paper is structured as follows: in Section II the challenges
and vision on how to solve the problem are presented. Sec-
tion III details the SocialTruth project means and techniques
to counter online disinformation problem. Conclusions are
given thereafter.
2 STATE OF AFFAIRS, CHALLENGES AND VISION
Facebook plans to use improved machine learning methods
to identify potential fake news articles, which can be passed
on to external fact checkers. Other attempts to deal with the
problem also exist, namely those of FakeBox [
1
], FightHoax
[
2
] and Truly Media [
3
]. However, after examining current
approaches, SocialTruth advocates that:
a)
content verication cannot be entrusted to a single
centralised authority;
b)
the aim should not be to devise the "single most perfect
verication algorithm", since even the most sophisti-
cated deep learning classication model is optimized
at the time it is created - and as a result its accuracy
deteriorates as new sources of fake news arise every
day and the writing style of fake news changes, in
order to successfully "game" and bypass verication
checks;
c)
content verication should be easy and exible to use
"as a service" by individual users and professional or-
ganisations alike.
In response to these unmet challenges, it is necessary to
take into consideration the existing approaches with their
limitations, but also strongly focus on creating an open,
democratic, pluralistic and distributed ecosystem that en-
ables easy access to various verication services, ensuring
scalability and establishing trust in a completely decentral-
ized environment. An ideal system uniquely combines sev-
eral cutting-edge ICT technologies, including social mining,
multimodal content analytics, blockchains and lifelong learn-
ing machines, enabling deep analysis, contextualization and
understanding of mass-volume digital content, collected in
real time from dierent social networks and web sources. By
bringing together and incorporating business insights not
just from the ICT domain but also from several other elds,
such as digital journalism, mass communication media and
social media, a system like that promises a exible, robust
and pluralistic digital content verication and trust estab-
lishment solution that is open and easily adoptable by the
concerned stakeholders, service providers and user commu-
nities alike.
A. Fake news data analysis strategies and sources
The analyzed news can contain variety of data types such as
unstructured text, images/videos, references to other sources,
etc. One of the methods for dealing with textual informa-
tion is proposed in [
15
]. Therefore, one should consider the
following principles to address fake news detection:
•
Indexing and gathering of information published in
the Internet in order to cross-reference current news
with previous ones (e.g. to detect duplicates or pic-
tures/photos used in dierent context). We consider:
–
State-of-the-art image processing techniques for con-
tent analysis and images comparison
–
State-of-the-art text analysis techniques (e.g. docu-
ment term matrices, etc.)
•
Reputation scoring to identify reliability of person
and/or information source providing the news. We
propose to consider reputation evaluation of:
–Webpage providing/forwarding news
–
Person publishing information/news via social net-
work, etc.
–Reputation of the content
•
Comparison, of the similar news published by dierent
information sources
•
Machine learning techniques for content feature anal-
ysis
•
Analysis of semantics by means of applied ontologies
The problem of machine learning in such a complex
and dynamic environment as fake news detection, re-
quire eective techniques and the ecient analysis of
heterogeneous data sources as presented in Fig. 1. [
8
]
SocialTruth Project Approach to Online Disinformation (Fake News) Detection and MitigationARES ’19, August 26–29, 2019, Canterbury, United Kingdom
Figure 1: Online Disinformation and Fake News Detection data analysis strategies and sources
The advantages of an open, democratic, pluralistic and dis-
tributed ecosystem are straightforward:
•
No "one size ts all" solution and no vendor lock-in:
open ecosystem that provides access to congurable
combinations of content analytics and verication ser-
vices (with support for text, image and video content)
through standard Application Programming Interfaces
(APIs).
•
Distributed trust and reputation establishment pow-
ered by blockchain technology, strengthening auditabil-
ity and revealing information cascades.
•
Integration of Lifelong Learning Machines (LLM) that
constantly accumulate experience and learn new paradigms
of fake news.
•
Digital Companion for convenient everyday access of
individual users to verication services from within
their browsers. In this context, ICT engineers, data sci-
entists and blockchain experts need to work together
with media specialists and user communities, in an
eective and constructive manner to co-create such
an open and innovative content verication solution.
This will help:
–
Individual users to verify the validity of Social Media
content and stop the spreading of false information.
Besides this, individual users will be able to check
and verify the original author/source of the content.
–
Media organisations, story writers, content authors
and journalists to boost their investigative and cre-
ative capabilities by enabling them to cross-check
and combine various multimedia information sources,
retrieve and use relevant and veriable background
information, and maintain a stream of real-time up-
dates.
–
Search engines, Social media platforms and Online
advertising networks to improve information verac-
ity and contribute into a healthier and more sustain-
able web and social media ecosystem.
A system performing these functions needs to full the
following set of objectives:
•
Develop a distributed content verication solution
with a complex-free Digital Companion for online cred-
ibility verication of digital content found on web and
social media.
•
Compose a digital content analytics and verication
ecosystem with support for text, image and video, open
to third-party service providers.
•
Leverage blockchain technologies to establish distributed
reputation and trust in digital content sharing.
•
Deploy a distributed and thoroughly validated archi-
tecture (TRL-7) for the delivery of the credibility eval-
uation services.
ARES ’19, August 26–29, 2019, Canterbury, United Kingdom Michał Choraś, et al.
•
Introduce innovative business models for news, web
and social media stakeholders, and provide support to
the EU strategic agendas and policies.
The main ambition of the project is on providing multiple,
extensible and reusable capabilities for verifying the credi-
bility of digital content and detecting hoaxes and chains of
news-scams spread in social media. A distributed architec-
ture would allow to scan and process vast amounts of digital
contents from social media and the Web to identify fake news
and to provide individuals and professionals with a certain
degree of condence about their accuracy and credibility.
One of the crucial components is the Digital Companion
- an open source browser plugin that enables individuals
to invoke one or more verication services and customise
their use. When multiple verication services are to be com-
bined, an open design software engine is supposed to assist
in carrying out the meta-verication process. Along with its
Digital Companion it increases the credibility of the content
shared in social media and improves the reliability of the
search engines, online advertising and click stream analytics
by limiting click percentages in hoaxes and falsied stories.
The project enables third-party service providers to plug
into the ecosystem and make available their own content an-
alytics and verication services through standard APIs. The
system exploits a distributed architecture and the blockchain
technology in order to provide a decentralised mechanism
for content verication. This decentralised approach will
radically facilitate the assessment of the shared content itself
and will pave the way for a wider adoption of decentralised
and community-based approaches in the next generation
Social Media platforms.
This foundation enhances the role of prosumers, commu-
nities and small businesses, mastering technological barriers,
introducing innovative and participatory forms of quality
journalism, and using various data in a secure manner. The
focus is on a series of advanced technologies to signicantly
enhance the production, management, use and reuse of digi-
tal content in Social Media, including social mining, lifelong
learning machines, blockchains, multi-level content analytics
and multimedia verication services. This enables the ex-
ploitation of the dierent data sources in order to introduce
a novel and eective content verication mechanism. Such
mechanism promote high-quality journalism, enhancing the
role of prosumers, communities and small businesses in the
eld of media.
The use of novel social mining technologies, such as emo-
tional content descriptors, advanced machine learning tech-
niques, multimedia verication algorithms and blockchain
technologies enables the emergence of a highly-ecient
and highly-distributed solution for identifying falsied infor-
mation and discovering information cascades across Social
Media. The scope of the project is to provide a distributed
solution in order to avoid the current situation where the
information is collected in a centralized approach to big data
companies outside Europe. Such companies are harvesting
individual users’ data in order to provide personalised or
easily "clickable" content to the users, thus increasing their
revenue when the users click on the provided content.
As to the development of intermediary-free solutions ad-
dressing information veracity for Social Media - the solutions
contribute to the understanding of information cascades, the
spreading of information and the identication of informa-
tion sources, the openness of algorithms and users’ access to
and control of their personal data (such as proles, images,
videos, biometrical, geolocation data and local data).
Personal data is better protected by limiting clicks on dubi-
ous digital content and suspicious web sources. Furthermore,
the solution is based on open algorithms (namely, those of
the expert meta-verication engines, lifelong learning mod-
els, as well as part of the verication services), in order for
the community of developers to be able to evolve them to
tackle the future evolution of fake news industries. The en-
tire ecosystem is open to third-party verication service
providers through open interfaces.
C. Design Principles
The distributed architecture we propose follows an open and
modular design, embodying the following core capabilities:
•
Digital Companion: This is an easy-to-use browser
plugin that allows a non-professional user to invoke
a metaverication process upon some form of digi-
tal content (e.g. article), passing its URI as input to a
meta-verication engine. In case of nonprofessional
use, the Digital Companion can be used by the author
of the digital article, by a reproducer (who shares the
article in Social Media) or even by a simple reader of
the article, who wishes to get an estimation of the cred-
ibility of the content before or after reading it. In case
of professional use, the Digital Companion is a web
front-end for medium/large organisations (e.g. news
agencies, search engines, etc.) that allows several calls
per day to the APIs of the meta-verication engine(s).
SocialTruth follows a user-centred design approach
for the Digital Companion.
•
Distributed Verication Services: This is a set of het-
erogeneous verication services each one providing a
specic type of content analytics (e.g. for text, image,
video) or verication-relevant functionality (e.g. emo-
tional descriptors, social inuence mapping. Each ser-
vice can be deployed at a dierent hosting facility (e.g.
dierent servers or clouds), hence there is no imposed
SocialTruth Project Approach to Online Disinformation (Fake News) Detection and MitigationARES ’19, August 26–29, 2019, Canterbury, United Kingdom
centralization. All of them use the same standard in-
terfaces to allow them to be easily accessible, reusable
and interchangeable. The registrar of service providers
and the services they oer is stored and maintained in
the blockchain.
•
Expert Meta-Verication Engine(s): Much like a meta-
search engine combines and presents results from mul-
tiple search engines, the expert metaverication en-
gine combines verication results from various sources
to compute a meta-score that reects the credibility of
the digital content under consideration. It follows an
open design, open algorithms and an expert-systems
approach. It uses open algorithms while most of its
settings and weights (e.g. which verication services
to prefer or to avoid, with which priority, etc.) can
optionally be congured through its standard web-
service interfaces. An example of a fake-news verica-
tion engine based on a two-level convolutional neural
network with reasoning based on collective user intel-
ligence can be found in [16]
•
Blockchain: The blockchain is used as a distributed
system of record with respect to digital content veri-
cation history. Since the complex web and social media
landscape is characterised by several competing con-
tent creators and distributors, each with their own mo-
tives, interests, strategies and practices, the blockchain
is an ideal tool to establish reputation and trust with-
out the need of a central authority or intermediary
(thus also avoiding to centralise even more regulatory
power to the US Internet giants, such as Facebook or
Google). Hence, a public distributed ledger provides an
auditable and immutable trail of verication actions
and reputation scores. The blockchain stores article
identication information, article descriptors (e.g. hash
codes for digital content integrity), author identica-
tion information, verication and meta-verication
scores, as well as identication information for the
verication services that have been used to calculate
them. It also holds the registrar of verication service
providers and the services they oer.
D. Meta-Verification process
For our purposes, the digital content is dened as text, photo
and video, produced, uploaded and/or edited by individuals,
journalists, reporters, writers, bloggers etc., in news sites,
social networks and web channels. The concept is to break
down the digital content into its constituent elements and
subsequently call individual verication services, ultimately
combining their results.
The verication process will scan the content shared through
Social Media and will identify its sources, metadata, media
elements, writing style, as well as how it has spread across
the Internet. It will attempt to verify these elements and will
ag the content as fake or not with a degree of certainty on
its accuracy, providing relevant background if applicable, in
a process that is partially similar to [
13
] In this light, the
verication process encompasses the following capabilities:
•
Seamless data crawling and social mining from mul-
tiple, heterogeneous external web sources and media.
Streamlined multi-layer semantic analysis, forgery de-
tection of multimedia content and indexing of corre-
sponding sources.
•
Deep understanding and analysis of authors’ writing
styles and content semantics based on multidimen-
sional contextual reasoning and inference. The proce-
dure will integrate these capabilities through an intelli-
gent meta-verication algorithm and make them easily
usable through a exible end-user tool (Digital Com-
panion), designed via an ecient user-centred method-
ology. By integrating and interfacing with advanced
services in digital content analysis, it promises to open
new applications and opportunities to wider audiences
and stakeholders, including but not limited to digital
content production professionals, aggregation and sup-
ply professionals, journalists and editors, online ad-
vertising, search engine and e-commerce companies,
content prosumers, social media sharing networks, etc.
3 SOCIALTRUTH APPROACH, TECHNIQUES AND
METHODS FOR ONLINE DISINFORMATION
DETECTION
This section provides insights on specic aspects of content
verication employed on the basis of the general architecture
and the meta-verication concept presented earlier.
A. Understanding Author and Writing Style
The semantic technologies are about understanding a story’s
author and writing style. This helps in the credibility evalua-
tion process. To address this, the solution uses as a starting
point and extends the semantic technology of an existing
tool developed by Expert System (ESF), namely COGITO,
which facilitates deep understanding of language.
The solution extends this capability through advanced and
innovative features like "writeprint" or stylometric analysis,
which makes it possible to analyse the style of writing behind
each story acquired from the web, social networks and any
other textual source, with the scope to:
•
Understand if an individual who is publishing a story
has a style of writing that can be related and mapped
to another style from a historical database.
•
Understand if contents published on the web using
dierent accounts or nicknames are actually related
ARES ’19, August 26–29, 2019, Canterbury, United Kingdom Michał Choraś, et al.
to the same person (i.e. clustering of dierent virtual
IDs).
This stylometric analysis is based on a series of techni-
cal parameters, such as usage of short words, conjunctions,
vocabulary richness and complexity, lexical dierentiation,
etc., which are strictly related to specic human factors and
behaviour, thus eectively dening a sort of "ngerprint"
(writeprint) of the writer.
B. Semantic analysis and Clustering of similar news
Using the semantic analysis capability of COGITO as a start-
ing point, the goal is to identify the primary and secondary
subjects of articles and stories, as well as other elements rel-
evant for classication and entity extraction (such as People,
Organizations and Places). Thus, the solution aims at being
able to classify the text according to a detailed, customizable
tree of categories, enabling modication according to the
end-user’s requirements. These capabilities allow to:
(1)
Make the information management automatic, more
ecient and independent of subjective criteria.
(2)
Immediately identify useful information and reduce
search time, by simplifying access to content and en-
abling search by subject/topic.
(3)
Cluster information according to customizable tax-
onomies and categories.
Clustering using categorisation and entities extraction is one
point. Extracting more semantic tags, such as relations, verbs
and events, can lead the proposed solution to nd correla-
tions based on dierent analogy or similarity criteria. This
is valuable, since, for example, the following two sentences
are dierent from the statistical point of view, but similar
from the semantics point of view: "John is driving his car to
home" versus "John is reaching home by his vehicle".
C. Sentiment/Emotional Analysis
The solution needs to have the capability to extract a full set
of emotions from textual content, not just using a standard
positive/negative/neutral evaluation approach (sentiment)
but also providing the capability to provide a ner granu-
larity of the dierent kinds of feelings (i.e. stress, fear, trust,
anger, etc.). When applied to sources like social networks
and the web, this feature can be used, amongst others, as
a bias estimator (e.g. detecting users that may have bad in-
tention towards an event, a person, an infrastructure, etc.)
or for assessing the public opinion about a specic topic of
interest.
D. Natural Language Processing baseline
For the implementation of the aforementioned advanced Nat-
ural Language Processing (NLP) and semantic functionalities,
the solution uses the semantic engine COGITO. This allows it
to provide advanced semantic solutions, including semantic
search, text analytics, ontology and taxonomy management,
automatic categorization, automated self-help solutions, ex-
traction of unstructured information and natural language
processing. COGITO is powered by two main components:
•
The Disambiguator, a multi-level linguistic engine able
to disambiguate the meaning of a word by recognising
the context of occurrence of that word. Disambigua-
tion can be described as the process of resolving con-
icts that arise when a term can express more than
one meaning, leading to dierent interpretations of the
same string of text. The ultimate aim of such a process
it to associate each term with the author’s intended
use of the term itself.
•
The Sensigrafo, a semantic network that represents
and stores the dierent semantic relations between
the words of a language. Unlike traditional dictionar-
ies where words are listed in alphabetical order, words
contained in this database are arranged in groups of
items expressing identical or similar meaning. These
groups are connected to each other by millions of logi-
cal and language-related links.
The COGITO semantic engine, starts from this consoli-
dated architecture, but will require further development to
customize the semantic engine in the domain of disinfor-
mation and fake news, within the scope, in terms of tax-
onomies/categories denition, entities to be extracted, as
well as emotions and writeprint to be identied from text.
In order to apply this dedicated tuning to the semantic net-
work, a manual or semiautomatic approach can be used. The
second approach is preferable and is related to the use of
machine learning techniques to create new ontologies au-
tomatically, which can be validated by human beings. This
is a unique technique that can be used also in the opposite
direction, so that the output of the semantic analysis serves
as an input to the machine learning algorithms; these bidi-
rectional processes allow the creation of a truly high-quality
hybrid engine.
E. Image/Photo/Video Verification
Nowadays, photo and video content is an essential part of
media coverage, often attracting more the consumer’s at-
tention than the actual information provided as text. Thus,
image and video data can be used as powerful tools to cre-
ate untruthful information or to unreasonably strengthen a
message provided to a recipient. For example, fake images
illustrating the textual content can be considered as common
elements of fake news. Fake photos can be linked to both
factual (true) news as well as totally fake information shared
online. Fabricated photos might be posted due to propaganda
reasons, but also for strengthening the intended message and
SocialTruth Project Approach to Online Disinformation (Fake News) Detection and MitigationARES ’19, August 26–29, 2019, Canterbury, United Kingdom
for inducing emotions. Often, publishers highlight the tex-
tual content of news by means of a random photo stored
in the publisher’s archives that is contextually related to an
ongoing situation, however not taken at the time and place
of the presented or analysed event. The other possible indi-
cators of fake photos are: excessively drastic and unrealistic
scenes, emotional messages and time elapsed between the
described event and photo publication (sometimes seconds).
Sometimes, poor quality of the image/video content in rela-
tion to the context (e.g. excessively low or excessively high
quality) can indicate that the image/video could not be cap-
tured during the given situation. Hence, image verication
becomes crucial for fake news analysis and can be used to de-
termine whether a given image is similar or not to previously
posted images. To verify the trustworthiness of photos, one
can employ and combine approaches such as the following:
•
Person re-identication - This relates to the problem
of identifying people across images that have been
taken using dierent cameras or across time using
a single camera. Due to the variations of viewpoint,
pose, illumination, expression, aging, cosmetics, and
occlusion - a given individual may appear consider-
ably dierent across dierent camera views. Hence,
re-identication is an important capability for fake im-
age analysis. The problem of deciding if two patches
correspond to each other or not is quite a challenging
and dicult problem, because large variations in view-
point and lighting across dierent views can cause two
images of the same person to look quite dierent and
can cause images of dierent people to look very sim-
ilar. Typically, methods for re-identication include
two components: a method for extracting invariant
and discriminative features from input images, and a
similarity metric for comparing those features across
images. Research on re-identication will focus either
on nding an improved set of features, nding an im-
proved similarity metric for comparing features, or a
combination of both.
•
Contextual Image analysis - Inconsistencies in ele-
ments such as actual weather or season, architecture
of depicted place, time indicators in which the photo
was taken (e.g. clocks) can be helpful to verify that
the presented photo was taken at another place or at
another time than it pretends. Therefore, techniques
like image geo localization that use recognition tech-
niques from computer vision can be used to estimate
the location (at city, region, or global scale) of ordinary
ground level photographs and can be used to deter-
mine the actual context within which the image was
taken. Additionally, GPS information and metadata
annotations can be useful in geolocating images.
•
Verication of the publisher - Examining the history
of posts/news can be helpful in anomaly detection
in publication history and possible identication of
an invalid publisher, since a fake piece of news or
photo is often the rst piece of content published by
a malevolent user, or even his/her rst online activity
ever.
•
Analysis of image features - Services such as
http://fotoforensics.com allow for online analysis of
modications done to a given photo, using Error Level
Analysis (ELA) of submitted jpeg les. In general, these
algorithms detect the modied regions in a photo, due
to the fact that these regions generate more errors
in jpeg (lossy) compression during re-savings of the
photo.
•
Reverse image searching - Finding similar or slightly
modied photos in relation to the searched photo,
including photos with changed ratio and resolution,
colours, horizontally ipped, trimmed, etc. Reverse im-
age search engines also allow for ordering the search
results depending on the time of online publication,
therefore it is possible to nd that the photo illustrat-
ing the current events is shared on the web already
for a number of months or years. The most popular
online tools used for reverse image searching include
google.images.com and tineye.com. On the other hand,
there is a limited number of online tools available for
analysis of video similarities, and thus for detection
of modied, fake video content. One such example
is the YouTube Data Viewer developed by Amnesty
International [
4
], allowing for detection of older ver-
sions of the same video. It can be used to determine
the original video when multiple copies of the same
video from the same date are available. Given an ar-
bitrary target object marked with a bounding box at
the beginning of a video, the goal of visual tracking
is to localize this target in subsequent video frames.
An ideal tracker should be fast, accurate and robust
to the appearance change of the target and this is a
challenging problem that needs to be addressed as part
of the video verication innovation activities.
4 SOCIALTRUTH - DETECTION SYSTEM DESIGN
A. Distributed reputation and trust through
Blockchain technologies
Blockchain is a security architecture allowing dierent users
to share sensitive data in a secure and decentralised manner,
without a central authority. This architecture will be used
to share the reputation or credibility (verication scores) of
digital content between the dierent end-users in a trusted
way. These end-users, interested in the details behind this
ARES ’19, August 26–29, 2019, Canterbury, United Kingdom Michał Choraś, et al.
scoring, will be able to read the verication history and to
decide if they trust or not the specic content. The solution
bases its blockchain algorithm on a public blockchain. With
that, it is be able to use directly a complete, functional infras-
tructure, but with a specically developed part on top of it.
The solution has to manage simultaneously remuneration
and data transactions. These data transactions will integrate
data and metadata (like Article ID, Author ID, Article De-
scriptors and dierent verication results) with dierent
links between them. To that end, a blockchain implementa-
tion with smart contract or chaincode is mandatory (such
as Ethereum or Hyperledger). One also has to take into ac-
count privacy (including General Data Protection Regulation
- GDPR compliance) and condentiality requirements (for
example to cypher any sensitive data). Dierent solutions
exist but need to be adapted for this kind of data (like the
Zk-snarks solution used for the ZCash blockchain). One can
complement the solution with an access control mechanism,
also based on blockchain. This mechanism will be able to
manage the access control to the data. For that, one can use
and extend the innovative Access Control - BlockChain (AC-
BC), based on the Ethereum Blockchain. Another important
aspect is the auditability, i.e. to use the data and the logs of
access control stored publicly in the blockchain. These data
will not only enable inference of abnormal behaviors, but
could also be used by one or several Expert Meta-verication
Engines to improve the data processing performed on the
multimedia data.
In conclusion, the proposed innovation strikes a balance
between the dierent blockchain aspects (public vs private,
Proof-of-Work consensus vs Proof-of-Concept consensus,
smart contract vs chaincode) to reach a commercially viable
solution.
B. Security and privacy by design
The design of the solution architecture is performed in ac-
cordance to the Security-by-Design and Privacy-by- Design
principles to:
•
implement basic security controls at the design phase,
so as to minimise the number of possible discovered
vulnerabilities in later phases, decreasing their poten-
tial impact on the prototype system’s security, and
•
ensure the privacy of users whose data is analysed
using the system (open data available on the web, social
media data, end-user data).
For the purposes of system security assurance, important
Security-by-Design principles dened by OWASP15 are im-
plemented, by means of establishing appropriate security
defaults, minimising attack surface area by reducing the
number of authorised users for any given functionality, re-
alising the so-called "least privilege" principle by assigning
only minimum amount of user rights to operate the services
and applying separation of duties (considering various user
roles with various levels of trust). Also, the use of third-party
components will be thoroughly analysed in terms of secu-
rity and its impact on the overall level of protection will be
evaluated.
In the design phase, applicable Privacy-by-Design princi-
ples [
5
] are also implemented, including privacy preserving
means as default system features, embedded in its design,
end-to-end data protection and privacy preservation man-
agement through the entire data lifecycle, and considering
privacy and data protection without an impact on the plat-
form functionalities. From the technical viewpoint, encryp-
tion by default will be considered, in order to mitigate the
security concerns associated with the unauthorised access
to the data, integrating it into data workows in an auto-
matic and seamless manner, whereas mechanisms for secure
destruction or removal of stored data will be established.
The implementation of appropriate Privacy Enhancing Tech-
nologies (PETs) are also considered, such as: use of privacy
keys, digital signatures, secures authentication and adop-
tion of secure communication protocols. For the purposes
of disassociation of the data / generated content that might
allow the detection of the identity of the data owner, appli-
cable techniques will be employed, such as anonymisation,
pseudonymisation and data clustering to prevent the correla-
tion of Personally Identiable Information (PII) for purposes
other than detection of fake news considered in the project.
Also, the data gathered, stored and analysed (e.g. during use
cases execution) will be used only for the predened pur-
poses according to the dened goals of use cases. On the
other hand, ethics management in the project will ensure
that the stored or analysed data will not be subject to any
commercial use and will not be shared publicly without the
consent of the data owner.
C. Web and social media data crawling
Within the project, work in the area of web and social media
data crawling uses the relevant software tools and services.
For its search engine, hardware and software architecture
hosted in its Data Centers located in Paris could be employed.
This architecture allows to process every day over 70 million
queries, to crawl 0.5 billion Web pages and run a 16 billion
Web pages index. The architecture is scalable and can grow
rapidly.
D. Verification in social media through deep learning
With the increasing popularity of the various social media
platforms, detecting and dealing with misinformation and
their creators becomes a critical problem. Malicious users
can be tracked down and their online linguistic expressions
can be used to infer their personal attributes and diverse
SocialTruth Project Approach to Online Disinformation (Fake News) Detection and MitigationARES ’19, August 26–29, 2019, Canterbury, United Kingdom
geographical, sociological and political demographics. Deep
learning models can be used to construct behavioural repre-
sentations for user personality identication and both private
traits/attributes can be predicted from digital records of hu-
man behaviour. Moreover, social network connections and
ow observations can be examined to detect misinformation
and false content can be found out using geosemantic infor-
mation. Similarly, the problem of detecting bots has gained
tremendous attention, since bots can be used to automate so-
cial media accounts and post fake content hampering the data
credibility. For example, bots can be used to sway political
elections by distorting the online discourse, manipulate the
stock market, or push antivaccine conspiracy theories that
might cause health epidemics. There are techniques to detect
bots at the account level, by processing large amount of so-
cial media posts, and leveraging information from network
graph structure, temporal dynamics, sentiment analysis, etc.
A repository of fake news with in-depth analysis of this kind
of data can be found in [
18
] Deep neural networks based on
long short-term memory (LSTM) architecture can be used
to exploit both content and metadata to detect bots at the
tweet level.
E. Lifelong Learning Intelligent Systems
One of the current challenges in machine learning is to de-
velop intelligent systems that are able to learn consecutive
tasks and to transfer knowledge from previously learnt basis
to learn new tasks. Such capability is termed as Lifelong
Learning [
12
] and tries to mimic "human learning". In prin-
ciple, humans are capable of accumulating knowledge from
the past and use it to solve future problems. Currently, the
existing classical machine learning algorithms are not able
to achieve that. Our ambition is to adapt Lifelong Machine
Learning techniques in order to gradually improve detection
models and incrementally update the knowledge, so that the
system can learn faster (reusing historical knowledge) e.g. by
transferring the knowledge (e.g. from dierent news topics
and tasks) [
9
]. Infact, there is an existing line of research [
6
]
showing promising results (also in the area of text mining).
Moreover, there are also competitions (such as DARPA L2M
- Lifelong Learning Machines) aiming at fostering research
in these directions. Therefore, the project’s ambition is to
use the Lifelong Learning paradigm to address challenges
in fake news detection. In particular, hybrid classier sys-
tems (HCS) and ensemble learning will be considered, as
these have already been successfully applied to solve com-
plex machine learning problems in various other domains.
In fact, the multi-classier paradigm is akin to some of the
algorithms proposed for Lifelong Learning that build and
maintain a kind of reservoir of latent models/modules that
may be useful or reused in the future.
5 PROTOTYPE
In [
7
] a prototype of a solution geared towards detecting
forged images is presented. The proposed method revolves
around image assessment, with the supposition that if the
image is forged, the whole piece might be fake. Three fac-
tors are taken into account - ELA analysis, copycat search
and meta data analysis. A publicly available dataset of 800
original and over 900 forged images was used. The detec-
tion of pasted elements is performed through dividing the
image into overlapping blocks and conducting the SURF and
FLANN agorithms. Meta-data allows to spot if the image
was modied with any image processing tools. A fusion of
decisions in our method allows to spot image forgeries with
64% accuracy.
6 CONCLUSION
The "fake news" phenomenon is a serious issue in modern
media and communication, with respect to false information
spreading within the society about current events and in-
cidents. The classication of fake news is challenging due
to its vague denition and tensions related to freedom of
speech.
Therefore the EU H2020 SocialTruth project tries to tackle
this important problem facing the modern society. In this
paper we have presented the project approach, design prin-
ciples and the used techniques. Our idea is that the selected
hands-on trials will cover a wide range of requirements and
operational conditions, allowing for a multi-faceted system
evaluation from the perspective of various end-users as enu-
merated in the paper.
ACKNOWLEDGMENTS
This work is funded under SocialTruth project, which has
received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement
No. 825477
REFERENCES
[1]
[n. d.]. FakeBox project homepage. https://machinebox.io/docs/fakebox
accessed= 24-Mar-2019.
[2]
[n. d.]. Fighthoax project website. http://ghthoax.com Accessed=
24-Mar-2019.
[3]
[n. d.]. Truly Media tackles fake news ahead
of German elections. https://www.truly.media/
truly-media- tackles-fake-news-ahead- of-german-elections/
Accessed= 24-Mar-2019.
[4]
[n. d.]. Youtube DataViewer. http://www.amnestyusa.org/sites/default/
customscripts/citizenevidence/ Accessed= 24-Mar-2019.
[5]
Ann Cavoukian. 2012. Operationalizing Privacy by Design: A Guide to
Implementing Strong Privacy Practices. http://www.ontla.on.ca/library/
repository/mon/26012/320221.pdf.
[6]
Zhiyuan Chen, Nianzu Ma, and Bing Liu. 2018. Lifelong Learning for
Sentiment Classication. (01 2018).
ARES ’19, August 26–29, 2019, Canterbury, United Kingdom Michał Choraś, et al.
[7]
Michał Choraś, Agata Giełczyk, Konstantinos Demestichas, Damian
Puchalski, and Rafał Kozik. 2018. Pattern Recognition Solutions for Fake
News Detection: 17th International Conference, CISIM 2018, Olomouc,
Czech Republic, September 27-29, 2018, Proceedings. 130–139. https:
//doi.org/10.1007/978-3- 319-99954- 8_12
[8]
Michal Choras, Agata Gielczyk, Konstantinos P. Demestichas, Damian
Puchalski, and Rafal Kozik. 2018. Pattern Recognition Solutions for
Fake News Detection. In Computer Information Systems and Industrial
Management - 17th International Conference, CISIM 2018, Olomouc,
Czech Republic, September 27-29, 2018, Proceedings. 130–139. https:
//doi.org/10.1007/978-3- 319-99954- 8_12
[9]
Michał Choraś, Rafał Kozik, Rafal Renk, and Witold Hołubowicz. 2017.
The Concept of Applying Lifelong Learning Paradigm to Cybersecurity.
663–671. https://doi.org/10.1007/978-3- 319-63315- 2_58
[10]
Patti Domm. [n. d.]. False Rumor of Explosion at White House Causes
Stocks to Briey Plunge; AP Conrms Its Twitter Feed Was Hacked.
https://www.cnbc.com/id/100646197.
[11]
Chloe Farand. [n. d.]. French social media awash with fake news stories
from sources ’exposed to Russian inuence’ ahead of presidential election.
https://tinyurl.com/y5gdzvz4.
[12]
Patxi Galán-García, José Gaviria de la Puerta, Carlos Laorden, Igor
Santos, and Pablo García Bringas. 2013. Supervised machine learning
for the detection of troll proles in twitter social network: application
to a real case of cyberbullying. Logic Journal of the IGPL 24 (2013),
42–53.
[13]
Mykhailo Granik and Volodymyr Mesyura. 2017. Fake news detection
using naive Bayes classier. 2017 IEEE First Ukraine Conference on
Electrical and Computer Engineering (UKRCON) (2017), 900–903.
[14]
MATHEW INGRAM. [n. d.]. Google’s Fake News Problem Could
Be Worse Than on Facebook. http://fortune.com/2017/03/06/
google-facebook- fake-news/.
[15]
Zhang Jiawei, Limeng Cui, Yanjie Fu, and Fisher B. Gouza. 2018. Fake
News Detection with Deep Diusive Network Model.
[16]
Feng Qian, Chengyue Gong, Karishma Sharma, and Yan Liu. 2018.
Neural User Response Generator: Fake News Detection with Collective
User Intelligence. https://doi.org/10.24963/ijcai.2018/533
[17]
Kevin Rawlinson. [n. d.]. How newsroom pressure is letting fake stories
on to the web. https://www.theguardian.com/media/2016/apr/17/
fake-news-storiesclicks-fact- checking.
[18]
Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and
Huan Liu. 2018. FakeNewsNet: A Data Repository with News Content,
Social Context and Dynamic Information for Studying Fake News on
Social Media.
[19]
Olivia Solon. [n. d.]. Tim Berners-Lee: we must regulate tech rms to
prevent ’weaponised’ web.
[20]
Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true
and false news online. Science 359, 6380 (2018), 1146–1151. https:
//doi.org/10.1126/science.aap9559