ArticlePDF Available

A Benchmark Comparison of State-of-the-Practice Sentiment Analysis Methods

Authors:

Abstract and Figures

In the last few years thousands of scientific papers have explored sentiment analysis, several startups that measures opinions on real data have emerged, and a number of innovative products related to this theme have been developed. There are multiple methods for measuring sentiments, including lexical-based approaches and supervised machine learning methods. Despite the vast interest on the theme and wide popularity of some methods, it is unclear which method is better for identifying the polarity (i.e., positive or negative) of a message. Thus, there is a strong need to conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources. Such a comparison is key for understanding the potential limitations, advantages, and disadvantages of popular methods. This study aims at filling this gap by presenting a benchmark comparison of twenty one popular sentiment analysis methods (which we call the state-of-the-practice methods). Our evaluation is based on a benchmark of twenty labeled datasets, covering messages posted on social networks, movie and product reviews, as well as opinions and comments in news articles. Our results highlight the extent to which the prediction performance of these methods varies widely across datasets. Aiming at boosting the development of this research area, we open the methods' codes and datasets used in this paper and we deploy a benchmark system, which provides an open API for accessing and comparing sentence-level sentiment analysis methods.
Content may be subject to copyright.
39
A Benchmark Comparison of State-of-the-Practice
Sentiment Analysis Methods
Pollyanna Gonc¸alves, Federal University of Minas Gerais
Matheus Ara´
ujo, Federal University of Minas Gerais
Filipe Ribeiro, Federal University of Minas Gerais and Federal University of Ouro Preto
Fabr´
ıcio Benevenuto, Federal University of Minas Gerais
Marcos Gonc¸ alves, Federal University of Minas Gerais
In the last few years thousands of scientific papers have explored sentiment analysis, several startups that measures opinions
on real data have emerged, and a number of innovative products related to this theme have been developed. There are multiple
methods for measuring sentiments, including lexical-based approaches and supervised machine learning methods. Despite
the vast interest on the theme and wide popularity of some methods, it is unclear which method is better for identifying the
polarity (i.e., positive or negative) of a message. Thus, there is a strong need to conduct a thorough apple-to-apple comparison
of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources.
Such a comparison is key for understanding the potential limitations, advantages, and disadvantages of popular methods.
This study aims at filling this gap by presenting a benchmark comparison of twenty one popular sentiment analysis methods
(which we call the state-of-the-practice methods). Our evaluation is based on a benchmark of twenty labeled datasets, cov-
ering messages posted on social networks, movie and product reviews, as well as opinions and comments in news articles.
Our results highlight the extent to which the prediction performance of these methods varies widely across datasets. Aiming
at boosting the development of this research area, we open the methods’ codes and datasets used in this paper and we deploy
a benchmark system, which provides an open API for accessing and comparing sentence-level sentiment analysis methods.
CCS Concepts: rInformation systems Sentiment analysis; rNetworks Social media networks; Online social net-
works;
Additional Key Words and Phrases: Sentiment analysis, social media, online social networks, sentence-level
1. INTRODUCTION
Sentiment analysis has become an extremely popular tool, applied in several analytical domains,
especially on the Web and social media. To illustrate the growth of interest in the field , Figure 1
shows the steady increase on the number of searches on the topic, according to Google Trends1,
mainly after the popularization of the online social networks (OSNs). More than 7,000 articles have
been written about sentiment analysis and various startups are developing tools and strategies to
extract sentiments from text [Feldman 2013].
The number of possible applications of such a technique is also considerable. Many of them are
focused on monitoring the reputation or opinion of a company or a brand with the analysis of re-
views of consumer products or services [Hu and Liu 2004]. Sentiment analysis can also provide
analytical perspectives for financial investors who want to discover and respond to market opin-
ions [Oliveira et al. 2013; Bollen et al. 2010]. Another important set of applications is in politics,
where marketing campaigns are interested in tracking sentiments expressed by voters associated
with candidates [Tumasjan et al. 2010].
Due to the enormous interest and applicability, there has been a corresponding increase in the
number of proposed sentiment analysis methods in the last years. The proposed methods rely on
many different techniques from different computer science fields. Some of them employ machine
learning methods that often rely on supervised classification approaches, requiring labeled data to
train classifiers [Pang et al. 2002]. Others are lexical-based methods that make use of predefined
lists of words, in which each word is associated with a specific sentiment. The lexical methods
vary according to the context in which they were created. For instance, LIWC [Tausczik and Pen-
nebaker 2010] was originally proposed to analyze sentiment patterns in formally written English
1https://www.google.com/trends/explore\#q=sentiment\%20analysis
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
arXiv:1512.01818v1 [cs.CL] 6 Dec 2015
39:2 Gonc¸alves et al.
Fig. 1. Searches on Google for the Query: “Sentiment Analysis”
texts, whereas PANAS-t [Gonc¸alves et al. 2013b] and POMS-ex [Bollen et al. 2009] were proposed
as psychometric scales adapted to the Web context.
Overall, the above techniques are acceptable by the research community and it is common to see
concurrent important papers, sometimes published in the same computer science conference, using
completely different methods. For example, the famous Facebook experiment [Kramer et al. 2014]
which manipulated users feeds to study emotional contagion, used LIWC [Tausczik and Pennebaker
2010]. Concurrently, Reis et al. used Sentistrength [Thelwall 2013] to measure the negativeness or
positiveness of online news headlines [Reis et al. 2014; Reis et al. 2015], whereas Tamersoy [Tamer-
soy et al. 2015] explored Vader’s lexicon [Hutto and Gilbert 2014] to study patterns of smoking and
drinking abstinence in social media.
As the state-of-the-art has not been clearly established, researchers tend to accept any popular
method as a valid methodology to measure sentiments. However, little is known about the relative
performance of the several existing sentiment analysis methods. In fact, most of the newly pro-
posed methods are rarely compared with all other pre-existing ones using a large number of existing
datasets. This is a very unusual situation from a scientific perspective, in which benchmark compar-
isons are the rule. In fact, most applications and experiments reported in the literature make use of
previously developed methods exactly how they were released with no changes and adaptations and
with none or almost none parameter setting. In other words, the methods have been used as a black-
box, without a deeper investigation on their the suitability to a particular context or application.
To sum up, existing methods have been widely deployed for developing applications without a
deeper understanding regarding either their applicability in different contexts or their advantages,
disadvantages and limitations in comparison with each another. Thus, there is a strong need to
conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in
practice, across multiple datasets originated from different data sources.
This state-of-the-practice situation is what we propose to investigate in this article. We do this
by providing a thorough benchmark comparison of twenty one state-of-the-practice methods using
twenty labeled datasets. In particular, given the recent popularity of online social networks and of
short texts on the Web, many methods are focused in detecting sentiments at the sentence-level,
usually used to measure the sentiment of small sets of sentences in which the topic is known a
priori. We focus on such context – thus, our datasets cover messages posted on social networks,
movie and product reviews, and opinions and comments in news articles, Ted talks, and blogs.
We survey an extensive literature on sentiment analysis to identify existing sentence-level methods
covering several different techniques. We contacted authors asking for their codes when available
or we implemented existing methods when they were unavailable but could be reproduced based on
their descriptions in a published paper.
Our experimental results unveil a number of important findings. First, we show that there is no
single method that always achieves the best prediction performance for all different datasets, a re-
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:3
sult consistent with the “there is no free lunch theorem” [Wolpert and Macready 1997]. We also
show that existing methods vary widely regarding their agreement, even across similar datasets (e.g.
random tweets). This suggests that the same content could be interpreted very differently depending
on the choice of a sentiment method. We noted that most methods are more accurate in correctly
classifying positive than negative text, suggesting that current existing approaches tend to be biased
in their analysis towards positivity. Finally, we quantify the relative prediction performance of ex-
isting efforts in the field across different types of datasets, identifying those with higher prediction
performance across different datasets.
Based on these observations, our final contribution consists on releasing our gold standard dataset
and the codes of the compared methods2. We also created a Web system through which we allow
other researchers to easily use our data and codes to compare results with the existing methods. More
important, by using our system one could easily test which method would be the most suitable to
a particular dataset and/or application. We hope that our tool will not only help researchers and
practitioners for accessing and comparing a wide range of sentiment analysis techniques, but can
also help towards the development of this research field as a whole.
The remainder of this paper is organized as follows. In Section 2, we briefly describe related
efforts. Then, in Section 3 we describe the sentiment analysis methods we compare. Section 4
presents the gold standard data used for comparison. Section 5 summarizes our results and findings.
Finally, Section 6 concludes the article and discusses directions for future work.
2. BACKGROUND AND RELATED WORK
Next we discuss important definitions and justify the focus of our benchmark comparison. We also
briefly survey existing related efforts that compare sentiment analysis methods.
2.1. Focus on Sentiment Level
Since sentiment analysis can be applied to different tasks, we restrict our focus on comparing those
efforts related to detect the polarity (i.e. positivity or negativity) of a given short text (i.e. sentence-
level). Polarity detection is a common function across all sentiment methods considered in our work,
providing valuable information to a number of different applications, specially those that explore
short messages that are commonly available in social media [Feldman 2013].
Sentence-level sentiment analysis can be performed with supervision (i.e. requiring labeled train-
ing data) or not. An advantage of supervised methods is at their ability to adapt and create trained
models for specific purposes and contexts. A drawback is the need of labeled data, which might
be highly costly and even prohibitive for some tasks. On the other hand, the lexical-based meth-
ods make use of a pre-defined list of words, where each word is associated with a specific senti-
ment. The lexical methods vary according to the context in which they were created. For instance,
LIWC [Tausczik and Pennebaker 2010] was originally proposed to analyze sentiment patterns in
English texts, whereas PANAS-t [Gonc¸alves et al. 2013b] and POMS-ex [Bollen et al. 2009] are
psychometric scales adapted to the Web context. Although lexical-based methods do not rely on
labeled data, it is hard to create a unique lexical-based dictionary to be used for different contexts.
We focus our effort on evaluating unsupervised efforts as they can be easily deployed in Web
services and applications without the need of human labeling or any other type of manual inter-
vention. As described in Section 3, some of the methods we consider have used machine learning
to build lexicon dictionaries or even to build models and tune specific parameters. We incorporate
those methods in our study, since they have been released as black-box tools that can be used in an
unsupervised manner.
2.2. Existing Efforts on Methods’ Comparison
Despite the large number of existing methods, only a limited of them have performed a comparison
among sentiment analysis methods, usually with limited datasets. Overall, lexical methods and ma-
2Except for one paid method
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:4 Gonc¸alves et al.
chine learning approaches have been evolving in parallel in the last years, and it comes as no surprise
that studies have started to compare their performance on specific datasets and use one or another
strategy as baseline for comparison. A recent survey summarizes several of these efforts [Tsytsarau
and Palpanas 2012] and conclude that a systematic comparative study that implements and eval-
uates all relevant algorithms under the same framework is still missing in the literature. As new
methods emerge and compare themselves only against one or other method using different evalua-
tion datasets and testing methodologies, it is hard to conclude if a single method triumphs over the
other methods, even under specific scenarios. To the best of our knowledge, our effort is the first of
kind to create a benchmark and provide such a comparison.
Another important worth noticing effort consists of an annual workshop namely International
Workshop on Semantic Evaluation (SemEval). It consists of a series of exercises grouped in tracks,
that include sentiment analysis, text similarity, among others, that put together competitors. Some
new methods such as Umigon [Levallois 2013] have been proposed after obtaining good results
on part of these tracks. Although, SemEval has been playing an important role for identifying the
current important methods, it requires authors of the methods to register for the challenge and many
popular methods have been evaluated in these exercises. Additionally, SemEval labeled datasets are
usually focused on one specific types of data, such as tweets, and do not represent a wide range
of social media data. In our evaluation effort, we consider one dataset from Semeval 2013 and two
methods that participated in the competition in that same year.
Finally, in a previous effort [Gonc¸alves et al. 2013a], we compared eight sentence-level sentiment
analysis methods, based on one public dataset used to evaluate the method sentistrength [Thelwall
2013]. Our effort largely extents our previous work by comparing much more methods across many
different datasets, providing a much deeper benchmark evaluation of current existing popular senti-
ment analysis methods. The methods used in this paper were also incorporated as part of an existing
system, namely ifeel [Araujo et al. 2014].
3. SENTIMENT ANALYSIS METHODS
This section provides a brief description of the twenty one sentence-level sentiment analysis meth-
ods investigated in this paper.
Our effort to identify important sentence-level sentiment analysis methods consisted of system-
atically search for them in the main conferences in the field and then checking for papers that cited
them as well as their own references. Some of the methods are available for download on the Web;
others were kindly shared by their authors under request; and a small part of them were imple-
mented by us based on their descriptions in the original paper. This usually happened when authors
shared only the lexical dictionaries they created, letting the implementation of the method that use
the lexical resource to ourselves.
Table I and Table II present an overview of these methods, providing a description of each method
as well as the techniques they employ (L for Lexicon Dictionary and ML for Machine Learning),
their outputs (e.g. -1,0,1, meaning negative, neutral, and positive, respectively), the datasets they
used to validate, the baseline methods used for comparison and finally lexicon details. The methods
are organized in chronological order to allow a better overview of the existing efforts over the years.
We can note that the methods generate different outputs formats. We colored in blue the positive
outputs, in black the neutral ones, and in red those that are negative.
Since we are comparing sentiment analysis methods on a sentence-level basis, we need to work
with mechanisms that are able to receive sentences as input and give polarities as output. Some of
the approaches considered in this paper, shown in Table II, are complex dictionaries built with great
effort. However, a lexicon alone has no natural ability to infer polarity in sentence level tasks. The
purpose of a lexicon goes beyond the detection of polarity of a sentence [Feldman 2013; Liu 2012],
but it can also be used to that end [Godbole et al. 2007; Kouloumpis et al. 2011].
Several existing sentence-level sentiment analysis methods like Vader [Hutto and Gilbert 2014]
and SO-CAL [Taboada et al. 2011], combine a Lexicon and the processing of the sentence char-
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:5
Table I. Overview of the sentence-level methods available in the literature.
Name Description L ML
Emoticons [Gonc¸alves et al. 2013a] Messages containing positive/negative emoticons are positive/negative. Messages
without emoticons are not classified. X
Opinion Lexicon [Hu and Liu
2004]
Focus on Product Reviews. Builds a Lexicon to predict polarity of product features
phrases that are summarized to provide an overall score to that product feature. X
Opinion Finder (MPQA) [Wilson
et al. 2005a] [Wilson et al. 2005b]
Performs subjectivity analysis trough a framework with lexical analysis former and a
machine learning approach latter. X X
Happiness Index [Dodds and
Danforth 2009]
Quantifies happiness levels for large-scale texts as lyrics and blogs. It uses ANEW
words [Bradley and Lang 1999] to rank the documents. X
SentiWordNet [Esuli and
Sebastiani 2006] [Baccianella et al.
2010]
Construction of a lexical resource for Opinion Mining based on WordNet [Miller
1995]. The authors grouped adjectives, nouns, etc in synonym sets (synsets) and
associated three polarity scores (positive, negative and neutral) for each one.
X X
LIWC [Tausczik and Pennebaker
2010]
Text analysis paid tool to evaluate emotional, cognitive, and structural components of
a given text. It uses a dictionary with words classified into categories (anxiety, health,
leisure, etc).
X
SenticNet [Cambria et al. 2010]
Uses dimensionality reduction to infer the polarity of common sense concepts and
hence provide a public resource for mining opinions from natural language text at a
semantic, rather than just syntactic level.
X
AFINN [Nielsen 2011] - A new
ANEW Builds a twitter based sentiment Lexicon including Internet slangs and obscene words. X
SO-CAL [Taboada et al. 2011]
Creates a new Lexicon with unigrams (verbs, adverbs, nouns and adjectives) and
multi-grams (phrasal verbs and intensifiers) hand ranked with scale +5 (strongly
positive) to -5 (strongly negative). Authors also included part of speech processing,
negation and intensifiers.
X
Emoticons DS (Distant
Supervision)[Hannak et al. 2012]
Creates a scored lexicon based on a large dataset of tweets. Its based on the frequency
each lexicon occurs with positive or negative emotions. X
NRC Hashtag [Mohammad 2012]
Builds a lexicon dictionary using a Distant Supervised Approach. In a nutshell it uses
a known hashtag to “classify” the tweet (i.e #joy, #sadness, etc). Afterwards, it verifies
the occurrence of each specific n-gram in that emotion. Then, the score of a n-gram
occur in an emotion is calculated.
X
Pattern.en [De Smedt and
Daelemans 2012]
Python Programming Package (toolkit) to deal with NLP, Web Mining and Sentiment
Analysis. Sentiment analysis is provided through averaging scores from adjectives in
the sentence according to a bundle lexicon of adjective.
X
SASA [Wang et al. 2012]
Detects public sentiments on Twitter during the 2012 U.S. presidential election. It is
based on the statistical model obtained from the classifier Na¨
ıve Bayes on unigram
features. It also explores emoticons and exclamations.
X
PANAS-t [Gonc¸ alves et al. 2013b]
Detects mood fluctuations of users on Twitter. The method consists of an adapted
version (PANAS) Positive Affect Negative Affect Scale [Watson and Clark 1985],
well-known method in psychology with a large set of words associated with eleven
moods ( surprise, fear, etc).
X
EmoLex [Mohammad and Turney
2013]
Builds a general sentiment Lexicon crowdsourcing supported. Each entry lists the
association of a token with 8 basic sentiments: joy, sadness, anger, etc defined
by [Plutchik 1980]. Proposed Lexicon includes unigrams and bigrams from
Macquarie Thesaurus and also words from GI and Wordnet.
X
SANN [Pappas and Popescu-Belis
2013]
Infer additional reviews user ratings by performing sentiment analysis (SA) of user
comments and integrating its output in a nearest neighbor (NN) model that provides
multimedia recommendations over TED Talks.
X X
Sentiment140 Lexicon
[Mohammad et al. 2013]
Creation of a lexicon dictionary in a similar way to [Mohammad 2012] and a SVM
Classifier with features like: number and categories of emoticons, sum of the
sentiment scores for all tokens (calculated with lexicons), etc.
X
SentiStrength [Thelwall 2013] Builds a lexicon dictionary annotated by humans and improved with the use of
Machine Learning. X X
Stanford Recursive Deep Model
[Socher et al. 2013]
Proposes a model called Recursive Neural Tensor Network (RNTN) that processes all
sentences dealing with their structures and compute the interactions between them.
This approach is interesting since RNTN take into account the order of words in a
sentence, which is ignored in most of methods.
X X
Umigon [Levallois 2013] Disambiguates tweets using lexicon with heuristics to detect negations plus elongated
words and hashtags evaluation. X
Vader [Hutto and Gilbert 2014]
It is a human-validated sentiment analysis method developed for twitter and social
media contexts. Vader was created from a generalizable, valence-based,
human-curated gold standard sentiment lexicon.
X
Semantria [Lexalytics 2015]
It is a paid tool that employs multi-level analysis of sentences. Basically it has four
levles: part of speech, assignment of previous scores from dictionaries, application of
intensifiers and finally machine learning tecnhiques to delivery a final weight to the
sentence.
X X
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:6 Gonc¸alves et al.
Table II. Overview of the sentence-level methods available in the literature - 2.
Name Output Validation Compared To Lexicon C
Emoticons -1,1- - 79
Opinion Lexicon -1,0,1Product Reviews from
Amazon and CNet - 6787 X
Opinion Finder
(MPQA)
Negative,
Neutral,
Positive
MPQA [Wiebe et al.
2005]
Compared to itself in different
versions. 20611 X
Happiness Index 1,2,3,4,5,6,7,
8,9
Lyrics, Blogs,STU
Messages 3, British
National Corpus 4,
- 1034 X
SentiWordNet -1,0,1-General Inquirer (GI)[Stone et al.
1966] 117658
LIWC negEmo,
posEmo - - ?X
SenticNet Negative,
Positive
Patient Opinions
(Unavailable) SentiStrength [Thelwall 2013] 15000
AFINN -1,0,1Twiter [Biever 2010]
OpinonFinder [Wilson et al.
2005a], ANEW [Bradley and Lang
1999], GI [Stone et al. 1966] and
Sentistrength [Thelwall 2013]
2477 X
SO-CAL [<0),0,(>0]
Epinion [Taboada et al.
2006a], MPQA[Wiebe
et al. 2005],
Myspace[Thelwall
2013],
MPQA[Wiebe et al. 2005],
GI[Stone et al. 1966],
SentiWordNet [Esuli and
Sebastiani 2006],”Maryland” Dict.
[Mohammad et al. 2009], Google
Generated Dict. [Taboada et al.
2006b]
9928
Emoticons DS
(Distant
Supervision)
-1,0,1
Validation with
unlabeled twitter data
[Cha et al. 2010]
- 1162894 X
NRC Hashtag -1,0,1
Twitter (SemEval-2007
Affective Text Corpus)
[Strapparava and
Mihalcea 2007]
- 679468 X
Pattern.en <0.1,0.1] Product reviews, but the
source was not specified - 2973
SASA [Wang et al.
2012]
Negative,
Neutral,
Unsure, Positive
“Political” Tweets
labeled by “turkers”
(AMT) (unavailable)
- 21012
PANAS-t -1,0,1
Validation with
unlabeled twitter data
[Cha et al. 2010]
-50 X
EmoLex -1,0,1-
Compared with existing gold
standard data but it was not
specified
141820 X
SANN neg, neu, pos Their own dataset - Ted
Talks
Comparison with other multimedia
recommendation approaches. 8701
Sentiment140
Negative,
Neutral,
Positive
Twitter and SMS from
Semeval 2013, task 2
[Nakov et al. 2013].
Other Semeval 2013, task 2
approaches 1220176 X
SentiStrength -1,0,1
Their own datasets -
Twitter, Youtube, Digg,
Myspace, BBC Forums
and Runners World.
The best of nine Machine Learning
techniques for each test. 2698
Stanford Recursive
Deep Model
very negative,
negative,
neutral,
positive,very
positive
Movie Reviews [Pang
and Lee 2004]
Na¨
ıve Bayes and SVM with bag of
words features and bag of bigram
features.
227009
Umigon
Negative,
Neutral,
Positive
Twitter and SMS from
Semeval 2013, task 2
[Nakov et al. 2013].
[Mohammad et al. 2013] 1053
Vader -1,0,1
Their own datasets -
Twitter, Movie
Reviews, Technical
Product Reviews, NYT
User’s Opinions.
(GI)[Stone et al. 1966], LIWC,
[Tausczik and Pennebaker 2010],
SentiWordNet [Esuli and
Sebastiani 2006], ANEW [Bradley
and Lang 1999], SenticNet
[Cambria et al. 2010] and some
Machine Learning Approaches.
7517
Semantria negative,
neutral, positive not available not available not available
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:7
acteristics to determine a sentence polarity. These approaches make use of a series of intensifiers,
punctuation transformation, emoticons, and many other heuristics.
Thus, to evaluate each lexicon dictionaries as the base for a sentence-level sentiment analysis
method, we considered the Vader’s implementation. In other words, we used Vader’s code for de-
termining if a sentence is positive or not considering different lexicons as dictionaries.
Vader’s heuristics were proposed by means of qualitative analyses of textual properties and char-
acteristics which affect the perceived sentiment intensity of the text. Vader’s author identified five
heuristics based on grammatical and syntactical cues to convey changes to sentiment intensity that
go beyond the bag-of-words model. The heuristics include treatments for: 1) punctuation (e.g num-
ber of ‘!’s); 2) capitalization (e.g ”I HATE YOU” is more intense than ”i hate you”); 3) degree mod-
ifiers (e.g ”The service here is extremely good” is more intense than ”The service here is good”);
4) constructive conjunction ”but” to shift the polarity; 5) Tri-gram examination to identify negation
(e.g ”The food here isnt really all that great.”). We choose Vader as it is the newest method among
those we considered, it is becoming widely used as it was even implemented as part of the well
known NLTK python library5.
We applied such heuristics with the following lexicons: Emolex, EmoticonsDS, NRC Hashtag,
Opinion Lexicon, Panas, Sentiment 140, SentiWordNet. We notice that this strategy drastically im-
proved results of lexicon for sentence-level sentimente analysis in comparison with a simple base-
line approach that averages the occurrence of positive and negative words to classify the polarity of
a sentence. Table II has also a column Lexicon that describes the number of terms the proposed dic-
tionary contains and column C (changed) indicates some methods we slightly modified to adequate
their output formats to the polarity detection task.
Some other methods required similar adaptations. Methods that are based on machine learning,
like SASA and SentiStrength, are used here as unsupervised approaches as their trained models were
released by the authors and they have been used in other efforts as tools that require no training data.
We plan to release all the codes used in this article, except for paid softwares like LIWC and Sen-
tiStrength, as an attempt to allow reproducibility as well as possible corrections in our decisions.
There are a few other methods for sentiment detection proposed in the literature and not consid-
ered here. Most of them consist of variations of the techniques used by the above methods, such as
WordNet-Affect[Valitutti 2004] and ANEW [Bradley and Lang 1999] (the same used by Happiness
Index, SentiWordNet, SenticNet, etc.). Finally, there exist a few other methods which are not avail-
able on the Web or request and that could not be re-implemented based on their descriptions in the
original papers (e.g, Profile of Mood States (POMS) [Bollen et al. 2009]).
From Table II we can also note that the validation strategy, the datasets used, and the baseline
comparison of these methods varies greatly, from toy examples to large labeled datasets. Panas-t
and Happiness Index use labeled examples to validate their methods, by presenting evaluations of
events in which some bias towards positivity and negativity would be expected. Panas-t is tested with
unlabeled twitter data related to Michael Jackson’s death and the release of a Harry Potter movie
whereas Happiness Index was used to measure song lyrics happiness from 1967 to 2007. Lexical
dictionaries were validated in very different ways. AFINN[Nielsen 2011] compared its Lexicon
with other dictionaries. Emoticon Distance Supervised [Hannak et al. 2012] used Pearson Corre-
lation between human labeling and the predicted value. SentiWordNet [Esuli and Sebastiani 2006]
validates the proposed dictionary with comparisons with other dictionaries, but it also used human
validation of the proposed lexicon. These efforts attempt to validate the lexicon created, without
comparing the lexicon as a sentiment analysis method itself. Vader [Hutto and Gilbert 2014] com-
pared results with lexical approaches considering labeled datasets from different social media data.
SenticNet [Cambria et al. 2010] was compared with SentiStrength [Thelwall 2013] with a specific
dataset related to patient opinions, which could not be made available. Stanford Recursive Deep
Model [Socher et al. 2013] and SentiStrength [Thelwall 2013] were both compared with standard
machine learning approaches, with their own datasets.
5http://www.nltk.org/ modules/nltk/sentiment/vader.html
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:8 Gonc¸alves et al.
Table III. Labeled datasets.
Dataset Nomeclature # # # # Average # Average Annotators # of R
Msgs Pos Neg Neu of phrases # of words Expertise Annotators (%)
Comments (BBC) Comments BBC 1,000 99 653 248 3,98 64,39 NonExpert 3 87,0
[Thelwall 2013]
Comments (Digg) Comments Digg 1,077 210 572 295 2,50 33,97 Non Expert 3 88,0
[Thelwall 2013]
Comments (NYT) Comments NYT 5,190 2,204 2,742 244 1,01 17,76 AMT 20 88,0
[Hutto and Gilbert 2014]
Comments (TED) Comments TED 839 318 409 112 1 16,95 Non Expert 6 82,0
[Pappas and Popescu-Belis 2013]
Comments (Youtube) Comments YTB 3,407 1,665 767 975 1,78 17,68 Non Expert 3 90,0
[Thelwall 2013]
Movie-reviews Reviews I 10,662 5,331 5,331 - 1,15 18,99 User - 66,0
[Pang and Lee 2004] Rating
Movie-reviews Reviews II 10,605 5,242 5,326 37 1,12 19,33 AMT 20 97,0
[Hutto and Gilbert 2014]
Myspace posts Myspace 1,041 702 132 207 2,22 21,12 NonExpert 3 91,0
[Thelwall 2013]
Product reviews Amazon 3,708 2,128 1,482 98 1,03 16,59 AMT 20 94,0
[Hutto and Gilbert 2014]
Tweets (Debate) Tweets DBT 3,238 730 1249 1259 1,86 14,86 AMT + Undef. 60
[Diakopoulos and Shamma 2010] Expert
Tweets (Irony) Irony 100 38 43 19 1,01 17,44 Expert 3 -
(Labeled by us)
Tweets (Sarcasm) Sarcasm 100 38 38 24 1 15,55 Expert 3 -
(Labeled by us)
Tweets (Random) Tweets RND I 4,242 1,340 949 1953 1,77 15,81 Non Expert 3 88,0
[Thelwall 2013]
Tweets (Random) Tweets RND II 4,200 2,897 1,299 4 1,87 14,10 AMT 20 97,5
[Hutto and Gilbert 2014]
Tweets (Random) Tweets RND III 3,771 739 488 2,536 1,54 14,32 AMT 3 90,0
[Narr et al. 2012]
Tweets (Random) Tweets RND IV 500 139 119 222 1,90 15,44 Expert Undef. 90,0
[Aisopos 2014]
Tweets (Specific domains w/ emot.) Tweets STF 359 182 177 - 1,0 15,1 NonExpert Undef. 97,0
[Go et al. 2009]
Tweets (Specific topics) Tweets SAN 3737 580 654 2503 1,60 15,03 Expert 1 97,0
[Sanders 2011]
Tweets (Semeval2013 Task2) Tweets Semeval 6,087 2,223 837 3027 1,86 20,05 AMT 5 100,0
[Nakov et al. 2013]
Runners World forum RW 1,046 484 221 341 4,79 66,12 Non Expert 3 86,0
[Thelwall 2013]
This scenario, where every new developed solution compares itself with different solutions using
different datasets, happens because there is no standard benchmark for evaluating new methods. This
problem is exarcebated because many methods have been proposed in different research communi-
ties (e.g. NLP, Information Science, information Retrieval, Machine Learning), exploiting different
techniques, with low knowledge about related efforts in other communities. Next, we describe how
we created a large gold standard to properly compare all the considered sentiment analysis methods.
4. GOLD STANDARD DATA
A key aspect in evaluating sentiment analysis methods consists of using an accurate gold standard
(datasets). Several existing efforts have generated labeled data produced by expert or non-experts
evaluators. Previous studies suggest that both efforts are valid as non-expert labeling may be as
effective as annotations produced by experts for affect recognition, a very related task [Snow et al.
2008]. Thus, our effort to build a large and representative gold standard dataset consists of obtaining
labeled data from trustful previous works that cover a wide range of sources and kinds of data. We
also attempt to assess the “quality” of our gold standard in terms of the accuracy of the labeling
process.
Table III summarizes the main characteristics of twenty of the exploited datasets, such as number
of messages and the average number of words per message in each dataset. It also defines a simpler
nomenclature that is used in the remainder of this paper. The table also presents the methodology
employed in the classification. Human labeling was implemented in almost all datasets, usually done
with the use of non-expert reviewers. Reviews I dataset relies on five stars rates, in which users rate
and provide a comment about an entity of interest (e.g. a movie or an establishment).
Labeling based on Amazon Mechanical Turk (AMT) was used in seven out of the twenty datasets,
while volunteers and other strategies that involve non-expert evaluators were used in ten datasets.
Usually, an agreement strategy (i.e. majority voting) is applied to ensure that, in the end, each
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:9
sentence has an agreed-upon polarity assigned to it. The number of annotators used to build the
datasets is also shown in Table III.
Tweets DBT was the unique dataset built with a combination of AMT Labeling with Expert
validation. They selected 200 random tweets to be classified by experts and compared with AMT
results to ensure accurate ratings. We note that the Tweets Semeval dataset was provided as a list of
Twitter IDs, due to the Twitter policies related to data sharing. While crawling the respective tweets,
a a small part of them could not be accessed, as they were deleted. In any case, we plan to release
all gold standard datasets in a request basis, which is in agreement with Twitter policies.
In order to assess the extent to which these datasets are trustful, we used a strategy similar to
the one used by Tweets DBT. Our goal was not to redo all the performed human evaluation, but
simply inspecting a small sample of them to infer the level of agreement with our own evaluation.
We randomly select 1% of all sentences to be evaluated by experts (two of the authors) as an attempt
to assess if these gold standard data are really trustful. It is important to mention that we do not have
access to the instructions provided by the authors. We also could not get access to small amount of
the raw data in a few datasets, which was discarded. Finally, our manual inspection unveiled a few
sentences in idioms other than English in a few datasets, such as Tweets STA and TED, which were
obviously discarded.
Column R from Table III exhibits the level of agreement of each dataset in our evaluation. After
a close look in the cases we disagree with the evaluations in the Gold standard, we understand that
other interpretations could be given to the text, finding cases of sentences with mixed polarity. Some
of then are strongly linked to context and very hard to evaluate. Some NYT comments, for instance,
are directly related to the news they were inserted to. We can also note that some of the datasets
do not contain neutral messages. This might be a characteristic of the data or even a result of how
annotators were instructed to label their pieces of text. Most of the cases of disagreement involve
neutral messages. Thus, we considered these cases as well as the amount of disagreement we had
with the gold standard data as reasonable and expected.
5. COMPARISON RESULTS
Next, we present comparison results for the twenty one methods considered in this paper based on
the twenty considered gold standard datasets.
5.1. Experimental details
At least three distinct approaches have been proposed to deal with sentiment analysis of sentences.
The first of them, splits this task into two steps: (i) identifying sentences with no sentiment, also
named as objective vs. neutral sentences and then (ii) detecting polarity (positive or negative), only
for the subjective sentences. Another common way to detect sentence polarity is considers in a single
task three distinct classes (positive, negative and neutral). Finally, some methods classify a sentence
as positive or negative only, assuming that only polarized sentences are present, given the context of
a given application. As example, review of products are expected to contain only polarized opinion.
Aiming at providing a more thorough comparison among these distinct approaches, we perform
two rounds of tests. In the first we consider the performance of methods to identify 3-classes (posi-
tive, negative and neutral). The second considers only positive and negative as output and assumes
that a first step of removing the neutral messages was already performed. In the 3-classes experi-
ments we used only datasets containing a considerable number of neutral messages (which excludes
Tweets RND II, Amazon and Reviews II that contain an insignificant number of neutral sentences).
Despite being 2-classes methods, as highlighted in Table IV, we decided to include LIWC, Emoti-
cons and Senticnet in the 3-classes experiments to present a full set of comparative experiments.
LIWC, Emoticons and Senticnet cannot define, for some sentences, their positive or negative po-
larity, consering it as undefined. It occurs due to the absence in the sentence of emoticons (in the
case of Emoticons method) or of twords beloging to the methods’ sentiment lexicon. As a neutral
(objective) sentence is one that express no sentiment at all about a topic, we assumed, in the case of
these 2-class methods, undefined polarities as being equivalent to neutral ones.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:10 Gonc¸alves et al.
The 2-classes experiments, in turn, were performed with all datasets described in Table III without
the neutral sentences. We also included all methods in these experiments, even those that produce
neutral outputs. As discussed before, when 2-classes methods cannot detect the polarity (positive or
negative) of a sentences they usually assign it to an undefined polarity. As we know all sentences
in the 2-classes experiments are positive or negative we create the coverage metric to determine the
percentage of sentences a method can in fact classify as positive or negative. For instance, suppose
that Emoticons’ method can classify only 10% of the sentences in a dataset, corresponsing to the
actual percentage of sentences with emoticons. It means that the coverage of this method in this
specific dataset is 10%. Note that, the coverage is quite an important metric for a more complete
evaluation in the 2-classes experiments. Even though Emoticons presents high accuracy for the
classified phrases it was not able to make a prediction for 90% of the sentences. More formally,
coverage is calculated as the number of total sentences minus the number of undefined sentences,
all of this divided by the total of sentences, where the number of undefined sentences includes
neutral outputs for 3-classes methods.
Coverage =#Sentences #U ndefined
#Sentences
5.2. Comparison Metrics
Considering the 3-classes comparison experiments, we used the traditional Precision, Recall and F1
measures for the automated classification.
Predicted
Positive Neutral Negative
Positive a b c
Actual Neutral d e f
Negative g h i
Each letter in the above table represents the number of instances which are actually in class X
and predicted in class Y, where X;Y positive; neutral; negative. The recall (R) of a class Xis the
ratio of the number of elements correctly classified as Xto the number of known elements in class
X. Precision (P) of a class Xis the ratio of the number of elements classified correctly as Xto
the total predicted as the class X. For example, the precision of the negative class is computed as:
P(neg) = i/(c+f+i); its recall, as: R(neg) = i/(g+h+i); and the F1measure is the harmonic
mean between both precision and recall. In this case, F1(neg) = 2P(neg)·R(neg)
P(neg)+R(neg).
We also compute the overall accuracy as: A=a+e+i
a+b+c+d+e+f+g+h+i. It considers equally impor-
tant the correct classification of each sentence, independently of the class, and basically measures
the capability of the method to predict the correct output. A variation of F1, namely, macro-F1, is
normally reported to evaluate classification effectiveness on skewed datasets. Macro-F1 values are
computed by first calculating F1 values for each class in isolation, as exemplified above for negative,
and then averaging over all classes. Macro-F1 considers equally important the effectiveness in each
class, independently of the relative size of the class. Thus, accuracy and Macro-F1 provide com-
plementary assessments of the classification effectiveness. Macro-F1 is especially important when
the class distribution is very skewed, to verify the capability of the method to perform well in the
smaller classes.
The described metrics can easily computed for the 2-classes experiments by just removing neutral
columns and rows as per below. In this case, the precision of positive class is computed as: P(pos) =
a/(a+c); its recall as: R(pos) = a/(a+b); while its F1 is F1(pos) = 2P(pos)·R(pos)
P(pos)+R(pos)
As we have a large number of combination among base methods, metrics and datasets, a global
analysis of the performance of all these combinations is not an easy task. We propose a simple but
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:11
Predicted
Positive Negative
Positive a b
Actual Negative c d
informative measure to assess the overall performance ranking. The Mean Ranking is basically the
sum of ranks obtained by a method in each dataset divided by the total number of datasets, as per
below:
MR =
nd
P
j=1
ri
nd
where nd is the number of datasets and ri is the rank of the method for dataset i.It is important to
notice that rank was calculated based on Macro F1.
The last evaluation metric we exploit is the Friedman’s Test [Berenson et al. 2014]. It allows to
verify whether, in a specific experiment, the observed values are globally similar. In other words, are
the methods presenting similar performance across different datasets? To exemplify the application
of this test, suppose that nrestaurants are each rated by kjudges. The question that arises is: are the
judges ratings consistent with each other or are they following completely different patterns? The
application in our context is very similar: the datasets as the restaurants and the macro-F1 achieved
by a method is the rating from the judges.
The Friedman’s Test is applied to rankings. Then, to proceed with this statistical test, we sort the
methods for each dataset using for comparison the macro-F1 metric. In other words, the method
with highest macro-F1 received rank ‘1’ while the slowest macro-F1 method was ranked as ‘21’ for
each dataset.
More formally, the Friedman’s rank test is defined as:
FR= ( 12
rc(c+ 1)
c
X
j=1
R2
j)3r(c+ 1)
where
R2
j= square of the total of the ranks for group j (j = 1,2,..., c)
r= number of blocks
c= number of groups
In our case, the number of blockes corresponds to the number of datasets and the number of
the number of groups is the number of methods evaluated. As the number of blocks increases, the
statistical test can be aproximated by using the chi-square distribution with c1degrees of freedom.
Then, if the FRcomputed value is greater than the critical value for the chi-square distribution the
null hypothesis is rejected. This null hypothesis states that ranks obtained by judges are globally
similar, then rejecting the null hypothesis means that there are significant differences in the judgment
ranks (datasets). It is important to note that, in general, the critical value is obtained with significance
level α= 0.05 and. Synthesizing, the null hypothesis should be rejected if FR> X2
α, where X2
αis
the critical value verified in the chi-square distribution table with c1degrees of freedom and α
equals 0.05.
5.3. Comparing Prediction Performance
We start the analysis of our experiments by comparing the results of all metrics previously discussed
for all datasets. Table V and Table IV presents accuracy, precision and macro-F1 for all methods
in 4 datasets for the 3-classes experiments and 2-classes experiments respectively. For simplicity,
results fot the other 16 datasets are presented in the appendix.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:12 Gonc¸alves et al.
Table IV. 2-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage
P R F1 P R F1
AFINN 77.85 72.73 86.72 79.11 84.83 69.54 76.43 77.77 69.94
Emolex 72.06 65.42 85.67 74.19 82.61 60.06 69.55 71.87 60.04
Emoticons 85.71 87.50 89.74 88.61 82.61 79.17 80.85 84.73 5.77
Emoticons DS 47.20 47.13 99.21 63.90 55.56 0.88 1.74 32.82 98.08
Happiness Index 64.23 59.63 89.92 71.70 78.81 38.11 51.38 61.54 45.10
NRC Hashtag 69.77 72.01 58.71 64.69 68.36 79.63 73.57 69.13 93.68
LIWC 77.27 71.37 92.05 80.40 88.38 62.10 72.95 76.67 63.70
Tweets Opinion Finder 68.60 63.54 71.35 67.22 73.80 66.35 69.87 68.55 34.74
SAN Opinion Lexicon 81.77 78.18 86.74 82.24 85.98 77.05 81.27 81.75 65.35
PANAS-t 84.62 80.00 80.00 80.00 87.50 87.50 87.50 83.75 2.38
Pattern.en 74.62 68.46 90.59 77.98 86.40 58.90 70.04 74.01 72.59
SANN 70.02 64.40 85.95 73.63 80.46 54.90 65.27 69.45 45.55
SASA 61.18 61.74 74.71 67.61 60.12 45.16 51.58 59.59 43.45
Semantria 78.91 76.42 82.85 79.50 81.79 75.08 78.29 78.90 57.38
SenticNet 66.51 65.31 57.73 61.29 67.33 73.96 70.49 65.89 77.45
Sentiment140 72.45 0.00 0.00 0.00 72.45 100.00 84.02 42.01 53.90
SentiStrength 89.47 88.70 91.81 90.23 90.41 86.84 88.59 89.41 29.61
SentiWordNet 67.49 64.25 75.63 69.48 71.90 59.70 65.23 67.35 59.21
SO-CAL 79.70 74.74 84.55 79.34 85.11 75.56 80.05 79.70 68.19
Stanford DM 62.72 87.60 22.60 35.93 59.35 97.25 73.71 54.82 92.94
Umigon 82.41 83.66 80.49 82.04 81.25 84.32 82.76 82.40 67.74
Vader 77.18 71.81 88.15 79.15 85.34 66.59 74.81 76.98 78.74
AFINN 80.32 79.38 90.84 84.72 82.39 64.47 72.34 78.53 71.47
Emolex 73.28 74.49 83.19 78.60 70.93 59.03 64.43 71.52 59.02
Emoticons 85.43 90.27 90.27 90.27 71.05 71.05 71.05 80.66 13.19
Emoticons DS 58.95 58.84 99.63 73.99 72.22 1.37 2.70 38.34 99.83
Happiness Index 68.28 66.75 93.60 77.93 76.23 30.58 43.65 60.79 60.46
NRC Hashtag 65.61 72.95 65.91 69.26 57.31 65.19 61.00 65.13 95.54
LIWC 59.92 62.41 79.37 69.87 52.66 32.43 40.14 55.01 54.61
Tweets Opinion Finder 77.16 82.14 78.04 80.04 70.90 75.92 73.32 76.68 40.37
RND I Opinion Lexicon 81.56 82.00 87.68 84.74 80.84 72.98 76.71 80.73 63.74
PANAS-t 85.45 91.18 86.11 88.57 76.19 84.21 80.00 84.29 4.81
Pattern.en 78.02 79.84 85.52 82.58 74.60 66.33 70.22 76.40 77.72
SANN 75.61 75.48 87.04 80.85 75.89 59.06 66.43 73.64 50.15
SASA 65.60 70.72 70.36 70.54 58.47 58.89 58.68 64.61 58.67
Semantria 83.98 85.94 87.75 86.83 80.85 78.28 79.54 83.19 58.63
SenticNet 70.90 75.79 70.92 73.28 65.45 70.87 68.05 70.66 78.51
Sentiment140 70.19 0.00 0.00 0.00 70.19 100.00 82.49 41.24 40.45
SentiStrength 93.72 94.26 96.33 95.28 92.61 88.68 90.60 92.94 27.13
SentiWordNet 70.70 76.03 77.64 76.83 61.27 59.11 60.17 68.50 62.78
SO-CAL 80.85 82.08 85.98 83.98 78.92 73.66 76.20 80.09 64.57
Stanford DM 54.19 87.02 25.40 39.33 47.44 94.67 63.21 51.27 92.70
Umigon 82.07 89.22 80.71 84.76 73.02 84.26 78.24 81.50 67.50
Vader 80.12 78.73 91.76 84.75 83.39 62.52 71.46 78.10 81.08
AFINN 84.42 80.62 91.49 85.71 89.66 77.04 82.87 84.29 76.88
Emolex 79.65 76.09 88.98 82.03 85.23 69.44 76.53 79.28 62.95
Emoticons 85.42 80.65 96.15 87.72 94.12 72.73 82.05 84.89 13.37
Emoticons DS 51.96 51.41 100.00 67.91 100.00 2.27 4.44 36.18 99.72
Happiness Index 65.93 58.72 94.39 72.40 88.89 40.34 55.49 63.95 62.95
NRC Hashtag 71.30 73.05 70.93 71.98 69.51 71.70 70.59 71.28 92.20
LIWC 64.29 63.75 76.12 69.39 65.22 50.85 57.14 63.27 70.39
Tweets Opinion Finder 80.77 81.16 76.71 78.87 80.46 84.34 82.35 80.61 43.45
STF Opinion Lexicon 86.10 83.67 91.11 87.23 89.29 80.65 84.75 85.99 72.14
PANAS-t 94.12 88.89 100.00 94.12 100.00 88.89 94.12 94.12 4.74
Pattern.en 77.85 75.69 85.09 80.12 80.95 69.86 75.00 77.56 85.52
SANN 73.21 69.35 82.69 75.44 78.82 63.81 70.53 72.98 58.22
SASA 68.52 65.65 78.90 71.67 72.94 57.94 64.58 68.12 60.17
Semantria 88.45 89.15 88.46 88.80 87.70 88.43 88.07 88.43 69.92
SenticNet 70.49 71.31 63.50 67.18 69.88 76.82 73.19 70.18 80.22
Sentiment140 75.53 0.00 0.00 0.00 75.53 100.00 86.06 43.03 52.37
SentiStrength 95.33 95.18 96.34 95.76 95.52 94.12 94.81 95.29 41.78
SentiWordNet 72.99 73.17 78.95 75.95 72.73 65.98 69.19 72.57 58.77
SO-CAL 87.36 82.89 93.33 87.80 92.80 81.69 86.89 87.35 77.16
Stanford DM 66.56 87.69 36.31 51.35 61.24 95.18 74.53 62.94 89.97
Umigon 86.99 91.73 81.88 86.52 83.02 92.31 87.42 86.97 81.34
Vader 94.12 100.00 90.48 95.00 86.67 100.00 92.86 93.93 9.47
AFINN 66.56 23.08 81.08 35.93 96.32 64.66 77.38 56.65 85.11
Emolex 59.64 21.52 89.04 34.67 97.38 55.62 70.80 52.73 80.72
Emoticons 33.33 0.00 0.00 0.00 100.00 33.33 50.00 25.00 0.40
Emoticons DS 13.33 13.10 100.00 23.17 100.00 0.31 0.61 11.89 99.73
Happiness Index 41.81 15.65 95.52 26.89 98.41 35.03 51.67 39.28 79.52
NRC Hashtag 84.45 33.33 25.27 28.75 89.76 92.83 91.27 60.01 97.47
LIWC 50.10 15.38 58.33 24.35 88.00 48.78 62.77 43.56 69.55
Comments Opinion Finder 74.43 21.74 62.50 32.26 94.93 75.72 84.24 58.25 76.46
BBC Opinion Lexicon 74.14 29.81 84.93 44.13 97.24 72.66 83.17 63.65 80.72
PANAS-t 58.73 20.00 75.00 31.58 93.94 56.36 70.45 51.02 8.38
Pattern.en 61.09 20.00 70.73 31.18 93.48 59.72 72.88 52.03 87.50
SANN 54.34 19.09 88.06 31.38 96.88 49.80 65.78 48.58 75.13
SASA 61.61 23.50 66.20 34.69 90.80 60.77 72.81 53.75 61.30
Semantria 83.43 40.00 84.75 54.35 97.64 83.26 89.88 72.11 67.42
SenticNet 66.07 24.44 74.16 36.77 94.24 64.83 76.81 56.79 88.96
Sentiment140 92.56 0.00 0.00 0.00 92.56 100.00 96.14 48.07 51.86
SentiStrength 93.93 64.29 78.26 70.59 97.72 95.54 96.61 83.60 32.85
SentiWordNet 57.49 20.00 88.06 32.60 97.13 53.45 68.96 50.78 76.33
SO-CAL 75.28 28.93 80.28 42.54 96.71 74.64 84.25 63.40 82.85
Stanford DM 89.45 63.16 40.91 49.66 91.81 96.52 94.11 71.88 92.02
Umigon 79.37 39.13 61.02 47.68 92.10 82.72 87.15 67.42 50.93
Vader 62.19 22.12 85.54 35.15 96.77 59.02 73.32 54.23 92.15
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:13
Table V. 3-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment Neut. Sentiment MacroF1
P R F1 P R F1 P R F1
AFINN 62.36 61.10 70.09 65.28 44.08 31.91 37.02 71.43 58.57 64.37 55.56
Emolex 48.74 48.15 62.71 54.47 31.27 17.71 22.61 57.90 41.30 48.21 41.76
Emoticons 52.88 72.83 11.34 19.62 55.56 32.37 40.91 34.05 96.53 50.34 36.96
Emoticons DS 36.59 36.55 100.00 53.53 75.00 0.08 0.16 100.00 0.03 0.07 17.92
Happiness Index 48.81 43.61 65.27 52.29 36.96 7.54 12.53 36.82 45.16 40.56 35.13
NRC Hashtag 36.95 42.04 75.03 53.88 24.57 16.94 20.05 53.33 3.70 6.92 26.95
LIWC 39.54 36.52 42.33 39.21 15.14 6.25 8.84 48.64 44.83 46.66 31.57
Tweets Opinion Finder 57.63 67.57 27.94 39.53 40.75 48.62 44.34 58.20 86.06 69.44 51.10
Semeval Opinion Lexicon 60.37 62.09 62.71 62.40 41.19 34.18 37.36 66.41 60.75 63.46 54.41
PANAS-t 53.08 90.95 9.04 16.45 51.56 62.26 56.41 51.65 99.01 67.89 46.92
Pattern.en 50.19 58.07 68.47 62.84 24.68 29.82 27.01 67.73 35.22 46.34 45.40
SANN 54.77 52.72 47.59 50.02 38.91 20.92 27.21 58.95 66.90 62.67 46.64
SASA 50.63 46.34 47.77 47.04 33.07 12.14 17.76 56.39 61.12 58.66 41.15
Semantria 61.54 67.28 57.35 61.92 39.57 41.62 40.57 65.98 67.03 66.50 56.33
SenticNet 49.68 51.85 1.26 2.46 29.79 35.00 32.18 49.82 98.51 66.17 33.60
Sentiment140 42.25 0.00 0.00 0.00 26.79 100.00 42.25 50.57 66.14 57.31 33.19
SentiStrength 57.83 78.01 27.13 40.25 47.80 53.55 50.52 55.49 89.89 68.62 53.13
SentiWordNet 48.33 55.54 53.44 54.47 19.67 24.82 21.95 61.22 47.57 53.54 43.32
SO-CAL 58.83 58.89 59.02 58.95 40.39 33.14 36.41 39.89 59.96 47.91 47.76
Stanford DM 22.54 72.14 18.17 29.03 14.92 82.93 25.28 47.19 6.94 12.10 22.14
Umigon 65.88 75.18 56.14 64.28 39.66 53.18 45.44 70.65 75.78 73.13 60.95
Vader 60.05 56.08 79.26 65.68 44.13 26.60 33.19 76.88 46.02 57.57 52.15
AFINN 64.41 40.81 72.12 52.13 49.67 28.29 36.05 85.95 62.54 72.40 53.53
Emolex 54.76 31.67 59.95 41.44 40.14 19.53 26.27 77.48 54.64 64.08 43.93
Emoticons 70.22 70.06 16.78 27.07 65.62 44.21 52.83 41.29 97.56 58.02 45.98
Emoticons DS 20.34 19.78 99.46 33.00 62.07 0.60 1.19 53.85 0.55 1.09 11.76
Happiness Index 55.16 29.13 61.98 39.64 50.65 9.50 16.01 43.35 59.16 50.03 35.23
NRC Hashtag 30.47 28.25 77.40 41.39 24.18 19.59 21.64 79.08 8.77 15.78 26.27
LIWC 46.88 21.85 38.43 27.86 19.18 8.05 11.34 69.51 54.83 61.31 33.50
Tweets Opinion Finder 71.55 57.48 32.75 41.72 49.85 48.56 49.20 75.95 89.90 82.34 57.75
RND III Opinion Lexicon 63.86 40.65 66.17 50.36 48.84 27.73 35.38 81.96 64.66 72.29 52.68
PANAS-t 68.79 79.49 8.39 15.18 48.57 51.52 50.00 68.75 98.86 81.10 48.76
Pattern.en 53.57 36.25 76.86 49.26 35.19 22.50 27.45 84.20 45.68 59.23 45.31
SANN 66.88 42.70 48.71 45.51 46.35 26.93 34.07 77.99 77.99 77.99 52.52
SASA 55.37 29.42 54.53 38.22 42.46 19.28 26.52 78.30 57.15 66.08 43.60
Semantria 68.89 48.86 63.73 55.31 49.82 35.47 41.44 82.02 72.96 77.22 57.99
SenticNet 29.97 31.08 74.83 43.92 20.98 22.75 21.83 79.70 8.49 15.35 27.03
Sentiment140 55.05 0.00 0.00 0.00 28.14 100.00 43.92 71.14 66.00 68.47 37.46
SentiStrength 73.80 70.94 41.95 52.72 57.53 49.80 53.39 75.35 92.26 82.95 63.02
SentiWordNet 55.85 37.42 58.19 45.55 24.04 19.57 21.58 79.25 59.00 67.64 44.92
SO-CAL 66.51 43.06 68.88 52.99 51.84 30.55 38.44 45.77 66.94 54.37 48.60
Stanford DM 31.90 64.48 38.57 48.26 15.58 72.55 25.65 75.64 19.77 31.35 35.09
Umigon 74.12 57.67 70.23 63.33 48.83 46.71 47.75 88.80 76.34 82.10 64.39
Vader 59.82 37.52 81.73 51.43 47.99 24.25 32.22 89.26 52.28 65.94 49.86
AFINN 50.10 16.22 60.61 25.59 82.62 54.14 65.42 40.11 30.24 34.48 41.83
Emolex 44.10 15.51 65.66 25.10 83.19 45.62 58.93 35.27 31.85 33.47 39.17
Emoticons 24.60 0.00 0.00 0.00 33.33 25.00 28.57 19.77 98.79 32.95 20.51
Emoticons DS 10.00 9.85 98.99 17.92 66.67 0.22 0.44 0.00 0.00 0.00 9.18
Happiness Index 33.60 11.83 64.65 20.00 84.93 28.05 42.18 26.46 34.68 30.02 30.73
NRC Hashtag 64.00 20.72 23.23 21.90 70.20 87.13 77.76 52.50 8.47 14.58 38.08
LIWC 33.00 11.11 42.42 17.61 67.69 39.57 49.94 22.90 27.42 24.95 30.84
Comments Opinion Finder 51.80 14.96 35.35 21.02 78.76 66.39 72.04 33.71 36.29 34.95 42.67
BBC Opinion Lexicon 55.00 20.67 62.63 31.08 85.27 61.98 71.79 40.82 40.32 40.57 47.81
PANAS-t 27.10 16.67 6.06 8.89 75.61 50.82 60.78 25.35 94.35 39.97 36.55
Pattern.en 46.00 14.39 58.59 23.11 77.30 49.93 60.67 38.16 23.39 29.00 37.59
SANN 40.10 14.50 59.60 23.32 79.49 41.61 54.63 33.45 37.90 35.54 37.83
SASA 38.20 17.03 47.47 25.07 70.75 50.86 59.18 25.19 39.52 30.77 38.34
Semantria 56.00 28.90 50.51 36.76 83.82 75.20 79.28 35.86 55.24 43.49 53.18
SenticNet 47.10 17.74 66.67 28.03 72.87 55.13 62.77 25.89 11.69 16.11 35.64
Sentiment140 50.60 0.00 0.00 0.00 73.23 100.00 84.54 28.60 58.47 38.41 40.98
SentiStrength 44.20 47.37 18.18 26.28 86.64 91.45 88.98 29.37 84.68 43.61 52.96
SentiWordNet 42.40 14.90 59.60 23.84 81.63 44.57 57.66 34.56 37.90 36.15 39.22
SO-CAL 55.50 20.88 57.58 30.65 80.47 65.61 72.28 28.57 34.68 31.33 44.75
Stanford DM 65.50 43.37 36.36 39.56 71.01 92.54 80.36 37.50 14.52 20.93 46.95
Umigon 45.70 28.35 36.36 31.86 76.35 74.65 75.49 29.31 61.69 39.74 49.03
Vader 49.10 15.96 71.72 26.10 82.57 49.05 61.54 50.42 24.19 32.70 40.11
AFINN 42.45 64.81 41.79 50.81 80.29 68.59 73.98 7.89 77.87 14.32 46.37
Emolex 42.97 55.12 53.72 54.41 75.35 48.67 59.14 7.22 54.10 12.74 42.10
Emoticons 4.68 0.00 0.00 0.00 0.00 0.00 0.00 4.47 99.59 8.56 2.85
Emoticons DS 42.58 42.55 99.77 59.66 78.57 0.37 0.73 0.00 0.00 0.00 30.20
Happiness Index 31.81 48.42 50.18 49.29 71.70 25.96 38.12 5.36 54.10 9.76 32.39
NRC Hashtag 54.84 55.38 45.74 50.10 61.55 68.92 65.03 8.33 15.16 10.76 41.96
LIWC 24.35 42.88 27.72 33.67 53.42 39.12 45.16 4.67 53.28 8.58 29.14
Comments Opinion Finder 29.38 68.77 18.78 29.51 76.52 82.66 79.47 6.29 88.11 11.75 40.24
NYT Opinion Lexicon 44.57 65.95 43.15 52.17 79.81 70.65 74.95 7.94 73.77 14.34 47.15
PANAS-t 5.88 69.23 1.23 2.41 62.07 75.00 67.92 4.75 99.18 9.07 26.47
Pattern.en 45.39 55.15 44.69 49.37 63.65 61.12 62.36 7.85 45.90 13.41 41.71
SANN 27.92 56.74 29.40 38.73 78.02 55.13 64.61 5.93 79.51 11.04 38.13
SASA 30.04 49.92 30.13 37.58 59.11 52.83 55.80 5.74 61.07 10.49 34.62
Semantria 44.59 70.60 41.83 52.54 80.54 75.95 78.18 7.53 73.36 13.65 48.12
SenticNet 4.70 0.00 0.00 0.00 0.00 0.00 0.00 4.70 100.00 8.98 2.99
Sentiment140 34.66 0.00 0.00 0.00 65.76 100.00 79.34 5.83 64.34 10.69 30.01
SentiStrength 18.17 78.51 8.62 15.54 81.12 90.91 85.74 5.41 95.49 10.24 37.17
SentiWordNet 32.20 57.35 34.53 43.10 70.31 56.63 62.73 6.08 70.08 11.19 39.01
SO-CAL 50.79 64.36 51.13 56.99 77.25 68.36 72.53 8.68 65.98 15.34 48.29
Stanford DM 51.93 73.39 21.14 32.83 59.48 92.67 72.46 9.65 38.11 15.40 40.23
Umigon 24.08 68.76 16.38 26.46 68.78 80.38 74.13 5.88 88.93 11.04 37.21
Vader 48.84 61.96 52.40 56.78 80.09 63.00 70.52 9.51 70.90 16.77 48.03
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:14 Gonc¸alves et al.
Table VI. Mean Rank Table
3-Classes 2-Classes
Pos Method Mean Rank Pos Method Mean Rank
1 Semantria 3.20 1 SentiStrength 1.90(2.29)
2 SentiStrength 3.80(4.78) 2 Semantria 3.80
3 AFINN 4.40 3 Opinion Lexicon 5.70
4 Umigon 4.47 4 SO-CAL 5.90
5 Opinion Lexicon 4.87 5 AFINN 6.75
6 SO-CAL 6.93 6 Vader 7.65(8.06)
7 Vader 6.93(7.21) 7 Umigon 7.85
8 Opinion Finder 8.87 8 PANAS-t 9.85
9 Pattern.en 10.00 9 Emoticons 10.15
10 SANN 10.93 10 Pattern.en 10.30
11 Emolex 11.80 11 Opinion Finder 11.95
12 SentiWordNet 11.87 12 SenticNet 12.25
13 SenticNet 14.00 13 Emolex 12.30
14 Stanford DM 14.40 14 SANN 12.75
15 SASA 14.73 15 Stanford DM 13.80
16 LIWC 15.13 16 NRC Hashtag 14.20
17 PANAS-t 15.87 17 SentiWordNet 14.95
18 NRC Hashtag 16.73 18 SASA 15.60
19 Sentiment140 17.13 19 LIWC 15.90
20 Happiness Index 17.53 20 Happiness Index 17.65
21 Emoticons 18.00 21 Sentiment140 20.60
22 Emoticons DS 21.40 22 Emoticons DS 21.20
First, we note that existing methods varied widely in their agreement. This suggests that the
same social media text could be interpreted very differently depending on the choice of a sentiment
method. A few methods obtain results worse than a random baseline (i.e. a method that would
randomly choose among positive, neutral, or negative as output). This usually happens when a
method is biased towards one or more classes. As an example, emoticons showed to be a good
method for detecting positive and negative messages when the input data has an emoticon. However,
it considers most of the instances as neutral, as the majority of the messages to do have emoticons,
leading to an overall bad performance for most of the datasets.
In a deeper look at Table IV we can note that Vader works well for Tweets RND I and
Tweets STF, appearing among the top 3 methods, but it presented poor performance in Tweets SAN
and Comments BBC, achieving the nineth and eleventh place, respectively. Although the three first
datasets contains tweet, they have different contexts, which can drastically affect the performance
of some methods. Another important aspect to be analysed in this table is the coverage. Although
Sentistrength has presented good macro-F1 values, its coverage was not so high, due to the fact this
method has some bias to the neutral class. Note that some datasets provided by the Sentistrength’s
authors (Thelwall and their coleagues), as shown in table III, specially the twitter datasets has more
neutral sentences than positive and negative ones.
In the 3-class experiments presented in table V we can realize that different contexts lead to
poor performance of some methods. For instance, Umigon, that was the top performer in four tweet
datasets, appears in the fourth overall position in the 3-class Mean Rank table VI and felt to the
thirteenth place in the Comments NYT dataset.
We noted from Figure 2 that most methods are more accurate while classifying positive than
negative messages, suggesting that some methods may be more biased towards positivity. Neutral
messages showed to be even harder to detect by most methods . Recent efforts show that human
language have a universal positivity bias [Dodds et al. 2015]. Naturally, part of the bias is observed
in sentiment prediction, an intrinsicproperty of some methods due to the way they are designed.
For instance, [Hannak et al. 2012] developed a lexicon in which positive and negative values are
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:15
Fig. 2. Average F1 Score for each class
associated to words, hashtags, and any sort of tokens according to the frequency with which these
tokens appear in tweets containing positive and negative emoticons. This method showed to be
biased towards positivity due to the larger amount of positivity in the data they used to build the
lexicon. The overall poor performance of this specific method is credited to its lack of treatment of
neutral messages and the focus on Twitter messages.
As it can been seen in Table VI, the top seven methods based on Macro-F1 are SentiStrength,
Semantria, AFINN, OpinionLexicon, Umigon, Vader and SO-CAL. This means that these methods
produce good results across several datasets in both, 2 and 3-class tasks. These methods would be
preferable in situations in which any sort of preliminary evaluation would be performed. We also
note those methods usually perform better in the datasets in which they were originally validated,
which is somewhat expected due to fine tuning procedures. This is specially true for SentiStrength
and VADER. To understand the impact of such factor, we calculated Mean Rank for these methods
without their ‘original’ datasets and put(results in parenthesis). Note that in some cases the rank
order changes towards a lower value.
Table VII presents Friedman’s test results and, as expected, we can conclude that there are sig-
nificant differences in the mean ratings observed for the methods across all datasets. It statistically
indicates that in terms of accuracy and Macro-F1 there is no single method that always achieves
the best prediction performance for different datasets, which is similar to the well-known “no-free
lunch theorem” [Wolpert and Macready 1997]. This suggests that at least a preliminary investiga-
tion should be performed when sentiment analysis is used in a new dataset in order to guarantee a
reasonable prediction performance.
Table VII. Friedman’s Test Results
2-classes experiments 3-classes experiments
FR 261.57 FR 219.31
Critical Value 31.41 Critical Value 31.41
Reject null hypothesis Reject null hypothesis
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:16 Gonc¸alves et al.
In order to verify whether this behavior also occurs in specific contexts such as tweets or com-
ments we divided all datasets in three contexts and perform the Friedman’s test for each one. The
contexts are Social Networks, Comments and Reviews and the datasets were grouped as presented
below. In spite of being sentences extracted from forums we defined the RW dataset as belonging
to the Comments context as the properties of sentences in this dataset are similar to those of other
Comments datasets, as seen in Table (table III).
Context Groups
Social Networks
Myspace, Tweets DBT, Tweets DBT, Tweets RND I, Tweets RND II,
Tweets RND III, Tweets RND IV, Tweets STF, Tweets SAN,
Tweets Semeval
Comments Comments BBC, Comments DIGG, Comments NYT,
Comments TED, Comments YTB, RW
Reviews Reviews I, Reviews I, Amazon
Even after grouping the datasets in such contexts, we still find out that there are significant differ-
ences in the observed ranks across the datasets. Although the values obtained for each context were
quite smaller than Friedman’ global value, they are still above the critical value. Table VIII presents
the results of Friedman’s test for the individual contexts in both experiments, 2 and 3-class. Recall
that for the 3-class experiments, datasets with no neutral sentences or with an unrepresentative num-
ber of neutral sentences were not removed. For this reason, the results for 3-class experiments in the
Reviews context has no values as none of the Review datasets has a significant number of neutral
sentences.
Table VIII. Friedman’s Test Results By Contexts
Context: Social Networks
2-classes experiments 3-classes experiments
FR 158.95 FR 138.12
Critical Value 31.41 Critical Value 31.41
Reject null hypothesis Reject null hypothesis
Context: Comments
2-classes experiments 3-classes experiments
FR 85.39 FR 94.39
Critical Value 31.41 Critical Value 31.41
Reject null hypothesis Reject null hypothesis
Context: Reviews
2-classes experiments 3-classes experiments
FR 56.01 FR -
Critical Value 31.41 Critical Value -
Reject null hypothesis Reject null hypothesis
6. CONCLUDING REMARKS
Recent efforts to analyze the moods embedded in Web 2.0 content have adopted various sentiment
analysis methods, which were originally developed in linguistics and psychology. Several of these
methods became widely used in their knowledge fields and have now been applied as tools to quan-
tify moods in the context of unstructured short messages in online social networks. In this articler,
we present a thorough comparison of twenty one popular sentence-level sentiment analysis methods
using gold standard datasets that span different types of data sources.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:17
To perform such comparison we have made significant efforts to obtain the latest working versions
of the various sentiment analysis tools and datasets, which we put together in a single webpage 6. We
are releasing this Web system so that other researchers can easily compare results of those methods
in their own datasets. With this system one could easily test which method would be most suitable
for a particular dataset and application. We hope that our tool will not only help researchers and
practitioners for accessing and comparing a wide range of sentiment analysis techniques. We also
hope that this can help the development of new research in this area.
APPENDIX
In this appendix, we present the full results of prediction performance of all twenty one sentiment
analysis methods on all labeled datasets.
6http://homepages.dcc.ufmg.br/fabricio/benchmark\sentiment\analysis.html
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:18 Gonc¸alves et al.
Table IX. 2-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage
P R F1 P R F1
AFINN 66.56 23.08 81.08 35.93 96.32 64.66 77.38 56.65 85.11
Emolex 59.64 21.52 89.04 34.67 97.38 55.62 70.80 52.73 80.72
Emoticons 33.33 0.00 0.00 0.00 100.00 33.33 50.00 25.00 0.40
Emoticons DS 13.33 13.10 100.00 23.17 100.00 0.31 0.61 11.89 99.73
Happiness Index 41.81 15.65 95.52 26.89 98.41 35.03 51.67 39.28 79.52
NRC Hashtag 84.45 33.33 25.27 28.75 89.76 92.83 91.27 60.01 97.47
LIWC 50.10 15.38 58.33 24.35 88.00 48.78 62.77 43.56 69.55
Comments Opinion Finder 74.43 21.74 62.50 32.26 94.93 75.72 84.24 58.25 76.46
BBC Opinion Lexicon 74.14 29.81 84.93 44.13 97.24 72.66 83.17 63.65 80.72
PANAS-t 58.73 20.00 75.00 31.58 93.94 56.36 70.45 51.02 8.38
Pattern.en 61.09 20.00 70.73 31.18 93.48 59.72 72.88 52.03 87.50
SANN 54.34 19.09 88.06 31.38 96.88 49.80 65.78 48.58 75.13
SASA 61.61 23.50 66.20 34.69 90.80 60.77 72.81 53.75 61.30
SenticNet 36.21 16.27 94.62 27.76 97.18 27.52 42.89 35.33 95.48
Sentiment140 92.56 0.00 0.00 0.00 92.56 100.00 96.14 48.07 51.86
SentiStrength 93.93 64.29 78.26 70.59 97.72 95.54 96.61 83.60 32.85
SentiWordNet 57.49 20.00 88.06 32.60 97.13 53.45 68.96 50.78 76.33
SO-CAL 75.28 28.93 80.28 42.54 96.71 74.64 84.25 63.40 82.85
Stanford DM 89.45 63.16 40.91 49.66 91.81 96.52 94.11 71.88 92.02
Umigon 79.37 39.13 61.02 47.68 92.10 82.72 87.15 67.42 50.93
Vader 62.19 22.12 85.54 35.15 96.77 59.02 73.32 54.23 92.15
AFINN 70.94 47.01 81.82 59.72 91.17 67.05 77.27 68.49 74.81
Emolex 61.71 34.60 75.83 47.52 88.93 57.53 69.87 58.69 67.14
Emoticons 73.08 72.22 86.67 78.79 75.00 54.55 63.16 70.97 3.32
Emoticons DS 28.24 27.30 100.00 42.89 100.00 1.77 3.48 23.19 98.72
Happiness Index 42.32 27.44 91.45 42.21 91.53 27.62 42.44 42.32 64.96
NRC Hashtag 74.69 51.01 40.64 45.24 80.80 86.48 83.54 64.39 92.97
LIWC 46.15 27.44 58.40 37.34 72.49 41.52 52.79 45.07 58.18
Comments Opinion Finder 71.14 43.04 64.76 51.71 86.88 73.13 79.42 65.56 56.27
DIGG Opinion Lexicon 71.82 47.45 86.43 61.27 93.40 66.75 77.86 69.56 69.44
PANAS-t 68.00 12.50 50.00 20.00 94.12 69.57 80.00 50.00 3.20
Pattern.en 66.72 43.49 77.44 55.70 88.25 62.75 73.35 64.53 77.62
SANN 60.04 35.56 84.96 50.13 91.83 52.33 66.67 58.40 61.13
SASA 65.54 40.26 66.91 50.27 84.82 65.06 73.64 61.95 68.29
SenticNet 43.00 30.39 92.97 45.81 91.22 25.52 39.88 42.84 91.30
Sentiment140 85.45 0.00 0.00 0.00 85.45 100.00 92.15 46.08 54.48
SentiStrength 92.09 78.69 92.31 84.96 97.40 92.02 94.64 89.80 27.49
SentiWordNet 62.17 36.86 77.68 50.00 88.84 57.18 69.58 59.79 58.82
SO-CAL 76.55 52.86 77.08 62.71 90.65 76.37 82.90 72.81 71.99
Stanford DM 79.45 66.67 37.58 48.06 81.57 93.63 87.19 67.63 83.38
Umigon 83.37 66.22 75.38 70.50 90.72 86.23 88.42 79.46 63.04
Vader 68.65 45.29 85.63 59.24 92.31 62.50 74.53 66.89 83.63
AFINN 73.82 66.35 78.85 72.07 81.55 70.04 75.36 73.71 55.14
Emolex 64.57 57.20 81.71 67.29 77.52 50.78 61.36 64.33 65.69
Emoticons 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Emoticons DS 44.75 44.66 99.86 61.72 78.57 0.40 0.80 31.26 99.84
Happiness Index 55.28 50.59 88.13 64.28 73.49 27.66 40.19 52.24 55.56
NRC Hashtag 61.89 58.47 49.85 53.82 63.98 71.55 67.55 60.69 91.77
LIWC 49.52 44.96 59.96 51.39 56.18 41.15 47.50 49.44 46.30
Comments Opinion Finder 75.11 70.17 61.61 65.61 77.64 83.58 80.50 73.06 35.26
NYT Opinion Lexicon 74.61 67.54 77.95 72.37 81.46 72.12 76.50 74.44 57.80
PANAS-t 66.32 71.05 56.25 62.79 63.16 76.60 69.23 66.01 1.92
Pattern.en 61.78 57.17 60.24 58.67 65.95 63.04 64.46 61.57 73.43
SANN 67.11 58.38 80.90 67.82 79.87 56.78 66.38 67.10 37.81
SASA 56.47 51.55 58.92 54.99 61.70 54.45 57.85 56.42 50.49
SenticNet 55.93 50.64 91.05 65.08 78.65 27.08 40.29 52.68 92.22
Sentiment140 68.13 0.00 0.00 0.00 68.13 100.00 81.05 40.52 48.73
SentiStrength 81.42 79.50 62.71 70.11 82.15 91.39 86.52 78.32 17.63
SentiWordNet 65.08 59.13 73.17 65.41 72.59 58.42 64.74 65.07 46.60
SO-CAL 72.52 66.14 75.74 70.61 78.88 70.03 74.19 72.40 69.01
Stanford DM 63.85 75.77 26.03 38.75 61.73 93.48 74.36 56.56 82.39
Umigon 70.03 69.56 55.97 62.03 70.29 80.96 75.25 68.64 29.82
Vader 71.58 63.60 80.66 71.12 81.33 64.61 72.02 71.57 66.72
AFINN 75.28 68.85 87.70 77.14 85.17 64.03 73.10 75.12 72.90
Emolex 67.27 59.88 85.46 70.42 81.03 52.03 63.37 66.89 68.50
Emoticons 91.67 100.00 75.00 85.71 88.89 100.00 94.12 89.92 1.65
Emoticons DS 43.74 43.74 100.00 60.86 0.00 0.00 0.00 30.43 100.00
Happiness Index 63.86 58.86 93.78 72.32 84.15 33.50 47.92 60.12 57.08
NRC Hashtag 71.00 68.05 58.36 62.84 72.66 80.15 76.23 69.53 92.02
LIWC 52.96 47.67 65.78 55.28 61.21 42.80 50.37 52.83 58.18
Comments Opinion Finder 70.99 66.48 66.12 66.30 74.38 74.69 74.53 70.42 58.32
TED Opinion Lexicon 74.35 68.15 84.58 75.49 82.89 65.40 73.11 74.30 74.55
PANAS-t 82.35 100.00 75.00 85.71 62.50 100.00 76.92 81.32 2.34
Pattern.en 67.21 62.89 76.29 68.94 73.15 58.93 65.28 67.11 83.91
SANN 72.55 67.82 76.96 72.10 77.73 68.77 72.98 72.54 68.64
SASA 65.94 59.63 77.40 67.36 75.00 56.40 64.38 65.87 63.00
SenticNet 55.76 50.27 90.00 64.51 77.70 28.12 41.30 52.90 95.46
Sentiment140 72.35 0.00 0.00 0.00 72.35 100.00 83.96 41.98 40.30
SentiStrength 82.81 83.59 86.29 84.92 81.72 78.35 80.00 82.46 30.40
SentiWordNet 58.67 56.70 77.46 65.48 63.08 39.42 48.52 57.00 57.91
SO-CAL 73.87 73.36 77.97 75.59 74.49 69.43 71.88 73.73 75.79
Stanford DM 75.34 83.42 58.33 68.66 71.46 90.00 79.67 74.16 81.98
Umigon 70.86 74.09 75.81 74.94 66.23 64.15 65.18 70.06 51.44
Vader 74.01 67.14 84.95 75.00 83.53 64.74 72.95 73.97 83.63
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:19
Table X. 2-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage
P R F1 P R F1
AFINN 85.92 86.93 94.08 90.36 82.72 66.73 73.87 82.12 74.18
Emolex 79.31 82.46 87.58 84.94 71.70 62.82 66.97 75.95 58.63
Emoticons 88.44 91.50 95.31 93.37 64.00 48.48 55.17 74.27 9.25
Emoticons DS 68.86 68.91 99.63 81.47 62.50 1.34 2.62 42.05 97.99
Happiness Index 74.44 74.62 94.86 83.53 73.26 30.38 42.95 63.24 58.55
NRC Hashtag 72.24 90.94 65.25 75.99 54.76 86.62 67.10 71.54 89.31
LIWC 64.86 72.42 80.68 76.33 37.69 27.55 31.83 54.08 63.65
Comments Opinion Finder 73.64 84.12 74.60 79.08 58.43 71.72 64.40 71.74 42.43
YTB Opinion Lexicon 84.89 87.88 91.06 89.44 76.84 70.26 73.40 81.42 68.01
PANAS-t 65.45 60.00 62.50 61.22 70.00 67.74 68.85 65.04 2.26
Pattern.en 83.62 87.90 89.52 88.70 71.96 68.60 70.24 79.47 78.08
SANN 79.08 82.22 89.64 85.77 68.75 54.05 60.52 73.15 56.41
SASA 69.17 84.55 67.96 75.36 49.81 71.91 58.85 67.10 71.63
SenticNet 75.24 75.75 93.90 83.85 72.38 34.70 46.91 65.38 85.69
Sentiment140 59.29 0.00 0.00 0.00 59.29 100.00 74.44 37.22 32.32
SentiStrength 95.27 97.48 96.40 96.94 87.96 91.35 89.62 93.28 38.24
SentiWordNet 75.26 83.05 82.00 82.52 56.81 58.60 57.69 70.10 59.00
SO-CAL 85.98 90.64 89.08 89.85 75.86 78.85 77.33 83.59 68.63
Stanford DM 69.04 93.56 58.90 72.29 50.41 91.15 64.92 68.60 79.81
Umigon 82.01 94.53 80.39 86.89 60.65 86.67 71.36 79.13 71.55
Vader 85.62 86.66 93.86 90.11 82.40 66.56 73.64 81.87 81.50
AFINN 65.93 63.56 79.10 70.48 70.15 51.99 59.72 65.10 72.59
Emolex 64.77 62.37 79.35 69.85 69.30 49.34 57.64 63.74 74.39
Emoticons 60.00 0.00 0.00 0.00 60.00 100.00 75.00 37.50 0.05
Emoticons DS 50.27 50.17 99.94 66.80 89.29 0.47 0.94 33.87 99.79
Happiness Index 54.25 53.22 85.59 65.63 58.96 21.59 31.61 48.62 63.62
NRC Hashtag 62.34 62.14 64.45 63.27 62.57 60.20 61.36 62.32 93.47
LIWC 63.00 61.37 82.45 70.36 67.08 40.81 50.75 60.56 66.08
Reviews I Opinion Finder 26.55 100.00 26.55 41.96 0.00 0.00 0.00 20.98 49.12
Opinion Lexicon 69.77 69.26 74.09 71.59 70.39 65.20 67.70 69.64 77.28
PANAS-t 66.30 75.44 61.72 67.89 58.12 72.55 64.53 66.21 3.40
Pattern.en 65.60 65.24 68.68 66.92 66.00 62.43 64.17 65.54 89.06
SANN 62.34 62.00 70.29 65.88 62.82 53.81 57.97 61.93 67.31
SASA 57.41 55.81 61.54 58.54 59.27 53.46 56.22 57.38 58.24
SenticNet 55.27 53.43 88.26 66.57 64.44 21.67 32.43 49.50 94.52
Sentiment140 69.49 0.00 0.00 0.00 69.49 100.00 82.00 41.00 30.96
SentiStrength 67.54 72.40 65.28 68.66 62.84 70.23 66.33 67.49 26.98
SentiWordNet 61.45 61.12 71.36 65.84 61.97 50.69 55.77 60.80 62.53
SO-CAL 71.65 72.09 72.82 72.46 71.18 70.43 70.80 71.63 89.10
Stanford DM 82.70 88.31 75.48 81.40 78.48 89.95 83.83 82.61 91.92
Umigon 63.44 66.36 55.62 60.52 61.30 71.37 65.95 63.24 53.95
Vader 64.62 62.19 79.65 69.84 69.33 48.71 57.22 63.53 84.76
AFINN 65.95 63.40 78.99 70.34 70.44 52.33 60.05 65.19 73.93
Emolex 64.96 62.04 80.00 69.88 70.52 49.41 58.11 64.00 75.15
Emoticons 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Emoticons DS 49.86 49.77 99.92 66.45 85.19 0.43 0.86 33.65 99.80
Happiness Index 54.45 53.16 86.25 65.78 60.44 21.65 31.88 48.83 66.30
NRC Hashtag 61.56 60.96 63.80 62.35 62.22 59.33 60.74 61.55 91.65
LIWC 61.77 60.08 81.98 69.34 66.11 39.22 49.24 59.29 66.61
Reviews II Opinion Finder 60.50 69.14 34.56 46.08 57.70 85.27 68.83 57.46 62.66
Opinion Lexicon 70.11 69.28 74.97 72.01 71.15 65.00 67.93 69.97 77.98
PANAS-t 66.85 74.16 63.16 68.22 60.10 71.60 65.35 66.78 3.51
Pattern.en 65.90 65.26 68.80 66.98 66.61 62.96 64.74 65.86 90.54
SANN 62.89 62.25 71.18 66.42 63.80 54.07 58.53 62.48 68.64
SASA 57.40 56.00 61.91 58.81 59.07 53.06 55.90 57.35 58.98
SenticNet 55.51 53.35 88.66 66.61 66.21 22.28 33.34 49.98 95.33
Sentiment140 68.20 0.00 0.00 0.00 68.20 100.00 81.09 40.55 34.45
SentiStrength 69.17 74.17 66.77 70.28 64.33 72.04 67.97 69.12 27.13
SentiWordNet 61.99 61.55 71.05 65.96 62.65 52.25 56.98 61.47 62.81
SO-CAL 72.18 72.42 73.69 73.05 71.92 70.61 71.26 72.15 88.99
Stanford DM 86.17 89.11 82.37 85.61 83.64 89.95 86.68 86.15 91.46
Umigon 63.96 67.38 55.94 61.13 61.46 72.19 66.40 63.76 56.42
Vader 65.06 62.19 80.38 70.12 70.61 49.09 57.92 64.02 86.14
AFINN 87.18 94.67 90.06 92.31 54.70 70.33 61.54 76.92 74.82
Emolex 83.62 93.30 87.05 90.07 45.79 63.64 53.26 71.67 62.95
Emoticons 90.59 97.30 92.31 94.74 45.45 71.43 55.56 75.15 10.19
Emoticons DS 83.94 84.24 99.57 91.27 0.00 0.00 0.00 45.63 99.28
Happiness Index 88.71 90.37 97.25 93.69 67.50 35.53 46.55 70.12 65.83
NRC Hashtag 55.67 95.89 49.47 65.27 24.77 88.71 38.73 52.00 94.12
LIWC 83.07 90.30 90.12 90.21 37.18 37.66 37.42 63.82 68.71
Myspace Opinion Finder 72.78 94.27 73.04 82.31 28.83 71.11 41.03 61.67 40.53
Opinion Lexicon 84.54 94.17 87.19 90.55 49.11 69.62 57.59 74.07 62.83
PANAS-t 96.25 100.00 96.05 97.99 57.14 100.00 72.73 85.36 9.59
Pattern.en 83.99 93.41 87.64 90.43 43.80 60.92 50.96 70.70 76.38
SANN 81.22 92.01 85.44 88.60 39.77 56.45 46.67 67.64 51.08
SASA 61.69 92.09 57.82 71.04 30.04 78.49 43.45 57.24 59.47
SenticNet 84.08 90.17 91.19 90.68 47.12 44.14 45.58 68.13 88.13
Sentiment140 41.75 0.00 0.00 0.00 41.75 100.00 58.90 29.45 24.70
SentiStrength 97.72 100.00 97.50 98.73 79.31 100.00 88.46 93.60 31.53
SentiWordNet 77.72 92.31 81.15 86.37 30.30 54.79 39.02 62.70 67.27
SO-CAL 84.40 94.84 86.16 90.29 50.40 75.00 60.29 75.29 63.79
Stanford DM 43.04 96.08 33.73 49.94 20.78 92.66 33.95 41.94 82.73
Umigon 76.71 96.72 75.78 84.98 34.00 82.93 48.23 66.60 75.18
Vader 88.05 94.04 91.94 92.97 56.48 64.21 60.10 76.54 81.29
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:20 Gonc¸alves et al.
Table XI. 2-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage
P R F1 P R F1
AFINN 78.69 80.17 87.87 83.84 75.43 63.14 68.74 76.29 62.80
Emolex 66.59 71.89 72.91 72.40 58.31 57.08 57.69 65.04 58.86
Emoticons 50.00 0.00 0.00 0.00 100.00 50.00 66.67 33.33 0.06
Emoticons DS 59.95 59.61 100.00 74.70 100.00 2.04 4.00 39.35 99.53
Happiness Index 65.93 65.23 96.75 77.92 74.24 15.38 25.49 51.70 46.59
NRC Hashtag 64.13 74.30 60.52 66.70 54.60 69.40 61.12 63.91 89.81
LIWC 60.97 64.39 81.77 72.05 48.75 27.72 35.34 53.69 55.57
Amazon Opinion Finder 68.07 78.34 68.95 73.35 54.97 66.53 60.20 66.77 37.40
Opinion Lexicon 80.82 82.25 88.48 85.25 77.85 67.96 72.57 78.91 67.15
PANAS-t 74.07 87.18 79.07 82.93 40.00 54.55 46.15 64.54 1.50
Pattern.en 71.57 76.75 77.51 77.13 62.95 61.93 62.43 69.78 76.68
SANN 72.03 73.92 87.60 80.18 65.87 43.64 52.50 66.34 44.27
SASA 62.18 66.94 73.56 70.10 52.84 44.91 48.55 59.32 66.43
SenticNet 63.70 63.38 92.52 75.23 65.81 21.21 32.09 53.66 91.41
Sentiment140 53.04 0.00 0.00 0.00 53.04 100.00 69.31 34.66 56.09
SentiStrength 90.52 92.24 95.51 93.85 84.31 75.00 79.38 86.62 19.58
SentiWordNet 72.89 77.52 82.35 79.86 62.42 55.11 58.54 69.20 53.85
SO-CAL 78.23 81.45 84.66 83.02 72.18 67.36 69.69 76.35 71.52
Stanford DM 68.53 89.26 54.54 67.71 56.38 89.96 69.31 68.51 80.28
Umigon 72.26 85.42 68.89 76.27 57.90 78.44 66.62 71.45 51.33
Vader 76.63 76.85 88.59 82.31 76.11 57.67 65.62 73.96 72.44
AFINN 71.08 58.32 83.71 68.74 86.38 63.34 73.09 70.92 59.58
Emolex 63.28 48.43 75.92 59.14 81.33 56.48 66.67 62.90 58.77
Emoticons 73.91 66.67 90.91 76.92 87.50 58.33 70.00 73.46 1.16
Emoticons DS 37.41 37.16 100.00 54.18 100.00 0.64 1.28 27.73 99.55
Happiness Index 50.24 39.22 83.45 53.36 79.40 33.04 46.66 50.01 42.95
NRC Hashtag 65.31 52.53 23.60 32.57 67.73 88.26 76.64 54.61 94.09
LIWC 64.64 56.44 96.19 71.14 92.42 38.50 54.35 62.75 61.45
Tweets Opinion Finder 72.93 56.72 63.68 60.00 81.97 77.26 79.55 69.77 33.60
DBT Opinion Lexicon 73.98 62.13 82.05 70.71 86.09 68.97 76.59 73.65 58.06
PANAS-t 79.59 43.75 87.50 58.33 96.97 78.05 86.49 72.41 2.48
Pattern.en 67.90 53.79 62.68 57.89 77.72 70.73 74.06 65.98 78.07
SANN 64.38 46.61 80.78 59.11 86.31 56.70 68.44 63.77 40.42
SASA 63.17 50.93 73.15 60.06 77.74 57.08 65.83 62.94 59.68
SenticNet 48.85 38.91 81.90 52.75 76.27 31.16 44.24 48.50 85.65
Sentiment140 78.42 0.00 0.00 0.00 78.42 100.00 87.91 43.95 46.13
SentiStrength 77.88 62.76 69.47 65.94 85.71 81.63 83.62 74.78 21.48
SentiWordNet 61.98 46.83 72.84 57.01 79.64 56.24 65.93 61.47 48.91
SO-CAL 72.96 61.36 75.70 67.78 82.98 71.31 76.70 72.24 57.55
Stanford DM 71.31 75.89 28.52 41.46 70.60 94.99 81.00 61.23 84.54
Umigon 76.62 66.45 73.50 69.80 83.59 78.44 80.93 75.37 38.91
Vader 68.95 55.31 84.05 66.72 86.49 60.07 70.90 68.81 70.14
AFINN 76.74 63.64 87.50 73.68 90.48 70.37 79.17 76.43 66.15
Emolex 68.42 50.00 83.33 62.50 88.89 61.54 72.73 67.61 58.46
Emoticons 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Emoticons DS 35.94 34.92 100.00 51.76 100.00 2.38 4.65 28.21 98.46
Happiness Index 54.84 50.00 92.86 65.00 80.00 23.53 36.36 50.68 47.69
NRC Hashtag 66.10 55.00 50.00 52.38 71.79 75.68 73.68 63.03 90.77
LIWC 67.50 56.00 87.50 68.29 86.67 54.17 66.67 67.48 61.54
Irony Opinion Finder 78.95 70.00 87.50 77.78 88.89 72.73 80.00 78.89 29.23
Opinion Lexicon 66.67 52.38 84.62 64.71 86.67 56.52 68.42 66.56 55.38
PANAS-t 100.00 0.00 0.00 0.00 100.00 100.00 100.00 50.00 1.54
Pattern.en 73.17 62.96 94.44 75.56 92.86 56.52 70.27 72.91 63.08
SANN 63.33 45.00 100.00 62.07 100.00 47.62 64.52 63.29 46.15
SASA 77.42 66.67 83.33 74.07 87.50 73.68 80.00 77.04 47.69
SenticNet 44.26 37.25 90.48 52.78 80.00 20.00 32.00 42.39 93.85
Sentiment140 83.87 0.00 0.00 0.00 83.87 100.00 91.23 45.61 47.69
SentiStrength 88.89 87.50 87.50 87.50 90.00 90.00 90.00 88.75 27.69
SentiWordNet 63.89 52.17 85.71 64.86 84.62 50.00 62.86 63.86 55.38
SO-CAL 78.57 65.00 86.67 74.29 90.91 74.07 81.63 77.96 64.62
Stanford DM 79.31 76.92 52.63 62.50 80.00 92.31 85.71 74.11 89.23
Umigon 67.65 56.25 69.23 62.07 77.78 66.67 71.79 66.93 52.31
Vader 71.43 55.17 94.12 69.57 95.00 59.38 73.08 71.32 75.38
AFINN 64.81 58.14 96.15 72.46 90.91 35.71 51.28 61.87 76.06
Emolex 55.00 51.52 89.47 65.38 71.43 23.81 35.71 50.55 56.34
Emoticons 100.00 100.00 100.00 100.00 0.00 0.00 0.00 50.00 1.41
Emoticons DS 45.07 45.71 96.97 62.14 0.00 0.00 0.00 31.07 100.00
Happiness Index 54.05 48.28 87.50 62.22 75.00 28.57 41.38 51.80 52.11
NRC Hashtag 70.59 66.67 70.97 68.75 74.29 70.27 72.22 70.49 95.77
LIWC 69.81 63.16 92.31 75.00 86.67 48.15 61.90 68.45 74.65
Sarcasm Opinion Finder 64.10 59.26 84.21 69.57 75.00 45.00 56.25 62.91 54.93
Opinion Lexicon 69.39 61.11 95.65 74.58 92.31 46.15 61.54 68.06 69.01
PANAS-t 50.00 0.00 0.00 0.00 100.00 50.00 66.67 33.33 2.82
Pattern.en 59.38 56.82 78.12 65.79 65.00 40.62 50.00 57.89 90.14
SANN 61.36 52.78 100.00 69.09 100.00 32.00 48.48 58.79 61.97
SASA 53.06 44.83 65.00 53.06 65.00 44.83 53.06 53.06 69.01
SenticNet 53.73 50.88 90.62 65.17 70.00 20.00 31.11 48.14 94.37
Sentiment140 72.09 0.00 0.00 0.00 72.09 100.00 83.78 41.89 60.56
SentiStrength 82.76 70.59 100.00 82.76 100.00 70.59 82.76 82.76 40.85
SentiWordNet 56.86 56.76 77.78 65.62 57.14 33.33 42.11 53.87 71.83
SO-CAL 67.31 55.88 90.48 69.09 88.89 51.61 65.31 67.20 73.24
Stanford DM 68.85 82.35 46.67 59.57 63.64 90.32 74.67 67.12 85.92
Umigon 69.39 66.67 75.00 70.59 72.73 64.00 68.09 69.34 69.01
Vader 62.50 55.77 96.67 70.73 91.67 32.35 47.83 59.28 90.14
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:21
Table XII. 2-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage
P R F1 P R F1
AFINN 80.32 79.38 90.84 84.72 82.39 64.47 72.34 78.53 71.47
Emolex 73.28 74.49 83.19 78.60 70.93 59.03 64.43 71.52 59.02
Emoticons 85.43 90.27 90.27 90.27 71.05 71.05 71.05 80.66 13.19
Emoticons DS 58.95 58.84 99.63 73.99 72.22 1.37 2.70 38.34 99.83
Happiness Index 68.28 66.75 93.60 77.93 76.23 30.58 43.65 60.79 60.46
NRC Hashtag 65.61 72.95 65.91 69.26 57.31 65.19 61.00 65.13 95.54
LIWC 59.92 62.41 79.37 69.87 52.66 32.43 40.14 55.01 54.61
Tweets Opinion Finder 77.16 82.14 78.04 80.04 70.90 75.92 73.32 76.68 40.37
RND I Opinion Lexicon 81.56 82.00 87.68 84.74 80.84 72.98 76.71 80.73 63.74
PANAS-t 85.45 91.18 86.11 88.57 76.19 84.21 80.00 84.29 4.81
Pattern.en 78.02 79.84 85.52 82.58 74.60 66.33 70.22 76.40 77.72
SANN 75.61 75.48 87.04 80.85 75.89 59.06 66.43 73.64 50.15
SASA 65.60 70.72 70.36 70.54 58.47 58.89 58.68 64.61 58.67
SenticNet 68.48 67.00 91.22 77.26 74.35 36.16 48.65 62.96 92.44
Sentiment140 70.19 0.00 0.00 0.00 70.19 100.00 82.49 41.24 40.45
SentiStrength 93.72 94.26 96.33 95.28 92.61 88.68 90.60 92.94 27.13
SentiWordNet 70.70 76.03 77.64 76.83 61.27 59.11 60.17 68.50 62.78
SO-CAL 80.85 82.08 85.98 83.98 78.92 73.66 76.20 80.09 64.57
Stanford DM 54.19 87.02 25.40 39.33 47.44 94.67 63.21 51.27 92.70
Umigon 82.07 89.22 80.71 84.76 73.02 84.26 78.24 81.50 67.50
Vader 80.12 78.73 91.76 84.75 83.39 62.52 71.46 78.10 81.08
AFINN 96.37 97.66 96.94 97.30 93.75 95.19 94.47 95.88 80.77
Emolex 86.06 89.82 89.11 89.47 78.77 80.00 79.38 84.42 63.58
Emoticons 97.75 97.90 99.42 98.65 96.97 89.72 93.20 95.93 14.82
Emoticons DS 71.04 70.61 99.90 82.74 95.83 5.43 10.28 46.51 99.09
Happiness Index 82.39 81.92 95.51 88.20 84.30 53.33 65.33 76.77 58.60
NRC Hashtag 67.37 83.76 65.43 73.47 48.17 71.69 57.62 65.55 91.94
LIWC 66.47 74.46 78.81 76.58 44.20 38.31 41.04 58.81 73.93
Tweets Opinion Finder 78.32 93.86 71.11 80.92 63.42 91.50 74.92 77.92 41.23
RND II Opinion Lexicon 93.45 97.03 93.14 95.04 86.93 94.11 90.38 92.71 70.64
PANAS-t 90.71 96.95 88.19 92.36 82.11 95.12 88.14 90.25 5.39
Pattern.en 87.11 93.13 88.42 90.72 74.49 83.83 78.89 84.80 80.03
SANN 83.80 89.89 86.50 88.16 71.39 77.58 74.36 81.26 52.67
SASA 70.06 82.81 72.81 77.49 49.05 63.39 55.30 66.40 63.04
SenticNet 83.28 82.92 95.28 88.67 84.60 56.92 68.05 78.36 89.63
Sentiment140 59.94 0.00 0.00 0.00 59.94 100.00 74.95 37.48 38.49
SentiStrength 96.97 98.92 96.43 97.66 93.54 98.01 95.72 96.69 34.65
SentiWordNet 78.57 87.88 80.91 84.25 61.09 72.87 66.46 75.36 61.49
SO-CAL 87.76 94.25 86.99 90.47 77.34 89.32 82.90 86.68 67.18
Stanford DM 60.46 94.48 44.87 60.84 44.06 94.30 60.06 60.45 88.89
Umigon 88.63 97.73 85.92 91.45 73.64 95.17 83.03 87.24 70.83
Vader 98.97 99.05 99.45 99.25 98.77 97.89 98.33 98.79 94.61
AFINN 86.66 87.38 91.11 89.21 85.43 79.84 82.54 85.87 78.81
Emolex 82.02 83.90 87.55 85.69 78.64 73.19 75.82 80.75 67.07
Emoticons 92.74 94.66 95.38 95.02 87.50 85.71 86.60 90.81 14.59
Emoticons DS 61.98 61.51 99.73 76.09 90.00 3.77 7.23 41.66 99.02
Happiness Index 75.26 73.05 95.82 82.90 85.40 40.91 55.32 69.11 62.27
NRC Hashtag 79.28 84.99 80.22 82.54 71.52 77.80 74.53 78.53 95.19
LIWC 60.65 63.82 77.81 70.12 52.35 35.60 42.38 56.25 50.12
Tweets Opinion Finder 81.71 88.64 79.87 84.03 73.48 84.50 78.60 81.32 40.99
RND III Opinion Lexicon 88.21 88.43 92.79 90.56 87.82 81.07 84.31 87.43 70.50
PANAS-t 94.05 95.38 96.88 96.12 89.47 85.00 87.18 91.65 6.85
Pattern.en 85.03 87.12 89.45 88.27 81.23 77.54 79.34 83.81 82.23
SANN 80.66 81.26 88.67 84.81 79.46 68.20 73.40 79.10 54.36
SASA 76.94 80.44 81.41 80.92 71.52 70.21 70.86 75.89 67.16
SenticNet 78.54 75.92 94.81 84.32 86.84 53.23 66.00 75.16 90.38
Sentiment140 72.71 0.00 0.00 0.00 72.71 100.00 84.20 42.10 44.50
SentiStrength 94.99 96.27 96.57 96.42 91.97 91.30 91.64 94.03 37.41
SentiWordNet 72.54 79.19 78.75 78.97 60.14 60.76 60.45 69.71 67.97
SO-CAL 88.27 90.73 90.25 90.49 84.33 85.06 84.69 87.59 74.33
Stanford DM 65.73 94.06 45.09 60.96 54.46 95.84 69.46 65.21 86.80
Umigon 86.86 92.84 85.36 88.95 78.96 89.30 83.81 86.38 80.03
Vader 86.79 86.66 92.64 89.55 87.03 77.59 82.04 85.79 86.96
AFINN 76.77 76.15 86.84 81.15 77.94 63.10 69.74 75.44 71.22
Emolex 72.56 75.93 81.19 78.47 66.07 58.73 62.18 70.33 58.99
Emoticons 94.52 98.18 91.53 94.74 90.83 98.02 94.29 94.51 78.78
Emoticons DS 57.82 57.72 99.37 73.02 66.67 1.71 3.33 38.18 98.92
Happiness Index 67.50 64.39 94.44 76.58 82.14 32.86 46.94 61.76 57.55
NRC Hashtag 61.83 67.36 64.67 65.99 55.08 58.04 56.52 61.25 94.24
LIWC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Tweets Opinion Finder 70.34 75.00 73.91 74.45 64.00 65.31 64.65 69.55 42.45
RND IV Opinion Lexicon 79.35 79.17 87.96 83.33 79.69 67.11 72.86 78.10 66.19
PANAS-t 56.52 63.64 53.85 58.33 50.00 60.00 54.55 56.44 8.27
Pattern.en 91.76 92.76 92.76 92.76 90.43 90.43 90.43 91.60 96.04
SANN 72.14 73.12 82.93 77.71 70.21 56.90 62.86 70.29 50.36
SASA 64.16 67.96 70.71 69.31 58.57 55.41 56.94 63.13 62.23
SenticNet 66.53 65.05 87.68 74.69 71.19 39.25 50.60 62.65 88.13
Sentiment140 75.74 0.00 0.00 0.00 75.74 100.00 86.19 43.10 48.92
SentiStrength 89.77 92.06 93.55 92.80 84.00 80.77 82.35 87.58 31.65
SentiWordNet 66.67 72.73 74.23 73.47 56.14 54.24 55.17 64.32 56.12
SO-CAL 75.53 73.50 85.15 78.90 78.87 64.37 70.89 74.89 67.63
Stanford DM 61.51 83.82 39.86 54.03 53.26 89.91 66.89 60.46 90.65
Umigon 91.25 95.65 88.59 91.99 86.40 94.74 90.38 91.18 94.60
Vader 83.66 81.18 93.24 86.79 88.51 70.64 78.57 82.68 92.45
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:22 Gonc¸alves et al.
Table XIII. 2-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage
P R F1 P R F1
AFINN 84.42 80.62 91.49 85.71 89.66 77.04 82.87 84.29 76.88
Emolex 79.65 76.09 88.98 82.03 85.23 69.44 76.53 79.28 62.95
Emoticons 85.42 80.65 96.15 87.72 94.12 72.73 82.05 84.89 13.37
Emoticons DS 51.96 51.41 100.00 67.91 100.00 2.27 4.44 36.18 99.72
Happiness Index 65.93 58.72 94.39 72.40 88.89 40.34 55.49 63.95 62.95
NRC Hashtag 71.30 73.05 70.93 71.98 69.51 71.70 70.59 71.28 92.20
LIWC 64.29 63.75 76.12 69.39 65.22 50.85 57.14 63.27 70.39
Tweets Opinion Finder 80.77 81.16 76.71 78.87 80.46 84.34 82.35 80.61 43.45
STF Opinion Lexicon 86.10 83.67 91.11 87.23 89.29 80.65 84.75 85.99 72.14
PANAS-t 94.12 88.89 100.00 94.12 100.00 88.89 94.12 94.12 4.74
Pattern.en 77.85 75.69 85.09 80.12 80.95 69.86 75.00 77.56 85.52
SANN 73.21 69.35 82.69 75.44 78.82 63.81 70.53 72.98 58.22
SASA 68.52 65.65 78.90 71.67 72.94 57.94 64.58 68.12 60.17
SenticNet 72.62 66.80 93.06 77.78 87.37 50.92 64.34 71.06 93.59
Sentiment140 75.53 0.00 0.00 0.00 75.53 100.00 86.06 43.03 52.37
SentiStrength 95.33 95.18 96.34 95.76 95.52 94.12 94.81 95.29 41.78
SentiWordNet 72.99 73.17 78.95 75.95 72.73 65.98 69.19 72.57 58.77
SO-CAL 87.36 82.89 93.33 87.80 92.80 81.69 86.89 87.35 77.16
Stanford DM 66.56 87.69 36.31 51.35 61.24 95.18 74.53 62.94 89.97
Umigon 86.99 91.73 81.88 86.52 83.02 92.31 87.42 86.97 81.34
Vader 94.12 100.00 90.48 95.00 86.67 100.00 92.86 93.93 9.47
AFINN 77.85 72.73 86.72 79.11 84.83 69.54 76.43 77.77 69.94
Emolex 72.06 65.42 85.67 74.19 82.61 60.06 69.55 71.87 60.04
Emoticons 85.71 87.50 89.74 88.61 82.61 79.17 80.85 84.73 5.77
Emoticons DS 47.20 47.13 99.21 63.90 55.56 0.88 1.74 32.82 98.08
Happiness Index 64.23 59.63 89.92 71.70 78.81 38.11 51.38 61.54 45.10
NRC Hashtag 69.77 72.01 58.71 64.69 68.36 79.63 73.57 69.13 93.68
LIWC 77.27 71.37 92.05 80.40 88.38 62.10 72.95 76.67 63.70
Tweets Opinion Finder 68.60 63.54 71.35 67.22 73.80 66.35 69.87 68.55 34.74
SAN Opinion Lexicon 81.77 78.18 86.74 82.24 85.98 77.05 81.27 81.75 65.35
PANAS-t 84.62 80.00 80.00 80.00 87.50 87.50 87.50 83.75 2.38
Pattern.en 74.62 68.46 90.59 77.98 86.40 58.90 70.04 74.01 72.59
SANN 70.02 64.40 85.95 73.63 80.46 54.90 65.27 69.45 45.55
SASA 61.18 61.74 74.71 67.61 60.12 45.16 51.58 59.59 43.45
SenticNet 61.12 55.40 91.21 68.93 81.22 34.13 48.06 58.50 94.78
Sentiment140 72.45 0.00 0.00 0.00 72.45 100.00 84.02 42.01 53.90
SentiStrength 89.47 88.70 91.81 90.23 90.41 86.84 88.59 89.41 29.61
SentiWordNet 67.49 64.25 75.63 69.48 71.90 59.70 65.23 67.35 59.21
SO-CAL 79.70 74.74 84.55 79.34 85.11 75.56 80.05 79.70 68.19
Stanford DM 62.72 87.60 22.60 35.93 59.35 97.25 73.71 54.82 92.94
Umigon 82.41 83.66 80.49 82.04 81.25 84.32 82.76 82.40 67.74
Vader 77.18 71.81 88.15 79.15 85.34 66.59 74.81 76.98 78.74
AFINN 86.05 90.95 90.01 90.48 72.88 75.00 73.93 82.20 76.83
Emolex 79.82 86.58 86.48 86.53 59.70 59.93 59.81 73.17 70.29
Emoticons 92.24 95.09 95.45 95.27 78.95 77.59 78.26 86.77 10.52
Emoticons DS 72.75 72.72 100.00 84.20 100.00 0.36 0.71 42.46 100.00
Happiness Index 77.08 78.14 95.34 85.88 68.30 27.37 39.08 62.48 68.01
NRC Hashtag 72.15 83.82 76.83 80.17 48.25 59.29 53.20 66.69 96.80
LIWC 64.54 74.33 78.88 76.54 30.19 25.12 27.42 51.98 53.17
Tweets Opinion Finder 75.95 90.13 74.02 81.28 56.40 80.57 66.35 73.82 38.86
Semeval Opinion Lexicon 86.20 92.07 88.90 90.46 71.75 78.65 75.04 82.75 69.61
PANAS-t 91.76 96.63 93.49 95.04 70.21 82.50 75.86 85.45 8.33
Pattern.en 77.94 89.27 80.02 84.39 55.14 71.85 62.39 73.39 83.40
SANN 79.33 84.91 87.37 86.12 62.13 57.18 59.55 72.84 53.92
SASA 75.63 81.44 87.26 84.25 52.31 41.26 46.13 65.19 53.24
SenticNet 75.59 79.11 90.33 84.35 58.22 36.10 44.57 64.46 95.59
Sentiment140 51.68 0.00 0.00 0.00 51.68 100.00 68.14 34.07 36.05
SentiStrength 91.11 96.17 91.78 93.93 78.40 89.09 83.40 88.66 28.66
SentiWordNet 69.93 86.53 72.04 78.62 40.52 62.93 49.29 63.96 70.20
SO-CAL 82.52 90.17 85.03 87.53 66.28 76.05 70.83 79.18 69.93
Stanford DM 41.13 94.17 19.78 32.70 31.64 96.81 47.69 40.19 92.32
Umigon 81.44 94.98 79.34 86.46 59.02 87.64 70.54 78.50 68.86
Vader 85.65 89.22 91.39 90.29 75.08 70.13 72.52 81.40 86.31
AFINN 82.25 83.16 93.98 88.24 78.63 53.80 63.89 76.06 83.12
Emolex 74.54 79.14 86.39 82.60 59.69 46.95 52.56 67.58 77.45
Emoticons 88.89 91.30 95.45 93.33 75.00 60.00 66.67 80.00 15.32
Emoticons DS 69.23 69.05 100.00 81.69 100.00 1.82 3.57 42.63 99.57
Happiness Index 75.14 76.04 94.56 84.30 68.66 28.57 40.35 62.32 77.59
NRC Hashtag 65.06 81.10 63.52 71.24 46.71 68.35 55.49 63.37 97.02
LIWC 83.18 83.12 96.77 89.43 83.54 45.52 58.93 74.18 77.59
RW Opinion Finder 61.30 79.91 58.81 67.75 42.04 66.90 51.63 59.69 65.25
Opinion Lexicon 80.25 83.89 89.03 86.39 69.50 59.39 64.05 75.22 79.01
PANAS-t 63.64 80.00 64.52 71.43 42.11 61.54 50.00 60.71 12.48
Pattern.en 71.57 84.20 73.86 78.69 50.64 65.92 57.28 67.99 87.80
SANN 75.75 82.43 84.25 83.33 57.14 53.90 55.47 69.40 71.35
SASA 66.00 78.04 71.33 74.53 44.83 53.72 48.87 61.70 56.74
SenticNet 72.84 73.82 94.18 82.77 65.38 24.76 35.92 59.34 95.04
Sentiment140 51.32 0.00 0.00 0.00 51.32 100.00 67.83 33.91 48.37
SentiStrength 85.89 93.69 86.67 90.04 69.23 83.72 75.79 82.92 23.12
SentiWordNet 68.94 80.90 74.21 77.41 45.92 55.56 50.28 63.85 81.28
SO-CAL 76.03 86.12 78.15 81.94 58.74 71.18 64.36 73.15 79.29
Stanford DM 43.10 96.25 17.46 29.56 35.58 98.53 52.28 40.92 91.49
Umigon 60.32 91.43 48.12 63.05 42.02 89.29 57.14 60.10 80.43
Vader 81.70 81.70 95.28 87.97 81.74 49.74 61.84 74.90 89.93
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:23
Table XIV. 3-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment Neut. Sentiment MacroF1
P R F1 P R F1 P R F1
AFINN 50.10 16.22 60.61 25.59 82.62 54.14 65.42 40.11 30.24 34.48 41.83
Emolex 44.10 15.51 65.66 25.10 83.19 45.62 58.93 35.27 31.85 33.47 39.17
Emoticons 24.60 0.00 0.00 0.00 33.33 25.00 28.57 19.77 98.79 32.95 20.51
Emoticons DS 10.00 9.85 98.99 17.92 66.67 0.22 0.44 0.00 0.00 0.00 9.18
Happiness Index 33.60 11.83 64.65 20.00 84.93 28.05 42.18 26.46 34.68 30.02 30.73
NRC Hashtag 64.00 20.72 23.23 21.90 70.20 87.13 77.76 52.50 8.47 14.58 38.08
LIWC 33.00 11.11 42.42 17.61 67.69 39.57 49.94 22.90 27.42 24.95 30.84
Comments Opinion Finder 51.80 14.96 35.35 21.02 78.76 66.39 72.04 33.71 36.29 34.95 42.67
BBC Opinion Lexicon 55.00 20.67 62.63 31.08 85.27 61.98 71.79 40.82 40.32 40.57 47.81
PANAS-t 27.10 16.67 6.06 8.89 75.61 50.82 60.78 25.35 94.35 39.97 36.55
Pattern.en 46.00 14.39 58.59 23.11 77.30 49.93 60.67 38.16 23.39 29.00 37.59
SANN 40.10 14.50 59.60 23.32 79.49 41.61 54.63 33.45 37.90 35.54 37.83
SASA 38.20 17.03 47.47 25.07 70.75 50.86 59.18 25.19 39.52 30.77 38.34
SenticNet 27.90 11.91 88.89 21.00 82.69 20.90 33.37 26.39 7.66 11.88 22.08
Sentiment140 50.60 0.00 0.00 0.00 73.23 100.00 84.54 28.60 58.47 38.41 40.98
SentiStrength 44.20 47.37 18.18 26.28 86.64 91.45 88.98 29.37 84.68 43.61 52.96
SentiWordNet 42.40 14.90 59.60 23.84 81.63 44.57 57.66 34.56 37.90 36.15 39.22
SO-CAL 55.50 20.88 57.58 30.65 80.47 65.61 72.28 28.57 34.68 31.33 44.75
Stanford DM 65.50 43.37 36.36 39.56 71.01 92.54 80.36 37.50 14.52 20.93 46.95
Umigon 45.70 28.35 36.36 31.86 76.35 74.65 75.49 29.31 61.69 39.74 49.03
Vader 49.10 15.96 71.72 26.10 82.57 49.05 61.54 50.42 24.19 32.70 40.11
AFINN 52.09 33.78 60.00 43.22 80.06 53.92 64.44 42.57 49.49 45.77 51.14
Emolex 43.64 25.00 43.33 31.71 75.16 46.05 57.11 36.23 49.49 41.83 43.55
Emoticons 28.69 61.90 6.19 11.26 60.00 42.86 50.00 21.71 98.31 35.56 32.27
Emoticons DS 20.98 19.92 99.05 33.17 66.67 1.18 2.32 44.44 2.71 5.11 13.54
Happiness Index 34.54 21.15 50.95 29.89 77.14 21.30 33.38 26.70 53.22 35.56 32.94
NRC Hashtag 53.57 35.02 36.19 35.60 60.57 76.81 67.73 38.20 11.53 17.71 40.35
LIWC 31.10 20.45 34.76 25.75 51.12 32.54 39.77 27.65 42.37 33.47 32.99
Comments Opinion Finder 46.24 32.69 32.38 32.54 71.64 63.64 67.40 35.10 62.71 45.01 48.32
Digg Opinion Lexicon 49.77 34.67 57.62 43.29 78.43 54.12 64.05 37.92 49.49 42.94 50.09
PANAS-t 28.23 11.11 0.48 0.91 66.67 66.67 66.67 27.49 97.29 42.87 36.82
Pattern.en 49.03 33.42 60.48 43.05 69.67 52.35 59.78 41.28 41.69 41.48 48.11
SANN 41.88 26.89 45.71 33.86 75.79 42.26 54.26 35.04 55.59 42.99 43.70
SASA 43.36 29.81 44.29 35.63 64.25 53.99 58.68 32.05 39.66 35.45 43.25
SenticNet 34.35 22.57 81.90 35.39 73.37 18.62 29.70 32.47 21.36 25.77 30.29
Sentiment140 49.30 0.00 0.00 0.00 65.70 100.00 79.30 31.93 56.61 40.83 40.04
SentiStrength 42.53 64.00 22.86 33.68 85.71 84.75 85.23 31.44 88.14 46.35 55.09
SentiWordNet 42.15 27.53 41.43 33.08 73.43 46.50 56.94 34.29 56.95 42.80 44.27
SO-CAL 53.57 38.54 52.86 44.58 75.47 64.39 69.49 28.57 49.49 36.23 50.10
Stanford DM 56.73 45.93 29.52 35.94 63.42 86.20 73.08 41.70 31.53 35.91 48.31
Umigon 53.57 52.97 46.67 49.62 71.62 78.25 74.79 36.48 56.27 44.27 56.23
Vader 53.02 32.53 70.95 44.61 81.30 49.26 61.35 48.80 41.36 44.77 50.24
AFINN 42.45 64.81 41.79 50.81 80.29 68.59 73.98 7.89 77.87 14.32 46.37
Emolex 42.97 55.12 53.72 54.41 75.35 48.67 59.14 7.22 54.10 12.74 42.10
Emoticons 4.68 0.00 0.00 0.00 0.00 0.00 0.00 4.47 99.59 8.56 2.85
Emoticons DS 42.58 42.55 99.77 59.66 78.57 0.37 0.73 0.00 0.00 0.00 30.20
Happiness Index 31.81 48.42 50.18 49.29 71.70 25.96 38.12 5.36 54.10 9.76 32.39
NRC Hashtag 54.84 55.38 45.74 50.10 61.55 68.92 65.03 8.33 15.16 10.76 41.96
LIWC 24.35 42.88 27.72 33.67 53.42 39.12 45.16 4.67 53.28 8.58 29.14
Comments Opinion Finder 29.38 68.77 18.78 29.51 76.52 82.66 79.47 6.29 88.11 11.75 40.24
NYT Opinion Lexicon 44.57 65.95 43.15 52.17 79.81 70.65 74.95 7.94 73.77 14.34 47.15
PANAS-t 5.88 69.23 1.23 2.41 62.07 75.00 67.92 4.75 99.18 9.07 26.47
Pattern.en 45.39 55.15 44.69 49.37 63.65 61.12 62.36 7.85 45.90 13.41 41.71
SANN 27.92 56.74 29.40 38.73 78.02 55.13 64.61 5.93 79.51 11.04 38.13
SASA 30.04 49.92 30.13 37.58 59.11 52.83 55.80 5.74 61.07 10.49 34.62
SenticNet 50.06 48.30 84.98 61.59 77.05 25.27 38.06 9.81 19.26 13.00 37.55
Sentiment140 34.66 0.00 0.00 0.00 65.76 100.00 79.34 5.83 64.34 10.69 30.01
SentiStrength 18.17 78.51 8.62 15.54 81.12 90.91 85.74 5.41 95.49 10.24 37.17
SentiWordNet 32.20 57.35 34.53 43.10 70.31 56.63 62.73 6.08 70.08 11.19 39.01
SO-CAL 50.79 64.36 51.13 56.99 77.25 68.36 72.53 8.68 65.98 15.34 48.29
Stanford DM 51.93 73.39 21.14 32.83 59.48 92.67 72.46 9.65 38.11 15.40 40.23
Umigon 24.08 68.76 16.38 26.46 68.78 80.38 74.13 5.88 88.93 11.04 37.21
Vader 48.84 61.96 52.40 56.78 80.09 63.00 70.52 9.51 70.90 16.77 48.03
AFINN 54.11 62.78 69.50 65.97 75.74 57.61 65.44 21.83 49.11 30.22 53.88
Emolex 45.77 54.49 61.01 57.57 68.78 46.53 55.51 17.63 43.75 25.13 46.07
Emoticons 14.66 100.00 0.94 1.87 88.89 100.00 94.12 11.93 100.00 21.31 39.10
Emoticons DS 37.90 37.90 100.00 54.97 0.00 0.00 0.00 0.00 0.00 0.00 27.48
Happiness Index 37.90 51.85 61.64 56.32 71.88 27.49 39.77 12.68 47.32 20.00 38.70
NRC Hashtag 57.69 57.54 51.57 54.39 63.86 71.99 67.68 13.43 8.04 10.06 44.04
LIWC 32.66 41.00 38.68 39.81 54.59 36.33 43.63 14.12 44.64 21.46 34.96
Comments Opinion Finder 42.31 59.31 38.05 46.36 64.75 68.44 66.54 15.13 48.21 23.03 45.31
TED Opinion Lexicon 54.47 62.76 67.30 64.95 72.97 59.81 65.74 22.59 48.21 30.77 53.82
PANAS-t 14.78 90.00 2.83 5.49 55.56 83.33 66.67 13.41 98.21 23.61 31.92
Pattern.en 52.32 56.78 69.81 62.62 62.25 52.66 57.06 19.86 25.89 22.48 47.39
SANN 48.87 60.62 55.66 58.03 68.27 61.67 64.80 17.39 42.86 24.74 49.19
SASA 40.52 51.77 50.63 51.19 63.80 48.45 55.08 12.38 33.93 18.14 41.47
SenticNet 46.48 42.99 87.74 57.70 70.13 22.59 34.18 7.69 2.68 3.97 31.95
Sentiment140 34.21 0.00 0.00 0.00 64.24 100.00 78.23 14.73 66.96 24.15 34.13
SentiStrength 33.61 80.45 33.65 47.45 75.25 74.51 74.88 16.36 88.39 27.62 49.98
SentiWordNet 36.11 50.15 51.89 51.00 55.41 33.33 41.62 15.47 50.00 23.63 38.75
SO-CAL 52.68 65.01 70.13 67.47 64.56 60.53 62.48 14.23 31.25 19.55 49.84
Stanford DM 55.66 76.30 50.63 60.87 60.13 85.21 70.50 12.08 16.07 13.79 48.39
Umigon 40.52 68.49 51.26 58.63 58.96 57.63 58.29 17.52 66.96 27.78 48.23
Vader 57.57 60.61 74.53 66.85 71.96 58.04 64.25 21.71 29.46 25.00 52.04
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:24 Gonc¸alves et al.
Table XV. 3-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment Neut. Sentiment MacroF1
P R F1 P R F1 P R F1
AFINN 59.52 70.31 71.53 70.91 59.14 41.65 48.88 43.22 49.03 45.94 55.24
Emolex 48.66 65.82 49.97 56.81 49.02 40.90 44.59 34.38 54.05 42.03 47.81
Emoticons 32.55 74.09 10.99 19.14 37.21 20.00 26.02 22.60 93.33 36.39 27.18
Emoticons DS 49.16 49.39 97.96 65.67 45.45 0.59 1.17 40.96 3.49 6.43 24.43
Happiness Index 45.82 56.73 55.44 56.08 50.55 16.29 24.64 24.94 51.38 33.58 38.10
NRC Hashtag 50.19 71.46 57.30 63.60 36.80 61.75 46.12 35.16 14.46 20.49 43.40
LIWC 40.97 52.93 52.67 52.80 26.79 14.00 18.39 30.72 40.21 34.83 35.34
Comments Opinion Finder 42.41 69.84 30.87 42.82 41.98 52.56 46.68 32.85 70.26 44.77 44.76
YTB Opinion Lexicon 56.74 72.29 63.60 67.67 54.33 45.94 49.78 40.47 54.26 46.36 54.60
PANAS-t 29.12 51.72 0.90 1.77 46.67 60.00 52.50 28.68 98.05 44.38 32.88
Pattern.en 57.73 70.62 73.33 71.95 47.91 41.94 44.73 41.56 38.87 40.17 52.28
SANN 49.46 67.21 51.95 58.60 47.83 34.27 39.93 36.14 61.54 45.54 48.02
SASA 46.61 66.86 49.31 56.76 34.72 48.55 40.48 35.69 39.28 37.40 44.88
SenticNet 53.01 56.83 80.48 66.62 49.24 18.30 26.68 28.88 24.41 26.46 39.92
Sentiment140 29.70 0.00 0.00 0.00 38.35 100.00 55.44 24.91 56.00 34.48 29.97
SentiStrength 51.01 90.27 41.80 57.14 67.38 71.70 69.47 36.19 87.38 51.19 59.27
SentiWordNet 47.93 67.20 50.33 57.55 39.67 37.17 38.38 35.68 56.72 43.80 46.58
SO-CAL 57.15 74.11 62.22 67.65 54.36 52.43 53.38 28.65 52.51 37.07 52.70
Stanford DM 47.05 81.84 47.09 59.78 32.80 76.16 45.86 34.88 26.97 30.42 45.35
Umigon 57.21 79.34 62.28 69.78 44.02 59.09 50.45 43.00 53.54 47.69 55.98
Vader 61.11 68.99 78.02 73.22 57.76 40.53 47.64 46.11 39.49 42.54 54.47
AFINN 62.25 83.92 68.38 75.35 41.29 41.03 41.16 33.12 50.24 39.92 52.14
Emolex 52.35 81.59 55.56 66.10 33.11 35.77 34.39 25.54 51.21 34.08 44.86
Emoticons 26.22 86.75 10.26 18.34 38.46 31.25 34.48 17.18 94.69 29.08 27.30
Emoticons DS 67.15 67.61 99.00 80.35 0.00 0.00 0.00 40.00 1.93 3.69 28.01
Happiness Index 56.87 77.44 65.53 70.99 47.37 16.77 24.77 21.21 50.72 29.91 41.89
NRC Hashtag 44.19 86.97 46.58 60.67 18.55 69.18 29.26 31.94 11.11 16.49 35.47
LIWC 54.85 77.20 63.68 69.79 27.36 18.01 21.72 26.69 45.89 33.75 41.75
Myspace Opinion Finder 38.71 85.94 30.48 45.01 23.02 47.76 31.07 24.04 75.85 36.51 37.53
Opinion Lexicon 52.74 83.98 55.27 66.67 33.74 42.64 37.67 25.48 51.21 34.03 46.12
PANAS-t 26.80 97.33 10.40 18.79 40.00 66.67 50.00 21.13 97.58 34.74 34.51
Pattern.en 60.33 82.82 68.66 75.08 31.36 34.64 32.92 32.07 44.93 37.42 48.47
SANN 45.82 80.99 44.30 57.27 29.66 32.41 30.97 24.30 63.29 35.12 41.12
SASA 39.39 82.33 33.19 47.31 23.10 59.35 33.26 23.53 50.24 32.05 37.54
SenticNet 64.27 76.17 81.05 78.54 34.03 21.59 26.42 25.37 24.64 25.00 43.32
Sentiment140 21.52 0.00 0.00 0.00 31.27 100.00 47.65 18.02 66.67 28.37 25.34
SentiStrength 43.61 96.69 33.33 49.58 74.19 74.19 74.19 25.65 95.17 40.41 54.73
SentiWordNet 52.45 82.50 56.41 67.01 22.47 32.26 26.49 28.72 53.14 37.29 43.59
SO-CAL 53.99 85.40 54.99 66.90 36.21 48.84 41.58 21.40 54.59 30.75 46.41
Stanford DM 35.35 89.50 27.92 42.56 16.64 81.45 27.63 33.02 34.30 33.65 34.62
Umigon 56.29 88.82 58.83 70.78 25.76 56.67 35.42 33.65 50.72 40.46 48.89
Vader 65.80 82.33 76.35 79.23 41.78 34.66 37.89 36.07 42.51 39.02 52.05
AFINN 45.06 35.38 51.37 41.90 61.16 40.33 48.61 43.70 49.32 46.34 45.62
Emolex 40.83 27.79 42.33 33.55 58.98 34.72 43.71 41.80 46.54 44.04 40.43
Emoticons 38.57 29.41 1.37 2.62 43.75 22.58 29.79 27.87 97.86 43.39 25.26
Emoticons DS 23.32 22.84 99.86 37.17 42.11 0.32 0.64 66.67 1.43 2.80 13.54
Happiness Index 36.87 23.84 33.15 27.74 56.57 19.31 28.79 28.80 60.92 39.11 31.88
NRC Hashtag 41.07 27.81 21.37 24.17 43.34 72.35 54.21 49.35 9.05 15.30 31.23
LIWC 33.45 28.48 72.60 40.91 80.76 16.13 26.89 28.02 23.59 25.61 31.14
Tweets Opinion Finder 43.33 34.62 18.49 24.11 56.82 57.85 57.33 41.13 72.92 52.59 44.68
DBT Opinion Lexicon 47.10 38.73 49.45 43.44 61.05 46.13 52.55 44.85 53.61 48.84 48.28
PANAS-t 39.28 23.33 0.96 1.84 71.11 58.18 64.00 38.98 97.93 55.77 40.54
Pattern.en 40.61 32.73 46.71 38.49 47.33 50.25 48.74 38.00 21.13 27.16 38.13
SANN 41.57 29.06 28.22 28.63 59.54 38.05 46.43 41.34 66.00 50.84 41.97
SASA 39.87 30.33 44.79 36.17 51.29 35.81 42.17 40.58 43.29 41.89 40.08
SenticNet 31.72 22.97 66.30 34.12 53.09 17.49 26.31 29.18 15.81 20.50 26.98
Sentiment140 44.84 0.00 0.00 0.00 49.86 100.00 66.54 40.84 58.46 48.09 38.21
SentiStrength 43.92 41.36 12.47 19.16 64.34 65.04 64.69 41.25 86.66 55.89 46.58
SentiWordNet 39.96 28.41 33.42 30.71 52.82 36.66 43.28 40.70 55.12 46.83 40.27
SO-CAL 47.25 38.76 44.38 41.38 58.75 49.75 53.88 31.23 55.52 39.98 45.08
Stanford DM 44.47 47.09 23.29 31.16 44.02 84.27 57.83 44.67 19.62 27.26 38.75
Umigon 44.66 40.47 28.49 33.44 57.97 55.52 56.72 41.45 67.99 51.50 47.22
Vader 44.75 33.23 59.18 42.56 61.40 37.69 46.71 45.43 39.08 42.02 43.76
AFINN 56.79 63.64 63.64 63.64 79.17 70.37 74.51 37.14 81.25 50.98 63.04
Emolex 45.68 41.67 45.45 43.48 84.21 53.33 65.31 28.95 68.75 40.74 49.84
Emoticons 19.75 0.00 0.00 0.00 0.00 0.00 0.00 16.49 100.00 28.32 9.44
Emoticons DS 30.86 28.57 100.00 44.44 100.00 1.79 3.51 66.67 12.50 21.05 23.00
Happiness Index 37.04 44.83 59.09 50.98 80.00 20.00 32.00 21.67 81.25 34.21 39.06
NRC Hashtag 53.09 47.83 50.00 48.89 58.33 70.00 63.64 40.00 25.00 30.77 47.76
LIWC 49.38 50.00 63.64 56.00 86.67 48.15 61.90 34.21 81.25 48.15 55.35
Irony Opinion Finder 38.27 70.00 31.82 43.75 88.89 72.73 80.00 25.81 100.00 41.03 54.93
Opinion Lexicon 46.91 47.83 50.00 48.89 86.67 52.00 65.00 32.56 87.50 47.46 53.78
PANAS-t 20.99 0.00 0.00 0.00 100.00 100.00 100.00 20.00 100.00 33.33 44.44
Pattern.en 53.09 62.96 77.27 69.39 76.47 56.52 65.00 35.14 81.25 49.06 61.15
SANN 40.74 40.91 40.91 40.91 100.00 43.48 60.61 28.57 87.50 43.08 48.20
SASA 39.51 55.56 45.45 50.00 66.67 63.64 65.12 19.05 50.00 27.59 47.57
SenticNet 41.98 33.33 86.36 48.10 61.54 17.39 27.12 38.89 43.75 41.18 38.80
Sentiment140 40.74 0.00 0.00 0.00 65.00 100.00 78.79 17.07 43.75 24.56 34.45
SentiStrength 39.51 87.50 31.82 46.67 90.00 90.00 90.00 25.40 100.00 40.51 59.06
SentiWordNet 46.91 52.17 54.55 53.33 78.57 50.00 61.11 34.09 93.75 50.00 54.81
SO-CAL 55.56 59.09 59.09 59.09 83.33 68.97 75.47 25.53 75.00 38.10 57.55
Stanford DM 62.96 76.92 45.45 57.14 64.29 92.31 75.79 41.67 31.25 35.71 56.22
Umigon 41.98 52.94 40.91 46.15 63.64 63.64 63.64 26.19 68.75 37.93 49.24
Vader 53.09 47.06 72.73 57.14 82.61 51.35 63.33 33.33 50.00 40.00 53.49
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:25
Table XVI. 3-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment Neut. Sentiment MacroF1
P R F1 P R F1 P R F1
AFINN 52.63 51.02 75.76 60.98 71.43 29.41 41.67 46.88 62.50 53.57 52.07
Emolex 37.89 39.53 51.52 44.74 71.43 16.13 26.32 31.11 58.33 40.58 37.21
Emoticons 25.26 50.00 3.03 5.71 0.00 0.00 0.00 19.83 95.83 32.86 12.86
Emoticons DS 33.68 34.04 96.97 50.39 0.00 0.00 0.00 0.00 0.00 0.00 25.20
Happiness Index 35.79 35.90 42.42 38.89 75.00 19.35 30.77 22.58 58.33 32.56 34.07
NRC Hashtag 51.58 52.38 66.67 58.67 53.06 56.52 54.74 25.00 4.17 7.14 40.18
LIWC 49.47 48.98 72.73 58.54 72.22 34.21 46.43 35.71 41.67 38.46 47.81
Sarcasm Opinion Finder 48.42 55.17 48.48 51.61 69.23 40.91 51.43 39.62 87.50 54.55 52.53
Opinion Lexicon 51.58 52.38 66.67 58.67 75.00 37.50 50.00 40.54 62.50 49.18 52.62
PANAS-t 26.32 0.00 0.00 0.00 100.00 50.00 66.67 25.81 100.00 41.03 35.90
Pattern.en 49.47 46.30 75.76 57.47 52.00 30.95 38.81 56.25 37.50 45.00 47.09
SANN 42.11 43.18 57.58 49.35 72.73 24.24 36.36 32.50 54.17 40.62 42.11
SASA 36.84 35.14 39.39 37.14 48.15 35.14 40.62 29.03 37.50 32.73 36.83
SenticNet 42.11 38.16 87.88 53.21 63.64 12.96 21.54 33.33 16.67 22.22 32.32
Sentiment140 48.42 0.00 0.00 0.00 59.62 100.00 74.70 34.88 62.50 44.78 39.82
SentiStrength 45.26 63.16 36.36 46.15 80.00 63.16 70.59 31.15 79.17 44.71 53.82
SentiWordNet 45.26 50.00 63.64 56.00 42.11 27.59 33.33 41.18 58.33 48.28 45.87
SO-CAL 52.63 45.24 57.58 50.67 84.21 41.03 55.17 30.61 62.50 41.10 48.98
Stanford DM 49.47 70.00 42.42 52.83 46.67 82.35 59.57 33.33 20.83 25.64 46.02
Umigon 50.53 60.00 54.55 57.14 55.17 57.14 56.14 38.89 58.33 46.67 53.32
Vader 51.58 44.62 87.88 59.18 78.57 23.40 36.07 56.25 37.50 45.00 46.75
AFINN 56.67 51.35 66.64 58.01 55.39 33.23 41.54 62.54 55.81 58.98 52.84
Emolex 48.21 43.45 49.48 46.27 45.23 27.48 34.19 52.94 54.02 53.47 44.64
Emoticons 49.67 68.92 15.22 24.94 49.09 36.99 42.19 32.52 94.67 48.42 38.51
Emoticons DS 32.93 32.03 99.55 48.47 56.52 0.46 0.91 92.59 2.56 4.98 18.12
Happiness Index 47.85 40.66 57.84 47.75 49.13 13.07 20.64 35.28 55.56 43.16 37.18
NRC Hashtag 38.64 39.51 63.21 48.62 32.81 31.19 31.98 66.67 10.45 18.06 32.89
LIWC 40.74 34.96 43.36 38.71 29.89 13.45 18.55 48.51 50.13 49.31 35.52
Tweets Opinion Finder 54.83 61.57 31.57 41.74 50.26 52.35 51.28 54.16 82.59 65.42 52.81
RND I Opinion Lexicon 56.29 53.28 55.75 54.49 54.56 40.35 46.39 59.07 61.34 60.19 53.69
PANAS-t 47.34 72.09 4.63 8.70 50.79 57.14 53.78 46.76 98.00 63.31 41.93
Pattern.en 53.49 50.41 69.18 58.32 45.55 33.58 38.66 63.34 45.11 52.69 49.89
SANN 53.42 49.96 44.10 46.85 53.27 31.88 39.88 55.06 71.58 62.24 49.66
SASA 46.04 39.93 41.27 40.59 39.09 28.28 32.82 53.12 54.89 53.99 42.47
SenticNet 41.58 36.82 84.55 51.30 46.68 13.98 21.52 39.23 16.13 22.86 31.89
Sentiment140 47.48 0.00 0.00 0.00 42.90 100.00 60.05 50.02 69.84 58.29 39.45
SentiStrength 55.23 73.64 29.40 42.03 67.63 57.14 61.94 51.36 90.17 65.44 56.47
SentiWordNet 50.45 50.00 52.09 51.02 36.55 31.30 33.72 56.88 57.55 57.22 47.32
SO-CAL 57.57 54.60 55.37 54.98 54.91 42.34 47.81 37.73 63.85 47.43 50.08
Stanford DM 30.95 65.49 23.51 34.60 24.34 83.42 37.68 49.39 8.35 14.28 28.85
Umigon 60.66 63.95 57.46 60.53 50.35 53.43 51.85 63.69 66.82 65.22 59.20
Vader 56.72 49.52 76.49 60.12 56.34 30.66 39.71 67.97 47.06 55.61 51.81
AFINN 64.41 40.81 72.12 52.13 49.67 28.29 36.05 85.95 62.54 72.40 53.53
Emolex 54.76 31.67 59.95 41.44 40.14 19.53 26.27 77.48 54.64 64.08 43.93
Emoticons 70.22 70.06 16.78 27.07 65.62 44.21 52.83 41.29 97.56 58.02 45.98
Emoticons DS 20.34 19.78 99.46 33.00 62.07 0.60 1.19 53.85 0.55 1.09 11.76
Happiness Index 55.16 29.13 61.98 39.64 50.65 9.50 16.01 43.35 59.16 50.03 35.23
NRC Hashtag 30.47 28.25 77.40 41.39 24.18 19.59 21.64 79.08 8.77 15.78 26.27
LIWC 46.88 21.85 38.43 27.86 19.18 8.05 11.34 69.51 54.83 61.31 33.50
Tweets Opinion Finder 71.55 57.48 32.75 41.72 49.85 48.56 49.20 75.95 89.90 82.34 57.75
RND III Opinion Lexicon 63.86 40.65 66.17 50.36 48.84 27.73 35.38 81.96 64.66 72.29 52.68
PANAS-t 68.79 79.49 8.39 15.18 48.57 51.52 50.00 68.75 98.86 81.10 48.76
Pattern.en 53.57 36.25 76.86 49.26 35.19 22.50 27.45 84.20 45.68 59.23 45.31
SANN 66.88 42.70 48.71 45.51 46.35 26.93 34.07 77.99 77.99 77.99 52.52
SASA 55.37 29.42 54.53 38.22 42.46 19.28 26.52 78.30 57.15 66.08 43.60
SenticNet 33.47 23.66 86.60 37.17 41.47 10.06 16.19 43.44 15.37 22.71 25.36
Sentiment140 55.05 0.00 0.00 0.00 28.14 100.00 43.92 71.14 66.00 68.47 37.46
SentiStrength 73.80 70.94 41.95 52.72 57.53 49.80 53.39 75.35 92.26 82.95 63.02
SentiWordNet 55.85 37.42 58.19 45.55 24.04 19.57 21.58 79.25 59.00 67.64 44.92
SO-CAL 66.51 43.06 68.88 52.99 51.84 30.55 38.44 45.77 66.94 54.37 48.60
Stanford DM 31.90 64.48 38.57 48.26 15.58 72.55 25.65 75.64 19.77 31.35 35.09
Umigon 74.12 57.67 70.23 63.33 48.83 46.71 47.75 88.80 76.34 82.10 64.39
Vader 59.82 37.52 81.73 51.43 47.99 24.25 32.22 89.26 52.28 65.94 49.86
AFINN 50.60 46.05 62.26 52.94 50.96 31.36 38.83 55.80 45.50 50.12 47.30
Emolex 45.40 42.27 51.57 46.46 44.05 24.83 31.76 48.65 48.65 48.65 42.29
Emoticons 77.20 76.06 67.92 71.76 82.50 74.44 78.26 42.93 80.63 56.03 68.68
Emoticons DS 32.60 32.04 98.74 48.38 66.67 0.60 1.18 57.14 1.80 3.49 17.69
Happiness Index 43.20 36.96 53.46 43.70 52.27 13.69 21.70 32.34 48.65 38.85 34.75
NRC Hashtag 36.00 39.27 61.01 47.78 29.68 30.23 29.95 52.94 8.11 14.06 30.60
LIWC 44.40 0.00 0.00 0.00 0.00 0.00 0.00 44.40 100.00 61.50 20.50
Tweets Opinion Finder 49.00 54.26 32.08 40.32 38.10 42.67 40.25 50.31 72.97 59.56 46.71
RND IV Opinion Lexicon 51.40 50.53 59.75 54.76 47.66 35.42 40.64 54.15 50.00 51.99 49.13
PANAS-t 46.00 53.85 4.40 8.14 40.00 50.00 44.44 45.97 97.75 62.54 38.37
Pattern.en 64.20 58.02 88.68 70.15 61.18 50.49 55.32 87.36 34.23 49.19 58.22
SANN 46.60 43.59 42.77 43.17 44.59 27.27 33.85 48.89 59.46 53.66 43.56
SASA 44.40 39.11 44.03 41.42 39.05 27.33 32.16 51.39 50.00 50.68 41.42
SenticNet 39.00 34.77 76.10 47.73 48.28 15.61 23.60 32.99 14.41 20.06 30.46
Sentiment140 51.40 0.00 0.00 0.00 50.49 100.00 67.10 52.03 69.37 59.46 42.19
SentiStrength 54.20 72.50 36.48 48.54 55.26 48.84 51.85 50.26 86.49 63.58 54.65
SentiWordNet 45.80 43.11 45.28 44.17 37.21 25.20 30.05 50.61 56.31 53.30 42.51
SO-CAL 54.80 49.43 54.09 51.65 53.85 38.89 45.16 37.29 59.46 45.83 47.55
Stanford DM 35.20 67.86 35.85 46.91 26.56 78.40 39.68 44.68 9.46 15.61 34.07
Umigon 74.40 69.84 83.02 75.86 65.85 65.45 65.65 89.80 59.46 71.54 71.02
Vader 59.00 50.92 86.79 64.19 60.16 36.67 45.56 79.21 36.04 49.54 53.09
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:26 Gonc¸alves et al.
Table XVII. 3-classes experiments results with 4 datasets
Dataset Method Accur. Posit. Sentiment Negat. Sentiment Neut. Sentiment MacroF1
P R F1 P R F1 P R F1
AFINN 58.70 29.77 61.66 40.15 45.29 26.63 33.54 81.19 60.69 69.46 47.72
Emolex 50.58 21.21 50.67 29.90 42.83 17.62 24.97 74.29 54.01 62.55 39.14
Emoticons 67.11 30.17 6.74 11.02 52.78 19.00 27.94 40.69 96.19 57.19 32.05
Emoticons DS 16.44 15.03 96.34 26.01 26.32 0.18 0.35 73.42 2.49 4.81 10.39
Happiness Index 55.52 21.61 42.97 28.76 44.71 10.31 16.76 42.05 67.94 51.95 32.49
NRC Hashtag 25.53 18.66 54.53 27.80 25.64 25.84 25.74 70.00 6.90 12.56 22.03
LIWC 58.27 28.45 62.43 39.08 49.42 20.72 29.20 78.64 62.49 69.64 45.98
Tweets Opinion Finder 65.30 31.04 23.51 26.75 40.23 33.74 36.70 73.51 84.70 78.71 47.39
SAN Opinion Lexicon 59.96 31.85 58.00 41.12 44.69 30.45 36.22 79.55 63.01 70.32 49.22
PANAS-t 68.11 44.44 1.54 2.98 45.16 58.33 50.91 68.44 99.01 80.94 44.94
Pattern.en 53.12 25.00 68.59 36.64 49.68 18.04 26.46 44.57 52.64 48.27 37.13
SANN 60.16 27.26 40.08 32.45 39.44 20.14 26.67 74.24 73.38 73.81 44.31
SASA 51.11 20.67 36.99 26.52 23.44 11.74 15.64 70.29 62.58 66.21 36.12
SenticNet 28.50 17.10 85.93 28.53 44.82 7.92 13.46 46.17 14.74 22.35 21.45
Sentiment140 56.10 0.00 0.00 0.00 29.87 100.00 46.00 74.82 64.08 69.04 38.35
SentiStrength 69.63 46.87 30.25 36.77 58.41 42.58 49.25 73.17 89.80 80.64 55.55
SentiWordNet 54.38 24.61 46.05 32.08 33.85 21.21 26.08 76.22 61.12 67.84 42.00
SO-CAL 58.67 28.02 55.88 37.32 48.40 28.91 36.20 44.54 60.69 51.38 41.63
Stanford DM 25.09 43.09 20.42 27.71 18.42 79.10 29.88 74.33 9.56 16.94 24.84
Umigon 60.43 35.96 57.23 44.16 39.69 37.10 38.35 80.57 62.58 70.45 50.99
Vader 55.05 28.12 71.68 40.39 44.98 23.43 30.81 84.04 52.38 64.54 45.25
AFINN 62.36 61.10 70.09 65.28 44.08 31.91 37.02 71.43 58.57 64.37 55.56
Emolex 48.74 48.15 62.71 54.47 31.27 17.71 22.61 57.90 41.30 48.21 41.76
Emoticons 52.88 72.83 11.34 19.62 55.56 32.37 40.91 34.05 96.53 50.34 36.96
Emoticons DS 36.59 36.55 100.00 53.53 75.00 0.08 0.16 100.00 0.03 0.07 17.92
Happiness Index 48.81 43.61 65.27 52.29 36.96 7.54 12.53 36.82 45.16 40.56 35.13
NRC Hashtag 36.95 42.04 75.03 53.88 24.57 16.94 20.05 53.33 3.70 6.92 26.95
LIWC 39.54 36.52 42.33 39.21 15.14 6.25 8.84 48.64 44.83 46.66 31.57
Tweets Opinion Finder 57.63 67.57 27.94 39.53 40.75 48.62 44.34 58.20 86.06 69.44 51.10
Semeval Opinion Lexicon 60.37 62.09 62.71 62.40 41.19 34.18 37.36 66.41 60.75 63.46 54.41
PANAS-t 53.08 90.95 9.04 16.45 51.56 62.26 56.41 51.65 99.01 67.89 46.92
Pattern.en 50.19 58.07 68.47 62.84 24.68 29.82 27.01 67.73 35.22 46.34 45.40
SANN 54.77 52.72 47.59 50.02 38.91 20.92 27.21 58.95 66.90 62.67 46.64
SASA 50.63 46.34 47.77 47.04 33.07 12.14 17.76 56.39 61.12 58.66 41.15
SenticNet 39.90 39.81 86.55 54.54 31.85 8.98 14.01 38.18 7.20 12.12 26.89
Sentiment140 42.25 0.00 0.00 0.00 26.79 100.00 42.25 50.57 66.14 57.31 33.19
SentiStrength 57.83 78.01 27.13 40.25 47.80 53.55 50.52 55.49 89.89 68.62 53.13
SentiWordNet 48.33 55.54 53.44 54.47 19.67 24.82 21.95 61.22 47.57 53.54 43.32
SO-CAL 58.83 58.89 59.02 58.95 40.39 33.14 36.41 39.89 59.96 47.91 47.76
Stanford DM 22.54 72.14 18.17 29.03 14.92 82.93 25.28 47.19 6.94 12.10 22.14
Umigon 65.88 75.18 56.14 64.28 39.66 53.18 45.44 70.65 75.78 73.13 60.95
Vader 60.05 56.08 79.26 65.68 44.13 26.60 33.19 76.88 46.02 57.57 52.15
AFINN 55.07 59.82 80.58 68.66 50.83 25.99 34.39 44.13 27.57 33.94 45.66
Emolex 48.95 56.31 68.18 61.68 39.29 23.12 29.11 39.77 30.79 34.71 41.83
Emoticons 37.28 65.12 17.36 27.41 46.15 21.05 28.92 24.81 86.22 38.53 31.62
Emoticons DS 46.75 46.62 99.59 63.50 66.67 0.72 1.42 50.00 0.88 1.73 22.22
Happiness Index 47.80 51.99 75.41 61.55 47.42 12.01 19.17 26.49 26.10 26.29 35.67
NRC Hashtag 44.07 57.59 61.16 59.32 30.10 40.60 34.57 43.24 4.69 8.47 34.12
LIWC 54.02 59.03 80.37 68.07 55.46 19.64 29.01 41.04 32.26 36.12 44.40
RW Opinion Finder 39.39 56.67 38.64 45.95 27.86 39.92 32.82 34.67 38.12 36.31 38.36
Opinion Lexicon 52.10 59.86 72.11 65.42 45.16 29.52 35.70 39.84 28.74 33.39 44.84
PANAS-t 34.99 60.61 8.26 14.55 30.19 38.10 33.68 33.44 90.91 48.90 32.38
Pattern.en 47.61 59.74 67.15 63.23 32.69 35.01 33.81 39.01 16.13 22.82 39.95
SANN 46.94 57.01 63.02 59.86 38.19 24.84 30.10 35.26 32.26 33.69 41.22
SASA 41.01 55.28 41.12 47.16 30.09 28.76 29.41 35.11 48.39 40.69 39.09
SenticNet 49.62 50.81 90.29 65.03 42.50 10.76 17.17 31.96 9.09 14.16 32.12
Sentiment140 31.36 0.00 0.00 0.00 33.08 100.00 49.72 29.59 44.87 35.66 28.46
SentiStrength 41.11 77.04 21.49 33.60 45.57 53.73 49.32 34.86 85.04 49.45 44.12
SentiWordNet 46.27 57.01 63.02 59.86 31.03 28.12 29.51 40.27 26.10 31.67 40.35
SO-CAL 49.24 63.07 62.81 62.94 36.89 40.47 38.60 27.61 26.39 26.99 42.84
Stanford DM 31.84 78.57 15.91 26.46 24.13 90.54 38.10 47.83 16.13 24.12 29.56
Umigon 41.78 70.59 39.67 50.79 27.73 65.22 38.91 40.77 27.86 33.10 40.94
Vader 55.26 57.77 87.60 69.62 51.93 23.27 32.14 45.80 17.60 25.42 42.39
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:27
ACKNOWLEDGMENTS
This work is supported by grants from CAPES, Fapemig, and CNPq.
REFERENCES
Fotis Aisopos. 2014. Manually Annotated Sentiment Analysis Twitter Dataset NTUA. (2014). www.grid.ece.ntua.gr.
Matheus Araujo, Pollyanna Gonc¸ alves, Fabrcio Benevenuto, and Meeyoung Cha. 2014. iFeel: A System that Compares and
Combines Sentiment Analysis Methods. In WWW (Companion Volume). International World Wide Web Conference
(WWW’14), 4.
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: An Enhanced Lexical Resource for
Sentiment Analysis and Opinion Mining.. In LREC (2010-06-02), Nicoletta Calzolari, Khalid Choukri, Bente Maegaard,
Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias (Eds.). European Language Resources
Association. http://dblp.uni-trier.de/db/conf/lrec/lrec2010.html#BaccianellaES10
Mark L. Berenson, David M. Levine, and Kathryn A. Szabat. 2014. Basic Business Statistics - Concepts and Applications
(13 ed.). Pearson. 840 pages.
Celeste Biever. 2010. Twitter mood maps reveal emotional states of America. The New Scientist 207 (2010). Issue 2771.
Johan Bollen, Huina Mao, and Xiao-Jun Zeng. 2010. Twitter Mood Predicts the Stock Market. CoRR abs/1010.3003 (2010).
Johan Bollen, Alberto Pepe, and Huina Mao. 2009. Modeling Public Mood and Emotion: Twitter Sentiment and Socio-
Economic Phenomena. CoRR abs/0911.1583 (2009).
M. M. Bradley and P. J. Lang. 1999. Affective norms for English words (ANEW): Stimuli, instruction manual, and affective
ratings. Technical Report. Center for Research in Psychophysiology, University of Florida, Gainesville, Florida.
Erik Cambria, Robert Speer, Catherine Havasi, and Amir Hussain. 2010. SenticNet: A Publicly Available Semantic Resource
for Opinion Mining. In AAAI Fall Symposium Series.
Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. 2010. Measuring User Influence in Twitter:
The Million Follower Fallacy. In International AAAI Conference on Weblogs and Social Media (ICWSM).
Tom De Smedt and Walter Daelemans. 2012. Pattern for python. The Journal of Machine Learning Research 13, 1 (2012),
2063–2067.
N.A. Diakopoulos and D.A. Shamma. 2010. Characterizing debate performance via aggregated twitter sentiment. In Pro-
ceedings of the 28th international conference on Human factors in computing systems. ACM, 1195–1198.
Peter Sheridan Dodds, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis
Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow, Karine Megerdoomian, Matthew T. McMa-
hon, Brian F. Tivnan, and Christopher M. Danforth. 2015. Human language reveals a universal positivity bias. Proceed-
ings of the National Academy of Sciences 112, 8 (2015), 2389–2394. DOI:http://dx.doi.org/10.1073/pnas.1411678112
Peter Sheridan Dodds and Christopher M Danforth. 2009. Measuring the happiness of large-scale writ-
ten expression: songs, blogs, and presidents. Journal of Happiness Studies 11, 4 (2009), 441–456.
DOI:http://dx.doi.org/10.1007/s10902-009-9150-9
Esuli and Sebastiani. 2006. SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining. In International
Conference on Language Resources and Evaluation (LREC). 417–422.
Ronen Feldman. 2013. Techniques and Applications for Sentiment Analysis. Commun. ACM 56, 4 (April 2013), 82–89.
DOI:http://dx.doi.org/10.1145/2436256.2436274
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter Sentiment Classification using Distant Supervision. Processing -
(2009), 1–6.
Namrata Godbole, Manjunath Srinivasaiah, and Steven Skiena. 2007. Large-Scale Sentiment Analysis for News and Blogs.
In Proceedings of the International Conference on Weblogs and Social Media (ICWSM).
Pollyanna Gonc¸ alves, Matheus Araujo, Fabrcio Benevenuto, and Meeyoung Cha. 2013a. Comparing and Combining Senti-
ment Analysis Methods. In Proceedings of the 1st ACM Conference on Online Social Networks (COSN’13). 12.
Pollyanna Gonc¸ alves, Fabr´
ıcio Benevenuto, and Meeyoung Cha. 2013b. PANAS-t: A Pychometric Scale for Measuring
Sentiments on Twitter. abs/1308.1857v1 (2013).
Aniko Hannak, Eric Anderson, Lisa Feldman Barrett, Sune Lehmann, Alan Mislove, and Mirek Riedewald. 2012. Tweetin’
in the Rain: Exploring societal-scale effects of weather on mood. In Int’l AAAI Conference on Weblogs and Social
Media (ICWSM).
Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews (KDD ’04). 168–177. http://doi.acm.org/10.
1145/1014052.1014073
CJ Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In
Eighth International AAAI Conference on Weblogs and Social Media (ICWSM).
Efthymios Kouloumpis, Theresa Wilson, and Johanna Moore. 2011. Twitter Sentiment Analysis: The Good the Bad and the
OMG!. In Int’l AAAI Conference on Weblogs and Social Media (ICWSM).
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:28 Gonc¸alves et al.
Adam D I Kramer, Jamie E Guillory, and Jeffrey T Hancock. 2014. Experimental evidence of massive-scale emotional
contagion through social networks. Proceedings of the National Academy of Sciences of the United States of America
111, 24 (June 2014), 8788–90. DOI:http://dx.doi.org/10.1073/pnas.1320040111
Clement Levallois. 2013. Umigon: sentiment analysis for tweets based on terms lists and heuristics. In Second Joint Confer-
ence on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop
on Semantic Evaluation (SemEval 2013). Association for Computational Linguistics, Atlanta, Georgia, USA, 414–417.
http://www.aclweb.org/anthology/S13- 2068
Lexalytics. 2015. Sentiment Extraction - Measuring the Emotional Tone of Content. Technical Report. Lexalytics.
Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies 5, 1 (May
2012), 1–167. DOI:http://dx.doi.org/10.2200/s00416ed1v01y201204hlt016
George A. Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
Saif Mohammad. 2012. #Emotional Tweets. In *SEM 2012: The First Joint Conference on Lexical and Computational
Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth
International Workshop on Semantic Evaluation (SemEval 2012). Association for Computational Linguistics, Montr´
eal,
Canada, 246–255. http://www.aclweb.org/anthology/S12- 1033
Saif Mohammad, Cody Dunne, and Bonnie Dorr. 2009. Generating High-coverage Semantic Orientation Lexicons from
Overtly Marked Words and a Thesaurus. In Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing: Volume 2 - Volume 2 (EMNLP ’09). Association for Computational Linguistics, Stroudsburg,
PA, USA, 599–608. http://dl.acm.org/citation.cfm?id=1699571.1699591
Saif Mohammad and Peter D. Turney. 2013. Crowdsourcing a Word-Emotion Association Lexicon. Computational Intelli-
gence 29, 3 (2013), 436–465.
Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. 2013. NRC-Canada: Building the State-of-the-Art in Sentiment
Analysis of Tweets. In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-
2013). Atlanta, Georgia, USA.
Preslav Nakov, Zornitsa Kozareva, Alan Ritter, Sara Rosenthal, Veselin Stoyanov, and Theresa Wilson. 2013. SemEval-2013
Task 2: Sentiment Analysis in Twitter. (2013).
Sascha Narr, Michael Hlfenhaus, and Sahin Albayrak. 2012. Language-independent Twitter sentiment analysis. Knowledge
Discovery and Machine Learning (KDML) (2012), 12–14.
Finn ˚
Arup Nielsen. 2011. A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. arXiv preprint
arXiv:1103.2903 (2011).
Nuno Oliveira, Paulo Cortez, and Nelson Areal. 2013. On the Predictability of Stock Market Behavior Using StockTwits
Sentiment and Posting Volume.. In EPIA (Lecture Notes in Computer Science), Lus Correia, Lus Paulo Reis, and Jos
Cascalho (Eds.), Vol. 8154. Springer, 355–365.
Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on
minimum cuts. In In Proceedings of the ACL. 271–278.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning
techniques. In ACL Conference on Empirical Methods in Natural Language Processing. 79–86.
Nikolaos Pappas and Andrei Popescu-Belis. 2013. Sentiment analysis of user comments for one-class collaborative filtering
over TED talks. In Proceedings of the 36th international ACM SIGIR conference on Research and development in
information retrieval. ACM, 773–776.
R. Plutchik. 1980. A general psychoevolutionary theory of emotion. Academic press, New York, 3–33.
Julio Reis, Fabricio Benevenuto, Pedro Vaz de Melo, Raquel Prates, Haewoon Kwak, and Jisun An. 2015. Breaking the News:
First Impressions Matter on Online News. In Proceedings of the 9th International AAAI Conference on Web-Blogs and
Social Media (ICWSM).
Julio Reis, Pollyanna Goncalves, Pedro Vaz de Melo, Raquel Prates, and Fabricio Benevenuto. 2014. Magnet News: You
Choose the Polarity of What you Read. In International AAAI Conference on Web-Blogs and Social Media.
Niek Sanders. 2011. Twitter Sentiment Corpus by Niek Sanders. (2011). http://www.sananalytics.com/lab/twitter-sentiment/.
Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and Fast—but is It Good?: Evaluating
Non-expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natu-
ral Language Processing (EMNLP ’08).
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts.
2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In 2013 Conference on
Empirical Methods in Natural Language Processing. 1631–1642.
Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, and Daniel M. Ogilvie. 1966. The General Inquirer: A Computer
Approach to Content Analysis. MIT Press. http://www.webuse.umd.edu:9090/
Carlo Strapparava and Rada Mihalcea. 2007. SemEval-2007 Task 14: Affective Text. In Proceedings of the 4th International
Workshop on Semantic Evaluations (SemEval ’07). Association for Computational Linguistics, Stroudsburg, PA, USA,
70–74. http://dl.acm.org/citation.cfm?id=1621474.1621487
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:29
Maite Taboada, Caroline Anthony, and Kimberly Voll. 2006a. Methods for Creating Semantic Orientation Dictionaries. In
Conference on Language Resources and Evaluation (LREC). 427–432.
Maite Taboada, Caroline Anthony, and Kimberly Voll. 2006b. Methods for Creating Semantic Orientation Dictionaries. In
Conference on Language Resources and Evaluation (LREC). 427–432.
Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based Methods for Sen-
timent Analysis. Comput. Linguist. 37, 2 (June 2011), 267–307. DOI:http://dx.doi.org/10.1162/COLI a 00049
Acar Tamersoy, Munmun De Choudhury, and Duen Horng Chau. 2015. Characterizing Smoking and Drinking Abstinence
from Social Media. In Proceedings of the 26th ACM Conference on Hypertext and Social Media (HT).
Yla R. Tausczik and James W. Pennebaker. 2010. The Psychological Meaning of Words: LIWC and Computerized Text
Analysis Methods. Journal of Language and Social Psychology 29, 1 (2010), 24–54.
Mike Thelwall. 2013. Heart and soul: Sentiment strength detection in the social web with SentiStrength. (2013). http://
sentistrength.wlv.ac.uk/documentation/SentiStrengthChapter.pdf.
Mikalai Tsytsarau and Themis Palpanas. 2012. Survey on Mining Subjective Data on the Web. Data Min. Knowl. Discov. 24,
3 (May 2012), 478–514. DOI:http://dx.doi.org/10.1007/s10618-011-0238-6
Andranik Tumasjan, Timm O. Sprenger, Philipp G. Sandner, and Isabell M. Welpe. 2010. Predicting Elections with Twitter:
What 140 Characters Reveal about Political Sentiment. In International AAAI Conference on Weblogs and Social Media
(ICWSM).
Ro Valitutti. 2004. WordNet-Affect: an Affective Extension of WordNet. In In Proceedings of the 4th International Confer-
ence on Language Resources and Evaluation. 1083–1086.
Hao Wang, Dogan Can, Abe Kazemzadeh, Franc¸ois Bar, and Shrikanth Narayanan. 2012. A system for real-time Twitter
sentiment analysis of 2012 U.S. presidential election cycle. In ACL System Demonstrations. 115–120.
D. Watson and L. Clark. 1985. Development and validation of brief measures of positive and negative affect: the PANAS
scales. Journal of Personality and Social Psychology 54, 1 (1985), 1063–1070.
Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating Expressions of Opinions and Emotions in Language.
Language Resources and Evaluation 1, 2 (2005), 0. http://www.cs.pitt.edu/\
Theresa Wilson, Paul Hoffmann, Swapna Somasundaran, Jason Kessler, Janyce Wiebe, Yejin Choi, Claire Cardie, Ellen
Riloff, and Siddharth Patwardhan. 2005a. OpinionFinder: a system for subjectivity analysis. In HLT/EMNLP on Inter-
active Demonstrations. 34–35.
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005b. Recognizing Contextual Polarity in Phrase-Level Sentiment
Analysis. In ACL Conference on Empirical Methods in Natural Language Processing. 347–354.
David H. Wolpert and William G. Macready. 1997. No free lunch theorems for optimization. IEEE Transactions on Evlu-
tionary Computation 1, 1 (1997), 67–82.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
... Several studies have demonstrated VADER's effectiveness in sentiment analysis within contexts similar to ours. For instance (Ribeiro et al., 2016), successfully applied VADER in their comparative analysis of sentiments. While other lexicons and sentiment analysis tools exist, such as LIWC (Linguistic Inquiry and Word Count) or SentiWordNet, these alternatives either require extensive customization to the specific dataset or lack the nuanced understanding of internet-specific language forms that VADER offers. ...
Article
Full-text available
In this study, we leverage sentiment analysis to investigate public perception towards environmental issues as conveyed through global news articles and its potential implications on the transition to a low-carbon economy. Utilizing an extensive corpus of news articles sourced globally, we deploy Natural Language Processing (NLP) techniques to quantify sentiment in these articles, capturing public sentiment’s dynamism and complexity towards various environmental issues. Our methodology involves sentiment scoring of key aspects like “climate change”, “climate policy”, “renewable energy”, “solar energy”, “wind energy”, and “environmental impact” which facilitated a detailed sentiment trend analysis over time. We also incorporated a Latent Dirichlet Allocation (LDA) model to conduct topic modelling, identifying five major topics recurring in the discourse. Our correlation analysis uncovers interesting relationships such as a positive correlation between sentiment scores of “low carbon” and “electric cars”, and a negative correlation between “greenhouse gas emissions” and “electric cars". The findings indicate that public sentiment towards environmental issues is not only multifaceted but also evolving, with significant implications for policy-making and stakeholder engagement in the low-carbon transition. These results exemplify sentiment analysis as a powerful tool in understanding public perception, providing actionable insights for researchers, policymakers, and stakeholders involved in environmental issues and the low-carbon economy transition.
... Besides those, we exploit 16 other datasets with various news, reviews, and social media domains with different characteristics, such as class distribution, density, etc. These datasets have high relevance for sentiment analysis, used, for instance, in popular benchmarks [Ribeiro et al., 2016], as well as in highly cited papers such as the VADER lexicon one [Hutto and Gilbert, 2014]. Table 1 shows some characteristics of these 19 datasets. ...
Article
Full-text available
The challenge of constructing effective sentiment models is exacerbated by a lack of sufficient information, particularly in short texts. Enhancing short texts with semantic relationships becomes crucial for capturing affective nuances and improving model efficacy, albeit with the potential drawback of introducing noise. This article introduces a novel approach, CluSent, designed for customized dataset-oriented sentiment analysis. CluSent capitalizes on the CluWords concept, a proposed powerful representation of semantically related words. To address the issues of information scarcity and noise, CluSent addresses these challenges: (i) leveraging the semantic neighborhood of pre-trained word embedding representations to enrich document representation and (ii) introducing dataset-specific filtering and weighting mechanisms to manage noise. These mechanisms utilize part-of-speech and polarity/intensity information from lexicons. In an extensive experimental evaluation spanning 19 datasets and five state-of-the-art baselines, including modern transformer architectures, CluSent emerged as the superior method in the majority of scenarios (28 out of 38 possibilities), demonstrating noteworthy performance gains of up to 14% over the strongest baselines.
... There are multiple factors as a result of which VADER was used for sentiment analysis in this work even though several other approaches for sentiment analysis exist. First, studies have shown that VADER demonstrates outstanding efficiency with respect to both precision and efficacy [49][50][51]. Second, VADER effectively addresses the limitations faced by several other sentiment analysis approaches [52][53][54][55]. ...
Preprint
Full-text available
The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at https://dx.doi.org/10.21227/40s8-xf63. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. Finally, this paper also presents a list of open research questions that may be investigated using this dataset.
... Sentiment analysis is a text mining technique that quantifies the latent emotional reaction or opinion contained in text, often expressed by polarity between negative and positive sentiments (Li et al., 2023). While all analysis techniques help understand textual data at scale, recent techniques based on machine learning often provide higher accuracy than purely dictionary-based approaches (Ribeiro et al., 2016). This is why we used Google Cloud's Natural Language Processing API (https://cloud.google.com/natural-language) to quantify the sentiment in consumers' written replies to the retailers' tweets. ...
... Recent years have seen a significant general increase in the methods available for Sentiment Analysis (SA). While dictionary-based approaches like VADER (Hutto and Gilbert [2014]) seem to perform well (Ribeiro et al. [2016]), they still struggle when applied to some domains (Elsahar and Gallé [2019], Ohana et al. [2012], Bowers and Dombrowski [2021]), in much the same way as more state-of-the-art transformer-based models do, despite providing a much richer semantic representation of texts (Tabinda Kokab et al. [2022],Öhman [2021]). Moreover, while these tools are commonly used to analyze emotive language in contexts like social media (Alantari et al. [2022]), their suitability for literary texts remains relatively under-explored. ...
... • Multimodal Sentiment Analysis: Integrating analysis of text with visual and audio data can provide a more comprehensive understanding of user sentiments, particularly in rich media environments. Testing with not only BERT but other models and methods will prove useful as well[28]. • Cross-Cultural Comparisons: Exploring how cultural contexts influence sentiment expression in anonymous settings could reveal important variations and commonalities across different user demographics. ...
Preprint
Full-text available
This study introduces the "Sentiment Gummy Worm" (SGW) as a breakthrough sentiment analysis model that yields a remarkable 95.42% accuracy on mean sentiment prediction using a Random Forest approach on anonymized data. Extracting over 20,000 posts from somewheretowrite.com, a platform with minimal tracking, search engine optimization and commercial influences, I have developed the SGW model, which integrates new feature variables like decay factors and buffer ratios, specifically catered to determining the possible leptokurtic distribution trends observed in large aggregated human text datasets. By merging the AFINN method for sentiment classification with my Random Forest regression, I have implemented a combined approach which not only calculates the mean sentiment of user feelings across the website, but also creates a preliminary model for evaluating mean sentiment in human texts of all lengths, particularly excelling on large counts of unique user inputs. This approach has allowed me to measure average temporal sentiment trends and uncover intricate patterns in anonymous texts, refining our preconceived notion of the binary classification in sentiment dynamics and setting a new benchmark for predictive accuracy in sentiment analysis tools. This research provides newfound insights into the dynamics of online emotional expression and has considerable implications for analyzing user sentiment in markets, enterprises and hubs of social dissent all around, contributing significantly to the expansion of future sentiment analysis methodologies.
... One difficult area is wherever emoji or facial expression sentiment is completely different from the text, maybe indicating witticism. Many downloadable sentiment analysis tools, like SentiStrength (2016), are developed and ways are created to check their performance [15] and set benchmarks [21] . However, sentiment analysis is also increasingly being offered in the form of commercial web service APIs. ...
Article
Sentiment analysis of short social media messages on microblogging platforms like Twitter or Instagram is of high interest to organizations. Emoticons and emojis can have a significant impact on the sentiment of a post. the field of sentiment analysis is a field of study to extract digital opinion knowledge.This was supported in research by, among [3, 5].2.2 Sentiment Analysis on Microblogging PlatformsTwitter-like microblog posts differ from sources traditionally used in sentiment analysis in several ways: Tweets are restricted to 140 characters, that means that they're sometimes short and to the purpose..alternative platforms might not be as restricted, however there's additional of attention on short messages. Emoticons and emojis are used each to reinforce the sentiment of a tweet [5] and to indicate a joke or sarcasm [24]. Language use is more casual, less composed, uses slang, and can vary by subject [25]. noted that alternative emotional signals, like bound word pairings, exist..However, [26] found a low correlation between the emotional words employed in social media and also the spirit of the user, suggesting that victimization words alone isn't sufficient to spot sentiment.Volume, speed, variation, and noisiness of data.The use of hash tags, both for subject identification and for sentiment annotation [16]. A group view rather than Individual views on a topic is the target of research [15]. Other features such retweets, follows, and mentions [27].Giachanou and Crestani [27] added that there are also specific issues in processing microblogging messages in areas like topic identification, tokenization, and data sparsity (incorrect language and misspellings)..This has led to two further approaches to sentiment analysis: hybrid models that combine lexical and machine-learning methods, and graph based models that include social networking features.
Article
Full-text available
The burgeoning panda tourism market in China is attracting an increasing number of domestic and international tourists. This study focuses on the Chengdu Research Base of Giant Panda Breeding as a case study and utilizes Latent Dirichlet Allocation (LDA) modeling and topic-based sentiment analysis to conduct text mining on online travel reviews in both English and Chinese languages. LDA modeling was employed to identify topics within online reviews, with a subsequent evaluation of the importance of each topic. Furthermore, topic-based sentiment analysis was conducted to assess the performance of different topics. Through importance-performance analysis, this study interprets the destination image disparities between English and Chinese reviews from a cross-cultural perspective. The research findings validate the effectiveness of destination image analysis methods, providing valuable insights for tailoring distinct destination marketing strategies that target tourists from diverse linguistic backgrounds.
Conference Paper
Full-text available
We describe the Sentiment Analysis in Twitter task, ran as part of SemEval-2014. It is a continuation of the last year’s task that ran successfully as part of SemEval2013. As in 2013, this was the most popular SemEval task; a total of 46 teams contributed 27 submissions for subtask A (21 teams) and 50 submissions for subtask B (44 teams). This year, we introduced three new test sets: (i) regular tweets, (ii) sarcastic tweets, and (iii) LiveJournal sentences. We further tested on (iv) 2013 tweets, and (v) 2013 SMS messages. The highest F1score on (i) was achieved by NRC-Canada at 86.63 for subtask A and by TeamX at 70.96 for subtask B.
Conference Paper
Full-text available
A growing number of people are changing the way they consume news, replacing the traditional physical newspapers and magazines by their virtual online versions or/and weblogs. The interactivity and immedi-acy present in online news are changing the way news are being produced and exposed by media corporations. News websites have to create effective strategies to catch people's attention and attract their clicks. In this paper we investigate possible strategies used by on-line news corporations in the design of their news headlines. We analyze the content of 69,907 headlines produced by four major global media corporations during a minimum of eight consecutive months in 2014. In order to discover strategies that could be used to attract clicks, we extracted features from the text of the news headlines related to the sentiment polarity of the headline. We discovered that the sentiment of the headline is strongly related to the popularity of the news and also with the dynamics of the posted comments on that particular news.
Article
The inherent nature of social media content poses serious challenges to practical applications of sentiment analysis. We present VADER, a simple rule-based model for general sentiment analysis, and compare its effectiveness to eleven typical state-of-practice benchmarks including LIWC, ANEW, the General Inquirer, SentiWordNet, and machine learning oriented techniques relying on Naive Bayes, Maximum Entropy, and Support Vector Machine (SVM) algorithms. Using a combination of qualitative and quantitative methods, we first construct and empirically validate a gold-standard list of lexical features (along with their associated sentiment intensity measures) which are specifically attuned to sentiment in microblog-like contexts. We then combine these lexical features with consideration for five general rules that embody grammatical and syntactical conventions for expressing and emphasizing sentiment intensity. Interestingly, using our parsimonious rule-based model to assess the sentiment of tweets, we find that VADER outperforms individual human raters (F1 Classification Accuracy = 0.96 and 0.84, respectively), and generalizes more favorably across contexts than any of our benchmarks.
Conference Paper
Social media has been established to bear signals relating to health and well-being states. In this paper, we investigate the potential of social media in characterizing and understanding abstinence from tobacco or alcohol use. While the link between behavior and addiction has been explored in psychology literature, the lack of longitudinal self-reported data on long-term abstinence has challenged addiction research. We leverage the activity spanning almost eight years on two prominent communities on Reddit: StopSmoking and StopDrinking. We use the self-reported "badge" information of nearly a thousand users as gold standard information on their abstinence status to characterize long-term abstinence. We build supervised learning based statistical models that use the linguistic features of the content shared by the users as well as the network structure of their social interactions. Our findings indicate that long-term abstinence from smoking or drinking (~one year) can be distinguished from short-term abstinence (~40 days) with 85% accuracy. We further show that language and interaction on social media offer powerful cues towards characterizing these addiction-related health outcomes. We discuss the implications of our findings in social media and health research, and in the role of social media as a platform for positive behavior change and therapy.
Article
Today millions of web-users express their opinions about many topics through blogs, wikis, fora, chats and social networks. For sectors such as e-commerce and e-tourism, it is very useful to automatically analyze the huge amount of social information available on the Web, but the extremely unstructured nature of these contents makes it a difficult task. SenticNet is a publicly available resource for opinion mining built exploiting AI and Semantic Web techniques. It uses dimensionality reduction to infer the polarity of common sense concepts and hence provide a public resource for mining opinions from natural language text at a semantic, rather than just syntactic, level. Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved.
Conference Paper
In this study, we explored data from StockTwits, a microblogging platform exclusively dedicated to the stock market. We produced several indicators and analyzed their value when predicting three market variables: returns, volatility and trading volume. For six major stocks, we measured posting volume and sentiment indicators. We advance on the previous studies on this subject by considering a large time period, using a robust forecasting exercise and performing a statistical test of forecasting ability. In contrast with previous studies, we find no evidence of return predictability using sentiment indicators, and of information content of posting volume for forecasting volatility. However, there is evidence that posting volume can improve the forecasts of trading volume, which is useful for measuring stock liquidity (e.g. assets easily sold).