ArticlePDF Available

A Benchmark Comparison of State-of-the-Practice Sentiment Analysis Methods

December 2015
EPJ Data Science 5(1)

December 2015
5(1)

DOI:10.1140/epjds/s13688-016-0085-1

Source
arXiv

License
CC BY 4.0

Authors:

Matheus Araújo

Federal University of Minas Gerais

Filipe N. Ribeiro

Universidade Federal de Ouro Preto

Fabrício Benevenuto

Federal University of Minas Gerais

Show all 5 authorsHide

In the last few years thousands of scientific papers have explored sentiment analysis, several startups that measures opinions on real data have emerged, and a number of innovative products related to this theme have been developed. There are multiple methods for measuring sentiments, including lexical-based approaches and supervised machine learning methods. Despite the vast interest on the theme and wide popularity of some methods, it is unclear which method is better for identifying the polarity (i.e., positive or negative) of a message. Thus, there is a strong need to conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources. Such a comparison is key for understanding the potential limitations, advantages, and disadvantages of popular methods. This study aims at filling this gap by presenting a benchmark comparison of twenty one popular sentiment analysis methods (which we call the state-of-the-practice methods). Our evaluation is based on a benchmark of twenty labeled datasets, covering messages posted on social networks, movie and product reviews, as well as opinions and comments in news articles. Our results highlight the extent to which the prediction performance of these methods varies widely across datasets. Aiming at boosting the development of this research area, we open the methods' codes and datasets used in this paper and we deploy a benchmark system, which provides an open API for accessing and comparing sentence-level sentiment analysis methods.

Searches on Google for the Query: ‘Sentiment Analysis’. This figure shows the steady growth on the number of searches on the topic, according to Google Trends, mainly after the popularization of online social networks (OSNs).

…

Average F 1 score for each class. This figure presents the average F1 of positive and negative class and as we can see, methods use to achieve better prediction performance on positive messages.

…

Figures - uploaded by Marcos André Gonçalves

Content may be subject to copyright.

Content uploaded by Marcos André Gonçalves

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

A Benchmark Comparison of State-of-the-Practice

Sentiment Analysis Methods

Pollyanna Gonc¸alves, Federal University of Minas Gerais

Matheus Ara´

ujo, Federal University of Minas Gerais

Filipe Ribeiro, Federal University of Minas Gerais and Federal University of Ouro Preto

Fabr´

ıcio Benevenuto, Federal University of Minas Gerais

Marcos Gonc¸ alves, Federal University of Minas Gerais

In the last few years thousands of scientiﬁc papers have explored sentiment analysis, several startups that measures opinions

on real data have emerged, and a number of innovative products related to this theme have been developed. There are multiple

methods for measuring sentiments, including lexical-based approaches and supervised machine learning methods. Despite

the vast interest on the theme and wide popularity of some methods, it is unclear which method is better for identifying the

polarity (i.e., positive or negative) of a message. Thus, there is a strong need to conduct a thorough apple-to-apple comparison

of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources.

Such a comparison is key for understanding the potential limitations, advantages, and disadvantages of popular methods.

This study aims at ﬁlling this gap by presenting a benchmark comparison of twenty one popular sentiment analysis methods

(which we call the state-of-the-practice methods). Our evaluation is based on a benchmark of twenty labeled datasets, cov-

ering messages posted on social networks, movie and product reviews, as well as opinions and comments in news articles.

Our results highlight the extent to which the prediction performance of these methods varies widely across datasets. Aiming

at boosting the development of this research area, we open the methods’ codes and datasets used in this paper and we deploy

a benchmark system, which provides an open API for accessing and comparing sentence-level sentiment analysis methods.

CCS Concepts: rInformation systems →Sentiment analysis; rNetworks →Social media networks; Online social net-

works;

Additional Key Words and Phrases: Sentiment analysis, social media, online social networks, sentence-level

1. INTRODUCTION

Sentiment analysis has become an extremely popular tool, applied in several analytical domains,

especially on the Web and social media. To illustrate the growth of interest in the ﬁeld , Figure 1

shows the steady increase on the number of searches on the topic, according to Google Trends1,

mainly after the popularization of the online social networks (OSNs). More than 7,000 articles have

been written about sentiment analysis and various startups are developing tools and strategies to

extract sentiments from text [Feldman 2013].

The number of possible applications of such a technique is also considerable. Many of them are

focused on monitoring the reputation or opinion of a company or a brand with the analysis of re-

views of consumer products or services [Hu and Liu 2004]. Sentiment analysis can also provide

analytical perspectives for ﬁnancial investors who want to discover and respond to market opin-

ions [Oliveira et al. 2013; Bollen et al. 2010]. Another important set of applications is in politics,

where marketing campaigns are interested in tracking sentiments expressed by voters associated

with candidates [Tumasjan et al. 2010].

Due to the enormous interest and applicability, there has been a corresponding increase in the

number of proposed sentiment analysis methods in the last years. The proposed methods rely on

many different techniques from different computer science ﬁelds. Some of them employ machine

learning methods that often rely on supervised classiﬁcation approaches, requiring labeled data to

train classiﬁers [Pang et al. 2002]. Others are lexical-based methods that make use of predeﬁned

lists of words, in which each word is associated with a speciﬁc sentiment. The lexical methods

vary according to the context in which they were created. For instance, LIWC [Tausczik and Pen-

nebaker 2010] was originally proposed to analyze sentiment patterns in formally written English

1https://www.google.com/trends/explore\#q=sentiment\%20analysis

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

arXiv:1512.01818v1 [cs.CL] 6 Dec 2015

39:2 Gonc¸alves et al.

Fig. 1. Searches on Google for the Query: “Sentiment Analysis”

texts, whereas PANAS-t [Gonc¸alves et al. 2013b] and POMS-ex [Bollen et al. 2009] were proposed

as psychometric scales adapted to the Web context.

Overall, the above techniques are acceptable by the research community and it is common to see

concurrent important papers, sometimes published in the same computer science conference, using

completely different methods. For example, the famous Facebook experiment [Kramer et al. 2014]

which manipulated users feeds to study emotional contagion, used LIWC [Tausczik and Pennebaker

2010]. Concurrently, Reis et al. used Sentistrength [Thelwall 2013] to measure the negativeness or

positiveness of online news headlines [Reis et al. 2014; Reis et al. 2015], whereas Tamersoy [Tamer-

soy et al. 2015] explored Vader’s lexicon [Hutto and Gilbert 2014] to study patterns of smoking and

drinking abstinence in social media.

As the state-of-the-art has not been clearly established, researchers tend to accept any popular

method as a valid methodology to measure sentiments. However, little is known about the relative

performance of the several existing sentiment analysis methods. In fact, most of the newly pro-

posed methods are rarely compared with all other pre-existing ones using a large number of existing

datasets. This is a very unusual situation from a scientiﬁc perspective, in which benchmark compar-

isons are the rule. In fact, most applications and experiments reported in the literature make use of

previously developed methods exactly how they were released with no changes and adaptations and

with none or almost none parameter setting. In other words, the methods have been used as a black-

box, without a deeper investigation on their the suitability to a particular context or application.

To sum up, existing methods have been widely deployed for developing applications without a

deeper understanding regarding either their applicability in different contexts or their advantages,

disadvantages and limitations in comparison with each another. Thus, there is a strong need to

conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in

practice, across multiple datasets originated from different data sources.

This state-of-the-practice situation is what we propose to investigate in this article. We do this

by providing a thorough benchmark comparison of twenty one state-of-the-practice methods using

twenty labeled datasets. In particular, given the recent popularity of online social networks and of

short texts on the Web, many methods are focused in detecting sentiments at the sentence-level,

usually used to measure the sentiment of small sets of sentences in which the topic is known a

priori. We focus on such context – thus, our datasets cover messages posted on social networks,

movie and product reviews, and opinions and comments in news articles, Ted talks, and blogs.

We survey an extensive literature on sentiment analysis to identify existing sentence-level methods

covering several different techniques. We contacted authors asking for their codes when available

or we implemented existing methods when they were unavailable but could be reproduced based on

their descriptions in a published paper.

Our experimental results unveil a number of important ﬁndings. First, we show that there is no

single method that always achieves the best prediction performance for all different datasets, a re-

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:3

sult consistent with the “there is no free lunch theorem” [Wolpert and Macready 1997]. We also

show that existing methods vary widely regarding their agreement, even across similar datasets (e.g.

random tweets). This suggests that the same content could be interpreted very differently depending

on the choice of a sentiment method. We noted that most methods are more accurate in correctly

classifying positive than negative text, suggesting that current existing approaches tend to be biased

in their analysis towards positivity. Finally, we quantify the relative prediction performance of ex-

isting efforts in the ﬁeld across different types of datasets, identifying those with higher prediction

performance across different datasets.

Based on these observations, our ﬁnal contribution consists on releasing our gold standard dataset

and the codes of the compared methods2. We also created a Web system through which we allow

other researchers to easily use our data and codes to compare results with the existing methods. More

important, by using our system one could easily test which method would be the most suitable to

a particular dataset and/or application. We hope that our tool will not only help researchers and

practitioners for accessing and comparing a wide range of sentiment analysis techniques, but can

also help towards the development of this research ﬁeld as a whole.

The remainder of this paper is organized as follows. In Section 2, we brieﬂy describe related

efforts. Then, in Section 3 we describe the sentiment analysis methods we compare. Section 4

presents the gold standard data used for comparison. Section 5 summarizes our results and ﬁndings.

Finally, Section 6 concludes the article and discusses directions for future work.

2. BACKGROUND AND RELATED WORK

Next we discuss important deﬁnitions and justify the focus of our benchmark comparison. We also

brieﬂy survey existing related efforts that compare sentiment analysis methods.

2.1. Focus on Sentiment Level

Since sentiment analysis can be applied to different tasks, we restrict our focus on comparing those

efforts related to detect the polarity (i.e. positivity or negativity) of a given short text (i.e. sentence-

level). Polarity detection is a common function across all sentiment methods considered in our work,

providing valuable information to a number of different applications, specially those that explore

short messages that are commonly available in social media [Feldman 2013].

Sentence-level sentiment analysis can be performed with supervision (i.e. requiring labeled train-

ing data) or not. An advantage of supervised methods is at their ability to adapt and create trained

models for speciﬁc purposes and contexts. A drawback is the need of labeled data, which might

be highly costly and even prohibitive for some tasks. On the other hand, the lexical-based meth-

ods make use of a pre-deﬁned list of words, where each word is associated with a speciﬁc senti-

ment. The lexical methods vary according to the context in which they were created. For instance,

LIWC [Tausczik and Pennebaker 2010] was originally proposed to analyze sentiment patterns in

English texts, whereas PANAS-t [Gonc¸alves et al. 2013b] and POMS-ex [Bollen et al. 2009] are

psychometric scales adapted to the Web context. Although lexical-based methods do not rely on

labeled data, it is hard to create a unique lexical-based dictionary to be used for different contexts.

We focus our effort on evaluating unsupervised efforts as they can be easily deployed in Web

services and applications without the need of human labeling or any other type of manual inter-

vention. As described in Section 3, some of the methods we consider have used machine learning

to build lexicon dictionaries or even to build models and tune speciﬁc parameters. We incorporate

those methods in our study, since they have been released as black-box tools that can be used in an

unsupervised manner.

2.2. Existing Efforts on Methods’ Comparison

Despite the large number of existing methods, only a limited of them have performed a comparison

among sentiment analysis methods, usually with limited datasets. Overall, lexical methods and ma-

2Except for one paid method

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:4 Gonc¸alves et al.

chine learning approaches have been evolving in parallel in the last years, and it comes as no surprise

that studies have started to compare their performance on speciﬁc datasets and use one or another

strategy as baseline for comparison. A recent survey summarizes several of these efforts [Tsytsarau

and Palpanas 2012] and conclude that a systematic comparative study that implements and eval-

uates all relevant algorithms under the same framework is still missing in the literature. As new

methods emerge and compare themselves only against one or other method using different evalua-

tion datasets and testing methodologies, it is hard to conclude if a single method triumphs over the

other methods, even under speciﬁc scenarios. To the best of our knowledge, our effort is the ﬁrst of

kind to create a benchmark and provide such a comparison.

Another important worth noticing effort consists of an annual workshop namely International

Workshop on Semantic Evaluation (SemEval). It consists of a series of exercises grouped in tracks,

that include sentiment analysis, text similarity, among others, that put together competitors. Some

new methods such as Umigon [Levallois 2013] have been proposed after obtaining good results

on part of these tracks. Although, SemEval has been playing an important role for identifying the

current important methods, it requires authors of the methods to register for the challenge and many

popular methods have been evaluated in these exercises. Additionally, SemEval labeled datasets are

usually focused on one speciﬁc types of data, such as tweets, and do not represent a wide range

of social media data. In our evaluation effort, we consider one dataset from Semeval 2013 and two

methods that participated in the competition in that same year.

Finally, in a previous effort [Gonc¸alves et al. 2013a], we compared eight sentence-level sentiment

analysis methods, based on one public dataset used to evaluate the method sentistrength [Thelwall

2013]. Our effort largely extents our previous work by comparing much more methods across many

different datasets, providing a much deeper benchmark evaluation of current existing popular senti-

ment analysis methods. The methods used in this paper were also incorporated as part of an existing

system, namely ifeel [Araujo et al. 2014].

3. SENTIMENT ANALYSIS METHODS

This section provides a brief description of the twenty one sentence-level sentiment analysis meth-

ods investigated in this paper.

Our effort to identify important sentence-level sentiment analysis methods consisted of system-

atically search for them in the main conferences in the ﬁeld and then checking for papers that cited

them as well as their own references. Some of the methods are available for download on the Web;

others were kindly shared by their authors under request; and a small part of them were imple-

mented by us based on their descriptions in the original paper. This usually happened when authors

shared only the lexical dictionaries they created, letting the implementation of the method that use

the lexical resource to ourselves.

Table I and Table II present an overview of these methods, providing a description of each method

as well as the techniques they employ (L for Lexicon Dictionary and ML for Machine Learning),

their outputs (e.g. -1,0,1, meaning negative, neutral, and positive, respectively), the datasets they

used to validate, the baseline methods used for comparison and ﬁnally lexicon details. The methods

are organized in chronological order to allow a better overview of the existing efforts over the years.

We can note that the methods generate different outputs formats. We colored in blue the positive

outputs, in black the neutral ones, and in red those that are negative.

Since we are comparing sentiment analysis methods on a sentence-level basis, we need to work

with mechanisms that are able to receive sentences as input and give polarities as output. Some of

the approaches considered in this paper, shown in Table II, are complex dictionaries built with great

effort. However, a lexicon alone has no natural ability to infer polarity in sentence level tasks. The

purpose of a lexicon goes beyond the detection of polarity of a sentence [Feldman 2013; Liu 2012],

but it can also be used to that end [Godbole et al. 2007; Kouloumpis et al. 2011].

Several existing sentence-level sentiment analysis methods like Vader [Hutto and Gilbert 2014]

and SO-CAL [Taboada et al. 2011], combine a Lexicon and the processing of the sentence char-

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:5

Table I. Overview of the sentence-level methods available in the literature.

Name Description L ML

Emoticons [Gonc¸alves et al. 2013a] Messages containing positive/negative emoticons are positive/negative. Messages

without emoticons are not classiﬁed. X

Opinion Lexicon [Hu and Liu

2004]

Focus on Product Reviews. Builds a Lexicon to predict polarity of product features

phrases that are summarized to provide an overall score to that product feature. X

Opinion Finder (MPQA) [Wilson

et al. 2005a] [Wilson et al. 2005b]

Performs subjectivity analysis trough a framework with lexical analysis former and a

machine learning approach latter. X X

Happiness Index [Dodds and

Danforth 2009]

Quantiﬁes happiness levels for large-scale texts as lyrics and blogs. It uses ANEW

words [Bradley and Lang 1999] to rank the documents. X

SentiWordNet [Esuli and

Sebastiani 2006] [Baccianella et al.

2010]

Construction of a lexical resource for Opinion Mining based on WordNet [Miller

1995]. The authors grouped adjectives, nouns, etc in synonym sets (synsets) and

associated three polarity scores (positive, negative and neutral) for each one.

X X

LIWC [Tausczik and Pennebaker

2010]

Text analysis paid tool to evaluate emotional, cognitive, and structural components of

a given text. It uses a dictionary with words classiﬁed into categories (anxiety, health,

leisure, etc).

SenticNet [Cambria et al. 2010]

Uses dimensionality reduction to infer the polarity of common sense concepts and

hence provide a public resource for mining opinions from natural language text at a

semantic, rather than just syntactic level.

AFINN [Nielsen 2011] - A new

ANEW Builds a twitter based sentiment Lexicon including Internet slangs and obscene words. X

SO-CAL [Taboada et al. 2011]

Creates a new Lexicon with unigrams (verbs, adverbs, nouns and adjectives) and

multi-grams (phrasal verbs and intensiﬁers) hand ranked with scale +5 (strongly

positive) to -5 (strongly negative). Authors also included part of speech processing,

negation and intensiﬁers.

Emoticons DS (Distant

Supervision)[Hannak et al. 2012]

Creates a scored lexicon based on a large dataset of tweets. Its based on the frequency

each lexicon occurs with positive or negative emotions. X

NRC Hashtag [Mohammad 2012]

Builds a lexicon dictionary using a Distant Supervised Approach. In a nutshell it uses

a known hashtag to “classify” the tweet (i.e #joy, #sadness, etc). Afterwards, it veriﬁes

the occurrence of each speciﬁc n-gram in that emotion. Then, the score of a n-gram

occur in an emotion is calculated.

Pattern.en [De Smedt and

Daelemans 2012]

Python Programming Package (toolkit) to deal with NLP, Web Mining and Sentiment

Analysis. Sentiment analysis is provided through averaging scores from adjectives in

the sentence according to a bundle lexicon of adjective.

SASA [Wang et al. 2012]

Detects public sentiments on Twitter during the 2012 U.S. presidential election. It is

based on the statistical model obtained from the classiﬁer Na¨

ıve Bayes on unigram

features. It also explores emoticons and exclamations.

PANAS-t [Gonc¸ alves et al. 2013b]

Detects mood ﬂuctuations of users on Twitter. The method consists of an adapted

version (PANAS) Positive Affect Negative Affect Scale [Watson and Clark 1985],

well-known method in psychology with a large set of words associated with eleven

moods ( surprise, fear, etc).

EmoLex [Mohammad and Turney

2013]

Builds a general sentiment Lexicon crowdsourcing supported. Each entry lists the

association of a token with 8 basic sentiments: joy, sadness, anger, etc deﬁned

by [Plutchik 1980]. Proposed Lexicon includes unigrams and bigrams from

Macquarie Thesaurus and also words from GI and Wordnet.

SANN [Pappas and Popescu-Belis

2013]

Infer additional reviews user ratings by performing sentiment analysis (SA) of user

comments and integrating its output in a nearest neighbor (NN) model that provides

multimedia recommendations over TED Talks.

X X

Sentiment140 Lexicon

[Mohammad et al. 2013]

Creation of a lexicon dictionary in a similar way to [Mohammad 2012] and a SVM

Classiﬁer with features like: number and categories of emoticons, sum of the

sentiment scores for all tokens (calculated with lexicons), etc.

SentiStrength [Thelwall 2013] Builds a lexicon dictionary annotated by humans and improved with the use of

Machine Learning. X X

Stanford Recursive Deep Model

[Socher et al. 2013]

Proposes a model called Recursive Neural Tensor Network (RNTN) that processes all

sentences dealing with their structures and compute the interactions between them.

This approach is interesting since RNTN take into account the order of words in a

sentence, which is ignored in most of methods.

X X

Umigon [Levallois 2013] Disambiguates tweets using lexicon with heuristics to detect negations plus elongated

words and hashtags evaluation. X

Vader [Hutto and Gilbert 2014]

It is a human-validated sentiment analysis method developed for twitter and social

media contexts. Vader was created from a generalizable, valence-based,

human-curated gold standard sentiment lexicon.

Semantria [Lexalytics 2015]

It is a paid tool that employs multi-level analysis of sentences. Basically it has four

levles: part of speech, assignment of previous scores from dictionaries, application of

intensiﬁers and ﬁnally machine learning tecnhiques to delivery a ﬁnal weight to the

sentence.

X X

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:6 Gonc¸alves et al.

Table II. Overview of the sentence-level methods available in the literature - 2.

Name Output Validation Compared To Lexicon C

Emoticons -1,1- - 79

Opinion Lexicon -1,0,1Product Reviews from

Amazon and CNet - 6787 X

Opinion Finder

(MPQA)

Negative,

Neutral,

Positive

MPQA [Wiebe et al.

2005]

Compared to itself in different

versions. 20611 X

Happiness Index 1,2,3,4,5,6,7,

8,9

Lyrics, Blogs,STU

Messages 3, British

National Corpus 4,

- 1034 X

SentiWordNet -1,0,1-General Inquirer (GI)[Stone et al.

1966] 117658

LIWC negEmo,

posEmo - - ?X

SenticNet Negative,

Positive

Patient Opinions

(Unavailable) SentiStrength [Thelwall 2013] 15000

AFINN -1,0,1Twiter [Biever 2010]

OpinonFinder [Wilson et al.

2005a], ANEW [Bradley and Lang

1999], GI [Stone et al. 1966] and

Sentistrength [Thelwall 2013]

2477 X

SO-CAL [<0),0,(>0]

Epinion [Taboada et al.

2006a], MPQA[Wiebe

et al. 2005],

Myspace[Thelwall

2013],

MPQA[Wiebe et al. 2005],

GI[Stone et al. 1966],

SentiWordNet [Esuli and

Sebastiani 2006],”Maryland” Dict.

[Mohammad et al. 2009], Google

Generated Dict. [Taboada et al.

2006b]

9928

Emoticons DS

(Distant

Supervision)

-1,0,1

Validation with

unlabeled twitter data

[Cha et al. 2010]

- 1162894 X

NRC Hashtag -1,0,1

Twitter (SemEval-2007

Affective Text Corpus)

[Strapparava and

Mihalcea 2007]

- 679468 X

Pattern.en <0.1,≥0.1] Product reviews, but the

source was not speciﬁed - 2973

SASA [Wang et al.

2012]

Negative,

Neutral,

Unsure, Positive

“Political” Tweets

labeled by “turkers”

(AMT) (unavailable)

- 21012

PANAS-t -1,0,1

Validation with

unlabeled twitter data

[Cha et al. 2010]

-50 X

EmoLex -1,0,1-

Compared with existing gold

standard data but it was not

speciﬁed

141820 X

SANN neg, neu, pos Their own dataset - Ted

Talks

Comparison with other multimedia

recommendation approaches. 8701

Sentiment140

Negative,

Neutral,

Positive

Twitter and SMS from

Semeval 2013, task 2

[Nakov et al. 2013].

Other Semeval 2013, task 2

approaches 1220176 X

SentiStrength -1,0,1

Their own datasets -

Twitter, Youtube, Digg,

Myspace, BBC Forums

and Runners World.

The best of nine Machine Learning

techniques for each test. 2698

Stanford Recursive

Deep Model

very negative,

negative,

neutral,

positive,very

positive

Movie Reviews [Pang

and Lee 2004]

Na¨

ıve Bayes and SVM with bag of

words features and bag of bigram

features.

227009

Umigon

Negative,

Neutral,

Positive

Twitter and SMS from

Semeval 2013, task 2

[Nakov et al. 2013].

[Mohammad et al. 2013] 1053

Vader -1,0,1

Their own datasets -

Twitter, Movie

Reviews, Technical

Product Reviews, NYT

User’s Opinions.

(GI)[Stone et al. 1966], LIWC,

[Tausczik and Pennebaker 2010],

SentiWordNet [Esuli and

Sebastiani 2006], ANEW [Bradley

and Lang 1999], SenticNet

[Cambria et al. 2010] and some

Machine Learning Approaches.

7517

Semantria negative,

neutral, positive not available not available not available

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:7

acteristics to determine a sentence polarity. These approaches make use of a series of intensiﬁers,

punctuation transformation, emoticons, and many other heuristics.

Thus, to evaluate each lexicon dictionaries as the base for a sentence-level sentiment analysis

method, we considered the Vader’s implementation. In other words, we used Vader’s code for de-

termining if a sentence is positive or not considering different lexicons as dictionaries.

Vader’s heuristics were proposed by means of qualitative analyses of textual properties and char-

acteristics which affect the perceived sentiment intensity of the text. Vader’s author identiﬁed ﬁve

heuristics based on grammatical and syntactical cues to convey changes to sentiment intensity that

go beyond the bag-of-words model. The heuristics include treatments for: 1) punctuation (e.g num-

ber of ‘!’s); 2) capitalization (e.g ”I HATE YOU” is more intense than ”i hate you”); 3) degree mod-

iﬁers (e.g ”The service here is extremely good” is more intense than ”The service here is good”);

4) constructive conjunction ”but” to shift the polarity; 5) Tri-gram examination to identify negation

(e.g ”The food here isnt really all that great.”). We choose Vader as it is the newest method among

those we considered, it is becoming widely used as it was even implemented as part of the well

known NLTK python library5.

We applied such heuristics with the following lexicons: Emolex, EmoticonsDS, NRC Hashtag,

Opinion Lexicon, Panas, Sentiment 140, SentiWordNet. We notice that this strategy drastically im-

proved results of lexicon for sentence-level sentimente analysis in comparison with a simple base-

line approach that averages the occurrence of positive and negative words to classify the polarity of

a sentence. Table II has also a column Lexicon that describes the number of terms the proposed dic-

tionary contains and column C (changed) indicates some methods we slightly modiﬁed to adequate

their output formats to the polarity detection task.

Some other methods required similar adaptations. Methods that are based on machine learning,

like SASA and SentiStrength, are used here as unsupervised approaches as their trained models were

released by the authors and they have been used in other efforts as tools that require no training data.

We plan to release all the codes used in this article, except for paid softwares like LIWC and Sen-

tiStrength, as an attempt to allow reproducibility as well as possible corrections in our decisions.

There are a few other methods for sentiment detection proposed in the literature and not consid-

ered here. Most of them consist of variations of the techniques used by the above methods, such as

WordNet-Affect[Valitutti 2004] and ANEW [Bradley and Lang 1999] (the same used by Happiness

Index, SentiWordNet, SenticNet, etc.). Finally, there exist a few other methods which are not avail-

able on the Web or request and that could not be re-implemented based on their descriptions in the

original papers (e.g, Proﬁle of Mood States (POMS) [Bollen et al. 2009]).

From Table II we can also note that the validation strategy, the datasets used, and the baseline

comparison of these methods varies greatly, from toy examples to large labeled datasets. Panas-t

and Happiness Index use labeled examples to validate their methods, by presenting evaluations of

events in which some bias towards positivity and negativity would be expected. Panas-t is tested with

unlabeled twitter data related to Michael Jackson’s death and the release of a Harry Potter movie

whereas Happiness Index was used to measure song lyrics happiness from 1967 to 2007. Lexical

dictionaries were validated in very different ways. AFINN[Nielsen 2011] compared its Lexicon

with other dictionaries. Emoticon Distance Supervised [Hannak et al. 2012] used Pearson Corre-

lation between human labeling and the predicted value. SentiWordNet [Esuli and Sebastiani 2006]

validates the proposed dictionary with comparisons with other dictionaries, but it also used human

validation of the proposed lexicon. These efforts attempt to validate the lexicon created, without

comparing the lexicon as a sentiment analysis method itself. Vader [Hutto and Gilbert 2014] com-

pared results with lexical approaches considering labeled datasets from different social media data.

SenticNet [Cambria et al. 2010] was compared with SentiStrength [Thelwall 2013] with a speciﬁc

dataset related to patient opinions, which could not be made available. Stanford Recursive Deep

Model [Socher et al. 2013] and SentiStrength [Thelwall 2013] were both compared with standard

machine learning approaches, with their own datasets.

5http://www.nltk.org/ modules/nltk/sentiment/vader.html

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:8 Gonc¸alves et al.

Table III. Labeled datasets.

Dataset Nomeclature # # # # Average # Average Annotators # of R

Msgs Pos Neg Neu of phrases # of words Expertise Annotators (%)

Comments (BBC) Comments BBC 1,000 99 653 248 3,98 64,39 NonExpert 3 87,0

[Thelwall 2013]

Comments (Digg) Comments Digg 1,077 210 572 295 2,50 33,97 Non Expert 3 88,0

[Thelwall 2013]

Comments (NYT) Comments NYT 5,190 2,204 2,742 244 1,01 17,76 AMT 20 88,0

[Hutto and Gilbert 2014]

Comments (TED) Comments TED 839 318 409 112 1 16,95 Non Expert 6 82,0

[Pappas and Popescu-Belis 2013]

Comments (Youtube) Comments YTB 3,407 1,665 767 975 1,78 17,68 Non Expert 3 90,0

[Thelwall 2013]

Movie-reviews Reviews I 10,662 5,331 5,331 - 1,15 18,99 User - 66,0

[Pang and Lee 2004] Rating

Movie-reviews Reviews II 10,605 5,242 5,326 37 1,12 19,33 AMT 20 97,0

[Hutto and Gilbert 2014]

Myspace posts Myspace 1,041 702 132 207 2,22 21,12 NonExpert 3 91,0

[Thelwall 2013]

Product reviews Amazon 3,708 2,128 1,482 98 1,03 16,59 AMT 20 94,0

[Hutto and Gilbert 2014]

Tweets (Debate) Tweets DBT 3,238 730 1249 1259 1,86 14,86 AMT + Undef. 60

[Diakopoulos and Shamma 2010] Expert

Tweets (Irony) Irony 100 38 43 19 1,01 17,44 Expert 3 -

(Labeled by us)

Tweets (Sarcasm) Sarcasm 100 38 38 24 1 15,55 Expert 3 -

(Labeled by us)

Tweets (Random) Tweets RND I 4,242 1,340 949 1953 1,77 15,81 Non Expert 3 88,0

[Thelwall 2013]

Tweets (Random) Tweets RND II 4,200 2,897 1,299 4 1,87 14,10 AMT 20 97,5

[Hutto and Gilbert 2014]

Tweets (Random) Tweets RND III 3,771 739 488 2,536 1,54 14,32 AMT 3 90,0

[Narr et al. 2012]

Tweets (Random) Tweets RND IV 500 139 119 222 1,90 15,44 Expert Undef. 90,0

[Aisopos 2014]

Tweets (Speciﬁc domains w/ emot.) Tweets STF 359 182 177 - 1,0 15,1 NonExpert Undef. 97,0

[Go et al. 2009]

Tweets (Speciﬁc topics) Tweets SAN 3737 580 654 2503 1,60 15,03 Expert 1 97,0

[Sanders 2011]

Tweets (Semeval2013 Task2) Tweets Semeval 6,087 2,223 837 3027 1,86 20,05 AMT 5 100,0

[Nakov et al. 2013]

Runners World forum RW 1,046 484 221 341 4,79 66,12 Non Expert 3 86,0

[Thelwall 2013]

This scenario, where every new developed solution compares itself with different solutions using

different datasets, happens because there is no standard benchmark for evaluating new methods. This

problem is exarcebated because many methods have been proposed in different research communi-

ties (e.g. NLP, Information Science, information Retrieval, Machine Learning), exploiting different

techniques, with low knowledge about related efforts in other communities. Next, we describe how

we created a large gold standard to properly compare all the considered sentiment analysis methods.

4. GOLD STANDARD DATA

A key aspect in evaluating sentiment analysis methods consists of using an accurate gold standard

(datasets). Several existing efforts have generated labeled data produced by expert or non-experts

evaluators. Previous studies suggest that both efforts are valid as non-expert labeling may be as

effective as annotations produced by experts for affect recognition, a very related task [Snow et al.

2008]. Thus, our effort to build a large and representative gold standard dataset consists of obtaining

labeled data from trustful previous works that cover a wide range of sources and kinds of data. We

also attempt to assess the “quality” of our gold standard in terms of the accuracy of the labeling

process.

Table III summarizes the main characteristics of twenty of the exploited datasets, such as number

of messages and the average number of words per message in each dataset. It also deﬁnes a simpler

nomenclature that is used in the remainder of this paper. The table also presents the methodology

employed in the classiﬁcation. Human labeling was implemented in almost all datasets, usually done

with the use of non-expert reviewers. Reviews I dataset relies on ﬁve stars rates, in which users rate

and provide a comment about an entity of interest (e.g. a movie or an establishment).

Labeling based on Amazon Mechanical Turk (AMT) was used in seven out of the twenty datasets,

while volunteers and other strategies that involve non-expert evaluators were used in ten datasets.

Usually, an agreement strategy (i.e. majority voting) is applied to ensure that, in the end, each

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:9

sentence has an agreed-upon polarity assigned to it. The number of annotators used to build the

datasets is also shown in Table III.

Tweets DBT was the unique dataset built with a combination of AMT Labeling with Expert

validation. They selected 200 random tweets to be classiﬁed by experts and compared with AMT

results to ensure accurate ratings. We note that the Tweets Semeval dataset was provided as a list of

Twitter IDs, due to the Twitter policies related to data sharing. While crawling the respective tweets,

a a small part of them could not be accessed, as they were deleted. In any case, we plan to release

all gold standard datasets in a request basis, which is in agreement with Twitter policies.

In order to assess the extent to which these datasets are trustful, we used a strategy similar to

the one used by Tweets DBT. Our goal was not to redo all the performed human evaluation, but

simply inspecting a small sample of them to infer the level of agreement with our own evaluation.

We randomly select 1% of all sentences to be evaluated by experts (two of the authors) as an attempt

to assess if these gold standard data are really trustful. It is important to mention that we do not have

access to the instructions provided by the authors. We also could not get access to small amount of

the raw data in a few datasets, which was discarded. Finally, our manual inspection unveiled a few

sentences in idioms other than English in a few datasets, such as Tweets STA and TED, which were

obviously discarded.

Column R from Table III exhibits the level of agreement of each dataset in our evaluation. After

a close look in the cases we disagree with the evaluations in the Gold standard, we understand that

other interpretations could be given to the text, ﬁnding cases of sentences with mixed polarity. Some

of then are strongly linked to context and very hard to evaluate. Some NYT comments, for instance,

are directly related to the news they were inserted to. We can also note that some of the datasets

do not contain neutral messages. This might be a characteristic of the data or even a result of how

annotators were instructed to label their pieces of text. Most of the cases of disagreement involve

neutral messages. Thus, we considered these cases as well as the amount of disagreement we had

with the gold standard data as reasonable and expected.

5. COMPARISON RESULTS

Next, we present comparison results for the twenty one methods considered in this paper based on

the twenty considered gold standard datasets.

5.1. Experimental details

At least three distinct approaches have been proposed to deal with sentiment analysis of sentences.

The ﬁrst of them, splits this task into two steps: (i) identifying sentences with no sentiment, also

named as objective vs. neutral sentences and then (ii) detecting polarity (positive or negative), only

for the subjective sentences. Another common way to detect sentence polarity is considers in a single

task three distinct classes (positive, negative and neutral). Finally, some methods classify a sentence

as positive or negative only, assuming that only polarized sentences are present, given the context of

a given application. As example, review of products are expected to contain only polarized opinion.

Aiming at providing a more thorough comparison among these distinct approaches, we perform

two rounds of tests. In the ﬁrst we consider the performance of methods to identify 3-classes (posi-

tive, negative and neutral). The second considers only positive and negative as output and assumes

that a ﬁrst step of removing the neutral messages was already performed. In the 3-classes experi-

ments we used only datasets containing a considerable number of neutral messages (which excludes

Tweets RND II, Amazon and Reviews II that contain an insigniﬁcant number of neutral sentences).

Despite being 2-classes methods, as highlighted in Table IV, we decided to include LIWC, Emoti-

cons and Senticnet in the 3-classes experiments to present a full set of comparative experiments.

LIWC, Emoticons and Senticnet cannot deﬁne, for some sentences, their positive or negative po-

larity, consering it as undeﬁned. It occurs due to the absence in the sentence of emoticons (in the

case of Emoticons method) or of twords beloging to the methods’ sentiment lexicon. As a neutral

(objective) sentence is one that express no sentiment at all about a topic, we assumed, in the case of

these 2-class methods, undeﬁned polarities as being equivalent to neutral ones.

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:10 Gonc¸alves et al.

The 2-classes experiments, in turn, were performed with all datasets described in Table III without

the neutral sentences. We also included all methods in these experiments, even those that produce

neutral outputs. As discussed before, when 2-classes methods cannot detect the polarity (positive or

negative) of a sentences they usually assign it to an undeﬁned polarity. As we know all sentences

in the 2-classes experiments are positive or negative we create the coverage metric to determine the

percentage of sentences a method can in fact classify as positive or negative. For instance, suppose

that Emoticons’ method can classify only 10% of the sentences in a dataset, corresponsing to the

actual percentage of sentences with emoticons. It means that the coverage of this method in this

speciﬁc dataset is 10%. Note that, the coverage is quite an important metric for a more complete

evaluation in the 2-classes experiments. Even though Emoticons presents high accuracy for the

classiﬁed phrases it was not able to make a prediction for 90% of the sentences. More formally,

coverage is calculated as the number of total sentences minus the number of undeﬁned sentences,

all of this divided by the total of sentences, where the number of undeﬁned sentences includes

neutral outputs for 3-classes methods.

Coverage =#Sentences −#U ndefined

#Sentences

5.2. Comparison Metrics

Considering the 3-classes comparison experiments, we used the traditional Precision, Recall and F1

measures for the automated classiﬁcation.

Predicted

Positive Neutral Negative

Positive a b c

Actual Neutral d e f

Negative g h i

Each letter in the above table represents the number of instances which are actually in class X

and predicted in class Y, where X;Y ∈positive; neutral; negative. The recall (R) of a class Xis the

ratio of the number of elements correctly classiﬁed as Xto the number of known elements in class

X. Precision (P) of a class Xis the ratio of the number of elements classiﬁed correctly as Xto

the total predicted as the class X. For example, the precision of the negative class is computed as:

P(neg) = i/(c+f+i); its recall, as: R(neg) = i/(g+h+i); and the F1measure is the harmonic

mean between both precision and recall. In this case, F1(neg) = 2P(neg)·R(neg)

P(neg)+R(neg).

We also compute the overall accuracy as: A=a+e+i

a+b+c+d+e+f+g+h+i. It considers equally impor-

tant the correct classiﬁcation of each sentence, independently of the class, and basically measures

the capability of the method to predict the correct output. A variation of F1, namely, macro-F1, is

normally reported to evaluate classiﬁcation effectiveness on skewed datasets. Macro-F1 values are

computed by ﬁrst calculating F1 values for each class in isolation, as exempliﬁed above for negative,

and then averaging over all classes. Macro-F1 considers equally important the effectiveness in each

class, independently of the relative size of the class. Thus, accuracy and Macro-F1 provide com-

plementary assessments of the classiﬁcation effectiveness. Macro-F1 is especially important when

the class distribution is very skewed, to verify the capability of the method to perform well in the

smaller classes.

The described metrics can easily computed for the 2-classes experiments by just removing neutral

columns and rows as per below. In this case, the precision of positive class is computed as: P(pos) =

a/(a+c); its recall as: R(pos) = a/(a+b); while its F1 is F1(pos) = 2P(pos)·R(pos)

P(pos)+R(pos)

As we have a large number of combination among base methods, metrics and datasets, a global

analysis of the performance of all these combinations is not an easy task. We propose a simple but

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:11

Predicted

Positive Negative

Positive a b

Actual Negative c d

informative measure to assess the overall performance ranking. The Mean Ranking is basically the

sum of ranks obtained by a method in each dataset divided by the total number of datasets, as per

below:

MR =

j=1

where nd is the number of datasets and ri is the rank of the method for dataset i.It is important to

notice that rank was calculated based on Macro F1.

The last evaluation metric we exploit is the Friedman’s Test [Berenson et al. 2014]. It allows to

verify whether, in a speciﬁc experiment, the observed values are globally similar. In other words, are

the methods presenting similar performance across different datasets? To exemplify the application

of this test, suppose that nrestaurants are each rated by kjudges. The question that arises is: are the

judges ratings consistent with each other or are they following completely different patterns? The

application in our context is very similar: the datasets as the restaurants and the macro-F1 achieved

by a method is the rating from the judges.

The Friedman’s Test is applied to rankings. Then, to proceed with this statistical test, we sort the

methods for each dataset using for comparison the macro-F1 metric. In other words, the method

with highest macro-F1 received rank ‘1’ while the slowest macro-F1 method was ranked as ‘21’ for

each dataset.

More formally, the Friedman’s rank test is deﬁned as:

FR= ( 12

rc(c+ 1)

j=1

j)−3r(c+ 1)

where

j= square of the total of the ranks for group j (j = 1,2,..., c)

r= number of blocks

c= number of groups

In our case, the number of blockes corresponds to the number of datasets and the number of

the number of groups is the number of methods evaluated. As the number of blocks increases, the

statistical test can be aproximated by using the chi-square distribution with c−1degrees of freedom.

Then, if the FRcomputed value is greater than the critical value for the chi-square distribution the

null hypothesis is rejected. This null hypothesis states that ranks obtained by judges are globally

similar, then rejecting the null hypothesis means that there are signiﬁcant differences in the judgment

ranks (datasets). It is important to note that, in general, the critical value is obtained with signiﬁcance

level α= 0.05 and. Synthesizing, the null hypothesis should be rejected if FR> X2

α, where X2

αis

the critical value veriﬁed in the chi-square distribution table with c−1degrees of freedom and α

equals 0.05.

5.3. Comparing Prediction Performance

We start the analysis of our experiments by comparing the results of all metrics previously discussed

for all datasets. Table V and Table IV presents accuracy, precision and macro-F1 for all methods

in 4 datasets for the 3-classes experiments and 2-classes experiments respectively. For simplicity,

results fot the other 16 datasets are presented in the appendix.

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:12 Gonc¸alves et al.

Table IV. 2-classes experiments results with 4 datasets

Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage

P R F1 P R F1

AFINN 77.85 72.73 86.72 79.11 84.83 69.54 76.43 77.77 69.94

Emolex 72.06 65.42 85.67 74.19 82.61 60.06 69.55 71.87 60.04

Emoticons 85.71 87.50 89.74 88.61 82.61 79.17 80.85 84.73 5.77

Emoticons DS 47.20 47.13 99.21 63.90 55.56 0.88 1.74 32.82 98.08

Happiness Index 64.23 59.63 89.92 71.70 78.81 38.11 51.38 61.54 45.10

NRC Hashtag 69.77 72.01 58.71 64.69 68.36 79.63 73.57 69.13 93.68

LIWC 77.27 71.37 92.05 80.40 88.38 62.10 72.95 76.67 63.70

Tweets Opinion Finder 68.60 63.54 71.35 67.22 73.80 66.35 69.87 68.55 34.74

SAN Opinion Lexicon 81.77 78.18 86.74 82.24 85.98 77.05 81.27 81.75 65.35

PANAS-t 84.62 80.00 80.00 80.00 87.50 87.50 87.50 83.75 2.38

Pattern.en 74.62 68.46 90.59 77.98 86.40 58.90 70.04 74.01 72.59

SANN 70.02 64.40 85.95 73.63 80.46 54.90 65.27 69.45 45.55

SASA 61.18 61.74 74.71 67.61 60.12 45.16 51.58 59.59 43.45

Semantria 78.91 76.42 82.85 79.50 81.79 75.08 78.29 78.90 57.38

SenticNet 66.51 65.31 57.73 61.29 67.33 73.96 70.49 65.89 77.45

Sentiment140 72.45 0.00 0.00 0.00 72.45 100.00 84.02 42.01 53.90

SentiStrength 89.47 88.70 91.81 90.23 90.41 86.84 88.59 89.41 29.61

SentiWordNet 67.49 64.25 75.63 69.48 71.90 59.70 65.23 67.35 59.21

SO-CAL 79.70 74.74 84.55 79.34 85.11 75.56 80.05 79.70 68.19

Stanford DM 62.72 87.60 22.60 35.93 59.35 97.25 73.71 54.82 92.94

Umigon 82.41 83.66 80.49 82.04 81.25 84.32 82.76 82.40 67.74

Vader 77.18 71.81 88.15 79.15 85.34 66.59 74.81 76.98 78.74

AFINN 80.32 79.38 90.84 84.72 82.39 64.47 72.34 78.53 71.47

Emolex 73.28 74.49 83.19 78.60 70.93 59.03 64.43 71.52 59.02

Emoticons 85.43 90.27 90.27 90.27 71.05 71.05 71.05 80.66 13.19

Emoticons DS 58.95 58.84 99.63 73.99 72.22 1.37 2.70 38.34 99.83

Happiness Index 68.28 66.75 93.60 77.93 76.23 30.58 43.65 60.79 60.46

NRC Hashtag 65.61 72.95 65.91 69.26 57.31 65.19 61.00 65.13 95.54

LIWC 59.92 62.41 79.37 69.87 52.66 32.43 40.14 55.01 54.61

Tweets Opinion Finder 77.16 82.14 78.04 80.04 70.90 75.92 73.32 76.68 40.37

RND I Opinion Lexicon 81.56 82.00 87.68 84.74 80.84 72.98 76.71 80.73 63.74

PANAS-t 85.45 91.18 86.11 88.57 76.19 84.21 80.00 84.29 4.81

Pattern.en 78.02 79.84 85.52 82.58 74.60 66.33 70.22 76.40 77.72

SANN 75.61 75.48 87.04 80.85 75.89 59.06 66.43 73.64 50.15

SASA 65.60 70.72 70.36 70.54 58.47 58.89 58.68 64.61 58.67

Semantria 83.98 85.94 87.75 86.83 80.85 78.28 79.54 83.19 58.63

SenticNet 70.90 75.79 70.92 73.28 65.45 70.87 68.05 70.66 78.51

Sentiment140 70.19 0.00 0.00 0.00 70.19 100.00 82.49 41.24 40.45

SentiStrength 93.72 94.26 96.33 95.28 92.61 88.68 90.60 92.94 27.13

SentiWordNet 70.70 76.03 77.64 76.83 61.27 59.11 60.17 68.50 62.78

SO-CAL 80.85 82.08 85.98 83.98 78.92 73.66 76.20 80.09 64.57

Stanford DM 54.19 87.02 25.40 39.33 47.44 94.67 63.21 51.27 92.70

Umigon 82.07 89.22 80.71 84.76 73.02 84.26 78.24 81.50 67.50

Vader 80.12 78.73 91.76 84.75 83.39 62.52 71.46 78.10 81.08

AFINN 84.42 80.62 91.49 85.71 89.66 77.04 82.87 84.29 76.88

Emolex 79.65 76.09 88.98 82.03 85.23 69.44 76.53 79.28 62.95

Emoticons 85.42 80.65 96.15 87.72 94.12 72.73 82.05 84.89 13.37

Emoticons DS 51.96 51.41 100.00 67.91 100.00 2.27 4.44 36.18 99.72

Happiness Index 65.93 58.72 94.39 72.40 88.89 40.34 55.49 63.95 62.95

NRC Hashtag 71.30 73.05 70.93 71.98 69.51 71.70 70.59 71.28 92.20

LIWC 64.29 63.75 76.12 69.39 65.22 50.85 57.14 63.27 70.39

Tweets Opinion Finder 80.77 81.16 76.71 78.87 80.46 84.34 82.35 80.61 43.45

STF Opinion Lexicon 86.10 83.67 91.11 87.23 89.29 80.65 84.75 85.99 72.14

PANAS-t 94.12 88.89 100.00 94.12 100.00 88.89 94.12 94.12 4.74

Pattern.en 77.85 75.69 85.09 80.12 80.95 69.86 75.00 77.56 85.52

SANN 73.21 69.35 82.69 75.44 78.82 63.81 70.53 72.98 58.22

SASA 68.52 65.65 78.90 71.67 72.94 57.94 64.58 68.12 60.17

Semantria 88.45 89.15 88.46 88.80 87.70 88.43 88.07 88.43 69.92

SenticNet 70.49 71.31 63.50 67.18 69.88 76.82 73.19 70.18 80.22

Sentiment140 75.53 0.00 0.00 0.00 75.53 100.00 86.06 43.03 52.37

SentiStrength 95.33 95.18 96.34 95.76 95.52 94.12 94.81 95.29 41.78

SentiWordNet 72.99 73.17 78.95 75.95 72.73 65.98 69.19 72.57 58.77

SO-CAL 87.36 82.89 93.33 87.80 92.80 81.69 86.89 87.35 77.16

Stanford DM 66.56 87.69 36.31 51.35 61.24 95.18 74.53 62.94 89.97

Umigon 86.99 91.73 81.88 86.52 83.02 92.31 87.42 86.97 81.34

Vader 94.12 100.00 90.48 95.00 86.67 100.00 92.86 93.93 9.47

AFINN 66.56 23.08 81.08 35.93 96.32 64.66 77.38 56.65 85.11

Emolex 59.64 21.52 89.04 34.67 97.38 55.62 70.80 52.73 80.72

Emoticons 33.33 0.00 0.00 0.00 100.00 33.33 50.00 25.00 0.40

Emoticons DS 13.33 13.10 100.00 23.17 100.00 0.31 0.61 11.89 99.73

Happiness Index 41.81 15.65 95.52 26.89 98.41 35.03 51.67 39.28 79.52

NRC Hashtag 84.45 33.33 25.27 28.75 89.76 92.83 91.27 60.01 97.47

LIWC 50.10 15.38 58.33 24.35 88.00 48.78 62.77 43.56 69.55

Comments Opinion Finder 74.43 21.74 62.50 32.26 94.93 75.72 84.24 58.25 76.46

BBC Opinion Lexicon 74.14 29.81 84.93 44.13 97.24 72.66 83.17 63.65 80.72

PANAS-t 58.73 20.00 75.00 31.58 93.94 56.36 70.45 51.02 8.38

Pattern.en 61.09 20.00 70.73 31.18 93.48 59.72 72.88 52.03 87.50

SANN 54.34 19.09 88.06 31.38 96.88 49.80 65.78 48.58 75.13

SASA 61.61 23.50 66.20 34.69 90.80 60.77 72.81 53.75 61.30

Semantria 83.43 40.00 84.75 54.35 97.64 83.26 89.88 72.11 67.42

SenticNet 66.07 24.44 74.16 36.77 94.24 64.83 76.81 56.79 88.96

Sentiment140 92.56 0.00 0.00 0.00 92.56 100.00 96.14 48.07 51.86

SentiStrength 93.93 64.29 78.26 70.59 97.72 95.54 96.61 83.60 32.85

SentiWordNet 57.49 20.00 88.06 32.60 97.13 53.45 68.96 50.78 76.33

SO-CAL 75.28 28.93 80.28 42.54 96.71 74.64 84.25 63.40 82.85

Stanford DM 89.45 63.16 40.91 49.66 91.81 96.52 94.11 71.88 92.02

Umigon 79.37 39.13 61.02 47.68 92.10 82.72 87.15 67.42 50.93

Vader 62.19 22.12 85.54 35.15 96.77 59.02 73.32 54.23 92.15

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:13

Table V. 3-classes experiments results with 4 datasets

Dataset Method Accur. Posit. Sentiment Negat. Sentiment Neut. Sentiment MacroF1

P R F1 P R F1 P R F1

AFINN 62.36 61.10 70.09 65.28 44.08 31.91 37.02 71.43 58.57 64.37 55.56

Emolex 48.74 48.15 62.71 54.47 31.27 17.71 22.61 57.90 41.30 48.21 41.76

Emoticons 52.88 72.83 11.34 19.62 55.56 32.37 40.91 34.05 96.53 50.34 36.96

Emoticons DS 36.59 36.55 100.00 53.53 75.00 0.08 0.16 100.00 0.03 0.07 17.92

Happiness Index 48.81 43.61 65.27 52.29 36.96 7.54 12.53 36.82 45.16 40.56 35.13

NRC Hashtag 36.95 42.04 75.03 53.88 24.57 16.94 20.05 53.33 3.70 6.92 26.95

LIWC 39.54 36.52 42.33 39.21 15.14 6.25 8.84 48.64 44.83 46.66 31.57

Tweets Opinion Finder 57.63 67.57 27.94 39.53 40.75 48.62 44.34 58.20 86.06 69.44 51.10

Semeval Opinion Lexicon 60.37 62.09 62.71 62.40 41.19 34.18 37.36 66.41 60.75 63.46 54.41

PANAS-t 53.08 90.95 9.04 16.45 51.56 62.26 56.41 51.65 99.01 67.89 46.92

Pattern.en 50.19 58.07 68.47 62.84 24.68 29.82 27.01 67.73 35.22 46.34 45.40

SANN 54.77 52.72 47.59 50.02 38.91 20.92 27.21 58.95 66.90 62.67 46.64

SASA 50.63 46.34 47.77 47.04 33.07 12.14 17.76 56.39 61.12 58.66 41.15

Semantria 61.54 67.28 57.35 61.92 39.57 41.62 40.57 65.98 67.03 66.50 56.33

SenticNet 49.68 51.85 1.26 2.46 29.79 35.00 32.18 49.82 98.51 66.17 33.60

Sentiment140 42.25 0.00 0.00 0.00 26.79 100.00 42.25 50.57 66.14 57.31 33.19

SentiStrength 57.83 78.01 27.13 40.25 47.80 53.55 50.52 55.49 89.89 68.62 53.13

SentiWordNet 48.33 55.54 53.44 54.47 19.67 24.82 21.95 61.22 47.57 53.54 43.32

SO-CAL 58.83 58.89 59.02 58.95 40.39 33.14 36.41 39.89 59.96 47.91 47.76

Stanford DM 22.54 72.14 18.17 29.03 14.92 82.93 25.28 47.19 6.94 12.10 22.14

Umigon 65.88 75.18 56.14 64.28 39.66 53.18 45.44 70.65 75.78 73.13 60.95

Vader 60.05 56.08 79.26 65.68 44.13 26.60 33.19 76.88 46.02 57.57 52.15

AFINN 64.41 40.81 72.12 52.13 49.67 28.29 36.05 85.95 62.54 72.40 53.53

Emolex 54.76 31.67 59.95 41.44 40.14 19.53 26.27 77.48 54.64 64.08 43.93

Emoticons 70.22 70.06 16.78 27.07 65.62 44.21 52.83 41.29 97.56 58.02 45.98

Emoticons DS 20.34 19.78 99.46 33.00 62.07 0.60 1.19 53.85 0.55 1.09 11.76

Happiness Index 55.16 29.13 61.98 39.64 50.65 9.50 16.01 43.35 59.16 50.03 35.23

NRC Hashtag 30.47 28.25 77.40 41.39 24.18 19.59 21.64 79.08 8.77 15.78 26.27

LIWC 46.88 21.85 38.43 27.86 19.18 8.05 11.34 69.51 54.83 61.31 33.50

Tweets Opinion Finder 71.55 57.48 32.75 41.72 49.85 48.56 49.20 75.95 89.90 82.34 57.75

RND III Opinion Lexicon 63.86 40.65 66.17 50.36 48.84 27.73 35.38 81.96 64.66 72.29 52.68

PANAS-t 68.79 79.49 8.39 15.18 48.57 51.52 50.00 68.75 98.86 81.10 48.76

Pattern.en 53.57 36.25 76.86 49.26 35.19 22.50 27.45 84.20 45.68 59.23 45.31

SANN 66.88 42.70 48.71 45.51 46.35 26.93 34.07 77.99 77.99 77.99 52.52

SASA 55.37 29.42 54.53 38.22 42.46 19.28 26.52 78.30 57.15 66.08 43.60

Semantria 68.89 48.86 63.73 55.31 49.82 35.47 41.44 82.02 72.96 77.22 57.99

SenticNet 29.97 31.08 74.83 43.92 20.98 22.75 21.83 79.70 8.49 15.35 27.03

Sentiment140 55.05 0.00 0.00 0.00 28.14 100.00 43.92 71.14 66.00 68.47 37.46

SentiStrength 73.80 70.94 41.95 52.72 57.53 49.80 53.39 75.35 92.26 82.95 63.02

SentiWordNet 55.85 37.42 58.19 45.55 24.04 19.57 21.58 79.25 59.00 67.64 44.92

SO-CAL 66.51 43.06 68.88 52.99 51.84 30.55 38.44 45.77 66.94 54.37 48.60

Stanford DM 31.90 64.48 38.57 48.26 15.58 72.55 25.65 75.64 19.77 31.35 35.09

Umigon 74.12 57.67 70.23 63.33 48.83 46.71 47.75 88.80 76.34 82.10 64.39

Vader 59.82 37.52 81.73 51.43 47.99 24.25 32.22 89.26 52.28 65.94 49.86

AFINN 50.10 16.22 60.61 25.59 82.62 54.14 65.42 40.11 30.24 34.48 41.83

Emolex 44.10 15.51 65.66 25.10 83.19 45.62 58.93 35.27 31.85 33.47 39.17

Emoticons 24.60 0.00 0.00 0.00 33.33 25.00 28.57 19.77 98.79 32.95 20.51

Emoticons DS 10.00 9.85 98.99 17.92 66.67 0.22 0.44 0.00 0.00 0.00 9.18

Happiness Index 33.60 11.83 64.65 20.00 84.93 28.05 42.18 26.46 34.68 30.02 30.73

NRC Hashtag 64.00 20.72 23.23 21.90 70.20 87.13 77.76 52.50 8.47 14.58 38.08

LIWC 33.00 11.11 42.42 17.61 67.69 39.57 49.94 22.90 27.42 24.95 30.84

Comments Opinion Finder 51.80 14.96 35.35 21.02 78.76 66.39 72.04 33.71 36.29 34.95 42.67

BBC Opinion Lexicon 55.00 20.67 62.63 31.08 85.27 61.98 71.79 40.82 40.32 40.57 47.81

PANAS-t 27.10 16.67 6.06 8.89 75.61 50.82 60.78 25.35 94.35 39.97 36.55

Pattern.en 46.00 14.39 58.59 23.11 77.30 49.93 60.67 38.16 23.39 29.00 37.59

SANN 40.10 14.50 59.60 23.32 79.49 41.61 54.63 33.45 37.90 35.54 37.83

SASA 38.20 17.03 47.47 25.07 70.75 50.86 59.18 25.19 39.52 30.77 38.34

Semantria 56.00 28.90 50.51 36.76 83.82 75.20 79.28 35.86 55.24 43.49 53.18

SenticNet 47.10 17.74 66.67 28.03 72.87 55.13 62.77 25.89 11.69 16.11 35.64

Sentiment140 50.60 0.00 0.00 0.00 73.23 100.00 84.54 28.60 58.47 38.41 40.98

SentiStrength 44.20 47.37 18.18 26.28 86.64 91.45 88.98 29.37 84.68 43.61 52.96

SentiWordNet 42.40 14.90 59.60 23.84 81.63 44.57 57.66 34.56 37.90 36.15 39.22

SO-CAL 55.50 20.88 57.58 30.65 80.47 65.61 72.28 28.57 34.68 31.33 44.75

Stanford DM 65.50 43.37 36.36 39.56 71.01 92.54 80.36 37.50 14.52 20.93 46.95

Umigon 45.70 28.35 36.36 31.86 76.35 74.65 75.49 29.31 61.69 39.74 49.03

Vader 49.10 15.96 71.72 26.10 82.57 49.05 61.54 50.42 24.19 32.70 40.11

AFINN 42.45 64.81 41.79 50.81 80.29 68.59 73.98 7.89 77.87 14.32 46.37

Emolex 42.97 55.12 53.72 54.41 75.35 48.67 59.14 7.22 54.10 12.74 42.10

Emoticons 4.68 0.00 0.00 0.00 0.00 0.00 0.00 4.47 99.59 8.56 2.85

Emoticons DS 42.58 42.55 99.77 59.66 78.57 0.37 0.73 0.00 0.00 0.00 30.20

Happiness Index 31.81 48.42 50.18 49.29 71.70 25.96 38.12 5.36 54.10 9.76 32.39

NRC Hashtag 54.84 55.38 45.74 50.10 61.55 68.92 65.03 8.33 15.16 10.76 41.96

LIWC 24.35 42.88 27.72 33.67 53.42 39.12 45.16 4.67 53.28 8.58 29.14

Comments Opinion Finder 29.38 68.77 18.78 29.51 76.52 82.66 79.47 6.29 88.11 11.75 40.24

NYT Opinion Lexicon 44.57 65.95 43.15 52.17 79.81 70.65 74.95 7.94 73.77 14.34 47.15

PANAS-t 5.88 69.23 1.23 2.41 62.07 75.00 67.92 4.75 99.18 9.07 26.47

Pattern.en 45.39 55.15 44.69 49.37 63.65 61.12 62.36 7.85 45.90 13.41 41.71

SANN 27.92 56.74 29.40 38.73 78.02 55.13 64.61 5.93 79.51 11.04 38.13

SASA 30.04 49.92 30.13 37.58 59.11 52.83 55.80 5.74 61.07 10.49 34.62

Semantria 44.59 70.60 41.83 52.54 80.54 75.95 78.18 7.53 73.36 13.65 48.12

SenticNet 4.70 0.00 0.00 0.00 0.00 0.00 0.00 4.70 100.00 8.98 2.99

Sentiment140 34.66 0.00 0.00 0.00 65.76 100.00 79.34 5.83 64.34 10.69 30.01

SentiStrength 18.17 78.51 8.62 15.54 81.12 90.91 85.74 5.41 95.49 10.24 37.17

SentiWordNet 32.20 57.35 34.53 43.10 70.31 56.63 62.73 6.08 70.08 11.19 39.01

SO-CAL 50.79 64.36 51.13 56.99 77.25 68.36 72.53 8.68 65.98 15.34 48.29

Stanford DM 51.93 73.39 21.14 32.83 59.48 92.67 72.46 9.65 38.11 15.40 40.23

Umigon 24.08 68.76 16.38 26.46 68.78 80.38 74.13 5.88 88.93 11.04 37.21

Vader 48.84 61.96 52.40 56.78 80.09 63.00 70.52 9.51 70.90 16.77 48.03

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:14 Gonc¸alves et al.

Table VI. Mean Rank Table

3-Classes 2-Classes

Pos Method Mean Rank Pos Method Mean Rank

1 Semantria 3.20 1 SentiStrength 1.90(2.29)

2 SentiStrength 3.80(4.78) 2 Semantria 3.80

3 AFINN 4.40 3 Opinion Lexicon 5.70

4 Umigon 4.47 4 SO-CAL 5.90

5 Opinion Lexicon 4.87 5 AFINN 6.75

6 SO-CAL 6.93 6 Vader 7.65(8.06)

7 Vader 6.93(7.21) 7 Umigon 7.85

8 Opinion Finder 8.87 8 PANAS-t 9.85

9 Pattern.en 10.00 9 Emoticons 10.15

10 SANN 10.93 10 Pattern.en 10.30

11 Emolex 11.80 11 Opinion Finder 11.95

12 SentiWordNet 11.87 12 SenticNet 12.25

13 SenticNet 14.00 13 Emolex 12.30

14 Stanford DM 14.40 14 SANN 12.75

15 SASA 14.73 15 Stanford DM 13.80

16 LIWC 15.13 16 NRC Hashtag 14.20

17 PANAS-t 15.87 17 SentiWordNet 14.95

18 NRC Hashtag 16.73 18 SASA 15.60

19 Sentiment140 17.13 19 LIWC 15.90

20 Happiness Index 17.53 20 Happiness Index 17.65

21 Emoticons 18.00 21 Sentiment140 20.60

22 Emoticons DS 21.40 22 Emoticons DS 21.20

First, we note that existing methods varied widely in their agreement. This suggests that the

same social media text could be interpreted very differently depending on the choice of a sentiment

method. A few methods obtain results worse than a random baseline (i.e. a method that would

randomly choose among positive, neutral, or negative as output). This usually happens when a

method is biased towards one or more classes. As an example, emoticons showed to be a good

method for detecting positive and negative messages when the input data has an emoticon. However,

it considers most of the instances as neutral, as the majority of the messages to do have emoticons,

leading to an overall bad performance for most of the datasets.

In a deeper look at Table IV we can note that Vader works well for Tweets RND I and

Tweets STF, appearing among the top 3 methods, but it presented poor performance in Tweets SAN

and Comments BBC, achieving the nineth and eleventh place, respectively. Although the three ﬁrst

datasets contains tweet, they have different contexts, which can drastically affect the performance

of some methods. Another important aspect to be analysed in this table is the coverage. Although

Sentistrength has presented good macro-F1 values, its coverage was not so high, due to the fact this

method has some bias to the neutral class. Note that some datasets provided by the Sentistrength’s

authors (Thelwall and their coleagues), as shown in table III, specially the twitter datasets has more

neutral sentences than positive and negative ones.

In the 3-class experiments presented in table V we can realize that different contexts lead to

poor performance of some methods. For instance, Umigon, that was the top performer in four tweet

datasets, appears in the fourth overall position in the 3-class Mean Rank table VI and felt to the

thirteenth place in the Comments NYT dataset.

We noted from Figure 2 that most methods are more accurate while classifying positive than

negative messages, suggesting that some methods may be more biased towards positivity. Neutral

messages showed to be even harder to detect by most methods . Recent efforts show that human

language have a universal positivity bias [Dodds et al. 2015]. Naturally, part of the bias is observed

in sentiment prediction, an intrinsicproperty of some methods due to the way they are designed.

For instance, [Hannak et al. 2012] developed a lexicon in which positive and negative values are

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:15

Fig. 2. Average F1 Score for each class

associated to words, hashtags, and any sort of tokens according to the frequency with which these

tokens appear in tweets containing positive and negative emoticons. This method showed to be

biased towards positivity due to the larger amount of positivity in the data they used to build the

lexicon. The overall poor performance of this speciﬁc method is credited to its lack of treatment of

neutral messages and the focus on Twitter messages.

As it can been seen in Table VI, the top seven methods based on Macro-F1 are SentiStrength,

Semantria, AFINN, OpinionLexicon, Umigon, Vader and SO-CAL. This means that these methods

produce good results across several datasets in both, 2 and 3-class tasks. These methods would be

preferable in situations in which any sort of preliminary evaluation would be performed. We also

note those methods usually perform better in the datasets in which they were originally validated,

which is somewhat expected due to ﬁne tuning procedures. This is specially true for SentiStrength

and VADER. To understand the impact of such factor, we calculated Mean Rank for these methods

without their ‘original’ datasets and put(results in parenthesis). Note that in some cases the rank

order changes towards a lower value.

Table VII presents Friedman’s test results and, as expected, we can conclude that there are sig-

niﬁcant differences in the mean ratings observed for the methods across all datasets. It statistically

indicates that in terms of accuracy and Macro-F1 there is no single method that always achieves

the best prediction performance for different datasets, which is similar to the well-known “no-free

lunch theorem” [Wolpert and Macready 1997]. This suggests that at least a preliminary investiga-

tion should be performed when sentiment analysis is used in a new dataset in order to guarantee a

reasonable prediction performance.

Table VII. Friedman’s Test Results

2-classes experiments 3-classes experiments

FR 261.57 FR 219.31

Critical Value 31.41 Critical Value 31.41

Reject null hypothesis Reject null hypothesis

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:16 Gonc¸alves et al.

In order to verify whether this behavior also occurs in speciﬁc contexts such as tweets or com-

ments we divided all datasets in three contexts and perform the Friedman’s test for each one. The

contexts are Social Networks, Comments and Reviews and the datasets were grouped as presented

below. In spite of being sentences extracted from forums we deﬁned the RW dataset as belonging

to the Comments context as the properties of sentences in this dataset are similar to those of other

Comments datasets, as seen in Table (table III).

Context Groups

Social Networks

Myspace, Tweets DBT, Tweets DBT, Tweets RND I, Tweets RND II,

Tweets RND III, Tweets RND IV, Tweets STF, Tweets SAN,

Tweets Semeval

Comments Comments BBC, Comments DIGG, Comments NYT,

Comments TED, Comments YTB, RW

Reviews Reviews I, Reviews I, Amazon

Even after grouping the datasets in such contexts, we still ﬁnd out that there are signiﬁcant differ-

ences in the observed ranks across the datasets. Although the values obtained for each context were

quite smaller than Friedman’ global value, they are still above the critical value. Table VIII presents

the results of Friedman’s test for the individual contexts in both experiments, 2 and 3-class. Recall

that for the 3-class experiments, datasets with no neutral sentences or with an unrepresentative num-

ber of neutral sentences were not removed. For this reason, the results for 3-class experiments in the

Reviews context has no values as none of the Review datasets has a signiﬁcant number of neutral

sentences.

Table VIII. Friedman’s Test Results By Contexts

Context: Social Networks

2-classes experiments 3-classes experiments

FR 158.95 FR 138.12

Critical Value 31.41 Critical Value 31.41

Reject null hypothesis Reject null hypothesis

Context: Comments

2-classes experiments 3-classes experiments

FR 85.39 FR 94.39

Critical Value 31.41 Critical Value 31.41

Reject null hypothesis Reject null hypothesis

Context: Reviews

2-classes experiments 3-classes experiments

FR 56.01 FR -

Critical Value 31.41 Critical Value -

Reject null hypothesis Reject null hypothesis

6. CONCLUDING REMARKS

Recent efforts to analyze the moods embedded in Web 2.0 content have adopted various sentiment

analysis methods, which were originally developed in linguistics and psychology. Several of these

methods became widely used in their knowledge ﬁelds and have now been applied as tools to quan-

tify moods in the context of unstructured short messages in online social networks. In this articler,

we present a thorough comparison of twenty one popular sentence-level sentiment analysis methods

using gold standard datasets that span different types of data sources.

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:17

To perform such comparison we have made signiﬁcant efforts to obtain the latest working versions

of the various sentiment analysis tools and datasets, which we put together in a single webpage 6. We

are releasing this Web system so that other researchers can easily compare results of those methods

in their own datasets. With this system one could easily test which method would be most suitable

for a particular dataset and application. We hope that our tool will not only help researchers and

practitioners for accessing and comparing a wide range of sentiment analysis techniques. We also

hope that this can help the development of new research in this area.

APPENDIX

In this appendix, we present the full results of prediction performance of all twenty one sentiment

analysis methods on all labeled datasets.

6http://homepages.dcc.ufmg.br/∼fabricio/benchmark\sentiment\analysis.html

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:18 Gonc¸alves et al.

Table IX. 2-classes experiments results with 4 datasets

Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage

P R F1 P R F1

AFINN 66.56 23.08 81.08 35.93 96.32 64.66 77.38 56.65 85.11

Emolex 59.64 21.52 89.04 34.67 97.38 55.62 70.80 52.73 80.72

Emoticons 33.33 0.00 0.00 0.00 100.00 33.33 50.00 25.00 0.40

Emoticons DS 13.33 13.10 100.00 23.17 100.00 0.31 0.61 11.89 99.73

Happiness Index 41.81 15.65 95.52 26.89 98.41 35.03 51.67 39.28 79.52

NRC Hashtag 84.45 33.33 25.27 28.75 89.76 92.83 91.27 60.01 97.47

LIWC 50.10 15.38 58.33 24.35 88.00 48.78 62.77 43.56 69.55

Comments Opinion Finder 74.43 21.74 62.50 32.26 94.93 75.72 84.24 58.25 76.46

BBC Opinion Lexicon 74.14 29.81 84.93 44.13 97.24 72.66 83.17 63.65 80.72

PANAS-t 58.73 20.00 75.00 31.58 93.94 56.36 70.45 51.02 8.38

Pattern.en 61.09 20.00 70.73 31.18 93.48 59.72 72.88 52.03 87.50

SANN 54.34 19.09 88.06 31.38 96.88 49.80 65.78 48.58 75.13

SASA 61.61 23.50 66.20 34.69 90.80 60.77 72.81 53.75 61.30

SenticNet 36.21 16.27 94.62 27.76 97.18 27.52 42.89 35.33 95.48

Sentiment140 92.56 0.00 0.00 0.00 92.56 100.00 96.14 48.07 51.86

SentiStrength 93.93 64.29 78.26 70.59 97.72 95.54 96.61 83.60 32.85

SentiWordNet 57.49 20.00 88.06 32.60 97.13 53.45 68.96 50.78 76.33

SO-CAL 75.28 28.93 80.28 42.54 96.71 74.64 84.25 63.40 82.85

Stanford DM 89.45 63.16 40.91 49.66 91.81 96.52 94.11 71.88 92.02

Umigon 79.37 39.13 61.02 47.68 92.10 82.72 87.15 67.42 50.93

Vader 62.19 22.12 85.54 35.15 96.77 59.02 73.32 54.23 92.15

AFINN 70.94 47.01 81.82 59.72 91.17 67.05 77.27 68.49 74.81

Emolex 61.71 34.60 75.83 47.52 88.93 57.53 69.87 58.69 67.14

Emoticons 73.08 72.22 86.67 78.79 75.00 54.55 63.16 70.97 3.32

Emoticons DS 28.24 27.30 100.00 42.89 100.00 1.77 3.48 23.19 98.72

Happiness Index 42.32 27.44 91.45 42.21 91.53 27.62 42.44 42.32 64.96

NRC Hashtag 74.69 51.01 40.64 45.24 80.80 86.48 83.54 64.39 92.97

LIWC 46.15 27.44 58.40 37.34 72.49 41.52 52.79 45.07 58.18

Comments Opinion Finder 71.14 43.04 64.76 51.71 86.88 73.13 79.42 65.56 56.27

DIGG Opinion Lexicon 71.82 47.45 86.43 61.27 93.40 66.75 77.86 69.56 69.44

PANAS-t 68.00 12.50 50.00 20.00 94.12 69.57 80.00 50.00 3.20

Pattern.en 66.72 43.49 77.44 55.70 88.25 62.75 73.35 64.53 77.62

SANN 60.04 35.56 84.96 50.13 91.83 52.33 66.67 58.40 61.13

SASA 65.54 40.26 66.91 50.27 84.82 65.06 73.64 61.95 68.29

SenticNet 43.00 30.39 92.97 45.81 91.22 25.52 39.88 42.84 91.30

Sentiment140 85.45 0.00 0.00 0.00 85.45 100.00 92.15 46.08 54.48

SentiStrength 92.09 78.69 92.31 84.96 97.40 92.02 94.64 89.80 27.49

SentiWordNet 62.17 36.86 77.68 50.00 88.84 57.18 69.58 59.79 58.82

SO-CAL 76.55 52.86 77.08 62.71 90.65 76.37 82.90 72.81 71.99

Stanford DM 79.45 66.67 37.58 48.06 81.57 93.63 87.19 67.63 83.38

Umigon 83.37 66.22 75.38 70.50 90.72 86.23 88.42 79.46 63.04

Vader 68.65 45.29 85.63 59.24 92.31 62.50 74.53 66.89 83.63

AFINN 73.82 66.35 78.85 72.07 81.55 70.04 75.36 73.71 55.14

Emolex 64.57 57.20 81.71 67.29 77.52 50.78 61.36 64.33 65.69

Emoticons 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Emoticons DS 44.75 44.66 99.86 61.72 78.57 0.40 0.80 31.26 99.84

Happiness Index 55.28 50.59 88.13 64.28 73.49 27.66 40.19 52.24 55.56

NRC Hashtag 61.89 58.47 49.85 53.82 63.98 71.55 67.55 60.69 91.77

LIWC 49.52 44.96 59.96 51.39 56.18 41.15 47.50 49.44 46.30

Comments Opinion Finder 75.11 70.17 61.61 65.61 77.64 83.58 80.50 73.06 35.26

NYT Opinion Lexicon 74.61 67.54 77.95 72.37 81.46 72.12 76.50 74.44 57.80

PANAS-t 66.32 71.05 56.25 62.79 63.16 76.60 69.23 66.01 1.92

Pattern.en 61.78 57.17 60.24 58.67 65.95 63.04 64.46 61.57 73.43

SANN 67.11 58.38 80.90 67.82 79.87 56.78 66.38 67.10 37.81

SASA 56.47 51.55 58.92 54.99 61.70 54.45 57.85 56.42 50.49

SenticNet 55.93 50.64 91.05 65.08 78.65 27.08 40.29 52.68 92.22

Sentiment140 68.13 0.00 0.00 0.00 68.13 100.00 81.05 40.52 48.73

SentiStrength 81.42 79.50 62.71 70.11 82.15 91.39 86.52 78.32 17.63

SentiWordNet 65.08 59.13 73.17 65.41 72.59 58.42 64.74 65.07 46.60

SO-CAL 72.52 66.14 75.74 70.61 78.88 70.03 74.19 72.40 69.01

Stanford DM 63.85 75.77 26.03 38.75 61.73 93.48 74.36 56.56 82.39

Umigon 70.03 69.56 55.97 62.03 70.29 80.96 75.25 68.64 29.82

Vader 71.58 63.60 80.66 71.12 81.33 64.61 72.02 71.57 66.72

AFINN 75.28 68.85 87.70 77.14 85.17 64.03 73.10 75.12 72.90

Emolex 67.27 59.88 85.46 70.42 81.03 52.03 63.37 66.89 68.50

Emoticons 91.67 100.00 75.00 85.71 88.89 100.00 94.12 89.92 1.65

Emoticons DS 43.74 43.74 100.00 60.86 0.00 0.00 0.00 30.43 100.00

Happiness Index 63.86 58.86 93.78 72.32 84.15 33.50 47.92 60.12 57.08

NRC Hashtag 71.00 68.05 58.36 62.84 72.66 80.15 76.23 69.53 92.02

LIWC 52.96 47.67 65.78 55.28 61.21 42.80 50.37 52.83 58.18

Comments Opinion Finder 70.99 66.48 66.12 66.30 74.38 74.69 74.53 70.42 58.32

TED Opinion Lexicon 74.35 68.15 84.58 75.49 82.89 65.40 73.11 74.30 74.55

PANAS-t 82.35 100.00 75.00 85.71 62.50 100.00 76.92 81.32 2.34

Pattern.en 67.21 62.89 76.29 68.94 73.15 58.93 65.28 67.11 83.91

SANN 72.55 67.82 76.96 72.10 77.73 68.77 72.98 72.54 68.64

SASA 65.94 59.63 77.40 67.36 75.00 56.40 64.38 65.87 63.00

SenticNet 55.76 50.27 90.00 64.51 77.70 28.12 41.30 52.90 95.46

Sentiment140 72.35 0.00 0.00 0.00 72.35 100.00 83.96 41.98 40.30

SentiStrength 82.81 83.59 86.29 84.92 81.72 78.35 80.00 82.46 30.40

SentiWordNet 58.67 56.70 77.46 65.48 63.08 39.42 48.52 57.00 57.91

SO-CAL 73.87 73.36 77.97 75.59 74.49 69.43 71.88 73.73 75.79

Stanford DM 75.34 83.42 58.33 68.66 71.46 90.00 79.67 74.16 81.98

Umigon 70.86 74.09 75.81 74.94 66.23 64.15 65.18 70.06 51.44

Vader 74.01 67.14 84.95 75.00 83.53 64.74 72.95 73.97 83.63

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:19

Table X. 2-classes experiments results with 4 datasets

Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage

P R F1 P R F1

AFINN 85.92 86.93 94.08 90.36 82.72 66.73 73.87 82.12 74.18

Emolex 79.31 82.46 87.58 84.94 71.70 62.82 66.97 75.95 58.63

Emoticons 88.44 91.50 95.31 93.37 64.00 48.48 55.17 74.27 9.25

Emoticons DS 68.86 68.91 99.63 81.47 62.50 1.34 2.62 42.05 97.99

Happiness Index 74.44 74.62 94.86 83.53 73.26 30.38 42.95 63.24 58.55

NRC Hashtag 72.24 90.94 65.25 75.99 54.76 86.62 67.10 71.54 89.31

LIWC 64.86 72.42 80.68 76.33 37.69 27.55 31.83 54.08 63.65

Comments Opinion Finder 73.64 84.12 74.60 79.08 58.43 71.72 64.40 71.74 42.43

YTB Opinion Lexicon 84.89 87.88 91.06 89.44 76.84 70.26 73.40 81.42 68.01

PANAS-t 65.45 60.00 62.50 61.22 70.00 67.74 68.85 65.04 2.26

Pattern.en 83.62 87.90 89.52 88.70 71.96 68.60 70.24 79.47 78.08

SANN 79.08 82.22 89.64 85.77 68.75 54.05 60.52 73.15 56.41

SASA 69.17 84.55 67.96 75.36 49.81 71.91 58.85 67.10 71.63

SenticNet 75.24 75.75 93.90 83.85 72.38 34.70 46.91 65.38 85.69

Sentiment140 59.29 0.00 0.00 0.00 59.29 100.00 74.44 37.22 32.32

SentiStrength 95.27 97.48 96.40 96.94 87.96 91.35 89.62 93.28 38.24

SentiWordNet 75.26 83.05 82.00 82.52 56.81 58.60 57.69 70.10 59.00

SO-CAL 85.98 90.64 89.08 89.85 75.86 78.85 77.33 83.59 68.63

Stanford DM 69.04 93.56 58.90 72.29 50.41 91.15 64.92 68.60 79.81

Umigon 82.01 94.53 80.39 86.89 60.65 86.67 71.36 79.13 71.55

Vader 85.62 86.66 93.86 90.11 82.40 66.56 73.64 81.87 81.50

AFINN 65.93 63.56 79.10 70.48 70.15 51.99 59.72 65.10 72.59

Emolex 64.77 62.37 79.35 69.85 69.30 49.34 57.64 63.74 74.39

Emoticons 60.00 0.00 0.00 0.00 60.00 100.00 75.00 37.50 0.05

Emoticons DS 50.27 50.17 99.94 66.80 89.29 0.47 0.94 33.87 99.79

Happiness Index 54.25 53.22 85.59 65.63 58.96 21.59 31.61 48.62 63.62

NRC Hashtag 62.34 62.14 64.45 63.27 62.57 60.20 61.36 62.32 93.47

LIWC 63.00 61.37 82.45 70.36 67.08 40.81 50.75 60.56 66.08

Reviews I Opinion Finder 26.55 100.00 26.55 41.96 0.00 0.00 0.00 20.98 49.12

Opinion Lexicon 69.77 69.26 74.09 71.59 70.39 65.20 67.70 69.64 77.28

PANAS-t 66.30 75.44 61.72 67.89 58.12 72.55 64.53 66.21 3.40

Pattern.en 65.60 65.24 68.68 66.92 66.00 62.43 64.17 65.54 89.06

SANN 62.34 62.00 70.29 65.88 62.82 53.81 57.97 61.93 67.31

SASA 57.41 55.81 61.54 58.54 59.27 53.46 56.22 57.38 58.24

SenticNet 55.27 53.43 88.26 66.57 64.44 21.67 32.43 49.50 94.52

Sentiment140 69.49 0.00 0.00 0.00 69.49 100.00 82.00 41.00 30.96

SentiStrength 67.54 72.40 65.28 68.66 62.84 70.23 66.33 67.49 26.98

SentiWordNet 61.45 61.12 71.36 65.84 61.97 50.69 55.77 60.80 62.53

SO-CAL 71.65 72.09 72.82 72.46 71.18 70.43 70.80 71.63 89.10

Stanford DM 82.70 88.31 75.48 81.40 78.48 89.95 83.83 82.61 91.92

Umigon 63.44 66.36 55.62 60.52 61.30 71.37 65.95 63.24 53.95

Vader 64.62 62.19 79.65 69.84 69.33 48.71 57.22 63.53 84.76

AFINN 65.95 63.40 78.99 70.34 70.44 52.33 60.05 65.19 73.93

Emolex 64.96 62.04 80.00 69.88 70.52 49.41 58.11 64.00 75.15

Emoticons 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Emoticons DS 49.86 49.77 99.92 66.45 85.19 0.43 0.86 33.65 99.80

Happiness Index 54.45 53.16 86.25 65.78 60.44 21.65 31.88 48.83 66.30

NRC Hashtag 61.56 60.96 63.80 62.35 62.22 59.33 60.74 61.55 91.65

LIWC 61.77 60.08 81.98 69.34 66.11 39.22 49.24 59.29 66.61

Reviews II Opinion Finder 60.50 69.14 34.56 46.08 57.70 85.27 68.83 57.46 62.66

Opinion Lexicon 70.11 69.28 74.97 72.01 71.15 65.00 67.93 69.97 77.98

PANAS-t 66.85 74.16 63.16 68.22 60.10 71.60 65.35 66.78 3.51

Pattern.en 65.90 65.26 68.80 66.98 66.61 62.96 64.74 65.86 90.54

SANN 62.89 62.25 71.18 66.42 63.80 54.07 58.53 62.48 68.64

SASA 57.40 56.00 61.91 58.81 59.07 53.06 55.90 57.35 58.98

SenticNet 55.51 53.35 88.66 66.61 66.21 22.28 33.34 49.98 95.33

Sentiment140 68.20 0.00 0.00 0.00 68.20 100.00 81.09 40.55 34.45

SentiStrength 69.17 74.17 66.77 70.28 64.33 72.04 67.97 69.12 27.13

SentiWordNet 61.99 61.55 71.05 65.96 62.65 52.25 56.98 61.47 62.81

SO-CAL 72.18 72.42 73.69 73.05 71.92 70.61 71.26 72.15 88.99

Stanford DM 86.17 89.11 82.37 85.61 83.64 89.95 86.68 86.15 91.46

Umigon 63.96 67.38 55.94 61.13 61.46 72.19 66.40 63.76 56.42

Vader 65.06 62.19 80.38 70.12 70.61 49.09 57.92 64.02 86.14

AFINN 87.18 94.67 90.06 92.31 54.70 70.33 61.54 76.92 74.82

Emolex 83.62 93.30 87.05 90.07 45.79 63.64 53.26 71.67 62.95

Emoticons 90.59 97.30 92.31 94.74 45.45 71.43 55.56 75.15 10.19

Emoticons DS 83.94 84.24 99.57 91.27 0.00 0.00 0.00 45.63 99.28

Happiness Index 88.71 90.37 97.25 93.69 67.50 35.53 46.55 70.12 65.83

NRC Hashtag 55.67 95.89 49.47 65.27 24.77 88.71 38.73 52.00 94.12

LIWC 83.07 90.30 90.12 90.21 37.18 37.66 37.42 63.82 68.71

Myspace Opinion Finder 72.78 94.27 73.04 82.31 28.83 71.11 41.03 61.67 40.53

Opinion Lexicon 84.54 94.17 87.19 90.55 49.11 69.62 57.59 74.07 62.83

PANAS-t 96.25 100.00 96.05 97.99 57.14 100.00 72.73 85.36 9.59

Pattern.en 83.99 93.41 87.64 90.43 43.80 60.92 50.96 70.70 76.38

SANN 81.22 92.01 85.44 88.60 39.77 56.45 46.67 67.64 51.08

SASA 61.69 92.09 57.82 71.04 30.04 78.49 43.45 57.24 59.47

SenticNet 84.08 90.17 91.19 90.68 47.12 44.14 45.58 68.13 88.13

Sentiment140 41.75 0.00 0.00 0.00 41.75 100.00 58.90 29.45 24.70

SentiStrength 97.72 100.00 97.50 98.73 79.31 100.00 88.46 93.60 31.53

SentiWordNet 77.72 92.31 81.15 86.37 30.30 54.79 39.02 62.70 67.27

SO-CAL 84.40 94.84 86.16 90.29 50.40 75.00 60.29 75.29 63.79

Stanford DM 43.04 96.08 33.73 49.94 20.78 92.66 33.95 41.94 82.73

Umigon 76.71 96.72 75.78 84.98 34.00 82.93 48.23 66.60 75.18

Vader 88.05 94.04 91.94 92.97 56.48 64.21 60.10 76.54 81.29

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:20 Gonc¸alves et al.

Table XI. 2-classes experiments results with 4 datasets

Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage

P R F1 P R F1

AFINN 78.69 80.17 87.87 83.84 75.43 63.14 68.74 76.29 62.80

Emolex 66.59 71.89 72.91 72.40 58.31 57.08 57.69 65.04 58.86

Emoticons 50.00 0.00 0.00 0.00 100.00 50.00 66.67 33.33 0.06

Emoticons DS 59.95 59.61 100.00 74.70 100.00 2.04 4.00 39.35 99.53

Happiness Index 65.93 65.23 96.75 77.92 74.24 15.38 25.49 51.70 46.59

NRC Hashtag 64.13 74.30 60.52 66.70 54.60 69.40 61.12 63.91 89.81

LIWC 60.97 64.39 81.77 72.05 48.75 27.72 35.34 53.69 55.57

Amazon Opinion Finder 68.07 78.34 68.95 73.35 54.97 66.53 60.20 66.77 37.40

Opinion Lexicon 80.82 82.25 88.48 85.25 77.85 67.96 72.57 78.91 67.15

PANAS-t 74.07 87.18 79.07 82.93 40.00 54.55 46.15 64.54 1.50

Pattern.en 71.57 76.75 77.51 77.13 62.95 61.93 62.43 69.78 76.68

SANN 72.03 73.92 87.60 80.18 65.87 43.64 52.50 66.34 44.27

SASA 62.18 66.94 73.56 70.10 52.84 44.91 48.55 59.32 66.43

SenticNet 63.70 63.38 92.52 75.23 65.81 21.21 32.09 53.66 91.41

Sentiment140 53.04 0.00 0.00 0.00 53.04 100.00 69.31 34.66 56.09

SentiStrength 90.52 92.24 95.51 93.85 84.31 75.00 79.38 86.62 19.58

SentiWordNet 72.89 77.52 82.35 79.86 62.42 55.11 58.54 69.20 53.85

SO-CAL 78.23 81.45 84.66 83.02 72.18 67.36 69.69 76.35 71.52

Stanford DM 68.53 89.26 54.54 67.71 56.38 89.96 69.31 68.51 80.28

Umigon 72.26 85.42 68.89 76.27 57.90 78.44 66.62 71.45 51.33

Vader 76.63 76.85 88.59 82.31 76.11 57.67 65.62 73.96 72.44

AFINN 71.08 58.32 83.71 68.74 86.38 63.34 73.09 70.92 59.58

Emolex 63.28 48.43 75.92 59.14 81.33 56.48 66.67 62.90 58.77

Emoticons 73.91 66.67 90.91 76.92 87.50 58.33 70.00 73.46 1.16

Emoticons DS 37.41 37.16 100.00 54.18 100.00 0.64 1.28 27.73 99.55

Happiness Index 50.24 39.22 83.45 53.36 79.40 33.04 46.66 50.01 42.95

NRC Hashtag 65.31 52.53 23.60 32.57 67.73 88.26 76.64 54.61 94.09

LIWC 64.64 56.44 96.19 71.14 92.42 38.50 54.35 62.75 61.45

Tweets Opinion Finder 72.93 56.72 63.68 60.00 81.97 77.26 79.55 69.77 33.60

DBT Opinion Lexicon 73.98 62.13 82.05 70.71 86.09 68.97 76.59 73.65 58.06

PANAS-t 79.59 43.75 87.50 58.33 96.97 78.05 86.49 72.41 2.48

Pattern.en 67.90 53.79 62.68 57.89 77.72 70.73 74.06 65.98 78.07

SANN 64.38 46.61 80.78 59.11 86.31 56.70 68.44 63.77 40.42

SASA 63.17 50.93 73.15 60.06 77.74 57.08 65.83 62.94 59.68

SenticNet 48.85 38.91 81.90 52.75 76.27 31.16 44.24 48.50 85.65

Sentiment140 78.42 0.00 0.00 0.00 78.42 100.00 87.91 43.95 46.13

SentiStrength 77.88 62.76 69.47 65.94 85.71 81.63 83.62 74.78 21.48

SentiWordNet 61.98 46.83 72.84 57.01 79.64 56.24 65.93 61.47 48.91

SO-CAL 72.96 61.36 75.70 67.78 82.98 71.31 76.70 72.24 57.55

Stanford DM 71.31 75.89 28.52 41.46 70.60 94.99 81.00 61.23 84.54

Umigon 76.62 66.45 73.50 69.80 83.59 78.44 80.93 75.37 38.91

Vader 68.95 55.31 84.05 66.72 86.49 60.07 70.90 68.81 70.14

AFINN 76.74 63.64 87.50 73.68 90.48 70.37 79.17 76.43 66.15

Emolex 68.42 50.00 83.33 62.50 88.89 61.54 72.73 67.61 58.46

Emoticons 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Emoticons DS 35.94 34.92 100.00 51.76 100.00 2.38 4.65 28.21 98.46

Happiness Index 54.84 50.00 92.86 65.00 80.00 23.53 36.36 50.68 47.69

NRC Hashtag 66.10 55.00 50.00 52.38 71.79 75.68 73.68 63.03 90.77

LIWC 67.50 56.00 87.50 68.29 86.67 54.17 66.67 67.48 61.54

Irony Opinion Finder 78.95 70.00 87.50 77.78 88.89 72.73 80.00 78.89 29.23

Opinion Lexicon 66.67 52.38 84.62 64.71 86.67 56.52 68.42 66.56 55.38

PANAS-t 100.00 0.00 0.00 0.00 100.00 100.00 100.00 50.00 1.54

Pattern.en 73.17 62.96 94.44 75.56 92.86 56.52 70.27 72.91 63.08

SANN 63.33 45.00 100.00 62.07 100.00 47.62 64.52 63.29 46.15

SASA 77.42 66.67 83.33 74.07 87.50 73.68 80.00 77.04 47.69

SenticNet 44.26 37.25 90.48 52.78 80.00 20.00 32.00 42.39 93.85

Sentiment140 83.87 0.00 0.00 0.00 83.87 100.00 91.23 45.61 47.69

SentiStrength 88.89 87.50 87.50 87.50 90.00 90.00 90.00 88.75 27.69

SentiWordNet 63.89 52.17 85.71 64.86 84.62 50.00 62.86 63.86 55.38

SO-CAL 78.57 65.00 86.67 74.29 90.91 74.07 81.63 77.96 64.62

Stanford DM 79.31 76.92 52.63 62.50 80.00 92.31 85.71 74.11 89.23

Umigon 67.65 56.25 69.23 62.07 77.78 66.67 71.79 66.93 52.31

Vader 71.43 55.17 94.12 69.57 95.00 59.38 73.08 71.32 75.38

AFINN 64.81 58.14 96.15 72.46 90.91 35.71 51.28 61.87 76.06

Emolex 55.00 51.52 89.47 65.38 71.43 23.81 35.71 50.55 56.34

Emoticons 100.00 100.00 100.00 100.00 0.00 0.00 0.00 50.00 1.41

Emoticons DS 45.07 45.71 96.97 62.14 0.00 0.00 0.00 31.07 100.00

Happiness Index 54.05 48.28 87.50 62.22 75.00 28.57 41.38 51.80 52.11

NRC Hashtag 70.59 66.67 70.97 68.75 74.29 70.27 72.22 70.49 95.77

LIWC 69.81 63.16 92.31 75.00 86.67 48.15 61.90 68.45 74.65

Sarcasm Opinion Finder 64.10 59.26 84.21 69.57 75.00 45.00 56.25 62.91 54.93

Opinion Lexicon 69.39 61.11 95.65 74.58 92.31 46.15 61.54 68.06 69.01

PANAS-t 50.00 0.00 0.00 0.00 100.00 50.00 66.67 33.33 2.82

Pattern.en 59.38 56.82 78.12 65.79 65.00 40.62 50.00 57.89 90.14

SANN 61.36 52.78 100.00 69.09 100.00 32.00 48.48 58.79 61.97

SASA 53.06 44.83 65.00 53.06 65.00 44.83 53.06 53.06 69.01

SenticNet 53.73 50.88 90.62 65.17 70.00 20.00 31.11 48.14 94.37

Sentiment140 72.09 0.00 0.00 0.00 72.09 100.00 83.78 41.89 60.56

SentiStrength 82.76 70.59 100.00 82.76 100.00 70.59 82.76 82.76 40.85

SentiWordNet 56.86 56.76 77.78 65.62 57.14 33.33 42.11 53.87 71.83

SO-CAL 67.31 55.88 90.48 69.09 88.89 51.61 65.31 67.20 73.24

Stanford DM 68.85 82.35 46.67 59.57 63.64 90.32 74.67 67.12 85.92

Umigon 69.39 66.67 75.00 70.59 72.73 64.00 68.09 69.34 69.01

Vader 62.50 55.77 96.67 70.73 91.67 32.35 47.83 59.28 90.14

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:21

Table XII. 2-classes experiments results with 4 datasets

Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage

P R F1 P R F1

AFINN 80.32 79.38 90.84 84.72 82.39 64.47 72.34 78.53 71.47

Emolex 73.28 74.49 83.19 78.60 70.93 59.03 64.43 71.52 59.02

Emoticons 85.43 90.27 90.27 90.27 71.05 71.05 71.05 80.66 13.19

Emoticons DS 58.95 58.84 99.63 73.99 72.22 1.37 2.70 38.34 99.83

Happiness Index 68.28 66.75 93.60 77.93 76.23 30.58 43.65 60.79 60.46

NRC Hashtag 65.61 72.95 65.91 69.26 57.31 65.19 61.00 65.13 95.54

LIWC 59.92 62.41 79.37 69.87 52.66 32.43 40.14 55.01 54.61

Tweets Opinion Finder 77.16 82.14 78.04 80.04 70.90 75.92 73.32 76.68 40.37

RND I Opinion Lexicon 81.56 82.00 87.68 84.74 80.84 72.98 76.71 80.73 63.74

PANAS-t 85.45 91.18 86.11 88.57 76.19 84.21 80.00 84.29 4.81

Pattern.en 78.02 79.84 85.52 82.58 74.60 66.33 70.22 76.40 77.72

SANN 75.61 75.48 87.04 80.85 75.89 59.06 66.43 73.64 50.15

SASA 65.60 70.72 70.36 70.54 58.47 58.89 58.68 64.61 58.67

SenticNet 68.48 67.00 91.22 77.26 74.35 36.16 48.65 62.96 92.44

Sentiment140 70.19 0.00 0.00 0.00 70.19 100.00 82.49 41.24 40.45

SentiStrength 93.72 94.26 96.33 95.28 92.61 88.68 90.60 92.94 27.13

SentiWordNet 70.70 76.03 77.64 76.83 61.27 59.11 60.17 68.50 62.78

SO-CAL 80.85 82.08 85.98 83.98 78.92 73.66 76.20 80.09 64.57

Stanford DM 54.19 87.02 25.40 39.33 47.44 94.67 63.21 51.27 92.70

Umigon 82.07 89.22 80.71 84.76 73.02 84.26 78.24 81.50 67.50

Vader 80.12 78.73 91.76 84.75 83.39 62.52 71.46 78.10 81.08

AFINN 96.37 97.66 96.94 97.30 93.75 95.19 94.47 95.88 80.77

Emolex 86.06 89.82 89.11 89.47 78.77 80.00 79.38 84.42 63.58

Emoticons 97.75 97.90 99.42 98.65 96.97 89.72 93.20 95.93 14.82

Emoticons DS 71.04 70.61 99.90 82.74 95.83 5.43 10.28 46.51 99.09

Happiness Index 82.39 81.92 95.51 88.20 84.30 53.33 65.33 76.77 58.60

NRC Hashtag 67.37 83.76 65.43 73.47 48.17 71.69 57.62 65.55 91.94

LIWC 66.47 74.46 78.81 76.58 44.20 38.31 41.04 58.81 73.93

Tweets Opinion Finder 78.32 93.86 71.11 80.92 63.42 91.50 74.92 77.92 41.23

RND II Opinion Lexicon 93.45 97.03 93.14 95.04 86.93 94.11 90.38 92.71 70.64

PANAS-t 90.71 96.95 88.19 92.36 82.11 95.12 88.14 90.25 5.39

Pattern.en 87.11 93.13 88.42 90.72 74.49 83.83 78.89 84.80 80.03

SANN 83.80 89.89 86.50 88.16 71.39 77.58 74.36 81.26 52.67

SASA 70.06 82.81 72.81 77.49 49.05 63.39 55.30 66.40 63.04

SenticNet 83.28 82.92 95.28 88.67 84.60 56.92 68.05 78.36 89.63

Sentiment140 59.94 0.00 0.00 0.00 59.94 100.00 74.95 37.48 38.49

SentiStrength 96.97 98.92 96.43 97.66 93.54 98.01 95.72 96.69 34.65

SentiWordNet 78.57 87.88 80.91 84.25 61.09 72.87 66.46 75.36 61.49

SO-CAL 87.76 94.25 86.99 90.47 77.34 89.32 82.90 86.68 67.18

Stanford DM 60.46 94.48 44.87 60.84 44.06 94.30 60.06 60.45 88.89

Umigon 88.63 97.73 85.92 91.45 73.64 95.17 83.03 87.24 70.83

Vader 98.97 99.05 99.45 99.25 98.77 97.89 98.33 98.79 94.61

AFINN 86.66 87.38 91.11 89.21 85.43 79.84 82.54 85.87 78.81

Emolex 82.02 83.90 87.55 85.69 78.64 73.19 75.82 80.75 67.07

Emoticons 92.74 94.66 95.38 95.02 87.50 85.71 86.60 90.81 14.59

Emoticons DS 61.98 61.51 99.73 76.09 90.00 3.77 7.23 41.66 99.02

Happiness Index 75.26 73.05 95.82 82.90 85.40 40.91 55.32 69.11 62.27

NRC Hashtag 79.28 84.99 80.22 82.54 71.52 77.80 74.53 78.53 95.19

LIWC 60.65 63.82 77.81 70.12 52.35 35.60 42.38 56.25 50.12

Tweets Opinion Finder 81.71 88.64 79.87 84.03 73.48 84.50 78.60 81.32 40.99

RND III Opinion Lexicon 88.21 88.43 92.79 90.56 87.82 81.07 84.31 87.43 70.50

PANAS-t 94.05 95.38 96.88 96.12 89.47 85.00 87.18 91.65 6.85

Pattern.en 85.03 87.12 89.45 88.27 81.23 77.54 79.34 83.81 82.23

SANN 80.66 81.26 88.67 84.81 79.46 68.20 73.40 79.10 54.36

SASA 76.94 80.44 81.41 80.92 71.52 70.21 70.86 75.89 67.16

SenticNet 78.54 75.92 94.81 84.32 86.84 53.23 66.00 75.16 90.38

Sentiment140 72.71 0.00 0.00 0.00 72.71 100.00 84.20 42.10 44.50

SentiStrength 94.99 96.27 96.57 96.42 91.97 91.30 91.64 94.03 37.41

SentiWordNet 72.54 79.19 78.75 78.97 60.14 60.76 60.45 69.71 67.97

SO-CAL 88.27 90.73 90.25 90.49 84.33 85.06 84.69 87.59 74.33

Stanford DM 65.73 94.06 45.09 60.96 54.46 95.84 69.46 65.21 86.80

Umigon 86.86 92.84 85.36 88.95 78.96 89.30 83.81 86.38 80.03

Vader 86.79 86.66 92.64 89.55 87.03 77.59 82.04 85.79 86.96

AFINN 76.77 76.15 86.84 81.15 77.94 63.10 69.74 75.44 71.22

Emolex 72.56 75.93 81.19 78.47 66.07 58.73 62.18 70.33 58.99

Emoticons 94.52 98.18 91.53 94.74 90.83 98.02 94.29 94.51 78.78

Emoticons DS 57.82 57.72 99.37 73.02 66.67 1.71 3.33 38.18 98.92

Happiness Index 67.50 64.39 94.44 76.58 82.14 32.86 46.94 61.76 57.55

NRC Hashtag 61.83 67.36 64.67 65.99 55.08 58.04 56.52 61.25 94.24

LIWC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Tweets Opinion Finder 70.34 75.00 73.91 74.45 64.00 65.31 64.65 69.55 42.45

RND IV Opinion Lexicon 79.35 79.17 87.96 83.33 79.69 67.11 72.86 78.10 66.19

PANAS-t 56.52 63.64 53.85 58.33 50.00 60.00 54.55 56.44 8.27

Pattern.en 91.76 92.76 92.76 92.76 90.43 90.43 90.43 91.60 96.04

SANN 72.14 73.12 82.93 77.71 70.21 56.90 62.86 70.29 50.36

SASA 64.16 67.96 70.71 69.31 58.57 55.41 56.94 63.13 62.23

SenticNet 66.53 65.05 87.68 74.69 71.19 39.25 50.60 62.65 88.13

Sentiment140 75.74 0.00 0.00 0.00 75.74 100.00 86.19 43.10 48.92

SentiStrength 89.77 92.06 93.55 92.80 84.00 80.77 82.35 87.58 31.65

SentiWordNet 66.67 72.73 74.23 73.47 56.14 54.24 55.17 64.32 56.12

SO-CAL 75.53 73.50 85.15 78.90 78.87 64.37 70.89 74.89 67.63

Stanford DM 61.51 83.82 39.86 54.03 53.26 89.91 66.89 60.46 90.65

Umigon 91.25 95.65 88.59 91.99 86.40 94.74 90.38 91.18 94.60

Vader 83.66 81.18 93.24 86.79 88.51 70.64 78.57 82.68 92.45

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:22 Gonc¸alves et al.

Table XIII. 2-classes experiments results with 4 datasets

Dataset Method Accur. Posit. Sentiment Negat. Sentiment MacroF1 Coverage

P R F1 P R F1

AFINN 84.42 80.62 91.49 85.71 89.66 77.04 82.87 84.29 76.88

Emolex 79.65 76.09 88.98 82.03 85.23 69.44 76.53 79.28 62.95

Emoticons 85.42 80.65 96.15 87.72 94.12 72.73 82.05 84.89 13.37

Emoticons DS 51.96 51.41 100.00 67.91 100.00 2.27 4.44 36.18 99.72

Happiness Index 65.93 58.72 94.39 72.40 88.89 40.34 55.49 63.95 62.95

NRC Hashtag 71.30 73.05 70.93 71.98 69.51 71.70 70.59 71.28 92.20

LIWC 64.29 63.75 76.12 69.39 65.22 50.85 57.14 63.27 70.39

Tweets Opinion Finder 80.77 81.16 76.71 78.87 80.46 84.34 82.35 80.61 43.45

STF Opinion Lexicon 86.10 83.67 91.11 87.23 89.29 80.65 84.75 85.99 72.14

PANAS-t 94.12 88.89 100.00 94.12 100.00 88.89 94.12 94.12 4.74

Pattern.en 77.85 75.69 85.09 80.12 80.95 69.86 75.00 77.56 85.52

SANN 73.21 69.35 82.69 75.44 78.82 63.81 70.53 72.98 58.22

SASA 68.52 65.65 78.90 71.67 72.94 57.94 64.58 68.12 60.17

SenticNet 72.62 66.80 93.06 77.78 87.37 50.92 64.34 71.06 93.59

Sentiment140 75.53 0.00 0.00 0.00 75.53 100.00 86.06 43.03 52.37

SentiStrength 95.33 95.18 96.34 95.76 95.52 94.12 94.81 95.29 41.78

SentiWordNet 72.99 73.17 78.95 75.95 72.73 65.98 69.19 72.57 58.77

SO-CAL 87.36 82.89 93.33 87.80 92.80 81.69 86.89 87.35 77.16

Stanford DM 66.56 87.69 36.31 51.35 61.24 95.18 74.53 62.94 89.97

Umigon 86.99 91.73 81.88 86.52 83.02 92.31 87.42 86.97 81.34

Vader 94.12 100.00 90.48 95.00 86.67 100.00 92.86 93.93 9.47

AFINN 77.85 72.73 86.72 79.11 84.83 69.54 76.43 77.77 69.94

Emolex 72.06 65.42 85.67 74.19 82.61 60.06 69.55 71.87 60.04

Emoticons 85.71 87.50 89.74 88.61 82.61 79.17 80.85 84.73 5.77

Emoticons DS 47.20 47.13 99.21 63.90 55.56 0.88 1.74 32.82 98.08

Happiness Index 64.23 59.63 89.92 71.70 78.81 38.11 51.38 61.54 45.10

NRC Hashtag 69.77 72.01 58.71 64.69 68.36 79.63 73.57 69.13 93.68

LIWC 77.27 71.37 92.05 80.40 88.38 62.10 72.95 76.67 63.70

Tweets Opinion Finder 68.60 63.54 71.35 67.22 73.80 66.35 69.87 68.55 34.74

SAN Opinion Lexicon 81.77 78.18 86.74 82.24 85.98 77.05 81.27 81.75 65.35

PANAS-t 84.62 80.00 80.00 80.00 87.50 87.50 87.50 83.75 2.38

Pattern.en 74.62 68.46 90.59 77.98 86.40 58.90 70.04 74.01 72.59

SANN 70.02 64.40 85.95 73.63 80.46 54.90 65.27 69.45 45.55

SASA 61.18 61.74 74.71 67.61 60.12 45.16 51.58 59.59 43.45

SenticNet 61.12 55.40 91.21 68.93 81.22 34.13 48.06 58.50 94.78

Sentiment140 72.45 0.00 0.00 0.00 72.45 100.00 84.02 42.01 53.90

SentiStrength 89.47 88.70 91.81 90.23 90.41 86.84 88.59 89.41 29.61

SentiWordNet 67.49 64.25 75.63 69.48 71.90 59.70 65.23 67.35 59.21

SO-CAL 79.70 74.74 84.55 79.34 85.11 75.56 80.05 79.70 68.19

Stanford DM 62.72 87.60 22.60 35.93 59.35 97.25 73.71 54.82 92.94

Umigon 82.41 83.66 80.49 82.04 81.25 84.32 82.76 82.40 67.74

Vader 77.18 71.81 88.15 79.15 85.34 66.59 74.81 76.98 78.74

AFINN 86.05 90.95 90.01 90.48 72.88 75.00 73.93 82.20 76.83

Emolex 79.82 86.58 86.48 86.53 59.70 59.93 59.81 73.17 70.29

Emoticons 92.24 95.09 95.45 95.27 78.95 77.59 78.26 86.77 10.52

Emoticons DS 72.75 72.72 100.00 84.20 100.00 0.36 0.71 42.46 100.00

Happiness Index 77.08 78.14 95.34 85.88 68.30 27.37 39.08 62.48 68.01

NRC Hashtag 72.15 83.82 76.83 80.17 48.25 59.29 53.20 66.69 96.80

LIWC 64.54 74.33 78.88 76.54 30.19 25.12 27.42 51.98 53.17

Tweets Opinion Finder 75.95 90.13 74.02 81.28 56.40 80.57 66.35 73.82 38.86

Semeval Opinion Lexicon 86.20 92.07 88.90 90.46 71.75 78.65 75.04 82.75 69.61

PANAS-t 91.76 96.63 93.49 95.04 70.21 82.50 75.86 85.45 8.33

Pattern.en 77.94 89.27 80.02 84.39 55.14 71.85 62.39 73.39 83.40

SANN 79.33 84.91 87.37 86.12 62.13 57.18 59.55 72.84 53.92

SASA 75.63 81.44 87.26 84.25 52.31 41.26 46.13 65.19 53.24

SenticNet 75.59 79.11 90.33 84.35 58.22 36.10 44.57 64.46 95.59

Sentiment140 51.68 0.00 0.00 0.00 51.68 100.00 68.14 34.07 36.05

SentiStrength 91.11 96.17 91.78 93.93 78.40 89.09 83.40 88.66 28.66

SentiWordNet 69.93 86.53 72.04 78.62 40.52 62.93 49.29 63.96 70.20

SO-CAL 82.52 90.17 85.03 87.53 66.28 76.05 70.83 79.18 69.93

Stanford DM 41.13 94.17 19.78 32.70 31.64 96.81 47.69 40.19 92.32

Umigon 81.44 94.98 79.34 86.46 59.02 87.64 70.54 78.50 68.86

Vader 85.65 89.22 91.39 90.29 75.08 70.13 72.52 81.40 86.31

AFINN 82.25 83.16 93.98 88.24 78.63 53.80 63.89 76.06 83.12

Emolex 74.54 79.14 86.39 82.60 59.69 46.95 52.56 67.58 77.45

Emoticons 88.89 91.30 95.45 93.33 75.00 60.00 66.67 80.00 15.32

Emoticons DS 69.23 69.05 100.00 81.69 100.00 1.82 3.57 42.63 99.57

Happiness Index 75.14 76.04 94.56 84.30 68.66 28.57 40.35 62.32 77.59

NRC Hashtag 65.06 81.10 63.52 71.24 46.71 68.35 55.49 63.37 97.02

LIWC 83.18 83.12 96.77 89.43 83.54 45.52 58.93 74.18 77.59

RW Opinion Finder 61.30 79.91 58.81 67.75 42.04 66.90 51.63 59.69 65.25

Opinion Lexicon 80.25 83.89 89.03 86.39 69.50 59.39 64.05 75.22 79.01

PANAS-t 63.64 80.00 64.52 71.43 42.11 61.54 50.00 60.71 12.48

Pattern.en 71.57 84.20 73.86 78.69 50.64 65.92 57.28 67.99 87.80

SANN 75.75 82.43 84.25 83.33 57.14 53.90 55.47 69.40 71.35

SASA 66.00 78.04 71.33 74.53 44.83 53.72 48.87 61.70 56.74

SenticNet 72.84 73.82 94.18 82.77 65.38 24.76 35.92 59.34 95.04

Sentiment140 51.32 0.00 0.00 0.00 51.32 100.00 67.83 33.91 48.37

SentiStrength 85.89 93.69 86.67 90.04 69.23 83.72 75.79 82.92 23.12

SentiWordNet 68.94 80.90 74.21 77.41 45.92 55.56 50.28 63.85 81.28

SO-CAL 76.03 86.12 78.15 81.94 58.74 71.18 64.36 73.15 79.29

Stanford DM 43.10 96.25 17.46 29.56 35.58 98.53 52.28 40.92 91.49

Umigon 60.32 91.43 48.12 63.05 42.02 89.29 57.14 60.10 80.43

Vader 81.70 81.70 95.28 87.97 81.74 49.74 61.84 74.90 89.93

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.

State-of-the-Practice BenchmarkComparison Sentiment Analysis Method 39:23

Table XIV. 3-classes experiments results with 4 datasets