TASS: Detecting Sentiments in Spanish Tweets

TASS: Detecting Sentiments in
Spanish Tweets
Xabier Saralegi and Iñaki San Vicente
Elhuyar Fundazioa
Knowledge discovery useful for decision making
and market analysis.
Explosion of Web 2.0, very rich source of user-
generated information.
Social media like twitter a very valuable source for
seeking opinions.
TASS: Opinion mining or sentiment analysis over
Spanish tweets.
State of the Art
State of the Art
Main approaches:
Knowledge/Lexicon/Rules based approach (Turney,
2002; Kim and Hovy, 2004).
Supervised approach (Pang et al., 2002).
State of the Art
Dealing with tweets:
POS and lemmas (Barbosa and Feng, 2010).
Emoticons (O'Connor et al., 2010).
Discourse (Somasundaran et al., 2009).
Follower graph (Speriosu et al., 2011).
State of the Art
Approaches for tweets:
Supervised combined with lexicons (Barbosa and Feng,
2010; Kouloumpis, Wilson, and Moore, 2011).
Semi-supervised (label propagation) combined with
lexicons (Sindhwani and Melville, 2008).
Training Data
Training Data
Training data Ct consists of 7,219 tweets:
Polarity # of tweets % of tweets
P+ 1,764 22,44%
P 1,019 14,12%
NEU 610 8,45%
N 1,221 16,91%
N+ 903 12,51%
NONE 1,702 23,58%
Total 7,219 100%
Polarity Lexicon
Polarity Lexicon
A new polarity lexicon for Spanish Pes created from
two different sources:
a)An existing English polarity lexicon Pen (Projection).
b)Training corpus Ct (Extraction).
Projection::Polarity Lexicon
An English polarity lexicon (Wilson et al., 2005) Pen
automatically translated into Spanish:
Translation by a English-Spanish dictionary Denes
Projection::Polarity Lexicon
An English polarity lexicon (Wilson et al., 2005) Pen
automatically translated into Spanish:
Translation by a English-Spanish dictionary Denes
Ambiguous translations solved manually:
Polarity was also revised.
# of headwords # of pairs Avg # of trans.
Denes 15,134 31,884 2.11
Projection::Polarity Lexicon
Translated dictionary:
Polarity English
words in Pen
Words translated
by Den→es
N4,144 2,416 3,480 2,164
P2,304 2,057 2,271 1,180
Total 6,878 4,473 5,751 3,344
Extraction::Polarity Lexicon
Polarity words automatically extracted from the
training corpus Ct:
Extraction of the words most associated with a certain
polarity by using Loglikelihood ratio (LLR).
Top 1,000 negative and top 1,000 positive words
manually checked:
338 negative and
271 positive words
1 2 3 4 5 6 7 8 9 10
Polarity Lexicon
Merging projection and extraction based dics.:
based lexicon
based lexicon
Final lexicon
N 2,164 338 2,435
P 1,180 271 1,518
Total 3,344 609 3,953
Supervised system
Supervised System
SMO implementation of the Support Vector
Machine algorithm (Weka).
All the classifiers built over the training data.
All the classifiers evaluated by the 10-fold cross
Supervised System
Pre-process: Some heuristics for dealing with
Replication of characters (e.g., “Sueñooo”):
Removed according to Freeling's dictionary.
Abbreviations (e.g., “q”,dl”, …):
Extended by using a equivalents list.
Overuse of upper case (e.g., “MIRA QUE BUENO”):
If a sequence of two common words change to lower
Normalization of urls:
complete url replaced by “URL”.
Baseline::Supervised System
Unigram representation using all lemmas (Freeling) as
features (15,069).
Frequency of the lemmas as values.
(6 cat.)
Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
Selection of Polarity Words::Supervised System
Only lemmas included in the polarity lexicon Pes:
More precise features and less computational cost (From
15,069 to 3,730 features).
(6 cat.)
Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
SP 0.484 0.594 0.254 0.098 0.397 0.422 0.598
Emoticons and Interjections::Supervised System
Two new features: # of positive emoticons, # of negative emoticons:
A list of 23 positive and 34 negative emoticons.
Two new features: # of positive interjections, # of negative
A list of 28 positive and 54 negative interjections.
(6 cat.)
Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
SP 0.484 0.594 0.254 0.098 0.397 0.422 0.598
SP+EM 0.49 0.612 0.253 0.097 0.402 0.428 0.6
POS Information::Supervised System
POS tags as features.
Useful for distinguishing between subjective and objective
(6 cat.)
Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
SP 0.484 0.594 0.254 0.098 0.397 0.422 0.598
SP+EM 0.49 0.612 0.253 0.097 0.402 0.428 0.6
SP+POS 0.496 0.596 0.245 0.093 0.414 0.438 0.634
Frequency of Polarity Words::Supervised System
Two new features: a score of the positivity and a score of
the negativity of a tweet:
positive Pes , wi
negativePes , wi
Frequency of Polarity Words::Supervised System
Treatment of negations and adverbs:
Change the polarity of a word it is included in a negative
Increase (e.g., “mucho”, “absolutamente”) or decrease (e.g.,
poco”) the polarity of a word depending on the adverb.
Weight polarity of words depending on Syntactic Nesting Level:
Importance of each word w by the relative syntactic nesting
level 1/ln(w):
positive Pes , wi 1
ln wi
negative Pes , wi 1
ln wi
Frequency of Polarity Words::Supervised System
Acc. (6
Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
SP 0.484 0.594 0.254 0.098 0.397 0.422 0.598
SP+EM 0.49 0.612 0.253 0.097 0.402 0.428 0.6
SP+POS 0.496 0.596 0.245 0.093 0.414 0.438 0.634
SP+FP 0.514 0.633 0.261 0.115 0.455 0.438 0.613
All features combined::Supervised System
Acc. (6
Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
SP 0.484 0.594 0.254 0.098 0.397 0.422 0.598
SP+EM 0.49 0.612 0.253 0.097 0.402 0.428 0.6
SP+POS 0.496 0.596 0.245 0.093 0.414 0.438 0.634
SP+FP 0.514 0.633 0.261 0.115 0.455 0.438 0.613
All 0.523 0.648 0.246 0.111 0.463 0.452 0.657
Using Additional Corpora::Supervised System
Additional training data Ctw was retrieved using the attitude
feature of the twitter search:
Search is based on emoticons as in (Go et al., 2009).
Retrieved tweets were classified according to their attitude (P
or N):
Compiled corpus used in two ways:
A)Find new polarity words for polarity lexicon Pes (AC1).
B)Adding Ctw to the training data (AC2).
P N Total
Ctw 11,363 9,865 21,228
Using Additional Corpora::Supervised System
A)Extraction of polarity words from Ctw (AC1)
Same methodology as used for building Pes:
LLR for extracting positive and negative candidates.
First 500 positive and first 500 negative candidates manually
revised (110 positive and 95 negative selected).
Acc. (6
All 0.523 0.648 0.246 0.111 0.463 0.452 0.657
All+AC1 0.523 0.647 0.248 0.116 0.46 0.451 0.655
Using Additional Corpora::Supervised System
B)Adding examples from Ctw to the training data (AC2):
Original Training data Ct divided into two parts:
Ct-test (15%) and Ct-train (85%).
Adding examples from Ctw to Ct-train:
All of examples for training (All+AC2).
Only examples containing OOV words ( ):
# of training
All 6,137 0.573
All+AC2 27,365 0.507
All+AC2/OOV 7,807 0.569
wPesfreq w , Cttrain=0
Evaluation & Results
Test Data
Test data Ct consists of 60,798 tweets:
Polarity # of tweets % of tweets
P+ 20,745 34.12%
P1,488 2.45%
NEU 1,305 2.15%
N11,287 18.56%
N+ 4,557 7.5%
NONE 21,416 35.22%
Total 60,798 100%
AC1 provides improvement.
Best performance over P+ and NONE.
Worst performance over NEU and P.
Better results than those achieved over the training data:
The best system (ALL+AC1): 0.653 vs. 0.523.
Acc. (4
Acc. (6
Baseline 0.616 0.527 0.638 0.214 0.139 0.483 0.471 0.587
All 0.702 0.641 0.752 0.323 0.166 0.563 0.564 0.683
(submitted run)
0.711 0.653 0.753 0.32 0.167 0.566 0.566 0.685
Acc. (4
Acc. (6
Baseline 0.616 0.527 0.638 0.214 0.139 0.483 0.471 0.587
All 0.702 0.641 0.752 0.323 0.166 0.563 0.564 0.683
All+AC1 0.711 0.653 0.753 0.32 0.167 0.566 0.566 0.685
Acc. (6
Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
All 0.523 0.648 0.246 0.111 0.463 0.452 0.657
All+AC1 0.523 0.647 0.248 0.116 0.46 0.451 0.655
The distribution difference between training and test data:
0.00% 10.00% 20.00% 30.00% 40.00%
Our system effectively combines several features based on linguistic
Lemmas, POS tags, polarity words...
Good contribution of semi-automatically built polarity dictionary.
Robust performance of the system.
The goal of sentiment prediction is to automatically iden- tify whether a given piece of text expresses positive or nega - tive opinion towards a topic of interest. One can pose senti- ment prediction as a standard text categorization problem. However, gathering labeled data turns out to be a bottleneck in the process of building high quality text classifiers. Fortu- nately, background knowledge is often available in the form of prior information about the sentiment polarity of words in a lexicon. Moreover, in many applications abundant un- labeled data is also available. In this paper, we propose a novel semi-supervised sentiment prediction algorithm tha t utilizes lexical prior knowledge in conjunction with unla- beled examples. Our method is based on joint sentiment analysis of documents and words based on a bipartite graph representation of the data. We present an empirical study on a diverse collection of sentiment prediction problems whic h confirms that our semi-supervised lexical models signifi- cantly outperform purely supervised and competing semi- supervised techniques.