ResearchPDF Available

OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY

Authors:

Abstract and Figures

With the evolution of web technology, there is a huge amount of data present in the web for the internet users. These users not only use the available resources in the web, but also give their feedback, thus generating additional useful information. Due to overwhelming amount of user's opinions, views, feedback and suggestions available through the web resources, it's very much essential to explore, analyze and organize their views for better decision making. Opinion Mining or Sentiment Analysis is a Natural Language Processing and Information Extraction task that identifies the user's views or opinions explained in the form of positive, negative or neutral comments and quotes underlying the text. Text categorization generally classifies the documents by topic. This survey gives an overview of the efficient techniques, recent advancements and the future research directions in the field of Sentiment Analysis.
Content may be subject to copyright.
S CHANDRAKALA AND C SINDHU: OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY
DOI: 10.21917/ijsc.2012.0065
420
OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY
S. ChandraKala1 and C. Sindhu2
Department of Computer Science and Engineering, Velammal Engineering College, India
E-mail: 1sckala@gmail.com and 2csindhucse@gmail.com
Abstract
With the evolution of web technology, there is a huge amount of data
present in the web for the internet users. These users not only use the
available resources in the web, but also give their feedback, thus
generating additional useful information. Due to overwhelming
amount of user’s opinions, views, feedback and suggestions available
through the web resources, it’s very much essential to explore,
analyze and organize their views for better decision making. Opinion
Mining or Sentiment Analysis is a Natural Language Processing and
Information Extraction task that identifies the user’s views or
opinions explained in the form of positive, negative or neutral
comments and quotes underlying the text. Text categorization
generally classifies the documents by topic. This survey gives an
overview of the efficient techniques, recent advancements and the
future research directions in the field of Sentiment Analysis.
Keywords:
Sentiment Analysis, Opinion Mining, Information Extraction
1. INTRODUCTION
Opinion Mining or Sentiment analysis involves building a
system to explore user’s opinions made in blog posts, comments,
reviews or tweets, about the product, policy or a topic. It aims to
determine the attitude of a user about some topic. In recent
years, the exponential increase in the Internet usage and
exchange of user’s opinion is the motivation for Opinion
Mining. The Web is a huge repository of structured and
unstructured data. The analysis of this data to extract underlying
user’s opinion and sentiment is a challenging task. An opinion
can be described as a quadruple consisting of a Topic, Holder,
Claim and Sentiment [56]. Here the Holder believes a Claim
about the Topic and expresses it through an associated
Sentiment. To a machine, opinion is a “quintuple”, an object
made up of 5 different things: [Bing Liu in NLP Handbook] (Oj,
fjk, SOijkl, hi, tl), where Oj= the object on which the opinion is on,
fjk = a feature of Oj, SOijkl = the sentiment value of the opinion,
hi = Opinion holder, tl = the time at which the opinion is given.
There are several challenges in the field of sentiment
analysis. The most common challenges are given here. Firstly,
Word Sense Disambiguation (WSD), a classical NLP problem is
often encountered. For example, “an unpredictable plot in the
movie” is a positive phrase, while “an unpredictable steering
wheel” is a negative one. The opinion word unpredictable is
used in different senses. Secondly, addressing the problem of
sudden deviation from positive to negative polarity, as in “The
movie has a great cast, superb storyline and spectacular
photography; the director has managed to make a mess of the
whole thing”. Thirdly, negations, unless handled properly can
completely mislead. “Not only do I not approve Supernova 7200,
but also hesitate to call it a phonehas a positive polarity word
approve; but its effect is negated by many negations. Fourthly,
keeping the target in focus can be a challenge as in my camera
compares nothing to Jack’s camera which is sleek and light,
produces life like pictures and is inexpensive”. All the positive
words about Jack’s camera being the constituents of the
document vector will produce an overall decision of positive
polarity which is wrong [7].
Table.1. Recent Papers on Sentiment Analysis and its related
tasks
Topic
Paper
Year
Sentiment Analysis
[66]
2012
[70]
2012
[73]
2011
Subjectivity Analysis
[47]
2011
Sentiment Detection
[48]
2009
Feature Selection for Opinion
Mining
[49]
2011
[62]
2008
[63]
2011
Review Aggregation
[61]
2008
Supervised Machine Learning
Approaches for Opinion Mining
[64]
2009
Sentiment Classification
[76]
2012
[77]
2012
Active learning for Opinion
Mining
[65]
2012
The importance and popularity of Sentiment Analysis have
led to several papers which describes and implements it’s variety
of tasks using several different techniques, some of them are
listed in Table.1, together with the years of publication and the
topics. In [73] methods to automatically generate new sentiment
lexicon, called SentiFul are described. A subjective-objective
sentence classifier that does not require annotated data as input
is built [47]. Such a classifier may then be used to improve
information extraction performance. Subjectivity classification
can prevent the sentiment classifier from considering irrelevant
or even potentially misleading text. Document sentiment
classification and opinion extraction have often involved word
sentiment classification techniques[48]. The Entropy Weighted
Genetic Algorithm (EWGA) is a hybridized genetic algorithm
that incorporates the information gain heuristic for feature
selection [62]. SVM and N-gram approaches outperformed the
Naïve Bayes approach when sentiment classification was applied
to the reviews on travel blogs for seven popular travel
destinations in the US and Europe [64].
ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2012, VOLUME: 03, ISSUE: 01
421
Fig.1 depicts the major important steps in order to achieve an
opinion impact. The web users post their views, comments and
feedback about a particular product or a thing through various
blogs, forums and social networking sites. Data is collected from
such opinion sources in such a way that only the reviews related
to the topic, that is searched is selected. The input document is
then preprocessed. Preprocessing, in this context, is the removal
of the fact based sentences, thus choosing only the opinionated
sentences. Further refinements are made by removing the
negations and by sensing the word disambiguation. Then, the
process of extracting relevant features is done. Feature selection
can potentially improve classification accuracy [58], narrow in
on a key feature subset of sentiment discriminators, and provide
greater insight into important class attributes. The extracted
features contribute to a document vector upon which various
machine learning techniques can be applied in order to classify
the polarity (positive and negative opinions)using the obtained
document vector and finally the opinion impact is obtained
based on the sentiment of the web users.
Fig.1. Systematic work flow of Sentiment Analysis
The organization of the paper includes Sentiment Analysis in
section 2 under which subjectivity detection, negation and
feature based sentiment classification are briefed in the sections
2.1 to 2.3. Feature extraction and Feature Reduction are
explained under 2.3.1 and 2.3.2. The Sentiment Classification is
explained under the section 3, under which the polarity and
intensity assignment is discussed under 3.1 and 3.2 respectively.
The Machine Learning Approaches are discussed in the section
4, under which the Naïve Bayes Classification, Maximum
Entropy and Support Vector Machines are briefed. Finally the
applications and future challenges of Opinion Mining and
Sentiment Classification are elaborated under the sections 5 and
6 respectively.
2. SENTIMENT ANALYSIS
Ordinary keyword search will not be suitable for mining all
kinds of opinions. Hence it becomes necessary that the
sophisticated opinion extraction methods are used. Sentiment
analysis is a natural language processing technique, helps to
identify and extract subjective information in source materials.
Sentiment analysis aims to determine the attitude of the writer
with respect to some topic or the overall contextual polarity of a
document. The attitude may be his or her judgment or
evaluation, affective state, or the intended emotional
communication. A basic task in sentiment analysis is classifying
the polarity of a given text at the document, sentence, or
feature/aspect level whether the expressed opinion in a
document, a sentence or an entity feature/aspect is positive,
negative, or neutral. Beyond polarity sentiment classification,
the emotional states such as "angry", “sad" and "happy” are also
identified.
One of the challenges of Sentiment Analysis is to define the
opinions and subjectivity of the study [7]. Originally,
subjectivity was defined by linguists, most prominently,
Randolph Quirk (R. Quirk and Svartvik, 1985). Quirk defines
private state as something that is not open to objective
observation or verification. These private states include
emotions, opinions, and speculations; among others. The very
definition of a private state foreshadows difficulties in analyzing
sentiment. Subjectivity is highly context-sensitive, and its
expression is often peculiar to each person. Subjectivity
Detection and Negation are the most important preprocessing
steps in order to achieve efficient opinion impact. They are
discussed in the following sections.
2.1 SUBECTIVITY DETECTION
Subjectivity detection can be defined as a process of
selecting opinion containing sentences[7]. (e.g.,) India’s
economy is heavily dependent on tourism and IT industry. It is
an excellent place to live in.” The first sentence is a factual one
and does not convey any sentiment towards India. Hence this
should not play any role in deciding on the polarity of the
review, and should be filtered out. Hence, the Polarity Classifier
assumes that the incoming documents are opinionated. Joint
Topic-Sentiment Analysis is done by collecting only on-topic
documents (e.g., by executing the topic-based query using a
standard search engine). In Information extraction, both topic-
based text filtering and subjectivity filtering are complementary
as in [8]. If a document contains information on a variety of
topics that may attract the attention of the user, then it will be
useful to classify the topics and its related opinions. This type of
analysis can be useful for comparative search analysis of related
items and also to discuss on the texts that contains various
features and attributes.
The political orientation of the websites can be done by
classifying the concatenation of all the documents found on that
particular site as in [9]. Analyzing sentiment and opinions in
political oriented text, generally focuses on the attitude
expressed via texts which are not targeted at a specific issue. In
order to mine opinion, the main concentration is on non-factual
information in text. There are various affect types; in specific
here the concentration is on the six “universal” emotions as in
[10]: anger, disgust, fear, happiness, sadness and surprise. These
emotions could be easily associated with an interesting
application of a human-computer interaction, where when a
system identifies that the user is upset or annoyed, the system
could change the user interface to a different mode of interaction
as in [11].
2.2 NEGATION
Negation is a very common linguistic construction that
affects polarity and, therefore, needs to be taken into
consideration in sentiment analysis. When treating negation, one
must be able to correctly determine what part of the meaning
expressed is modified by the presence of the negation. Most of
S CHANDRAKALA AND C SINDHU: OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY
422
the times, its expression is far from being simple , and does not
only contain obvious negation words, such as not, neither or nor.
Research in the field has shown that there are many other words
that invert the polarity of an opinion expressed[50], such as
diminishers/valence shifters (e.g., I find the functionality of the
new phone less practical), connectives (Perhaps it is a great
phone, but I fail to see why), or even modals (In theory, the
phone should have worked even under water). As can be seen
from these examples, modeling negation is a difficult yet an
important aspect of sentiment analysis.
2.3 FEATURE BASED SENTIMENT
CLASSIFICATION
Feature engineering is an extremely basic and essential task
for Sentiment Analysis. Converting a piece of text to a feature
vector is the basic step in any data driven approach to Sentiment
Analysis. It is important to convert a piece of text into a feature
vector, so as to process text in a much efficient manner. In text
domain, effective feature selection is a must in order to make the
learning task effective and accurate. In text classification, with
the bag of words model, each position in the input feature vector
corresponds to a given word or phrase. In the bag of words
framework, the documents are often converted into vectors
based on predefined feature presentation including feature type
and features weighting mechanism, which is critical to
classification accuracy. The major feature types contain
unigrams, bigrams and the mixtures of them, etc. The features
weighting mechanism mainly includes presence, frequency,
tf*idf and its variants [57]. The commonly used features used in
Sentiment Analysis and their critiques [70] are Term Presence,
Term frequency, term position, Subsequence Kernels, Parts of
Speech, Adjective-Adverb Combination, Adjectives, n-gram
features etc.
2.3.1 Feature Extraction:
Let us consider the n-gram features for feature extraction.
An n-gram is a contiguous sequence of n items from a
given sequence of text or speech. An n-gram could be any
combination of letters [49] (syllables, letters, word, part-of-
speech (POS), character, syntactic, and semantic n-grams).
The n-grams typically are collected from a text or speech corpus
and n-gram features captures sentiment cues in text. Fixed n-
grams are exact sequences. Variable n-grams are extraction
patterns capable of representing more sophisticated linguistic
phenomena. n-gram features can be classified into two
categories: 1) Fixed n-grams are sequences occurring at either
the character or token level. 2) Variable n-grams are extraction
patterns capable of representing more sophisticated linguistic
phenomena. A plethora of fixed and variable n-grams have been
used for opinion mining [50]. Documents are often converted
into vectors according to predefined features together with
weighting mechanisms [57]. Correlation is a commonly used
method for featureselection [58], [59]. The process of obtaining
ngram can be given as in the steps below,
1) Filtering removing URL Links
2) Tokenization Segmenting text by splitting it by spaces
and punctuation marks, and forming bag of words
3) Removing Stop Words Removing articles(“a”, ”an”,
”the”)
4) Constructing n-grams from consecutive words
After the extraction of the features, it would be effective if
the features are reduced.
2.3.2 Feature Reduction:
Feature reduction is an important part of optimizing the
performance of a (linear) classifier by reducing the feature
vector to a size that does not exceed the number of training cases
as a starting point. Further reduction of vector size can lead to
more improvements if the features are noisy or redundant.
Reducing the number of features in the feature vector can be
done in two different ways [41]:
1) reduction to the top ranking n features based on some
criterion of “predictiveness”
2) reduction by elimination of sets of features (e.g.
elimination of linguistic analysis features etc.)
Now the extracted features are reduced, and the classification
of the sentiment based on the polarity and intensity of the text
using any of the machine learning approaches is to be done.
3. SENTIMENT CLASSIFICATION
Sentiment Classification broadly refers to binary
categorization, multi-class categorization, regression and
ranking. Sentiment Classification mainly consists of two
important tasks, including sentiment polarity assignment and
sentiment intensity assignment [49]. Sentiment polarity
assignment deals with analyzing, whether a text has a positive,
negative, or neutral semantic orientation. Sentiment intensity
assignment deals with analyzing, whether the positive or
negative sentiments are mild or strong. There are several tasks
in order to achieve the goals of Sentiment Analysis. These tasks
include sentiment or opinion detection, polarity classification
and discovery of the opinion’s target. A wide range of tools and
techniques are used to tackle the difficulties in order to achieve
Sentiment Analysis. The various methodologies used in order to
achieve Sentiment Classification are; 1) Classification with
respect to term frequency, n-grams, negations or parts of speech,
2) Identification of the semantic orientation of words using
lexicon, statistical techniques and training documents,
3) Identification of the semantic orientation of the sentences and
phrases, 4) Identification semantic orientation of the documents,
5) Object feature extraction, 6) Comparative sentence
identification.
3.1 POLARITY ASSIGNMENT
The Sentiment Polarity Classification is a binary
classification task where an opinionated document is labeled
with an overall positive or negative sentiment. Sentiment
Polarity Classification can also be termed as a binary decision
task. The input to the Sentiment Classifier can be opinionated or
sometime not. When a news article is given as an input,
analyzing and classifying it as a good or bad news is considered
to be a text categorization task as in [5]. Furthermore, this piece
of information can be good or bad news, but not necessarily
subjective (i.e., without expressing the view of the author).
Summarizing reviews in order to collect information on to why
the reviewers liked or disliked the product is another way of
mining opinion. In order to determine the polarity of the
outcomes as described in medical texts is yet another type of
ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2012, VOLUME: 03, ISSUE: 01
423
categorization related to the degree of positivity as in [6]. Few
other problems related to the determination of the degree of
positivity are the analysis of comparative sentences. Automated
opinion mining often uses machine learning, a component of
artificial intelligence.
3.2 INTENSITY ASSIGNMENT
While Sentiment polarity assignment deals with analyzing,
whether a text has a positive, negative, or neutral semantic
orientation, Sentiment intensity assignment deals with analyzing,
whether the positive or negative sentiments are mild or strong.
Consider the two phrases “I don’t like you” and “I hate you”,
where, both the sentences would be assigned a negative semantic
orientation but the latter would be considered more intense than
the first[49]. Effectively classifying sentiment polarities and
intensities entails the use of classification methods applied to
linguistic features. While several classification methods have
been employed for opinion mining, Support Vector Machine
(SVM) has outperformed various techniques including Naive
Bayes, Decision Trees, Winnow, etc. [67], [68] and [69].
4. MACHINE LEARNING APPROACHES
The aim of Machine Learning is to develop an algorithm so
as to optimize the performance of the system using example data
or past experience. The Machine Learning provides a solution to
the classification problem that involves two steps: 1) Learning
the model from a corpus of training data 2) Classifying the
unseen data based on the trained model. In general, classification
tasks are often divided into several sub-tasks:
1) Data preprocessing
2) Feature selection and/or feature reduction
3) Representation
4) Classification
5) Post processing
Feature selection and feature reduction attempt to reduce the
dimensionality (i.e. the number of features) for the remaining
steps of the task. The classification phase of the process finds the
actual mapping between patterns and labels (or targets). Active
learning, a kind of machine learning is a promising way for
sentiment classification to reduce the annotation cost[65]. The
following are some of the Machine Learning approaches
commonly used for Sentiment Classification.
4.1 NAIVE BAYES CLASSIFICATION
It is an approach to text classification that assigns the class
c* = argmaxc P(c | d), to a given document d. A naive Bayes
classifier is a simple probabilistic classifier based on Bayes'
theorem and is particularly suited when the dimensionality of the
inputs are high. Its underlying probability model can be
described as an "independent feature model". The Naive Bayes
(NB) classifier uses the Bayes’ rule Eq.(1),
   
 
dP cdPcP
dcP |
|
(1)
where, P(d) plays no role in selecting c*. To estimate the term
P(d|c), Naive Bayes decomposes it by assuming the fi’s are
conditionally independent given d’s class as in Eq.(2),
   
 
 
 
dP
cfPcP
dcP m
id
i
n
i
NB
1|
|
(2)
where, m is the no of features and fi is the feature vector.
Consider a training method consisting of a relative-frequency
estimation P(c) and P (fi | c). Despite its simplicity and the fact
that its conditional independence assumption clearly does not
hold in real-world situations, Naive Bayes-based text
categorization still tends to perform surprisingly well [13];
indeed, Naive Bayes is optimal for certain problem classes with
highly dependent features[29].
4.2 MAXIMUM ENTROPY
Maximum Entropy (ME) classification is yet another
technique, which has proven effective in a number of natural
language processing applications [26]. Sometimes, it
outperforms Naive Bayes at standard text classification [27]. Its
estimate of P(c | d) takes the exponential form as in Eq.(3),
     
 
iciciME cdF
dZ
dcP ,exp
1
|,,
(3)
where, Z(d) is a normalization function. Fi,c is a feature/class
function for feature fi and class c, as in Eq.(4),
(4)
For instance, a particular feature/class function might fire if
and only if the bigram “still hate” appears and the document’s
sentiment is hypothesized to be negative. Importantly, unlike
Naive Bayes, Maximum Entropy makes no assumptions about
the relationships between features and so might potentially
perform better when conditional independence assumptions are
not met.
4.3 SUPPORT VECTOR MACHINES
Support vector machines (SVMs) have been shown to be
highly effective at traditional text categorization, generally
outperforming Naive Bayes [40]. They are large-margin, rather
than probabilistic, classifiers, in contrast to Naive Bayes and
Maximum Entropy. In the two-category case, the basic idea
behind the training procedure is to find a maximum margin
hyperplane, represented by vector
w
, that not only separates the
document vectors in one class from those in the other, but for
which the separation, or margin, is as large as possible. This
corresponds to a constrained optimization problem; letting cj
{1, 1} (corresponding to positive and negative) be the correct
class of document dj, the solution can be written as in Eq.(5),
jjjjj dcw 0,:
(5)
where, the αj’s (Lagrangian multipliers) are obtained by solving
a dual optimization problem. Those
j
d
such that αj is greater
than zero are called support vectors, since they are the only
document vectors contributing to
w
. Classification of test
S CHANDRAKALA AND C SINDHU: OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY
424
instances consists simply of determining which side of
w
’s
hyperplane they fall on.
5. APPLICATIONS
There are quite a large number of companies, big and small,
that have opinion mining and sentiment analysis as part of their
mission. Review-oriented search engines basically use sentiment
classification techniques. Opinion Mining proves itself to be an
important part of search engines. Topics need not be restricted to
product reviews, but could include opinions about candidates
running for office, political issues, and so forth. Summarizing
user reviews is an important problem. One could also imagine
that errors in user ratings could be fixed: there are cases where
users have clearly accidentally selected a low rating when their
review indicates a positive evaluation [3].
Sentiment-analysis and opinion-mining systems also have a
potential role in imparting sub-component technology for other
systems. Specifically, sentiment analysis system is an
augmentation to recommendation systems[14, 15]; since it might
recommend such a system not to suggest items that receive a lot
of negative feedback.
Detection of “flames” (overly-heated opposition) in email or
other types of communication is another possible use of
subjectivity detection [4]. In online systems that display ads as
sidebars, it is helpful to detect web pages that contain sensitive
content inappropriate for ads placement [16]; for more
sophisticated systems, it could be useful to bring up product ads
when relevant positive sentiments are detected. It has also been
argued that information extraction can be improved by
discarding information found in subjective sentences [17].
Question answering is another area where sentiment
analysis can prove useful [18, 19, and 20]. For example,
opinion-oriented questions may require different treatment. For
definitional questions, providing an answer that includes more
information about how an entity is viewed may better inform the
user [18]. Summarization may also benefit from accounting for
multiple viewpoints [21]. One effort seeks to use semantic
orientation to track literary reputation [22]. In general, the
computational treatment of affect has been motivated in part by
the desire to improve human-computer interaction [23, 54 and
55].
It's the breadth of opportunities promising ways text
analytics can be applied to extract and analyze attitudinal
information from sources as varied as articles, blog postings, e-
mail, call-center notes and survey responses and the difficulty
of the technical challenges that make existing and emerging
applications so interesting. Three other applications include
influence networks, assessment of marketing response and
customer experience management/enterprise feedback
management.
6. FUTURE CHALLENGES
There are several challenges in analyzing the sentiment of
the web user reviews. First, a word that is considered to be
positive in one situation may be considered negative in another
situation. Take the word "long" for instance. If a customer said a
laptop's battery life was long, that would be a positive opinion.
If the customer said that the laptop's start-up time was long,
however, that would be a negative opinion[49]. These
differences mean that an opinion system trained to gather
opinions on one type of product or product feature may not
perform very well on another.
Another challenge would arise because; people don't always
express opinions the same way. Most traditional text processing
relies on the fact that small differences between two pieces of
text don't change the meaning very much. In opinion mining,
however, "the movie was great" is very different from "the
movie was not great"[50].
People can be contradictory in their statements. Most reviews
will have both positive and negative comments, which is
somewhat manageable by analyzing sentences one at a time.
However, the more informal the medium (twitter or blogs for
example), the more likely people are to combine different
opinions in the same sentence. For example: "the movie bombed
even though the lead actor rocked it" is easy for a human to
understand, but more difficult for a computer to parse.
Sometimes even other people have difficulty understanding what
someone thought based on a short piece of text because it lacks
context. For example, "That movie was as good as his last one"
is entirely dependent on what the person expressing the opinion
thought of the previous film.
Pragmatics is a subfield of linguistics which studies the ways
in which context contributes to meaning. It is important to detect
the pragmatics of user opinion which may change the sentiment
thoroughly. Capitalization can be used with subtlety to denote
sentiment. In the examples given below, the first example
denotes a positive sentiment whereas the second denotes a
negative sentiment.
I just finished watching THE DESTROY.
That completely destroyed me.
Another challenge is in identifying the entity. A text or
sentence may have multiple entities associated with it. It is
extremely important to find out the entity towards which the
opinion is directed. Consider the following examples [70].
Sony is better than Samsung.
Raman defeated Ravanan in football.
The examples are positive for Sony and Raman respectively
but negative for Samsung and Ravanan.
7. CONCLUSION
This survey discusses various approaches to Opinion Mining
and Sentiment Analysis. It provides a detailed view of different
applications and potential challenges of Opinion Mining that
makes it a difficult task. Some of the machine learning
techniques like Naïve Bayes, Maximum Entropy and Support
Vector Machines has been discussed. Many of the applications
of Opinion Mining are based on bag-of-words, which do not
capture context which is essential for Sentiment Analysis. The
recent developments in Sentiment Analysis and its related sub-
tasks are also presented. The state of the art of existing
approaches has been described with the focus on the following
tasks: Subjectivity detection, Word Sense Disambiguation,
Feature Extraction and Sentiment Classification using various
ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2012, VOLUME: 03, ISSUE: 01
425
Machine learning techniques. Finally, the future challenges and
directions so as to further enhance the research in the field of
Opinion Mining and Sentiment Classification are discussed.
REFERENCES
[1] Alexander Pak and Patrick Paroubek, “Twitter as a corpus
for Sentiment Analysis and Opinion Mining”, Proceedings
of the Seventh conference on International Language
Resources and Evaluation, pp. 1320-1326, 2010.
[2] Bo Pang, Lillian Lee and Vaithyananthan, “Thumbs up?
Sentiment Classification using Machine Learning
Techniques”, Proceedings of the ACL-02 Conference on
Empirical Methods in Natural Language Processing, Vol.
10, pp. 79-86, 2002.
[3] Luis Cabral and Ali Hortacsu, “The dynamics of seller
reputation: Theory and evidence from eBay”, The Journal
of Industrial Economics, Vol. 58, No. 1, pp. 54-78, 2010.
[4] Ellen Spertus, “Smokey: Automatic recognition of hostile
messages”, Proceedings of Innovative Applications of
Artificial Intelligence, pp. 10581065, 1997.
[5] Sanjiv Das, Asis Martinez-Jerez and Peter Tufano, “e-
Information: A clinical study of investor discussion and
sentiment”, Financial Management, Vol. 34, No. 3, pp.
103137, 2005.
[6] Yun Niu, Xiaodan Zhu, Jianhua Li and Graeme Hirst,
“Analysis of polarity information in medical text”,
Proceedings of the American Medical Informatics
Association, Annual Symposium, pp. 570-574, 2005.
[7] Shitanshu Verma and Pushpak Bhattacharyya,
“Incorporating Semantic Knowledge for Sentiment
Analysis”, Proceedings of International Conference on
Natural Language Processing, 2009.
[8] Ellen Riloff and JanyceWiebe, “Learning extraction
patterns for subjective expressions”, Proceedings of the
Conference on Empirical Methods in Natural Language
Processing, 2003.
[9] Gregory Grefenstette, Yan Qu, James G. Shanahan and
David A. Evans, “Coupling niche browsers and affect
analysis for an opinion mining application”, Proceedings of
RIAO (Recherche d’Information Assist´ee par Ordinateur)
Conference, pp. 186-194, 2004.
[10] Paul Ekman, “Emotion in the Human Face”, Cambridge
University Press, second edition, 1982.
[11] Lisa Hankin, “The effects of user reviews on online
purchasing behavior across multiple product categories”,
Master’s final project report, UC Berkeley School of
Information, 2007.
[12] George Forman, “An Extensive Empirical study of feature
selection Metrics for Text Classification”, Journal of
Machine Learning Research, Vol. 3, pp. 1289-1305, 2003.
[13] David D. Lewis, “Naive Bayes at forty: The independence
assumption in information retrieval”, Proceedings of the
European Conference on Machine Learning, No. 1398, pp.
415, 1998.
[14] Junichi Tatemura, “Virtual reviewers for collaborative
exploration of movie reviews”, Proceedings of Intelligent
User Interfaces, pp. 272275, 2000.
[15] Loren Terveen, Will Hill, Brian Amento, David McDonald
and Josh Creter, “PHOAKS: A system for sharing
recommendations”, Communications of the Association for
Computing Machinery, Vol. 40, No. 3, pp. 5962, 1997.
[16] Xin Jin, Ying Li, Teresa Mah and Jie Tong, “Sensitive
webpage classification for content advertising”,
Proceedings of the International Workshop on Data
Mining and Audience Intelligence for Advertising, pp. 28-
33, 2007.
[17] Ellen Riloff, Janyce Wiebe and William Phillips,
“Exploiting subjectivity classification to improve
information extraction”, Proceedings of Association for the
Advancement of Artificial Intelligence, pp. 11061111,
2005.
[18] Lucian VladLita, Andrew Hazen Schlaikjer, WeiChang
Hong and Eric Nyberg, “Qualitative dimensions in question
answering: Extending the definitional QA task”,
Proceedings of Association for the Advancement of
Artificial Intelligence, pp. 16161617, 2005.
[19] Swapna Somasundaran, Theresa Wilson, JanyceWiebe and
Veselin Stoyanov, “QA with attitude: Exploiting opinion
type analysis for improving question answering in on-line
discussions and the news”, Proceedings of the
International Conference on Weblogs and Social Media,
2007.
[20] Veselin Stoyanov, Claire Cardie and Janyce Wiebe,
“Multi-perspective question answering using the OpQA
corpus”, Proceedings of the Human Language Technology
Conference and the Conference on Empirical Methods in
Natural Language Processing, pp. 923930, 2005.
[21] Yohei Seki, Koji Eguchi, Noriko Kando and Masaki Aono,
“Multi-document summarization with subjectivity
analysis” Proceedings of the Document Understanding
Conference, 2005.
[22] Maite Taboada, Mary Ann Gillies and Paul McFetridge,
“Sentiment classification techniques for tracking literary
reputation”, Language Resources and Evaluation
Workshop: Towards Computational Models of Literary
Analysis, pp. 3643, 2006.
[23] Jackson Liscombe, Giuseppe Riccardi and Dilek Hakkani-
T¨ur, “Using context to improve emotion detection in
spoken dialog systems”, Interspeech, pp. 18451848, 2005.
[24] Hugo Liu, Henry Lieberman and Ted Selker, “A model of
textual affect sensing using real-world knowledge”,
Proceedings of Intelligent User Interfaces, pp. 125132,
2003.
[25] Matt Thomas, Bo Pang and Lillian Lee, “Get out the vote:
Determining support or opposition from Congressional
floor-debate transcripts”, Proceedings of the Conference on
Empirical Methods in Natural Language Processing, pp.
327335, 2006.
[26] Adam L. Berger, Stephen A. Della Pietra and Vincent J.
Della Pietra, “A maximum entropy approach to natural
S CHANDRAKALA AND C SINDHU: OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY
426
language processing”, Computational Linguistics, Vol. 22,
No. 1, pp. 3971, 1996.
[27] Kamal Nigam, John Lafferty and Andrew McCallum,
“Using maximum entropy for text classification”,
Proceedings of the International Joint Conference on
Artificial Intelligence Workshop on Machine Learning for
Information Filtering, pp. 6167, 1999.
[28] Andrew McCallum and Kamal Nigam, “A comparison of
event models for Naive Bayes text classification”,
Proceedings of the Association for the Advancement of
Artificial Intelligence Workshop on Learning for Text
Categorization, pp. 4148, 1998.
[29] Pedro Domingos and Michael J. Pazzani, “On the
optimality of the simple Bayesian classifier under zero-one
loss”, Machine Learning, Vol. 29, No. 2-3, pp. 103130,
1997.
[30] Stephen Della Pietra, Vincent Della Pietra and John
Lafferty, “Inducing features of random fields”, IEEE
Transactions on Pattern Analysis and Machine
Intelligence, Vol. 19, No. 4, pp. 380393, 1997.
[31] Stanley Chen and Ronald Rosenfeld, “A survey of
smoothing techniques for ME models”, IEEE Transactions
Speech and Audio Processing, Vol. 8, No. 1, pp. 3750,
2000.
[32] Sanjiv Das and Mike Chen, “Yahoo! for Amazon:
Extracting market sentiment from stock message boards”,
Proceedings of the 8th Asia Pacific Finance Association
Annual Conference, 2001.
[33] Douglas Biber, Variation across Speech and Writing”,
Cambridge University Press, 1988.
[34] ShlomoArgamon-Engelson, Moshe Koppel and Galit
Avneri, “Style-based text categorization: What newspaper
am I reading?”, Proceedings of the Association for the
Advancement of Artificial Intelligence Workshop on Text
Categorization, pp. 14, 1998.
[35] Aidan Finn, Nicholas Kushmerick and Barry Smyth,
“Genre classification and domain transfer for information
filtering”, Proceedings of the European Colloquium on
Information Retrieval Research, pp. 353362, 2002.
[36] Vasileios Hatzivassiloglou and Kathleen McKeown,
“Predicting the semantic orientation of adjectives”, in
Proceedings of the 35th Association for Computational
Linguistics and 8thConference of the European Chapter of
the Association for Computational Linguistics, pp. 174
181, 1997.
[37] Vasileios Hatzivassiloglou and JanyceWiebe, “Effects of
adjective orientation and gradability on sentence
subjectivity”, Proceedings of the 18th Conference on
Computational Linguistic, Vol. 1, pp. 299-305, 2000.
[38] Marti Hearst, “Direction-based text interpretation as an
information access refinement”, Text-Based Intelligent
Systems, pp. 257-274, 1992.
[39] Alison Huettner and PeroSubasic, “Fuzzy typing for
document management”, Association for Computational
Linguistic, Companion Volume: Tutorial Abstracts and
Demonstration Notes, pp. 2627, 2000.
[40] Thorsten Joachims. “Text categorization with support
vector machines: Learning with many relevant features”,
Proceedings of the European Conference on Machine
Learning, pp. 137142, 1998.
[41] Michael Gamon, “Sentiment classification on customer
feedback data: noisy data, large feature vectors, and the
role of linguistic analysis”, Proceedings of the 20th
International Conference on Computational Linguistics pp.
841-847, 2004.
[42] Hsinchun Chen and David Zimbra, “AI and Opinion
Mining”, IEEE Intelligent Systems, Vol. 25, No. 3, pp. 74-
80, 2010.
[43] Hsinchun Chen, “AI and Opinion Mining, Part 2”, IEEE
Intelligent Systems, pp. 72-79, 2010.
[44] S. Shivashankar and B. Ravindran, “Multi Grain Sentiment
Analysis using Collective Classification”, Proceedings of
the European Conference on Artificial Intelligence,
pp. 823-828, 2010.
[45] George Stylios, Dimitris Christodoulakis, Jeries Besharat,
Maria-Alexandra Vonitsanou, Ioanis Kotrotsos, Athanasia
Koumpouri and Sofia Stamou, “Public Opinion Mining for
Governmental Decisions”, Electronic Journal of e-
Government, Vol. 8, No. 2, pp. 203-214, 2010.
[46] AnindyaGhose and Panagiotis G. Ipeirotis, “Estimating the
Helpfulness and Economic Impact of Product Reviews:
Mining Text and Reviewer Characteristics”, IEEE
Transactions on Knowledge and Data Engineering, Vol.
23, No. 10, pp. 1498-1512, 2011.
[47] JanyceWiebe and Ellen Riloff, “Finding Mutual Benefit
between Subjectivity Analysis and Information
Extraction”, IEEE Transactions on Affective Computing,
Vol. 2, No. 4, pp. 175-191, 2011.
[48] Huifeng Tang, Songbo Tan and Xueqi Cheng, “A survey
on sentiment detection of reviews”, Expert Systems with
Applications, Vol. 36, pp. 1076010773, 2009.
[49] Ahmed Abbasi, Stephen France, Zhu Zhang and Hsinchun
Chen, “Selecting Attributes for Sentiment Classification
Using Feature Relation Networks”, IEEE Transactions on
Knowledge and Data Engineering, Vol. 23, No. 3, pp. 447-
462, 2011.
[50] Michael Wiegand and Alexandra Balahur, “A Survey on
the Role of Negation in Sentiment Analysis”, Proceedings
of the Workshop on Negation and Speculation in Natural
Language Processing, 2010.
[51] Thorsten Joachims, “Making large-scale SVM learning
practical”, Bernhard Sch¨olkopf and Alexander Smola,
editors, Advances in Kernel Methods - Support Vector
Learning, pp. 4156, 1999.
[52] JussiKarlgren and Douglass Cutting, “Recognizing text
genres with simple metrics using discriminant analysis”,
Proceedings of Conference on Computational Linguistic,
Vol. 2, pp. 1071-1075, 1994.
[53] Brett Kessler, Geoffrey Nunberg and HinrichSch¨utze,
“Automatic detection of text genre”, Proceedings of the
35th Association for Computational Linguistics and
8thConference of the European Chapter of the Association
for Computational Linguistics, pp. 3238, 1997.
ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2012, VOLUME: 03, ISSUE: 01
427
[54] Hugo Liu, Henry Lieberman and Ted Selker, “A model of
textual affect sensing using real-world knowledge”,
Proceedings of Intelligent User Interfaces, pp. 125132,
2003.
[55] Junichi Tatemura, “Virtual reviewers for collaborative
exploration of movie reviews”, Proceedings of Intelligent
User Interfaces, pp. 272275, 2000.
[56] Soo-Min Kim and Eduard Hovy, “Determining the
Sentiment of Opinions”, Proceedings of the Conference on
Computational Linguistic, Article No. 1367, 2004.
[57] Yuming Lin, Jingwei Zhang, Xiaoling Wang and Aoying
Zhou, “Sentiment Classification via Integrating Multiple
Feature Presentations”, WWW 2012 Poster Presentation,
pp. 569-570, 2012.
[58] M. Hall and L.A. Smith, “Feature Subset Selection: A
Correlation Based Filter Approach, Proceedings of the
Fourth International Conference on Neural Information
Processing and Intelligent Information Systems, pp. 855-
858, 1997.
[59] G. Forman, “An Extensive Empirical Study of Feature
Selection Metrics for Text Classification”, Journal of
Machine Learning Research, Vol. 3, pp. 1289-1305, 2004.
[60] Bo Pang and Lillian Lee, “Opinion mining and sentiment
analysis”, Foundations and Trends in Information
Retrieval, Vol. 2, No 1-2, pp. 1135, 2008.
[61] Z. Zhang, “Weighing Stars: Aggregating Online Product
Reviews for Intelligent E-Commerce Applications, IEEE
Intelligent Systems, Vol. 23, No. 5, pp. 42-49, 2008.
[62] A. Abbasi, H. Chen and A. Salem, “Sentiment Analysis in
Multiple Languages: Feature Selection for Opinion
Classification in Web Forums, ACM Transactions on
Information Systems, Vol. 26, No. 3, Article No. 12, 2008.
[63] SugeWang, Deyu Li , Xiaolei Song , Yingjie Wei and
Hongxia Li, “A feature selection method based on
improved fisher’s discriminant ratio for text Sentiment
Classification”, Expert Systems with Applications, Vol. 38,
No. 7, pp. 86968702, 2011.
[64] Q. Ye, Z. Zhang and R. Law, “Sentiment classification of
online reviews to travel destinations by supervised machine
learning approaches”, Expert Systems with Applications,
Vol. 36, No. 3, part 2, pp. 65276535, 2009.
[65] Shoushan Li, ShengfengJu, Guodong Zhou and Xiaojun Li,
“Active Learning for Imbalanced Sentiment
Classification”, Proceedings of the International Joint
Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language
Learning, pp. 139148, 2012.
[66] Claudiu Cristian Musat, Alireza Ghasemi and Boi Faltings,
“Sentiment Analysis Using a Novel Human Computation
Game”, Proceedings of the 3rd Workshop on the People’s
Web Meets NLP, pp. 19, 2012.
[67] A. Abbasi and H. Chen, “CyberGate: A System and Design
Framework for Text Analysis of Computer Mediated
Communication, MIS Quarterly, Vol. 32, No. 4, pp. 811-
837, 2008.
[68] A. Abbasi, H. Chen, S. Thoms and T. Fu, “Affect Analysis
of Web Forums and Blogs Using Correlation Ensembles,
IEEE Transactions on Knowledge and Data Engineering,
Vol. 20, No. 9, pp. 1168-1180, 2008.
[69] A. Pang and L. Lee, “A Sentimental Education:
Sentimental Analysis Using Subjectivity Summarization
Based on Minimum Cuts, Proceedings of the 42nd Annual
Meeting of the Association for Computational Linguistics,
pp. 271-278, 2004.
[70] Subhabrata Mukherjee, “Sentiment Analysis-A Literature
Survey”, Indian Institute of Technology, Bombay,
Department of Computer Science and Engineering, 2012.
[71] G.A. Van Kleef, A.C. Homan, B. Beersma, D. Van
Knippenberg, B. Van Knippenberg and F. Damen,
Searing sentiment or cold calculation? The effects of
leader emotional displays on team performance depend on
follower epistemic motivation”, IEEE Engineering
Management Review, Vol. 40, No. 1, pp. 73-94, 2012.
[72] Chenghua Lin, Yulan He, R. Everson and S. Ruger,
Weakly Supervised Joint Sentiment-Topic Detection from
Text, IEEE Transactions on Knowledge and Data
Engineering, Vol. 24, No. 6, 2012.
[73] A. Neviarouskaya, H. Prendinger and M. Ishizuka,
SentiFul: A Lexicon for Sentiment Analysis, IEEE
Transactions on Affective Computing, Vol. 2, No. 1, pp.
22-36, 2011.
[74] X. Yu, Y. Liu, X. Huang and A. An, Mining Online
Reviews for Predicting Sales Performance: A Case Study
in the Movie Domain, IEEE Transactions on Knowledge
and Data Engineering, Vol. 24, No. 4, pp. 720-734, 2012.
[75] C.L. Liu, W.H. Hsaio, C.H. Lee, G.C. Lu and E. Jou,
Movie Rating and Review Summarization in Mobile
Environment, IEEE Transactions on Systems, Man and
Cybernetics, Part C: Applications and Reviews, Vol. 42,
No. 3, pp. 397-407, 2012.
[76] D. Bollegala, D. Weir and J. Carroll, Cross-Domain
Sentiment Classification using a Sentiment Sensitive
Thesaurus, IEEE Transactions on Knowledge and Data
Engineering, Vol. PP. No. 99, 2012.
[77] A.R. Naradhipa and A. Purwarianti, “Sentiment
classification for Indonesian message in social media”,
International Conference on Cloud Computing and Social
Networking, pp. 1-5, 2012.
Article
Full-text available
Training a support vector machine (SVM) leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner. In particular, for large learning tasks with many training examples, off-the-shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. SVMLight is an implementation of an SVM learner which addresses the problem of large tasks. This chapter presents algorithmic and computational results developed for SVMlight V2.0, which make large-scale SVM training more practical. The results give guidelines for the application of SVMs to large domains. Also published in: 'Advances in Kernel Methods - Support Vector Learning', Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola (eds.), MIT Press, Cambridge, USA, 1998. The paper is written in English.
Article
Full-text available
Opinion mining, a subdiscipline within data mining and computational linguistics, refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online news sources, social media comments, and other user-generated content. This Trends & Controversies department and the previous one include articles on opinion mining from distinguished experts in computer science and information systems. Each article presents a unique innovative research framework, computational methods, and selected results and examples.
Conference Paper
Full-text available
In this paper, we propose a novel human computation game for sentiment analysis. Our game aims at annotating sentiments of a collection of text documents and simultaneously constructing a highly discriminative lexicon of positive and negative phrases. Human computation games have been widely used in recent years to acquire human knowledge and use it to solve problems which are infeasible to solve by machine intelligence. We package the problems of lexicon construction and sentiment detection as a single human computation game. We compare the results obtained by the game with that of other well-known sentiment detection approaches. Obtained results are promising and show improvements over traditional approaches.
Article
Full-text available
With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we reexamine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes such as their perceived usefulness. Our approach explores multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences are negatively associated with product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are rated more informative (or helpful) by other users. By using Random Forest-based classifiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. We examine the relative importance of the three broad feature categories: “reviewer-related” features, “review subjectivity” features, and “review readability” features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates eco- - nometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.
Article
Full-text available
Automatic classification of sentiment is important for numerous applications such as opinion mining, opinion summarization, contextual advertising, and market analysis. Typically, sentiment classification has been modeled as the problem of training a binary classifier using reviews annotated for positive or negative sentiment. However, sentiment is expressed differently in different domains, and annotating corpora for every possible domain of interest is costly. Applying a sentiment classifier trained using labeled data for a particular domain to classify sentiment of user reviews on a different domain often results in poor performance because words that occur in the train (source) domain might not appear in the test (target) domain. We propose a method to overcome this problem in cross-domain sentiment classification. First, we create a sentiment sensitive distributional thesaurus using labeled data for the source domains and unlabeled data for both source and target domains. Sentiment sensitivity is achieved in the thesaurus by incorporating document level sentiment labels in the context vectors used as the basis for measuring the distributional similarity between words. Next, we use the created thesaurus to expand feature vectors during train and test times in a binary classifier. The proposed method significantly outperforms numerous baselines and returns results that are comparable with previously proposed cross-domain sentiment classification methods on a benchmark data set containing Amazon user reviews for different types of products. We conduct an extensive empirical analysis of the proposed method on single- and multisource domain adaptation, unsupervised and supervised domain adaptation, and numerous similarity measures for creating the sentiment sensitive thesaurus. Moreover, our comparisons against the SentiWordNet, a lexical resource for word polarity, show that the created sentiment-sensitive thesaurus accurately captures words that express similar s- ntiments.
Article
Full-text available
Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair-e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.
Conference Paper
Active learning is a promising way for sentiment classification to reduce the annotation cost. In this paper, we focus on the imbalanced class distribution scenario for sentiment classification, wherein the number of positive samples is quite different from that of negative samples. This scenario posits new challenges to active learning. To address these challenges, we propose a novel active learning approach, named co-selecting, by taking both the imbalanced class distribution issue and uncertainty into account. Specifically, our co-selecting approach employs two feature subspace classifiers to collectively select most informative minority-class samples for manual annotation by leveraging a certainty measurement and an uncertainty measurement, and in the meanwhile, automatically label most informative majority-class samples, to reduce human-annotation efforts. Extensive experiments across four domains demonstrate great potential and effectiveness of our proposed co-selecting approach to active learning for imbalanced sentiment classification.
Conference Paper
Nowadays, classifying sentiment from social media has been a strategic thing since people can express their feeling about something in an easy way and short text. Mining opinion from social media has become important because people are usually honest with their feeling on something. In our research, we tried to identify the problems of classifying sentiment from Indonesian social media. We identified that people tend to express their opinion in text while the emoticon is rarely used and sometimes misleading. We also identified that the Indonesian social media opinion can be classified not only to positive, negative, neutral and question but also to a special mix case between negative and question type. Basically there are two levels of problem: word level and sentence level. Word level problems include the usage of punctuation mark, the number usage to replace letter, misspelled word and the usage of nonstandard abbreviation. In sentence level, the problem is related with the sentiment type such as mentioned before. In our research, we built a sentiment classification system which includes several steps such as text preprocessing, feature extraction, and classification. The text preprocessing aims to transform the informal text into formal text. The word formalization method in that we use is the deletion of punctuation mark, the tokenization, conversion of number to letter, the reduction of repetition letter, and using corpus with Levensthein to formalize abbreviation. The sentence formalization method that we use is negation handling, sentiment relative, and affixes handling. Rule-based, SVM and Maximum Entropy are used as the classification algorithms with features of count of positive, negative, and question word in sentence and bigram. From our experimental result, the best classification method is SVM that yields 83.5% accuracy.
Article
Similarities and differences between speech and writing have been the subject of innumerable studies, but until now there has been no attempt to provide a unified linguistic analysis of the whole range of spoken and written registers in English. In this widely acclaimed empirical study, Douglas Biber uses computational techniques to analyse the linguistic characteristics of twenty three spoken and written genres, enabling identification of the basic, underlying dimensions of variation in English. In Variation Across Speech and Writing, six dimensions of variation are identified through a factor analysis, on the basis of linguistic co-occurence patterns. The resulting model of variation provides for the description of the distinctive linguistic characteristics of any spoken or written text andd emonstrates the ways in which the polarization of speech and writing has been misleading, and thus enables reconciliation of the contradictory conclusions reached in previous research.