ResearchPDF Available

OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY

October 2012

October 2012

Authors:

SRM Institute of Science and Technology

With the evolution of web technology, there is a huge amount of data present in the web for the internet users. These users not only use the available resources in the web, but also give their feedback, thus generating additional useful information. Due to overwhelming amount of user's opinions, views, feedback and suggestions available through the web resources, it's very much essential to explore, analyze and organize their views for better decision making. Opinion Mining or Sentiment Analysis is a Natural Language Processing and Information Extraction task that identifies the user's views or opinions explained in the form of positive, negative or neutral comments and quotes underlying the text. Text categorization generally classifies the documents by topic. This survey gives an overview of the efficient techniques, recent advancements and the future research directions in the field of Sentiment Analysis.

Systematic work flow of Sentiment Analysis The organization of the paper includes Sentiment Analysis in section 2 under which subjectivity detection, negation and feature based sentiment classification are briefed in the sections 2.1 to 2.3. Feature extraction and Feature Reduction are explained under 2.3.1 and 2.3.2. The Sentiment Classification is explained under the section 3, under which the polarity and intensity assignment is discussed under 3.1 and 3.2 respectively. The Machine Learning Approaches are discussed in the section 4, under which the Naïve Bayes Classification, Maximum Entropy and Support Vector Machines are briefed. Finally the applications and future challenges of Opinion Mining and Sentiment Classification are elaborated under the sections 5 and 6 respectively.

…

Figures - uploaded by Sindhu Chandra Sekharan

Content may be subject to copyright.

Content uploaded by Sindhu Chandra Sekharan

Content may be subject to copyright.

S CHANDRAKALA AND C SINDHU: OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY

DOI: 10.21917/ijsc.2012.0065

420

OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY

S. ChandraKala1 and C. Sindhu2

Department of Computer Science and Engineering, Velammal Engineering College, India

E-mail: 1sckala@gmail.com and 2csindhucse@gmail.com

Abstract

With the evolution of web technology, there is a huge amount of data

present in the web for the internet users. These users not only use the

available resources in the web, but also give their feedback, thus

generating additional useful information. Due to overwhelming

amount of user’s opinions, views, feedback and suggestions available

through the web resources, it’s very much essential to explore,

analyze and organize their views for better decision making. Opinion

Mining or Sentiment Analysis is a Natural Language Processing and

Information Extraction task that identifies the user’s views or

opinions explained in the form of positive, negative or neutral

comments and quotes underlying the text. Text categorization

generally classifies the documents by topic. This survey gives an

overview of the efficient techniques, recent advancements and the

future research directions in the field of Sentiment Analysis.

Keywords:

Sentiment Analysis, Opinion Mining, Information Extraction

1. INTRODUCTION

Opinion Mining or Sentiment analysis involves building a

system to explore user’s opinions made in blog posts, comments,

reviews or tweets, about the product, policy or a topic. It aims to

determine the attitude of a user about some topic. In recent

years, the exponential increase in the Internet usage and

exchange of user’s opinion is the motivation for Opinion

Mining. The Web is a huge repository of structured and

unstructured data. The analysis of this data to extract underlying

user’s opinion and sentiment is a challenging task. An opinion

can be described as a quadruple consisting of a Topic, Holder,

Claim and Sentiment [56]. Here the Holder believes a Claim

about the Topic and expresses it through an associated

Sentiment. To a machine, opinion is a “quintuple”, an object

made up of 5 different things: [Bing Liu in NLP Handbook] (Oj,

fjk, SOijkl, hi, tl), where Oj= the object on which the opinion is on,

fjk = a feature of Oj, SOijkl = the sentiment value of the opinion,

hi = Opinion holder, tl = the time at which the opinion is given.

There are several challenges in the field of sentiment

analysis. The most common challenges are given here. Firstly,

Word Sense Disambiguation (WSD), a classical NLP problem is

often encountered. For example, “an unpredictable plot in the

movie” is a positive phrase, while “an unpredictable steering

wheel” is a negative one. The opinion word unpredictable is

used in different senses. Secondly, addressing the problem of

sudden deviation from positive to negative polarity, as in “The

movie has a great cast, superb storyline and spectacular

photography; the director has managed to make a mess of the

whole thing”. Thirdly, negations, unless handled properly can

completely mislead. “Not only do I not approve Supernova 7200,

but also hesitate to call it a phone” has a positive polarity word

approve; but its effect is negated by many negations. Fourthly,

keeping the target in focus can be a challenge as in “my camera

compares nothing to Jack’s camera which is sleek and light,

produces life like pictures and is inexpensive”. All the positive

words about Jack’s camera being the constituents of the

document vector will produce an overall decision of positive

polarity which is wrong [7].

Table.1. Recent Papers on Sentiment Analysis and its related

tasks

Topic

Paper

Year

Sentiment Analysis

[66]

2012

[70]

2012

[73]

2011

Subjectivity Analysis

[47]

2011

Sentiment Detection

[48]

2009

Feature Selection for Opinion

Mining

[49]

2011

[62]

2008

[63]

2011

Review Aggregation

[61]

2008

Supervised Machine Learning

Approaches for Opinion Mining

[64]

2009

Sentiment Classification

[76]

2012

[77]

2012

Active learning for Opinion

Mining

[65]

2012

The importance and popularity of Sentiment Analysis have

led to several papers which describes and implements it’s variety

of tasks using several different techniques, some of them are

listed in Table.1, together with the years of publication and the

topics. In [73] methods to automatically generate new sentiment

lexicon, called SentiFul are described. A subjective-objective

sentence classifier that does not require annotated data as input

is built [47]. Such a classifier may then be used to improve

information extraction performance. Subjectivity classification

can prevent the sentiment classifier from considering irrelevant

or even potentially misleading text. Document sentiment

classification and opinion extraction have often involved word

sentiment classification techniques[48]. The Entropy Weighted

Genetic Algorithm (EWGA) is a hybridized genetic algorithm

that incorporates the information gain heuristic for feature

selection [62]. SVM and N-gram approaches outperformed the

Naïve Bayes approach when sentiment classification was applied

to the reviews on travel blogs for seven popular travel

destinations in the US and Europe [64].

ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2012, VOLUME: 03, ISSUE: 01

421

Fig.1 depicts the major important steps in order to achieve an

opinion impact. The web users post their views, comments and

feedback about a particular product or a thing through various

blogs, forums and social networking sites. Data is collected from

such opinion sources in such a way that only the reviews related

to the topic, that is searched is selected. The input document is

then preprocessed. Preprocessing, in this context, is the removal

of the fact based sentences, thus choosing only the opinionated

sentences. Further refinements are made by removing the

negations and by sensing the word disambiguation. Then, the

process of extracting relevant features is done. Feature selection

can potentially improve classification accuracy [58], narrow in

on a key feature subset of sentiment discriminators, and provide

greater insight into important class attributes. The extracted

features contribute to a document vector upon which various

machine learning techniques can be applied in order to classify

the polarity (positive and negative opinions)using the obtained

document vector and finally the opinion impact is obtained

based on the sentiment of the web users.

Fig.1. Systematic work flow of Sentiment Analysis

The organization of the paper includes Sentiment Analysis in

section 2 under which subjectivity detection, negation and

feature based sentiment classification are briefed in the sections

2.1 to 2.3. Feature extraction and Feature Reduction are

explained under 2.3.1 and 2.3.2. The Sentiment Classification is

explained under the section 3, under which the polarity and

intensity assignment is discussed under 3.1 and 3.2 respectively.

The Machine Learning Approaches are discussed in the section

4, under which the Naïve Bayes Classification, Maximum

Entropy and Support Vector Machines are briefed. Finally the

applications and future challenges of Opinion Mining and

Sentiment Classification are elaborated under the sections 5 and

6 respectively.

2. SENTIMENT ANALYSIS

Ordinary keyword search will not be suitable for mining all

kinds of opinions. Hence it becomes necessary that the

sophisticated opinion extraction methods are used. Sentiment

analysis is a natural language processing technique, helps to

identify and extract subjective information in source materials.

Sentiment analysis aims to determine the attitude of the writer

with respect to some topic or the overall contextual polarity of a

document. The attitude may be his or her judgment or

evaluation, affective state, or the intended emotional

communication. A basic task in sentiment analysis is classifying

the polarity of a given text at the document, sentence, or

feature/aspect level — whether the expressed opinion in a

document, a sentence or an entity feature/aspect is positive,

negative, or neutral. Beyond polarity sentiment classification,

the emotional states such as "angry", “sad" and "happy” are also

identified.

One of the challenges of Sentiment Analysis is to deﬁne the

opinions and subjectivity of the study [7]. Originally,

subjectivity was deﬁned by linguists, most prominently,

Randolph Quirk (R. Quirk and Svartvik, 1985). Quirk deﬁnes

private state as something that is not open to objective

observation or veriﬁcation. These private states include

emotions, opinions, and speculations; among others. The very

deﬁnition of a private state foreshadows difficulties in analyzing

sentiment. Subjectivity is highly context-sensitive, and its

expression is often peculiar to each person. Subjectivity

Detection and Negation are the most important preprocessing

steps in order to achieve efficient opinion impact. They are

discussed in the following sections.

2.1 SUBECTIVITY DETECTION

Subjectivity detection can be defined as a process of

selecting opinion containing sentences[7]. (e.g.,) “India’s

economy is heavily dependent on tourism and IT industry. It is

an excellent place to live in.” The first sentence is a factual one

and does not convey any sentiment towards India. Hence this

should not play any role in deciding on the polarity of the

review, and should be filtered out. Hence, the Polarity Classifier

assumes that the incoming documents are opinionated. Joint

Topic-Sentiment Analysis is done by collecting only on-topic

documents (e.g., by executing the topic-based query using a

standard search engine). In Information extraction, both topic-

based text filtering and subjectivity filtering are complementary

as in [8]. If a document contains information on a variety of

topics that may attract the attention of the user, then it will be

useful to classify the topics and its related opinions. This type of

analysis can be useful for comparative search analysis of related

items and also to discuss on the texts that contains various

features and attributes.

The political orientation of the websites can be done by

classifying the concatenation of all the documents found on that

particular site as in [9]. Analyzing sentiment and opinions in

political oriented text, generally focuses on the attitude

expressed via texts which are not targeted at a specific issue. In

order to mine opinion, the main concentration is on non-factual

information in text. There are various affect types; in specific

here the concentration is on the six “universal” emotions as in

[10]: anger, disgust, fear, happiness, sadness and surprise. These

emotions could be easily associated with an interesting

application of a human-computer interaction, where when a

system identifies that the user is upset or annoyed, the system

could change the user interface to a different mode of interaction

as in [11].

2.2 NEGATION

Negation is a very common linguistic construction that

affects polarity and, therefore, needs to be taken into

consideration in sentiment analysis. When treating negation, one

must be able to correctly determine what part of the meaning

expressed is modified by the presence of the negation. Most of

S CHANDRAKALA AND C SINDHU: OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY

422

the times, its expression is far from being simple , and does not

only contain obvious negation words, such as not, neither or nor.

Research in the field has shown that there are many other words

that invert the polarity of an opinion expressed[50], such as

diminishers/valence shifters (e.g., I find the functionality of the

new phone less practical), connectives (Perhaps it is a great

phone, but I fail to see why), or even modals (In theory, the

phone should have worked even under water). As can be seen

from these examples, modeling negation is a difficult yet an

important aspect of sentiment analysis.

2.3 FEATURE BASED SENTIMENT

CLASSIFICATION

Feature engineering is an extremely basic and essential task

for Sentiment Analysis. Converting a piece of text to a feature

vector is the basic step in any data driven approach to Sentiment

Analysis. It is important to convert a piece of text into a feature

vector, so as to process text in a much efficient manner. In text

domain, effective feature selection is a must in order to make the

learning task effective and accurate. In text classification, with

the bag of words model, each position in the input feature vector

corresponds to a given word or phrase. In the bag of words

framework, the documents are often converted into vectors

based on predefined feature presentation including feature type

and features weighting mechanism, which is critical to

classification accuracy. The major feature types contain

unigrams, bigrams and the mixtures of them, etc. The features

weighting mechanism mainly includes presence, frequency,

tf*idf and its variants [57]. The commonly used features used in

Sentiment Analysis and their critiques [70] are Term Presence,

Term frequency, term position, Subsequence Kernels, Parts of

Speech, Adjective-Adverb Combination, Adjectives, n-gram

features etc.

2.3.1 Feature Extraction:

Let us consider the n-gram features for feature extraction.

An n-gram is a contiguous sequence of n items from a

given sequence of text or speech. An n-gram could be any

combination of letters [49] (syllables, letters, word, part-of-

speech (POS), character, syntactic, and semantic n-grams).

The n-grams typically are collected from a text or speech corpus

and n-gram features captures sentiment cues in text. Fixed n-

grams are exact sequences. Variable n-grams are extraction

patterns capable of representing more sophisticated linguistic

phenomena. n-gram features can be classified into two

categories: 1) Fixed n-grams are sequences occurring at either

the character or token level. 2) Variable n-grams are extraction

patterns capable of representing more sophisticated linguistic

phenomena. A plethora of fixed and variable n-grams have been

used for opinion mining [50]. Documents are often converted

into vectors according to predefined features together with

weighting mechanisms [57]. Correlation is a commonly used

method for featureselection [58], [59]. The process of obtaining

n–gram can be given as in the steps below,

1) Filtering – removing URL Links

2) Tokenization – Segmenting text by splitting it by spaces

and punctuation marks, and forming bag of words

3) Removing Stop Words – Removing articles(“a”, ”an”,

”the”)

4) Constructing n-grams – from consecutive words

After the extraction of the features, it would be effective if

the features are reduced.

2.3.2 Feature Reduction:

Feature reduction is an important part of optimizing the

performance of a (linear) classifier by reducing the feature

vector to a size that does not exceed the number of training cases

as a starting point. Further reduction of vector size can lead to

more improvements if the features are noisy or redundant.

Reducing the number of features in the feature vector can be

done in two different ways [41]:

1) reduction to the top ranking n features based on some

criterion of “predictiveness”

2) reduction by elimination of sets of features (e.g.

elimination of linguistic analysis features etc.)

Now the extracted features are reduced, and the classification

of the sentiment based on the polarity and intensity of the text

using any of the machine learning approaches is to be done.

3. SENTIMENT CLASSIFICATION

Sentiment Classification broadly refers to binary

categorization, multi-class categorization, regression and

ranking. Sentiment Classification mainly consists of two

important tasks, including sentiment polarity assignment and

sentiment intensity assignment [49]. Sentiment polarity

assignment deals with analyzing, whether a text has a positive,

negative, or neutral semantic orientation. Sentiment intensity

assignment deals with analyzing, whether the positive or

negative sentiments are mild or strong. There are several tasks

in order to achieve the goals of Sentiment Analysis. These tasks

include sentiment or opinion detection, polarity classification

and discovery of the opinion’s target. A wide range of tools and

techniques are used to tackle the difficulties in order to achieve

Sentiment Analysis. The various methodologies used in order to

achieve Sentiment Classification are; 1) Classification with

respect to term frequency, n-grams, negations or parts of speech,

2) Identification of the semantic orientation of words using

lexicon, statistical techniques and training documents,

3) Identification of the semantic orientation of the sentences and

phrases, 4) Identification semantic orientation of the documents,

5) Object feature extraction, 6) Comparative sentence

identification.

3.1 POLARITY ASSIGNMENT

The Sentiment Polarity Classification is a binary

classification task where an opinionated document is labeled

with an overall positive or negative sentiment. Sentiment

Polarity Classification can also be termed as a binary decision

task. The input to the Sentiment Classifier can be opinionated or

sometime not. When a news article is given as an input,

analyzing and classifying it as a good or bad news is considered

to be a text categorization task as in [5]. Furthermore, this piece

of information can be good or bad news, but not necessarily

subjective (i.e., without expressing the view of the author).

Summarizing reviews in order to collect information on to why

the reviewers liked or disliked the product is another way of

mining opinion. In order to determine the polarity of the

outcomes as described in medical texts is yet another type of

ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2012, VOLUME: 03, ISSUE: 01

423

categorization related to the degree of positivity as in [6]. Few

other problems related to the determination of the degree of

positivity are the analysis of comparative sentences. Automated

opinion mining often uses machine learning, a component of

artificial intelligence.

3.2 INTENSITY ASSIGNMENT

While Sentiment polarity assignment deals with analyzing,

whether a text has a positive, negative, or neutral semantic

orientation, Sentiment intensity assignment deals with analyzing,

whether the positive or negative sentiments are mild or strong.

Consider the two phrases “I don’t like you” and “I hate you”,

where, both the sentences would be assigned a negative semantic

orientation but the latter would be considered more intense than

the first[49]. Effectively classifying sentiment polarities and

intensities entails the use of classification methods applied to

linguistic features. While several classification methods have

been employed for opinion mining, Support Vector Machine

(SVM) has outperformed various techniques including Naive

Bayes, Decision Trees, Winnow, etc. [67], [68] and [69].

4. MACHINE LEARNING APPROACHES

The aim of Machine Learning is to develop an algorithm so

as to optimize the performance of the system using example data

or past experience. The Machine Learning provides a solution to

the classification problem that involves two steps: 1) Learning

the model from a corpus of training data 2) Classifying the

unseen data based on the trained model. In general, classification

tasks are often divided into several sub-tasks:

1) Data preprocessing

2) Feature selection and/or feature reduction

3) Representation

4) Classification

5) Post processing

Feature selection and feature reduction attempt to reduce the

dimensionality (i.e. the number of features) for the remaining

steps of the task. The classification phase of the process finds the

actual mapping between patterns and labels (or targets). Active

learning, a kind of machine learning is a promising way for

sentiment classification to reduce the annotation cost[65]. The

following are some of the Machine Learning approaches

commonly used for Sentiment Classification.

4.1 NAIVE BAYES CLASSIFICATION

It is an approach to text classification that assigns the class

c* = argmaxc P(c | d), to a given document d. A naive Bayes

classifier is a simple probabilistic classifier based on Bayes'

theorem and is particularly suited when the dimensionality of the

inputs are high. Its underlying probability model can be

described as an "independent feature model". The Naive Bayes

(NB) classifier uses the Bayes’ rule Eq.(1),

     

 

dP cdPcP

dcP |

|

(1)

where, P(d) plays no role in selecting c*. To estimate the term

P(d|c), Naive Bayes decomposes it by assuming the fi’s are

conditionally independent given d’s class as in Eq.(2),

     

 

cfPcP

dcP m

NB 

1|

(2)

where, m is the no of features and fi is the feature vector.

Consider a training method consisting of a relative-frequency

estimation P(c) and P (fi | c). Despite its simplicity and the fact

that its conditional independence assumption clearly does not

hold in real-world situations, Naive Bayes-based text

categorization still tends to perform surprisingly well [13];

indeed, Naive Bayes is optimal for certain problem classes with

highly dependent features[29].

4.2 MAXIMUM ENTROPY

Maximum Entropy (ME) classification is yet another

technique, which has proven effective in a number of natural

language processing applications [26]. Sometimes, it

outperforms Naive Bayes at standard text classification [27]. Its

estimate of P(c | d) takes the exponential form as in Eq.(3),

     

 



iciciME cdF

dcP ,exp

|,,



(3)

where, Z(d) is a normalization function. Fi,c is a feature/class

function for feature fi and class c, as in Eq.(4),

   













otherwise

ccanddn

cdF i

ci 0

(4)

For instance, a particular feature/class function might fire if

and only if the bigram “still hate” appears and the document’s

sentiment is hypothesized to be negative. Importantly, unlike

Naive Bayes, Maximum Entropy makes no assumptions about

the relationships between features and so might potentially

perform better when conditional independence assumptions are

not met.

4.3 SUPPORT VECTOR MACHINES

Support vector machines (SVMs) have been shown to be

highly effective at traditional text categorization, generally

outperforming Naive Bayes [40]. They are large-margin, rather

than probabilistic, classifiers, in contrast to Naive Bayes and

Maximum Entropy. In the two-category case, the basic idea

behind the training procedure is to find a maximum margin

hyperplane, represented by vector

, that not only separates the

document vectors in one class from those in the other, but for

which the separation, or margin, is as large as possible. This

corresponds to a constrained optimization problem; letting cj 

{1, −1} (corresponding to positive and negative) be the correct

class of document dj, the solution can be written as in Eq.(5),

 jjjjj dcw 0,:



(5)

where, the αj’s (Lagrangian multipliers) are obtained by solving

a dual optimization problem. Those

such that αj is greater

than zero are called support vectors, since they are the only

document vectors contributing to

. Classification of test

S CHANDRAKALA AND C SINDHU: OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY

424

instances consists simply of determining which side of

’s

hyperplane they fall on.

5. APPLICATIONS

There are quite a large number of companies, big and small,

that have opinion mining and sentiment analysis as part of their

mission. Review-oriented search engines basically use sentiment

classification techniques. Opinion Mining proves itself to be an

important part of search engines. Topics need not be restricted to

product reviews, but could include opinions about candidates

running for office, political issues, and so forth. Summarizing

user reviews is an important problem. One could also imagine

that errors in user ratings could be fixed: there are cases where

users have clearly accidentally selected a low rating when their

review indicates a positive evaluation [3].

Sentiment-analysis and opinion-mining systems also have a

potential role in imparting sub-component technology for other

systems. Specifically, sentiment analysis system is an

augmentation to recommendation systems[14, 15]; since it might

recommend such a system not to suggest items that receive a lot

of negative feedback.

Detection of “flames” (overly-heated opposition) in email or

other types of communication is another possible use of

subjectivity detection [4]. In online systems that display ads as

sidebars, it is helpful to detect web pages that contain sensitive

content inappropriate for ads placement [16]; for more

sophisticated systems, it could be useful to bring up product ads

when relevant positive sentiments are detected. It has also been

argued that information extraction can be improved by

discarding information found in subjective sentences [17].

Question answering is another area where sentiment

analysis can prove useful [18, 19, and 20]. For example,

opinion-oriented questions may require different treatment. For

definitional questions, providing an answer that includes more

information about how an entity is viewed may better inform the

user [18]. Summarization may also benefit from accounting for

multiple viewpoints [21]. One effort seeks to use semantic

orientation to track literary reputation [22]. In general, the

computational treatment of affect has been motivated in part by

the desire to improve human-computer interaction [23, 54 and

55].

It's the breadth of opportunities – promising ways text

analytics can be applied to extract and analyze attitudinal

information from sources as varied as articles, blog postings, e-

mail, call-center notes and survey responses – and the difficulty

of the technical challenges that make existing and emerging

applications so interesting. Three other applications include

influence networks, assessment of marketing response and

customer experience management/enterprise feedback

management.

6. FUTURE CHALLENGES

There are several challenges in analyzing the sentiment of

the web user reviews. First, a word that is considered to be

positive in one situation may be considered negative in another

situation. Take the word "long" for instance. If a customer said a

laptop's battery life was long, that would be a positive opinion.

If the customer said that the laptop's start-up time was long,

however, that would be a negative opinion[49]. These

differences mean that an opinion system trained to gather

opinions on one type of product or product feature may not

perform very well on another.

Another challenge would arise because; people don't always

express opinions the same way. Most traditional text processing

relies on the fact that small differences between two pieces of

text don't change the meaning very much. In opinion mining,

however, "the movie was great" is very different from "the

movie was not great"[50].

People can be contradictory in their statements. Most reviews

will have both positive and negative comments, which is

somewhat manageable by analyzing sentences one at a time.

However, the more informal the medium (twitter or blogs for

example), the more likely people are to combine different

opinions in the same sentence. For example: "the movie bombed

even though the lead actor rocked it" is easy for a human to

understand, but more difficult for a computer to parse.

Sometimes even other people have difficulty understanding what

someone thought based on a short piece of text because it lacks

context. For example, "That movie was as good as his last one"

is entirely dependent on what the person expressing the opinion

thought of the previous film.

Pragmatics is a subfield of linguistics which studies the ways

in which context contributes to meaning. It is important to detect

the pragmatics of user opinion which may change the sentiment

thoroughly. Capitalization can be used with subtlety to denote

sentiment. In the examples given below, the first example

denotes a positive sentiment whereas the second denotes a

negative sentiment.

I just finished watching THE DESTROY.

That completely destroyed me.

Another challenge is in identifying the entity. A text or

sentence may have multiple entities associated with it. It is

extremely important to find out the entity towards which the

opinion is directed. Consider the following examples [70].

Sony is better than Samsung.

Raman defeated Ravanan in football.

The examples are positive for Sony and Raman respectively

but negative for Samsung and Ravanan.

7. CONCLUSION

This survey discusses various approaches to Opinion Mining

and Sentiment Analysis. It provides a detailed view of different

applications and potential challenges of Opinion Mining that

makes it a difficult task. Some of the machine learning

techniques like Naïve Bayes, Maximum Entropy and Support

Vector Machines has been discussed. Many of the applications

of Opinion Mining are based on bag-of-words, which do not

capture context which is essential for Sentiment Analysis. The

recent developments in Sentiment Analysis and its related sub-

tasks are also presented. The state of the art of existing

approaches has been described with the focus on the following

tasks: Subjectivity detection, Word Sense Disambiguation,

Feature Extraction and Sentiment Classification using various

ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2012, VOLUME: 03, ISSUE: 01

425

Machine learning techniques. Finally, the future challenges and

directions so as to further enhance the research in the field of

Opinion Mining and Sentiment Classification are discussed.

REFERENCES

[1] Alexander Pak and Patrick Paroubek, “Twitter as a corpus

for Sentiment Analysis and Opinion Mining”, Proceedings

of the Seventh conference on International Language

Resources and Evaluation, pp. 1320-1326, 2010.

[2] Bo Pang, Lillian Lee and Vaithyananthan, “Thumbs up?

Sentiment Classification using Machine Learning

Techniques”, Proceedings of the ACL-02 Conference on

Empirical Methods in Natural Language Processing, Vol.

10, pp. 79-86, 2002.

[3] Luis Cabral and Ali Hortacsu, “The dynamics of seller

reputation: Theory and evidence from eBay”, The Journal

of Industrial Economics, Vol. 58, No. 1, pp. 54-78, 2010.

[4] Ellen Spertus, “Smokey: Automatic recognition of hostile

messages”, Proceedings of Innovative Applications of

Artificial Intelligence, pp. 1058–1065, 1997.

[5] Sanjiv Das, Asis Martinez-Jerez and Peter Tufano, “e-

Information: A clinical study of investor discussion and

sentiment”, Financial Management, Vol. 34, No. 3, pp.

103–137, 2005.

[6] Yun Niu, Xiaodan Zhu, Jianhua Li and Graeme Hirst,

“Analysis of polarity information in medical text”,

Proceedings of the American Medical Informatics

Association, Annual Symposium, pp. 570-574, 2005.

[7] Shitanshu Verma and Pushpak Bhattacharyya,

“Incorporating Semantic Knowledge for Sentiment

Analysis”, Proceedings of International Conference on

Natural Language Processing, 2009.

[8] Ellen Riloff and JanyceWiebe, “Learning extraction

patterns for subjective expressions”, Proceedings of the

Conference on Empirical Methods in Natural Language

Processing, 2003.

[9] Gregory Grefenstette, Yan Qu, James G. Shanahan and

David A. Evans, “Coupling niche browsers and affect

analysis for an opinion mining application”, Proceedings of

RIAO (Recherche d’Information Assist´ee par Ordinateur)

Conference, pp. 186-194, 2004.

[10] Paul Ekman, “Emotion in the Human Face”, Cambridge

University Press, second edition, 1982.

[11] Lisa Hankin, “The effects of user reviews on online

purchasing behavior across multiple product categories”,

Master’s final project report, UC Berkeley School of

Information, 2007.

[12] George Forman, “An Extensive Empirical study of feature

selection Metrics for Text Classification”, Journal of

Machine Learning Research, Vol. 3, pp. 1289-1305, 2003.

[13] David D. Lewis, “Naive Bayes at forty: The independence

assumption in information retrieval”, Proceedings of the

European Conference on Machine Learning, No. 1398, pp.

4–15, 1998.

[14] Junichi Tatemura, “Virtual reviewers for collaborative

exploration of movie reviews”, Proceedings of Intelligent

User Interfaces, pp. 272–275, 2000.

[15] Loren Terveen, Will Hill, Brian Amento, David McDonald

and Josh Creter, “PHOAKS: A system for sharing

recommendations”, Communications of the Association for

Computing Machinery, Vol. 40, No. 3, pp. 59–62, 1997.

[16] Xin Jin, Ying Li, Teresa Mah and Jie Tong, “Sensitive

webpage classification for content advertising”,

Proceedings of the International Workshop on Data

Mining and Audience Intelligence for Advertising, pp. 28-

33, 2007.

[17] Ellen Riloff, Janyce Wiebe and William Phillips,

“Exploiting subjectivity classification to improve

information extraction”, Proceedings of Association for the

Advancement of Artificial Intelligence, pp. 1106–1111,

2005.

[18] Lucian VladLita, Andrew Hazen Schlaikjer, WeiChang

Hong and Eric Nyberg, “Qualitative dimensions in question

answering: Extending the definitional QA task”,

Proceedings of Association for the Advancement of

Artificial Intelligence, pp. 1616–1617, 2005.

[19] Swapna Somasundaran, Theresa Wilson, JanyceWiebe and

Veselin Stoyanov, “QA with attitude: Exploiting opinion

type analysis for improving question answering in on-line

discussions and the news”, Proceedings of the

International Conference on Weblogs and Social Media,

2007.

[20] Veselin Stoyanov, Claire Cardie and Janyce Wiebe,

“Multi-perspective question answering using the OpQA

corpus”, Proceedings of the Human Language Technology

Conference and the Conference on Empirical Methods in

Natural Language Processing, pp. 923–930, 2005.

[21] Yohei Seki, Koji Eguchi, Noriko Kando and Masaki Aono,

“Multi-document summarization with subjectivity

analysis” Proceedings of the Document Understanding

Conference, 2005.

[22] Maite Taboada, Mary Ann Gillies and Paul McFetridge,

“Sentiment classification techniques for tracking literary

reputation”, Language Resources and Evaluation

Workshop: Towards Computational Models of Literary

Analysis, pp. 36–43, 2006.

[23] Jackson Liscombe, Giuseppe Riccardi and Dilek Hakkani-

T¨ur, “Using context to improve emotion detection in

spoken dialog systems”, Interspeech, pp. 1845–1848, 2005.

[24] Hugo Liu, Henry Lieberman and Ted Selker, “A model of

textual affect sensing using real-world knowledge”,

Proceedings of Intelligent User Interfaces, pp. 125–132,

2003.

[25] Matt Thomas, Bo Pang and Lillian Lee, “Get out the vote:

Determining support or opposition from Congressional

floor-debate transcripts”, Proceedings of the Conference on

Empirical Methods in Natural Language Processing, pp.

327–335, 2006.

[26] Adam L. Berger, Stephen A. Della Pietra and Vincent J.

Della Pietra, “A maximum entropy approach to natural

S CHANDRAKALA AND C SINDHU: OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY

426

language processing”, Computational Linguistics, Vol. 22,

No. 1, pp. 39–71, 1996.

[27] Kamal Nigam, John Lafferty and Andrew McCallum,

“Using maximum entropy for text classification”,

Proceedings of the International Joint Conference on

Artificial Intelligence Workshop on Machine Learning for

Information Filtering, pp. 61–67, 1999.

[28] Andrew McCallum and Kamal Nigam, “A comparison of

event models for Naive Bayes text classification”,

Proceedings of the Association for the Advancement of

Artificial Intelligence Workshop on Learning for Text

Categorization, pp. 41–48, 1998.

[29] Pedro Domingos and Michael J. Pazzani, “On the

optimality of the simple Bayesian classifier under zero-one

loss”, Machine Learning, Vol. 29, No. 2-3, pp. 103–130,

1997.

[30] Stephen Della Pietra, Vincent Della Pietra and John

Lafferty, “Inducing features of random fields”, IEEE

Transactions on Pattern Analysis and Machine

Intelligence, Vol. 19, No. 4, pp. 380–393, 1997.

[31] Stanley Chen and Ronald Rosenfeld, “A survey of

smoothing techniques for ME models”, IEEE Transactions

Speech and Audio Processing, Vol. 8, No. 1, pp. 37–50,

2000.

[32] Sanjiv Das and Mike Chen, “Yahoo! for Amazon:

Extracting market sentiment from stock message boards”,

Proceedings of the 8th Asia Pacific Finance Association

Annual Conference, 2001.

[33] Douglas Biber, “Variation across Speech and Writing”,

Cambridge University Press, 1988.

[34] ShlomoArgamon-Engelson, Moshe Koppel and Galit

Avneri, “Style-based text categorization: What newspaper

am I reading?”, Proceedings of the Association for the

Advancement of Artificial Intelligence Workshop on Text

Categorization, pp. 1–4, 1998.

[35] Aidan Finn, Nicholas Kushmerick and Barry Smyth,

“Genre classification and domain transfer for information

filtering”, Proceedings of the European Colloquium on

Information Retrieval Research, pp. 353–362, 2002.

[36] Vasileios Hatzivassiloglou and Kathleen McKeown,

“Predicting the semantic orientation of adjectives”, in

Proceedings of the 35th Association for Computational

Linguistics and 8thConference of the European Chapter of

the Association for Computational Linguistics, pp. 174–

181, 1997.

[37] Vasileios Hatzivassiloglou and JanyceWiebe, “Effects of

adjective orientation and gradability on sentence

subjectivity”, Proceedings of the 18th Conference on

Computational Linguistic, Vol. 1, pp. 299-305, 2000.

[38] Marti Hearst, “Direction-based text interpretation as an

information access refinement”, Text-Based Intelligent

Systems, pp. 257-274, 1992.

[39] Alison Huettner and PeroSubasic, “Fuzzy typing for

document management”, Association for Computational

Linguistic, Companion Volume: Tutorial Abstracts and

Demonstration Notes, pp. 26–27, 2000.

[40] Thorsten Joachims. “Text categorization with support

vector machines: Learning with many relevant features”,

Proceedings of the European Conference on Machine

Learning, pp. 137–142, 1998.

[41] Michael Gamon, “Sentiment classification on customer

feedback data: noisy data, large feature vectors, and the

role of linguistic analysis”, Proceedings of the 20th

International Conference on Computational Linguistics pp.

841-847, 2004.

[42] Hsinchun Chen and David Zimbra, “AI and Opinion

Mining”, IEEE Intelligent Systems, Vol. 25, No. 3, pp. 74-

80, 2010.

[43] Hsinchun Chen, “AI and Opinion Mining, Part 2”, IEEE

Intelligent Systems, pp. 72-79, 2010.

[44] S. Shivashankar and B. Ravindran, “Multi Grain Sentiment

Analysis using Collective Classification”, Proceedings of

the European Conference on Artificial Intelligence,

pp. 823-828, 2010.

[45] George Stylios, Dimitris Christodoulakis, Jeries Besharat,

Maria-Alexandra Vonitsanou, Ioanis Kotrotsos, Athanasia

Koumpouri and Sofia Stamou, “Public Opinion Mining for

Governmental Decisions”, Electronic Journal of e-

Government, Vol. 8, No. 2, pp. 203-214, 2010.

[46] AnindyaGhose and Panagiotis G. Ipeirotis, “Estimating the

Helpfulness and Economic Impact of Product Reviews:

Mining Text and Reviewer Characteristics”, IEEE

Transactions on Knowledge and Data Engineering, Vol.

23, No. 10, pp. 1498-1512, 2011.

[47] JanyceWiebe and Ellen Riloff, “Finding Mutual Benefit

between Subjectivity Analysis and Information

Extraction”, IEEE Transactions on Affective Computing,

Vol. 2, No. 4, pp. 175-191, 2011.

[48] Huifeng Tang, Songbo Tan and Xueqi Cheng, “A survey

on sentiment detection of reviews”, Expert Systems with

Applications, Vol. 36, pp. 10760–10773, 2009.

[49] Ahmed Abbasi, Stephen France, Zhu Zhang and Hsinchun

Chen, “Selecting Attributes for Sentiment Classification

Using Feature Relation Networks”, IEEE Transactions on

Knowledge and Data Engineering, Vol. 23, No. 3, pp. 447-

462, 2011.

[50] Michael Wiegand and Alexandra Balahur, “A Survey on

the Role of Negation in Sentiment Analysis”, Proceedings

of the Workshop on Negation and Speculation in Natural

Language Processing, 2010.

[51] Thorsten Joachims, “Making large-scale SVM learning

practical”, Bernhard Sch¨olkopf and Alexander Smola,

editors, Advances in Kernel Methods - Support Vector

Learning, pp. 41–56, 1999.

[52] JussiKarlgren and Douglass Cutting, “Recognizing text

genres with simple metrics using discriminant analysis”,

Proceedings of Conference on Computational Linguistic,

Vol. 2, pp. 1071-1075, 1994.

[53] Brett Kessler, Geoffrey Nunberg and HinrichSch¨utze,

“Automatic detection of text genre”, Proceedings of the

35th Association for Computational Linguistics and

8thConference of the European Chapter of the Association

for Computational Linguistics, pp. 32–38, 1997.

ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, OCTOBER 2012, VOLUME: 03, ISSUE: 01

427

[54] Hugo Liu, Henry Lieberman and Ted Selker, “A model of

textual affect sensing using real-world knowledge”,

Proceedings of Intelligent User Interfaces, pp. 125–132,

2003.

[55] Junichi Tatemura, “Virtual reviewers for collaborative

exploration of movie reviews”, Proceedings of Intelligent

User Interfaces, pp. 272–275, 2000.

[56] Soo-Min Kim and Eduard Hovy, “Determining the

Sentiment of Opinions”, Proceedings of the Conference on

Computational Linguistic, Article No. 1367, 2004.

[57] Yuming Lin, Jingwei Zhang, Xiaoling Wang and Aoying

Zhou, “Sentiment Classification via Integrating Multiple

Feature Presentations”, WWW 2012 – Poster Presentation,

pp. 569-570, 2012.

[58] M. Hall and L.A. Smith, “Feature Subset Selection: A

Correlation Based Filter Approach”, Proceedings of the

Fourth International Conference on Neural Information

Processing and Intelligent Information Systems, pp. 855-

858, 1997.

[59] G. Forman, “An Extensive Empirical Study of Feature

Selection Metrics for Text Classification”, Journal of

Machine Learning Research, Vol. 3, pp. 1289-1305, 2004.

[60] Bo Pang and Lillian Lee, “Opinion mining and sentiment

analysis”, Foundations and Trends in Information

Retrieval, Vol. 2, No 1-2, pp. 1–135, 2008.

[61] Z. Zhang, “Weighing Stars: Aggregating Online Product

Reviews for Intelligent E-Commerce Applications”, IEEE

Intelligent Systems, Vol. 23, No. 5, pp. 42-49, 2008.

[62] A. Abbasi, H. Chen and A. Salem, “Sentiment Analysis in

Multiple Languages: Feature Selection for Opinion

Classification in Web Forums”, ACM Transactions on

Information Systems, Vol. 26, No. 3, Article No. 12, 2008.

[63] SugeWang, Deyu Li , Xiaolei Song , Yingjie Wei and

Hongxia Li, “A feature selection method based on

improved fisher’s discriminant ratio for text Sentiment

Classification”, Expert Systems with Applications, Vol. 38,

No. 7, pp. 8696–8702, 2011.

[64] Q. Ye, Z. Zhang and R. Law, “Sentiment classification of

online reviews to travel destinations by supervised machine

learning approaches”, Expert Systems with Applications,

Vol. 36, No. 3, part 2, pp. 6527–6535, 2009.

[65] Shoushan Li, ShengfengJu, Guodong Zhou and Xiaojun Li,

“Active Learning for Imbalanced Sentiment

Classification”, Proceedings of the International Joint

Conference on Empirical Methods in Natural Language

Processing and Computational Natural Language

Learning, pp. 139–148, 2012.

[66] Claudiu Cristian Musat, Alireza Ghasemi and Boi Faltings,

“Sentiment Analysis Using a Novel Human Computation

Game”, Proceedings of the 3rd Workshop on the People’s

Web Meets NLP, pp. 1–9, 2012.

[67] A. Abbasi and H. Chen, “CyberGate: A System and Design

Framework for Text Analysis of Computer Mediated

Communication”, MIS Quarterly, Vol. 32, No. 4, pp. 811-

837, 2008.

[68] A. Abbasi, H. Chen, S. Thoms and T. Fu, “Affect Analysis

of Web Forums and Blogs Using Correlation Ensembles”,

IEEE Transactions on Knowledge and Data Engineering,

Vol. 20, No. 9, pp. 1168-1180, 2008.

[69] A. Pang and L. Lee, “A Sentimental Education:

Sentimental Analysis Using Subjectivity Summarization

Based on Minimum Cuts”, Proceedings of the 42nd Annual

Meeting of the Association for Computational Linguistics,

pp. 271-278, 2004.

[70] Subhabrata Mukherjee, “Sentiment Analysis-A Literature

Survey”, Indian Institute of Technology, Bombay,

Department of Computer Science and Engineering, 2012.

[71] G.A. Van Kleef, A.C. Homan, B. Beersma, D. Van

Knippenberg, B. Van Knippenberg and F. Damen,

“Searing sentiment or cold calculation? The effects of

leader emotional displays on team performance depend on

follower epistemic motivation”, IEEE Engineering

Management Review, Vol. 40, No. 1, pp. 73-94, 2012.

[72] Chenghua Lin, Yulan He, R. Everson and S. Ruger,

“Weakly Supervised Joint Sentiment-Topic Detection from

Text”, IEEE Transactions on Knowledge and Data

Engineering, Vol. 24, No. 6, 2012.

[73] A. Neviarouskaya, H. Prendinger and M. Ishizuka,

“SentiFul: A Lexicon for Sentiment Analysis”, IEEE

Transactions on Affective Computing, Vol. 2, No. 1, pp.

22-36, 2011.

[74] X. Yu, Y. Liu, X. Huang and A. An, “Mining Online

Reviews for Predicting Sales Performance: A Case Study

in the Movie Domain”, IEEE Transactions on Knowledge

and Data Engineering, Vol. 24, No. 4, pp. 720-734, 2012.

[75] C.L. Liu, W.H. Hsaio, C.H. Lee, G.C. Lu and E. Jou,

“Movie Rating and Review Summarization in Mobile

Environment”, IEEE Transactions on Systems, Man and

Cybernetics, Part C: Applications and Reviews, Vol. 42,

No. 3, pp. 397-407, 2012.

[76] D. Bollegala, D. Weir and J. Carroll, “Cross-Domain

Sentiment Classification using a Sentiment Sensitive

Thesaurus”, IEEE Transactions on Knowledge and Data

Engineering, Vol. PP. No. 99, 2012.

[77] A.R. Naradhipa and A. Purwarianti, “Sentiment

classification for Indonesian message in social media”,

International Conference on Cloud Computing and Social

Networking, pp. 1-5, 2012.

Exploration of Lightweight Neural Network Architectures for Sentiment Analysis

Conference Paper

Nov 2023

Hongfei Rong

Review of Fake Product Review Detection Techniques

Conference Paper

Feb 2022

An Analysis on Data Mining for Sentiment Analysis Using Different Classification Algorithms

Conference Paper

Jun 2021

Sentiment Analysis Based on Multiple Reviews by using Machine learning approaches

Conference Paper

Mar 2019

Making large scale SVM learning practical

Article

Full-text available

Oct 1999

Thorsten Joachims

Training a support vector machine (SVM) leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner. In particular, for large learning tasks with many training examples, off-the-shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. SVMLight is an implementation of an SVM learner which addresses the problem of large tasks. This chapter presents algorithmic and computational results developed for SVMlight V2.0, which make large-scale SVM training more practical. The results give guidelines for the application of SVMs to large domains. Also published in: 'Advances in Kernel Methods - Support Vector Learning', Bernhard Schölkopf, Christopher J. C. Burges, and Alexander J. Smola (eds.), MIT Press, Cambridge, USA, 1998. The paper is written in English.

Al and Opinion Mining, Part 2

Article

Full-text available

Jul 2010

Opinion mining, a subdiscipline within data mining and computational linguistics, refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online news sources, social media comments, and other user-generated content. This Trends & Controversies department and the previous one include articles on opinion mining from distinguished experts in computer science and information systems. Each article presents a unique innovative research framework, computational methods, and selected results and examples.

Sentiment analysis using a novel human computation game

Conference Paper

Full-text available

Jul 2012

In this paper, we propose a novel human computation game for sentiment analysis. Our game aims at annotating sentiments of a collection of text documents and simultaneously constructing a highly discriminative lexicon of positive and negative phrases. Human computation games have been widely used in recent years to acquire human knowledge and use it to solve problems which are infeasible to solve by machine intelligence. We package the problems of lexicon construction and sentiment detection as a single human computation game. We compare the results obtained by the game with that of other well-known sentiment detection approaches. Obtained results are promising and show improvements over traditional approaches.

Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics

Article

Full-text available

Oct 2011

With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we reexamine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes such as their perceived usefulness. Our approach explores multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences are negatively associated with product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are rated more informative (or helpful) by other users. By using Random Forest-based classifiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. We examine the relative importance of the three broad feature categories: “reviewer-related” features, “review subjectivity” features, and “review readability” features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates eco- - nometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.

Cross-Domain Sentiment Classification Using a Sentiment Sensitive Thesaurus

Article

Full-text available

Jan 2012

Automatic classification of sentiment is important for numerous applications such as opinion mining, opinion summarization, contextual advertising, and market analysis. Typically, sentiment classification has been modeled as the problem of training a binary classifier using reviews annotated for positive or negative sentiment. However, sentiment is expressed differently in different domains, and annotating corpora for every possible domain of interest is costly. Applying a sentiment classifier trained using labeled data for a particular domain to classify sentiment of user reviews on a different domain often results in poor performance because words that occur in the train (source) domain might not appear in the test (target) domain. We propose a method to overcome this problem in cross-domain sentiment classification. First, we create a sentiment sensitive distributional thesaurus using labeled data for the source domains and unlabeled data for both source and target domains. Sentiment sensitivity is achieved in the thesaurus by incorporating document level sentiment labels in the context vectors used as the basis for measuring the distributional similarity between words. Next, we use the created thesaurus to expand feature vectors during train and test times in a binary classifier. The proposed method significantly outperforms numerous baselines and returns results that are comparable with previously proposed cross-domain sentiment classification methods on a benchmark data set containing Amazon user reviews for different types of products. We conduct an extensive empirical analysis of the proposed method on single- and multisource domain adaptation, unsupervised and supervised domain adaptation, and numerous similarity measures for creating the sentiment sensitive thesaurus. Moreover, our comparisons against the SentiWordNet, a lexical resource for word polarity, show that the created sentiment-sensitive thesaurus accurately captures words that express similar s- ntiments.

An extensive empirical study of feature selection metrics for text classification [J]

Article

Full-text available

Mar 2003

George Forman

Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair-e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.

Public opinion mining for governmental decisions

Article

Jan 2010

Active learning for imbalanced sentiment classification

Conference Paper

Jul 2012

Active learning is a promising way for sentiment classification to reduce the annotation cost. In this paper, we focus on the imbalanced class distribution scenario for sentiment classification, wherein the number of positive samples is quite different from that of negative samples. This scenario posits new challenges to active learning. To address these challenges, we propose a novel active learning approach, named co-selecting, by taking both the imbalanced class distribution issue and uncertainty into account. Specifically, our co-selecting approach employs two feature subspace classifiers to collectively select most informative minority-class samples for manual annotation by leveraging a certainty measurement and an uncertainty measurement, and in the meanwhile, automatically label most informative majority-class samples, to reduce human-annotation efforts. Extensive experiments across four domains demonstrate great potential and effectiveness of our proposed co-selecting approach to active learning for imbalanced sentiment classification.

Sentiment classification for Indonesian message in social media

Conference Paper

Apr 2012

Nowadays, classifying sentiment from social media has been a strategic thing since people can express their feeling about something in an easy way and short text. Mining opinion from social media has become important because people are usually honest with their feeling on something. In our research, we tried to identify the problems of classifying sentiment from Indonesian social media. We identified that people tend to express their opinion in text while the emoticon is rarely used and sometimes misleading. We also identified that the Indonesian social media opinion can be classified not only to positive, negative, neutral and question but also to a special mix case between negative and question type. Basically there are two levels of problem: word level and sentence level. Word level problems include the usage of punctuation mark, the number usage to replace letter, misspelled word and the usage of nonstandard abbreviation. In sentence level, the problem is related with the sentiment type such as mentioned before. In our research, we built a sentiment classification system which includes several steps such as text preprocessing, feature extraction, and classification. The text preprocessing aims to transform the informal text into formal text. The word formalization method in that we use is the deletion of punctuation mark, the tokenization, conversion of number to letter, the reduction of repetition letter, and using corpus with Levensthein to formalize abbreviation. The sentence formalization method that we use is negation handling, sentiment relative, and affixes handling. Rule-based, SVM and Maximum Entropy are used as the classification algorithms with features of count of positive, negative, and question word in sentence and bigram. From our experimental result, the best classification method is SVM that yields 83.5% accuracy.

Variation Across Speech and Writing

Article

Oct 1988

Douglas Biber

Similarities and differences between speech and writing have been the subject of innumerable studies, but until now there has been no attempt to provide a unified linguistic analysis of the whole range of spoken and written registers in English. In this widely acclaimed empirical study, Douglas Biber uses computational techniques to analyse the linguistic characteristics of twenty three spoken and written genres, enabling identification of the basic, underlying dimensions of variation in English. In Variation Across Speech and Writing, six dimensions of variation are identified through a factor analysis, on the basis of linguistic co-occurence patterns. The resulting model of variation provides for the description of the distinctive linguistic characteristics of any spoken or written text andd emonstrates the ways in which the polarization of speech and writing has been misleading, and thus enables reconciliation of the contradictory conclusions reached in previous research.

OPINION MINING AND SENTIMENT CLASSIFICATION: A SURVEY

Abstract and Figures

Recommended publications

Sentiment Analysis of Social Issues and Sentiment Score Calculation of Negative Prefixes

Sentiment Analysis for Drug Reviews Using PAMM Model and Comparative Study with NMF, SSNMF Models

Effects of Adjective Verb Adverb on Sentiment Classification Using Support Vector Machine for Green...

A Novel Approach for Mining Opinions from Social Networking Apps

Aspect based sentiment analysis for textual reviews using fuzzy C-means clustering algorithm

Comparative analysis of the techniques for Sentiment Analysis