Conference PaperPDF Available

OpinionFinder: A System for Subjectivity Analysis.

Authors:

Abstract

OpinionFinder is a system that performs subjectivity analysis, automatically identifying when opinions, sentiments, speculations, and other private states are present in text. Specifically, OpinionFinder aims to identify subjective sentences and to mark various aspects of the subjectivity in these sentences, including the source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments.
OpinionFinder: A system for subjectivity analysis
Theresa Wilson, Paul Hoffmann, Swapna Somasundaran, Jason Kessler,
Janyce Wiebe†‡, Yejin Choi§, Claire Cardie§, Ellen Riloff, Siddharth Patwardhan
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260
Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260
§Department of Computer Science, Cornell University, Ithaca, NY 14853
School of Computing, University of Utah, Salt Lake City, UT 84112
{twilson,hoffmanp,swapna,wiebe}@cs.pitt.edu,
{ychoi,cardie}@cs.cornell.edu,{riloff,sidd}@cs.utah.edu
1 Introduction
OpinionFinder is a system that performs subjectivity
analysis, automatically identifying when opinions,
sentiments, speculations and other private states are
present in text. Specifically, OpinionFinder aims to
identify subjective sentences and to mark various as-
pects of the subjectivity in these sentences, includ-
ing the source (holder) of the subjectivity and words
that are included in phrases expressing positive or
negative sentiments.
Our goal with OpinionFinder is to develop a sys-
tem capable of supporting other Natural Language
Processing (NLP) applications by providing them
with information about the subjectivity in docu-
ments. Of particular interest are question answering
systems that focus on being able to answer opinion-
oriented questions, such as the following:
Was the election in Iran regarded as fair?
Is support dimishing for the war in Iraq?
To answer these types of questions, a system needs
to be able to identify when opinions are expressed in
text and who is expressing them. Other applications
that would benefit from knowledge of subjective lan-
guage include systems that summarize the various
viewpoints in a document or that mine product re-
views. Even typical fact-oriented applications, such
as information extraction, can benefit from subjec-
tivity analysis by filtering out opinionated sentences.
(Riloff et al., 2005).
2 OpinionFinder
OpinionFinder runs in two modes, batch and inter-
active. Document processing is largely the same for
both modes. In batch mode, OpinionFinder takes a
list of documents to process. Interactive mode pro-
vides a front-end that allows a user to query on-line
news sources for documents to process.
2.1 System Architecture Overview
OpinionFinder operates as one large pipeline. Con-
ceptually, the pipeline can be divided into two parts.
The first part performs mostly general purpose doc-
ument processing (e.g., tokenization and part-of-
speech tagging). The second part performs the sub-
jectivity analysis. The results of the the subjectiv-
ity analysis are returned to the user in the form of
SGML/XML markup of the original documents.
2.2 Document Processing
For general document processing, OpinionFinder
first runs the Sundance partial parser (Riloff and
Phillips, 2004) to provide semantic class tags, iden-
tify Named Entities, and match extraction patterns
that correspond to subjective language (Riloff and
Wiebe, 2003). Next, OpenNLP11.1.0 is used to
tokenize, sentence split and part-of-speech tag the
data, and the Abney stemmer in SCOL2version 1g is
used to stem. In batch mode, OpinionFinder parses
the data again, this time to obtain constituency parse
trees (Collins, 1997), which are then converted to
dependency parse trees (Xia and Palmer, 2001).
1http://opennlp.sourceforge.net/
2http://www.vinartus.net/spa/
Currently, this stage is only included for batch mode
processing due to the time required for parsing. Fi-
nally, a clue-finder is run to identify words and
phrases from a large subjective language lexicon.
2.3 Subjectivity Analysis
The subjectivity analysis has four components.
2.3.1 Subjective Sentence Classification
The first component is a Naive Bayes classifier
that distinguishes between subjective and objective
sentences using a variety of lexical and contextual
features (Wiebe and Riloff, 2005; Riloff and Wiebe,
2003). The classifier is trained using subjective and
objective sentences, which are automatically gener-
ated from a large corpus of unannotated data by two
high-precision, rule-based classifiers.
2.3.2 Speech Events and Direct Subjective
Expression Classification
The second component identifies speech events
(e.g., “said,” “according to”) and direct subjective
expressions (e.g., “fears,” “is happy”). Speech
events include both speaking and writing events.
Direct subjective expressions are words or phrases
where an opinion, emotion, sentiment, etc. is di-
rectly described. A high-precision, rule-based clas-
sifier is used to identify these expressions.
2.3.3 Opinion Source Identification
The third component is a source identifier that
combines a Conditional Random Field sequence
tagging model (Lafferty et al., 2001) and extraction
pattern learning (Riloff, 1996) to identify the sources
of speech events and direct subjective expressions
(Choi et al., 2005). The source of a speech event is
the speaker; the source of a subjective expression is
the experiencer of the private state. The source iden-
tifier is trained on the MPQA Opinion Corpus3using
a variety of features, including those obtained from
the dependency parse. Because the source identi-
fier relies on dependency parse information, it is cur-
rently only included in batch mode.
2.3.4 Sentiment Expression Classification
The final component uses two classifiers to iden-
tify words contained in phrases that express pos-
3The MPQA Opinion Corpus can be freely obtained at
http://nrrc.mitre.org/NRRC/publications.htm.
itive or negative sentiments (Wilson et al., 2005).
The first classifier focuses on identifying sentiment
expressions. The second classifier takes the senti-
ment expressions and identifies those that are pos-
itive and negative. Both classifiers were developed
using BoosTexter (Schapire and Singer, 2000) and
trained on the MPQA Corpus.
3 Related Work
Please see (Wiebe and Riloff, 2005; Choi et al.,
2005; Wilson et al., 2005) for related work in au-
tomatic opinion and sentiment analysis.
4 Acknowledgments
This work was supported by the Advanced Research
and Development Activity (ARDA), by the National
Science Foundation under grants IIS-0208028, IIS-
0208798 and IIS-0208985, and by the Xerox Foun-
dation.
References
J. Choi, C. Cardie, E. Riloff, and S. Patwardhan. 2005. Identi-
fying sources of opinions with conditional random fields and
extraction patterns. In HLT/EMNLP 2005.
M. Collins. 1997. Three generative, lexicalised models for sta-
tistical parsing. In ACL-97.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional
random fields: Probabilistic models for segmenting and la-
beling sequence data. In ICML-2001.
E. Riloff and W. Phillips. 2004. An Introduction to the Sun-
dance and AutoSlog Systems. Technical Report UUCS-04-
015, School of Computing, University of Utah.
E. Riloff and J. Wiebe. 2003. Learning extraction patterns for
subjective expressions. In EMNLP-2003.
E. Riloff, J. Wiebe, and W. Phillips. 2005. Exploiting sub-
jectivity classification to improve information extraction. In
AAAI-2005.
Ellen Riloff. 1996. Automatically generating extraction pat-
terns from untagged text. In AAAI/IAAI, Vol. 2.
R. E. Schapire and Y. Singer. 2000. BoosTexter: A boosting-
based system for text categorization. Machine Learning,
39(2/3):135–168.
J. Wiebe and E. Riloff. 2005. Creating subjective and objec-
tive sentence classifiers from unannotated texts. InCICLing-
2005.
T. Wilson, J. Wiebe, and P. Hoffman. 2005. Recognizing
contextual polarity in phrase-level sentiment analysis. In
HLT/EMNLP 2005.
F. Xia and M. Palmer. 2001. Converting dependency structures
to phrase structures. In HLT-2001.
... Pesquisas em análise de sentimento têm diferentes abordagens: identificar a polaridade positiva ou negativa, (DAS;CHEN, 2007;TURNEY, 2002;DAVE;LAWRENCE;PENNOCK, 2003;PANG;LEE, 2008), reconhecer a opinião em classes mais específicas, como raiva e aversão (WILSON et al., 2005;YU;HATZIVASSILOGLOU, 2003;WILSON;WIEBE;HOFFMANN, 2005) e identificar também a fonte da opinião em questões da forma: "O que X pensa sobre Y?" (CHOI et al., 2005). ...
... Pesquisas em análise de sentimento têm diferentes abordagens: identificar a polaridade positiva ou negativa, (DAS;CHEN, 2007;TURNEY, 2002;DAVE;LAWRENCE;PENNOCK, 2003;PANG;LEE, 2008), reconhecer a opinião em classes mais específicas, como raiva e aversão (WILSON et al., 2005;YU;HATZIVASSILOGLOU, 2003;WILSON;WIEBE;HOFFMANN, 2005) e identificar também a fonte da opinião em questões da forma: "O que X pensa sobre Y?" (CHOI et al., 2005). ...
Article
É crescente o número de trabalhos que procuram extrair informação e conhecimento a partir de dados não estruturados como textos publicados na web em blogs, microblogs, redes sociais e fontes de notícias. Em muitos casos, a avaliação deste tipo de dado não estruturado depende do contexto ou domínio específico de uma aplicação. Assim o uso de ontologia surge como uma ferramenta para dar suporte à classificação de dados e inferência de informação a partir de dados não estruturados como. Os avanços trazem sistemas baseados em ontologias (ontology driven). Com o objetivo de verificar os avanços nesta área, o presente trabalho apresenta uma revisão sistemática sobre os temas análise de sentimento e ontologia. O objetivo específico é responder a questão: Qual é a contribuição do uso de Ontologia para Análise de Sentimento em dados não estruturados da Web?
... Initially, researchers in sentiment analysis outlined a research program focused on identifying positive or negative traces of subjectivity and opinions in texts, distinguishing them from impersonal, factual statements (Pang & Lee, 2008). As the field evolved, sentiment analysis diverged into two subdomains: classifying texts as "subjective" or "objective" (Kasmuri & Basiron, 2017;Riloff et al., 2005;Wilson et al., 2005;Yu & Hatzivassiloglou, 2003), and sentiment analysis per se, namely detecting the positive or negative sentiment of a text without regard to whether the statement is objective or subjective. ...
Article
Full-text available
We introduce umigon-lexicon, a novel resource comprising English lexicons and associated conditions designed specifically to evaluate the sentiment conveyed by an author's subjective perspective. We conduct a comprehensive comparison with existing lexicons and evaluate umigon-lexicon's efficacy in sentiment analysis and factuality classification tasks. This evaluation is performed across eight datasets and against six models. The results demonstrate umigon-lexicon's competitive performance, underscoring the enduring value of lexicon-based solutions in sentiment analysis and factuality categorization. Furthermore, umigon-lexicon stands out for its intrinsic interpretability and the ability to make its operations fully transparent to end users, offering significant advantages over existing models.
... That is, if language was not somewhat uniform, sentiment analysis would not be possible. However, in studies undertaken by Wilson et al. (2005) and Balahur et al. (2013), it is found that inter-annotator agreement for humans is only ~80%. This means that an objective understanding of the semantics (and sentiment) of a given text is not always shared for the population tested with the tool. ...
Article
Full-text available
In the midst of the Era of Big Data, tools for analysing and processing unstructured data are needed more than ever. Being among these, sentiment analysis has experienced both a substantial proliferation in popularity and major developmental progress. However, the development of sentiment analysis tools in Danish has not experienced the same rapid development as e.g. English tools. Few Danish tools exist, and often the ones available are either ineffective or outdated. Moreover, authoritative validation tests in low-resource languages, are missing, which is why little can be deduced about the competence of current Danish models. We present SENTIDA, a simple and effective model for general sentiment analysis in Danish, and compare its competence to the current benchmark within the field of Danish sentiment analysis, AFINN. Combining a lexical approach with several incorporated functions, we construct SENTIDA and categorise it as a domain-independent sentiment analysis tool focusing on polarity strength. Subsequently, we run different validation tests, including a binary classification test of Trustpilot reviews and a correlation test based on manually rated texts from different domains. The results show that SENTIDA excels across all tests, predicting reviews with an accuracy above 80% in all trials and providing significant correlations with manually annotated texts.
... The first step is to prepare a lexicon of generic and domain-specific opinion words. The multi-perspective question answering (MPQA) lexicon created by Wilson et al. [62] includes 8,221 generic opinion words, each having a sentiment polarity. We choose adjectives from this lexicon as the generic opinion words because most opinion words are emotional and expressed by adjectives [13,58,63]. ...
... Features are retrieved directly from source texts and then used in sentiment analysis. Text preparation is required in textual sentiment analysis to further reduce the influence of noise and increase classification precision [12,40,63]. Text sentiment classification approaches are extensively classified into two types: lexicon-based models and machine-learning models. ...
Article
Full-text available
With the popularity of smart devices and online social media platforms, people are expressing their views in various modalities like text, images, and audio. Thus, recent research in sentiment analysis is no more limited to one modality of information only, rather it compiles all the available modalities to predict more correct sentiment. Multimodal sentiment analysis (MSA) is the process of extracting sentiment from various modalities such as text, images, and audio. Existing research works predict the sentiment of individual modalities independently and these predictions leverage the final sentiment. This paper presents an MSA approach for obtaining the final sentiment of an image-text tweet using multimodal decision-level fusion by incorporating features of individual modalities and inter-modal semantic relations. A dataset is prepared from an existing benchmark MSA dataset by annotating the final sentiment to tweets as a whole after assessing all the modalities. The proposed approach is experimented on this dataset and compared with state-of-the-art MSA methods. The in-depth analysis of the comparison results shows that the proposed approach outperforms existing methods in terms of accuracy, and F1-score.
... We assumed that more subjective discourses were presented in FEAR comments than in NONFEAR comments. We applied OpinionFinder (Wilson et al., 2005) to determine the objectivity of the text. The OpinionFinder lexicons had been applied to sentiment analysis for social media data (Verma et al., 2011;Díaz, Johnson, Lazar, Piper, & Gergle, 2018). ...
Article
The fear of crime is an emotional response individuals have toward crime or the anticipation related to being the victim of crime. The increasing exposure to crime information presents considerable risks to people's psychological health and well‐being. Nevertheless, the fear of crime in online discourses is under‐researched despite abundant conversations about crime. This work presents a mixed‐methods study to comprehend how people disclose the fear of crime and what linguistic content or cues are associated with the fear. We gathered conversations about crime in the Baltimore subreddit. The content analysis revealed a necessity to differentiate between “experienced” and “expressive” fear of crime. The regression modeling identified strong factors related to the fear of crime, such as negative sentiment, objective expression, and first‐person pronouns. This work extends the conceptualization of the fear of crime in online discourses and suggests potential ways to detect the fear automatically.
Preprint
Studies conducted on financial market prediction lack a comprehensive feature set that can carry a broad range of contributing factors; therefore, leading to imprecise results. Furthermore, while cooperating with the most recent innovations in explainable AI, studies have not provided an illustrative summary of market-driving factors using this powerful tool. Therefore, in this study, we propose a novel feature matrix that holds a broad range of features including Twitter content and market historical data to perform a binary classification task of one step ahead prediction. The utilization of our proposed feature matrix not only leads to improved prediction accuracy when compared to existing feature representations, but also its combination with explainable AI allows us to introduce a fresh analysis approach regarding the importance of the market-driving factors included. Thanks to the Lime interpretation technique, our interpretation study shows that the volume of tweets is the most important factor included in our feature matrix that drives the market's movements.
Article
Full-text available
The high pace rising global competitions across education sector has forced institutions to enhance aforesaid aspects, which require assessing students or related stakeholders’ perception and opinion towards the learning materials, courses, learning methods or pedagogies, etc. To achieve it, the use of reviews by students can of paramount significance; yet, annotating student’s opinion over huge heterogenous and unstructured data remains a tedious task. Though, the artificial intelligence (AI) and natural language processing (NLP) techniques can play decisive role; yet the conventional unsupervised lexicon, corpus-based solutions, and machine learning and/or deep driven approaches are found limited due to the different issues like class-imbalance, lack of contextual details, lack of long-term dependency, convergence, local minima etc. The aforesaid challenges can be severe over large inputs in Big Data ecosystems. In this reference, this paper proposed an outlier resilient semantic featuring deep driven sentiment analysis model (ORDSAENet) for educational domain sentiment annotations. To address data heterogeneity and unstructured-ness over unpredictable digital media, the ORDSAENet applies varied pre-processing methods including missing value removal, Unicode normalization, Emoji and Website link removal, removal of the words with numeric values, punctuations removal, lower case conversion, stop-word removal, lemmatization, and tokenization. Moreover, it applies a text size-constrained criteria to remove outlier texts from the input and hence improve ROI-specific learning for accurate annotation. The tokenized data was processed for Word2Vec assisted continuous bag-of-words (CBOW) semantic embedding followed by synthetic minority over-sampling with edited nearest neighbor (SMOTE-ENN) resampling. The resampled embedding matrix was then processed for Bi-LSTM feature extraction and learning that retains both local as well as contextual features to achieve efficient learning and classification. Executing ORDSAENet model over educational review dataset encompassing both qualitative reviews as well as quantitative ratings for the online courses, revealed that the proposed approach achieves average sentiment annotation accuracy, precision, recall, and F-Measure of 95.87%, 95.26%, 95.06% and 95.15%, respectively, which is higher than the LSTM driven standalone feature learning solutions and other state-of-arts. The overall simulation results and allied inferences confirm robustness of the ORDSAENet model towards real-time educational sentiment annotation solution.
Article
Full-text available
Identifying sentiments (the affective parts of opinions) is a challenging problem. We present a system that, given a topic, automatically finds the people who hold opinions about that topic and the sentiment of each opinion. The system contains a module for determining word sentiment and another for combining sentiments within a sentence. We experiment with various models of classifying and combining sentiment at word and sentence levels, with promising results.
Conference Paper
Full-text available
This paper presents the results of developing subjectivity classifiers using only unannotated texts for training. The performance rivals that of previous supervised learning approaches. In addition, we advance the state of the art in objective sentence classification, by learn- ing extraction patterns associated with objectivity and creating objec- tive classifiers that achieve substantially higher recall than previous work with comparable precision.
Conference Paper
Full-text available
Information extraction (IE) systems are prone to false hits for a variety of reasons and we observed that many of these false hits occur in sentences that contain sub- jective language (e.g., opinions, emotions, and senti- ments). Motivated by these observations, we explore the idea of using subjectivity analysis to improve the precision of information extraction systems. In this pa- per, we describe an IE system that uses a subjective sen- tence classier to lter its extractions. We experimented with several different strategies for using the subjectiv- ity classications, including an aggressive strategy that discards all extractions found in subjective sentences and more complex strategies that selectively discard ex- tractions. We evaluated the performance of these differ- ent approaches on the MUC-4 terrorism data set. We found that indiscriminately ltering extractions from subjective sentences was overly aggressive, but more selective ltering strategies improved IE precision with minimal recall loss.
Conference Paper
Full-text available
Recent systems have been developed for sentiment classification, opinion recognition, and opinion analysis (e.g., detecting polarity and strength). We pursue another aspect of opinion analysis: identifying the sources of opinions, emotions, and sentiments. We view this problem as an information extraction task and adopt a hybrid approach that combines Conditional Random Fields (Lafferty et al., 2001) and a variation of AutoSlog (Riloff, 1996a). While CRFs model source identification as a sequence tagging task, AutoSlog learns extraction patterns. Our results show that the combination of these two methods performs better than either one alone. The resulting system identifies opinion sources with 79.3% precision and 59.5% recall using a head noun matching measure, and 81.2% precision and 60.6% recall using an overlap measure.
Article
This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.
Article
This paper presents a bootstrapping process that learns linguistically rich extraction patterns for subjective (opinionated) expressions. High-precision classifiers label unannotated data to automatically create a large training set, which is then given to an extraction pattern learning algorithm. The learned patterns are then used to identify more subjective sentences. The bootstrapping process learns many subjective patterns and increases recall while maintaining high precision.
Article
this paper, we address the relationship between dependency structures and phrase structures from a practical perspective; namely, the exploration of different algorithms that convert dependency structures to phrase structures and the evaluation of their performance against an existing Treebank. This work not only provides ways to convert Treebanks from one type of representation to the other, but also clarifies the differences in representational coverage of the two approaches
Article
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.