Conference PaperPDF Available

Leveraging the crowd to improve feature-sentiment analysis of user reviews

Authors:

Abstract and Figures

Crowdsourcing and machine learning are both useful techniques for solving difficult problems (e.g., computer vision and natural language processing). In this paper, we propose a novel method that harnesses and combines the strength of these two techniques to better analyze the features and the sentiments toward them in user reviews. To strike a good balance between reducing information overload and providing the original context expressed by review writers, the proposed system (1) allows users to interactively rank the entities based on feature-rating, (2) automatically highlights sentences that are related to relevant features, and (3) utilizes implicit crowdsourcing by encouraging users to provide correct labels of their own reviews to improve the feature-sentiment classifier. The proposed system not only helps users to save time and effort to digest the often massive amount of user reviews, but also provides real-time suggestions on relevant features and ratings as users generate their own reviews. Results from a simulation experiment show that leveraging on the crowd can significantly improve the feature-sentiment analysis of user reviews. Furthermore, results from a user study show that the proposed interface was preferred by more participants than interfaces that use traditional noun-adjective pair summarization, as the current interface allows users to view feature-related information in the original context.
Content may be subject to copyright.
Leveraging the Crowd to Improve Feature-Sentiment
Analysis of User Reviews
Shih-Wen Huang1, Pei-Fen Tu1, Wai-Tat Fu1, Mohammad Amanzadeh2
Department of Computer Science1, Industrial & Enterprise System Engineering2
University of Illinois at Urbana-Champaign
{shuang51, ptu3, wfu, amanzad2}@illinois.edu
ABSTRACT
Crowdsourcing and machine learning are both useful tech-
niques for solving difficult problems (e.g., computer vision
and natural language processing). In this paper, we propose
a novel method that harnesses and combines the strength of
these two techniques to better analyze the features and the
sentiments toward them in user reviews. To strike a good
balance between reducing information overload and provid-
ing the original context expressed by review writers, the pro-
posed system (1) allows users to interactively rank the enti-
ties based on feature-rating, (2) automatically highlights sen-
tences that are related to relevant features, and (3) utilizes im-
plicit crowdsourcing by encouraging users to provide correct
labels of their own reviews to improve the feature-sentiment
classifier. The proposed system not only helps users to save
time and effort to digest the often massive amount of user re-
views, but also provides real-time suggestions on relevant fea-
tures and ratings as users generate their own reviews. Results
from a simulation experiment show that leveraging on the
crowd can significantly improve the feature-sentiment anal-
ysis of user reviews. Furthermore, results from a user study
show that the proposed interface was preferred by more par-
ticipants than interfaces that use traditional noun-adjective
pairs summarization, as the current interface allows users to
view feature-related information in the original context.
Author Keywords
Human computation, crowdsourcing, interactive machine
learning, sentiment analysis, user generated content
ACM Classification Keywords
H.5.2 [Information interfaces and presentation]: User
Interfaces.
INTRODUCTION
With the rapid success of Web 2.0 technologies, user-
generated content haa become a major source of online infor-
mation. One of the most notable examples is online reviews.
Millions of people now write reviews on websites like Yelp or
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
IUI’13, March 1922, 2013, Santa Monica, CA, USA
Copyright 2013 ACM 978-1-4503-1965-2/13/03...$15.00.
Amazon to express their opinions regarding different restau-
rants or products. These user-generated reviews can be very
helpful for others to make wiser decisions.
However, extensive amounts of user-generated reviews are
difficult for people to digest, creating a typical problem of
information overload. There are two potentially conflicting
goals when designing systems that leverage on user-generated
reviews. First, the system should mitigate information over-
load by summarizing important information for the users;
second, the system should allow users to express their ex-
periences in their own words while at the same time allowing
others to read their reviews in context. With the proposed
system, we aim to provide a balance between the two.
Summarizing the information in user reviews based on re-
lated features and sentiment has proven to be an effective
way to help users to digest the massive amount of informa-
tion more efficiently. For example, Review Spotlight [33]
performs feature-sentiment analysis and presents the results
using noun-adjective pairs. Yatani et al. [33] showed that this
interface can help users make decisions significantly faster,
which demonstrates that feature-sentiment information can
help users digest user-generated reviews more efficiently.
The feature-sentiment analysis in existing intelligent inter-
faces [13, 18, 33] typically incorporates two steps. First, it
finds the features by identifying nouns with high frequency
counts in the text. Second, after determining the features, it
uses the adjective near each feature and a predefined glossary
(e.g., SentiWordNet [9]) to determine the feature’s orienta-
tion. However, researchers have pointed out that this unsuper-
vised learning approach has two major disadvantages, which
we will summarize below.
First, as mentioned in [12], this kind of analysis typically can
discover only features that are explicitly discussed in the con-
tent. For example, consider the following sentence:
“While light, it will not easily fit in pockets.
This sentence is related to size of a product; however, it is
very difficult for this unsupervised learning approach to dis-
cover the feature because the word “size” does not appear in
the sentence [12]. This greatly undermines the utility of the
algorithm since many opinions expressed in user-generated
reviews are implicit, and they tend to elude discovery by un-
supervised learning methods.
The second disadvantage is that this approach often makes
mistakes in selecting useful features and deciding the senti-
ment orientation of the features. For instance, as mentioned
in [33], “impeccable”, which should be a positive word, has
a very high negative score in their system. Illustrating simi-
lar mistakes made by the system, some of the system’s users
noted that the features they presented did not make much
sense (as described in the paper [33]). These errors may lower
the motivation to use a system as users perceive it to be unre-
liable.
One way to address these problems is to use labeled data
and supervised learning to find the hidden features in the re-
views. Supervised learning has demonstrated better perfor-
mance than unsupervised methods in general [3]. Moreover,
this approach can identify sentences related to a feature even
if the feature itself is not in the sentence because it uses more
than a single term in the sentences to classify. However, su-
pervised methods are difficult, if not impossible, to imple-
ment in existing interfaces because it is difficult to motivate
users to create a label for each sentence they write. Therefore,
one big challenge for this approach is collecting labeled data.
Finding a way to motivate users to provide labels is key to the
success of this supervised approach.
To address the questions discussed above, we propose a
novel intelligent interface that collects training data directly
from users as they generate reviews a concept often called
implicit crowdsourcing. The system we built can perform
feature-sentiment analysis in nearly real-time. As a result,
it can provide predictions while a user is writing the review.
If the user sees the prediction is wrong, the user can sim-
ply click the icons on the interface to correct the erroneous
prediction. Therefore, instead of providing labels for every
sentence in the review, users need only to correct some mis-
takes made by the system, which greatly reduces the effort
necessary to provide feature-sentiment information. In ad-
dition, users are more motivated to provide labels because
these labels are related to the accuracy of their own reviews.
The collected data can be used as new training instances for
the classifier. Moreover, increasing the number of training
instances raises the coverage of the supervised learning algo-
rithms [14]. Therefore, leveraging the crowd to collect user-
generated labels allows the classifiers to provide more accu-
rate predictions as the number of users in the system grows.
To preview our result, the experiment shows that our super-
vised classifiers can achieve much higher F1scores than base-
line models that discover feature-sentiment information using
unsupervised methods.
Another drawback of many existing intelligent interfaces is
that the feature-related information is summarized in a very
compressed form (e.g., noun-adjective pairs). Although this
has the advantage of allowing users to retrieve the feature-
related information in the reviews more quickly, it also de-
stroys the original context of the reviews, which often contain
more than pure information, such as social cues, personal ex-
pression, etc. In contrast, the proposed system provides a
highlighting function that highlights the feature-related infor-
mation in its original context. This function integrates into
the traditional review reading experience and creates a good
balance between focusing on feature-related information and
understanding the full reviews. In this study, we compared
the two designs (i.e., noun-adjective pairs summarization and
highlighting) to observe what users liked or disliked about
these two types of systems. To preview our results, we did
find that most users preferred to see the original context rather
than the purely summarized features, thus providing support
to the design.
Overview of the paper
In the rest of the paper, we first will review related work on
how crowdsourcing can be used to assist supervised learning
and feature-sentiment analysis of user reviews. We will then
describe the current system and how it differs from previous
ones. Then we will perform two sets of evaluation. First, we
will present results from a simulation experiment to demon-
strate how the system can outperform previous systems that
utilize unsupervised learning, and how the system can im-
prove over time as more user-generated labels are collected
to improve the classifier in the current system. Second, we
will present results from a user study that tested whether re-
view readers and writers would like the features of the current
system. Specifically, we tested the extent to which review
readers would like to see the context of user reviews instead
of merely summarized reviews, and we tested whether the
real-time feedback would encourage review writers to pro-
vide labels for their own reviews. Finally, we will discuss the
implication of our results for the design of systems that rely
on user-generated content in general, as well as the future di-
rection of the current research.
RELATED WORK
Enhancing machine learning algorithms by collecting la-
beled data from crowdsourcing
Crowdsourcing has been proven as an effective way to solve
various kinds of problems [11]. Individuals can easily recruit
online workers from crowdsourcing platforms like Amazon
Mechanical Turk (AMT)1to solve problems that are diffi-
cult for digital computers (e.g., text editing [1] and answer-
ing visual query [2]) at a very low cost. These examples have
merely begun to demonstrate the potential of crowdsourcing
as a social computing technique that can be applied in a wide
range of situations.
One of the most notable application for crowdsourcing is col-
lecting labeled data to improve the performance of machine
learning algorithms. Von Ahn and Dabbish [29] pioneered
the field by developing the ESP game, which recruits people
to generate image labels while playing an online game. By
2008, this game had recruited 200,000 players and collected
more than 50 million labels [30]. The collected labels then
were used to improve Google image search. Other games
[10, 31] and methods [22, 25, 26] also have been proposed
to collect high-quality image labels to train computer vision
algorithms.
1https://www.mturk.com
Figure 1. When reading reviews, users can highlight sentences that are
related to the aspect which interests them by a single click.
In addition to image labels, crowdsourcing also has been used
to collect training data for natural language processing. Snow
et al. [24] studied how the labeled data generated by the
non-experts from AMT can be more cost-effective than those
generated by experts. They also showed that the labels col-
lected from crowdsourcing could successfully improve ma-
chine learning algorithms. There are also various workshops
in NAACL [4], SIGIR [17], and WSDM [16] that aim at uti-
lizing crowdsourcing to generate labeled data that are useful
to data mining and information retrieval.
Instead of explicitly recruiting crowd workers to generate la-
beled data to enhance machine learning, Nichols et al. [19]
proposed implicit crowdsourcing, a method that directly col-
lects data generate by the users, which is different from tra-
ditional crowdsourcing that pays money to recruit workers
from AMT to create labeled data. This allows the system to
collect more data as the number of users increases. For exam-
ple, they [19] found that by collecting status updates posted
to Twitter, the system can successfully generate meaningful
summaries of sporting events. This design is especially use-
ful for supervised learning because the size of training data
is essential to the performance of the algorithms [14]. Be-
sides summarizing the data that have been already posted, the
intelligent interface we built can generate the predictions in
real time and involve users to correct the errors made by the
system. As a result, our interface further uses artificial intelli-
gence to assist users to generate labeled data more easily. To
the best of our knowledge, this is a novel concept that has not
yet been explored.
Intelligent interfaces for user reviews
Since user reviews contain much valuable information, many
researches have proposed different methods to analyze the
features and sentiments expressed in user reviews [20]. Hu
and Liu [12] used the minimum support of association rule
to identify frequent terms and phrases in user reviews as fea-
tures. Many researchers [7, 21, 28, 32] also studied how to
use machine learning algorithms to classify the sentiments
expressed in user-generated reviews.
Recently, many intelligent interfaces have been developed to
Figure 2. When browsing entities, the system allows users to sort them
(e.g. restaurants) using their ratings of different features.
help users to understand feature-sentiment information in a
huge amount of user reviews. Liu et al. [18] implemented
Opinion Observer, an interface that uses bar charts to present
the positive and negative sentiments of each feature. Carenini
et al. [5] constructed a treemaps interface for users to inter-
actively explore the information that interests them. In ad-
dition, they also designed a novel visualization for users to
compare the feature-sentiment information between different
entities [6]. Yatani et al. [33] developed Review Spotlight
to present feature-sentiment information in user reviews us-
ing noun-adjective pairs in a tag cloud. Huang et al. [13]
further group the similar features together to display feature-
sentiment information in a more concise format. The biggest
difference between the intelligent interface of our system and
the existing ones is that: instead of summarizing the informa-
tion using noun-adjective pairs, our system presents the infor-
mation by highlighting the feature-related sentences in their
original context. This allows the users to focus on feature-
related information while still have the opportunity to explore
other information in the reviews.
Moreover, intelligent interfaces also have been used to assist
review writing. Dong et al. [8] developed Reviewer’s Assis-
tant, a browser plug-in that identifies the sentences written by
previous users which might also be used by the current user
and recommend them to the user. Their study showed that
the system could suggest sentences that were actually written
by the users. The current system also has an intelligent in-
terface for review writing. Nevertheless, we aim at collecting
feature-sentiment information that can be helpful to the read-
ers instead of assisting reviewers to generate user reviews.
SYSTEM DESIGN AND IMPLEMENTATION
The proposed system incorporates two functions that help
readers digest user reviews: first, it allows its users to high-
light sentences based on the features they are interested in by
a single click; this greatly reduces the amount of information
that a user must read. (Figure 1) Second, users of the system
can rank the entities based on their feature ratings, which are
inferred directly from the contents. (Figure 2)
Figure 3. Users can click the icons on the interface to correct the er-
roneous predictions made by the system while composing reviews. For
example, the cross sign near the sentences allow users to cancel the pre-
dictions made by the system. In addition, users also can click the stars
to change feature-ratings.
To collect the information needed to perform the two func-
tions mentioned above, the system also has a novel intelligent
interface that conducts feature-sentiment analysis in real time.
When a user is writing a review using the system, whenever
the user finishes a sentence, the system would provide the
user real-time predictions about the feature(s) that are related
to it and the star ratings of the features. If the user feels that
the predictions made by the system are wrong, they can sim-
ply click the icons on the interface to correct the errors. (Fig-
ure 3) A graphical representation of the design of our system
is shown in Figure 4.
Data and features of the current system
The data used in the current system was retrieved from Yelp’s
Academic Dataset 2, which consists of 87,173 reviews of
restaurants near 30 schools. In this study, we used three
pre-defined features: food, service, and price. However, one
should notice that the data and features of the system easily
can be altered or expanded and are not limited to the current
settings.
Supervised two-layer feature-sentiment analysis
To discover the related features of each sentence and the in-
ferred ratings of the features, a supervised two-layer feature-
sentiment analysis was conducted on each review in the cor-
pus. A graphical representation of the flow of the two-layer
analysis is shown in Figure 5.
The first layer of the analysis is the sentence feature classifi-
cation. In this layer, the system decides if one sentence is re-
lated to a target feature or not. To initiate the classifiers of the
system, we collected 5,000 labeled sentences by recruiting
194 workers from Amazon Mechanical Turk at a cost of $9.70
(from 4/11/2012 to 4/21/2012). The workers were asked to
label whether a sentence was related to any of the three pre-
defined features or to none of them (A sentence can be related
to more than one feature). This is used to simulate the data
generated by the initial users of our system. We preprocessed
the text by stemming and removing stop words, and we con-
verted these labeled sentences into unigram feature vectors.
2http://www.yelp.com/academic dataset
Figure 4. When user is writing a review using our novel intelligent in-
terface, the system performs a real-time feature-sentiment analysis. The
user can easily correct the erroneous predictions made by the system.
The corrected data then is used to improve the classifiers to provide use-
ful information for other users to digest the reviews.
Then, we used the SV M light package3to train the classifiers
on these feature vectors. If a sentence was related to a cer-
tain feature, its associated feature vector would be treated as
a positive training instance for the classifier of that feature.
In contrast, if the sentence is not related to that feature, its
feature vector was used as a negative training instance. Fi-
nally, sentences that did not have labels in the corpus were
converted into unigram feature vectors, and the system used
the classifiers trained on the labeled sentences to classify the
features related to the unlabeled sentences.
The second layer of the analysis is the feature-rating pre-
diction, which predicts the star ratings of different features
in each review. To construct the classifiers, we utilized the
reviews and their associated star ratings in Yelp’s academic
dataset. First, the reviews with 4 or more stars were used as
positive training instances and reviews with less than 4 stars
were used as negative training instances. These reviews were
converted to feature vectors and were used to train a positive-
negative classifier using SV M light . By a similar procedure,
we built a 4-5-star classifier and a 2-3-star classifier4. These
classifiers allowed us to predict the overall ratings of each
review. For example, if a review was predicted as positive
(more than 3 stars) by the positive-negative classifier, the sys-
tem then would run the 4-5-star classifier to see if it should
be classified as a 4-star review or a 5-star review.
Equipped with the rating classifiers, the system then predicts
3http://svmlight.joachims.org/
4We grouped 1-star review to the 2-star reviews category because
there is only a very small portion of the reviews are 1-star
Figure 5. The flow of the two-layer feature-sentiment analysis. The sys-
tem first classify the sentences with their related features. The sentences
that are related to the same feature then are grouped together and are
used to predict the rating of the feature.
the feature ratings of each review by classifying the star rating
using only the contents that are related to a feature (based on
the first layer analysis). For instance, when predicting the ser-
vice rating of a review, the system would first find out which
sentences related to service using the service-feature classi-
fier. Then, these sentences would be grouped together and
classified by the star-rating classifiers. Finally, the output of
the classifier would be the service rating of the review.
Collecting corrected feature-sentiment information from
users
To ensure the accuracy of the analysis made by the system and
collect more training data to improve the performance of the
classifier, the system provides an interface that can perform
real-time feature-sentiment analysis while a user composes a
review. Whenever the user finishes a sentence, the web-based
system sends it to the server using AJAX. When the server
receives the sentence, a Python script performs the text pre-
processing and converts the sentence into a unigram feature
vector that can be processed by the classifiers. Then the sys-
tem conducts the two-layer feature-sentiment analysis using
the SV M light package. Finally, the result of the analysis is
sent back to the interface on the client side. The whole analy-
sis can be performed within one second, including the latency
of the Internet, so for the user, the analysis seems to occur in
real time.
If the predictions are wrong, the user can simply click the
icons next to the predictions on the interface to correct them.
For example, if a sentence is mistakenly classified as a sen-
tence that is related to price but the user judges that it is not,
the user can use the cross sign near the sentence to cancel
this prediction and assign the sentence to other categories, or
not to any existing category. If, for example, the sentence is
judged to be related to food, the user can click the icon that
represents food near the sentence to label it. Furthermore, if
the star-rating predictions are wrong, the user can click the
stars on the interface to change the ratings.
EXPERIMENTS
To evaluate whether our proposed method really improves
feature-sentiment analysis, we manually labeled the related
features (i.e., food, service, and price) of 1,000 sentences in
the dataset and used these as gold standard test dataset. The
system’s ability to discover the related features of our pro-
posed design and the baseline models was evaluated on this
dataset.5In this study, we focused only on the ability to clas-
sify feature-related sentences of the proposed method and left
evaluation of the ability to generate accurate feature-rating for
future work. Specifically, we wanted to test two hypotheses:
H1: The supervised classifier in our design can achieve
higher performance than a traditional unsupervised approach
can.
H2: More training data improves the performance of the clas-
sifier.
Two experiments were performed to test the hypotheses. The
details of these experiments are described below.
Experiment I: Comparisons between supervised and un-
supervised methods
In this experiment, we compared the proposed supervised
method to a baseline model with unsupervised learning to test
if H1 is true.
Method
Three sentence-feature classifiers (food, service, and price)
were trained on 5,000 sentences labeled by AMT workers.
We performed text preprocessing which includes stemming
and removing the stop words of the sentences. The sentences
then were converted to the feature vectors using a unigram
model. Finally, we used SV M light to train the classifiers on
these feature vectors.
In addition, we built a baseline model similar to the ones in
[13, 18, 33]. This unsupervised model used 434,664 sen-
tences in the full data set. The data size is much larger than
the 5,000 labeled sentences used in the supervised method. To
identify the frequent features in the sentences, we first used
the part-of-speech tagging function in NLTK6to find all the
nouns and adjectives in the sentences. Then, we performed
the same text preprocessing as in the supervised method. Af-
ter that, the nouns (after stemming) that appeared in more
than 1% of the total sentences were selected as the features.
This threshold is the same as the minimum support used in
[18]. We tried to vary the threshold to 0.1%, 0.5%, and
2%, but there were no significant differences in the results
of the various thresholds. Therefore, only the results of the
1% threshold were reported. After the features were selected,
the closest adjective to each feature was considered as the one
5We did not use the labels retrieved from AMT to evaluate the clas-
sifiers because we found that it contains many low-quality labels.
6http://nltk.org/
Figure 6. The comparisons between the F1scores of the supervised
method and the best performance of the unsupervised methods. The
results show that the supervised method can achieve much better F1
scores than the unsupervised methods in all three features tested.
that described the feature. We then grouped the features us-
ing the Kullback-Leibler divergence [15] between the adjec-
tives that described the features, which is the feature grouping
method used in [13]. Finally, the top (5, 10, 20, 30) closest
features to food, service, and price were assigned to them as
sub-features. If one of the sub-features occurred in a target
sentence, the sentence was classified as related to the main
feature (food, service, or price).
Evaluation
We evaluated the systems on a 1000-sentence test data man-
ually labeled by the authors. The precision, recall, and F1
score were calculated using the formulas below:
precision =# feature-related sentences classfied correctly
# sentences classified as related to the feature
recall =# feature-related sentences classfied correctly
# feature-related sentences
F1= 2 ·precision ·recall
precision + recall
We used the F1scores to evaluate the performance of the
classifiers because it is a weighted average of precision and
recall. Since there is a trade-off between precision and recall,
F1scores can evaluate the results more fairly [23]. The pre-
cision, recall, and F1scores of the supervised method and the
unsupervised methods with various number of sub-features
are summarized in the table 1, 2, and 3.
The results show that the proposed supervised method
achieved much higher F1scores in classifying feature-related
sentences. (Figure 6) When looking more carefully into the
precision and recall of each classifier, we see that the super-
vised method can achieve much higher precision than the un-
supervised methods can. In addition, although increasing fea-
tures to the unsupervised methods allows them to outperform
the supervised method in recall of two of the three features,
5 feat. 10 feat. 20 feat. 30 feat. supervised
Food 47.67% 51.69% 53.78% 51.99% 85.75%
Service 18.47% 18.39% 16.59% 17.09% 90.00%
Price 24.55% 16.29% 13.83% 11.91% 80.95%
Table 1. Precision of supervised method and unsupervised methods with
different number of sub-features
5 feat. 10 feat. 20 feat. 30 feat. supervised
Food 30.09% 41.63% 61.09% 68.10% 70.50%
Service 34.33% 41.04% 50.75% 65.67% 47.01%
Price 31.68% 35.64% 49.50% 63.28% 40.50%
Table 2. Recall of supervised method and unsupervised methods with
different number of sub-features
5 feat. 10 feat. 20 feat. 30 feat. supervised
food 36.89% 46.12% 57.20% 58.96% 77.38%
Service 24.02% 25.40% 25.01% 27.12% 61.76%
Price 30.60% 24.51% 22.22% 20.00% 53.99%
Table 3. F1scores of supervised method and unsupervised methods with
different number of sub-features
the precision becomes unacceptably low (around 15%). The
reason is that unsupervised methods would include many gen-
eral terms as sub-features, which are not very helpful to the
classification task. In contrast, the supervised method uses
the unigram feature vector to determine whether one sentence
is related to a feature. Therefore, the classification result is
not dominated by any single term. In addition, the supervised
method can discover some hidden patterns in sentences. For
instance, consider the following:
“The food is superb and it comes out pretty fast.
This sentence is related to both food and service. By using an
unsupervised method, the only feature that can be discovered
is food because it is mentioned explicitly in the sentence. On
the other hand, the supervised method used in our system can
successfully capture both features because “fast” and “come”
both carry meanings that are related to service, which is the
hidden feature of this sentence. The supervised method can
learn these implicit concepts (e.g., fast and come) to discover
the hidden feature if it is trained on a massive amount of la-
beled data. As a result, the supervised method can achieve
higher F1scores. The results therefore provide support to H1
using supervised learning to train the classifier can result in
better performance than that achieved by traditional unsuper-
vised methods. In particular, the current method can signifi-
cantly improve precision because of the fact that many hidden
variables that define the categories in user reviews cannot be
easily identified by unsupervised methods.
Experiment II: Comparisons between the supervised
methods with various training data size
In this second experiment, we varied the size of data that is
used to train the supervised classifiers to see if H2 is sup-
ported.
Method
We trained the classifiers on 0.5K, 1K, 2K, 3K, 4K, and 5K
labeled sentences. These subsets of labeled sentences were
selected randomly from the 5,000 labeled sentences collected
from AMT as described in experiment I.
Figure 7. The F1scores of the feature-classifiers trained on various size of data. The results show that there was a very high positive correlation between
the size of training data and the performance of the classifiers.
Evaluation
The precision, recall and F1scores of the feature-classifiers
trained on various sizes of labeled sentences are summarized
in the table 4, 5, and 6.
500 1000 2000 3000 4000 5000
food 85.02% 82.13% 81.37% 82.90% 84.58% 85.75%
Service 95.00% 68.83% 70.93% 86.67% 85.53% 90.00%
Price 92.59% 96.15% 87.10% 78.38% 79.49% 80.95%
Table 4. Precision of supervised method on various size of training data
500 1000 2000 3000 4000 5000
food 51.13% 64.19% 59.01% 72.07% 74.10% 70.50%
Service 28.36% 39.55% 45.52% 48.51% 48.51% 47.01%
Price 24.75% 24.75% 26.73% 28.71% 30.69% 40.50%
Table 5. Recall of supervised method on various size of training data
500 1000 2000 3000 4000 5000
food 63.86% 72.06% 68.41% 77.11% 78.99% 77.38%
Service 43.68% 50.23% 55.45% 62.20% 61.91% 61.76%
Price 39.06% 39.37% 40.91% 42.03% 44.28% 53.99%
Table 6. F1score of supervised method on various size of training data
The results show that the performances of the classifiers
clearly are positively correlated to the size of training data.
The Pearson correlations between the size of training data and
the F1scores of the feature classifiers for food, service, and
price are 85%, 90%, and 89%, respectively. We also found
that the improvements of the classifiers are mostly from re-
call. The Pearson correlations between the size of training
data and the recall of the feature classifiers for food, service,
and price are 81%, 79%, and 91% respectively. This shows
that more training data can help the classifiers discover more
hidden patterns in the data, which supports H2, that more
training instances can enhance the performance of the feature
classifiers. This also implies that when more users use the
system, the classifiers behind the system can generate more
accurate predictions because more training data can be col-
lected from users.
Summary
To summarize, our experiments show that the proposed super-
vised method can be a better mechanism in classifying sen-
tences by their related features than traditional unsupervised
method. Moreover, increasing the training data can further
improve the performance of the supervised classifiers, which
means that collecting labeled data from users effectively en-
hances the feature-sentiment analysis of the system.
USER STUDY
We conducted a user study to gain more insight into how
our interface impacts users and to answer the following three
questions that are important to our theses:
1. Does the feature-sentiment analysis and highlighting func-
tion of our system help users digest the information in user
reviews?
2. Is the highlighting function better than existing intelli-
gent interfaces that summarize feature-sentiment by noun-
adjective pairs?
3. Will users correct the erroneous predictions made by the
system?
The first two questions show how our proposed interface
helps readers digest user reviews and the last question lets us
know if the system really can collect more corrected feature-
sentiment information when the system is deployed in the
wild.
Procedure
At the beginning of the study, the participants were asked to
fill out a questionnaire collecting demographic information
and asking how often they read and write user reviews. Then,
we introduced our interface and its functions to the partici-
pant. After that, the participants were asked to read reviews
about a restaurant using both our interface and yelp’s web-
site7. When they finished reading, they were asked if the
highlighting function of our interface helped them get useful
information about the restaurant from the reviews.
Then, we introduced RevMiner8[13], an interface that uses
noun-adjective pairs to summarize feature-sentiment infor-
mation in the reviews to its users. (Figure 8) After the partic-
ipants became familiar with the interfaces, we let them com-
pare the noun-adjective pairs summarization with our high-
lighting function. We chose to use the RevMiner interface
7http://www.yelp.com/
8http://revminer.com/
Figure 8. Noun-adjective pairs visualization in RevMiner. We used
RevMiner as an example to represent the interfaces that summarize
feature-sentiment information using noun-adjective pairs.
because it was one of the most recently developed at the time
this article was written, and because it was publicly available
on the Web. We should point out that our intention was not
to directly compare our interface to this particular interface.
Instead, we are interested in comparing the general design be-
tween summarization using noun-adjective pairs (which are
used primarily in RevMiner) and highlighting (which is used
in our system).
Finally, we demonstrated the review-writing interface of our
system, and asked the participants to write a review about a
restaurant they recently visited using the interface. After they
completed their reviews, we asked the participants whether
they would use the interface to correct the feature-sentiment
predictions when they were wrong.9
Participants
Thirteen college and graduate students (7 males and 6 females
between the ages of 22 and 34) participated in this study. All
of the participants read user reviews online at least once a
month, and 8 of the 13 participants (62%) have experience in
writing user reviews online. The study lasted approximately
30 minutes.
USER STUDY RESULTS
Highlighting helped participants digest user reviews
When comparing our interface with traditional review web-
sites, 11 of the 13 participants (85%) suggested that the high-
lighting function helped them to understand more quickly the
information contained in a massive amount of online reviews.
One participant mentioned:
The highlighting function really helps me focus on the
information I am interested in. I can get the informa-
tion without spending time on the murmur of the re-
viewers.
Although two participants thought this function didn’t make
significant differences, this result demonstrated that the
feature-sentiment analysis and highlighting function were
9The exact wordings of our questions are listed in the appendix at
the end of the paper.
perceived as helpful for the majority of the participants in di-
gesting the large number of user reviews.
More participants preferred highlighting over noun-
adjective pairs summarization
When participants were asked to compare our interface (high-
lighting) to RevMiner (noun-adjective pair), 6 of 13 partici-
pants (46%) thought our interface was better, 3 of them (23%)
thought RevMiner was better, and 4 of them (31%) thought it
was a tie. The result suggests that about twice as many partic-
ipants preferred the highlighting function, compared to those
who preferred noun-adjective pairs summarization. The rea-
son was that the highlighting function allowed people to fo-
cus on the feature-sentiment information in its original con-
text, which creates a good balance between focusing on some
feature-sentiment information and the whole review. In con-
trast, the noun-adjective pairs summarization allowed the par-
ticipants to see only the compressed and fractured informa-
tion, which was not easy for them to interpret. One of our
participants noted:
I really like the first one (our interface) because it let
me focus on some parts that I am interested in and
I can also see its context. However, when reading
reviews using the second one (RevMiner), I only see
some short phrases. I have to first put them together to
guess the meanings, so it takes more time and is hard
to get the original context. This doesn’t help me learn
the experience of the previous users.
The participants who favored noun-adjective pairs summa-
rization preferred it mainly because it offered more features
than the three pre-defined features in our interface. However,
this issue can be addressed by including more features in our
system since the cost of adding new features is low.
Participants were motivated to correct erroneous predic-
tions made by the system
After having the experience using our review-writing inter-
face with real-time feature-sentiment analysis, 9 of the 13
participants (69%) expressed that they would correct errors
made by the system. One of the participants mentioned,
When I write a review, I want to let others get my com-
ments as clear as possible, so I care about the correct-
ness of the information in the review and will correct
the mistakes made by the system.
Another participant said,
Because I read reviews using this system before, I know
that my effort can help others understand my review,
so I will provide the information even if it causes some
extra work for me.
This promising result demonstrates that the real-time feature-
sentiment analysis does motivate users to correct the erro-
neous predictions of their own reviews. This is important as
the system can collect more corrected labels to improve the
classifier over time. We believe that the user’s ability to see
immediately how their reviews will be classified is an impor-
tant feature that motivates users to provide a low-cost (one
click) easy correction of the automatic classification.
Of course, the current study cannot directly prove that most
users really would correct the mistakes made by the system.
Nevertheless, when the number of users increases, even if
only a small portion of them provide feedback, the system
still can benefit from the feedback and enhance classification
accuracy.
Summary
To summarize, our user study answered the three questions
related to our theses. First, the highlighting function pro-
vided by the proposed system can improve the user’s read-
ing experience. Second, highlighting is at least as good as,
if not better than, noun-adjective pair summarization because
it preserves the original context of the feature-sentiment in-
formation as expressed by the review writers. Finally, the
interface with real-time feature-sentiment analysis can suc-
cessfully motivate users to correct errors made by the system,
so the classifiers behind the system can be improved as the
number of users increases.
DISCUSSION
Reducing effort in order to motivate users to correct erro-
neous predictions
Our interface reduces the effort needed to provide feature-
sentiment information by performing real-time analysis,
which requires only that users correct some mistakes made by
the system instead of segmenting, labeling, and rating their
reviews themselves. As a result, our user study shows that
around 70% of the participants expressed that they would
provide corrected feature-sentiment information. However,
about 30% of the participants did say that they would not cor-
rect the erroneous predictions made by the system. When
asked, they said the effort required needed to be further re-
duced. As one of the participants explained:
I know that correcting the errors can be helpful to my
readers, but I think it’s just too much work for me.
I need to click many icons there to make them right.
That’s why I choose to just ignore the errors.
Therefore, it is important for us to design interfaces that al-
low users to correct the errors more easily. In future, we
would like to experiment with different interface designs to
determine how to motivate more users to provide the cor-
rected data. One possible way is to design an interface that
allows users to drag the icons and sentences directly. On the
other hand, given that most users said they would provide the
label and the classifier could benefit from the input, fewer
and fewer corrections would be needed for future users as the
classifier became more accurate.
Including more features in the proposed design
Our user study shows that about 23% of the participants pre-
ferred the intelligent interface that used noun-adjective pairs
summarization (RevMiner). As suggested by the participants,
the biggest advantage of the interface is that it has more fea-
tures that interest them. In contrast, our system has only three
pre-defined features. This problem can be solved easily by
including more features, so it is not an inherent limitation of
the system. Since the efficiency of the classifiers is pretty
high, adding more features will not cause any technical prob-
lems when more features are added. However, additional fea-
tures may introduce a different problem. As mentioned ear-
lier, maintaining a low level of effort required is important
for motivating users to provide the correct labels. However,
adding more features may increase the effort, as users need
to remember what the possible categories are. This also may
make the interface more complex. Therefore, there is clearly
a tradeoff between providing more detailed categories and in-
formation and maximizing usability.
Real-time feature-sentiment analysis can encourage
users to generate more structured reviews
Although our system was not intended to help users write
reviews with higher quality, we did see that the real-time
feature-sentiment analysis affected the reviews generated by
the users. As one of the participants mentioned after she
wrote a review using our interface:
The results of the (feature-sentiment) analysis let me
know which part I haven’t mentioned in my review, so
I will try to write some words that are related to that
part.
In our user study, we also found that participants would try
to write something related to the three pre-defined features
before they finished. This shows that the real-time analysis
can encourage users to generate reviews that cover more fea-
tures, which can improve the quality of reviews collected by
the system. A controlled experiment that shows how the real-
time analysis affects review quality can be done in the future.
The upper bounds of classification accuracy
In our experiment, the food-classifier reached the highest F1
score at 4000 training instances, and the service-classifier
reached the highest at 3000 training instances; however, the
F1score of the price-classifier continued to grow even after
5000 training instances. The intuitive explanation for this is
that there were more positive training instances for food and
service in our randomly chosen training dataset, so the clas-
sifiers learned faster initially.
In the future, it would be valuable to perform a study to de-
termine the upper bounds of classification accuracy. Once
a classifier reaches its upper bound, the system could stop
asking users to provide feedback for that classifier. This can
reduce the workload of the users of the system.
CONCLUSIONS AND FUTURE WORKS
In this paper, we presented a novel intelligent system that per-
forms a two-layer feature-sentiment analysis in real time. The
system can provide real-time predictions to users who are
writing user reviews, which makes it very easy for them to
provide feature-sentiment information by simply correcting
the erroneous predictions. Our user study shows that about
70% of the participants were willing to correct the mistakes
made by the system, which means that the proposed interface
can successfully utilize the power of the crowd to collect a
massive amount of labeled data that can be used to train the
supervised classifiers. Moreover, our experiment shows that
the size of training data is positively correlated to the per-
formance of the feature-sentiment analysis. As a result, we
can expect that the analysis performed by the system can be-
come more and more accurate as the number of system users
increases.
In addition, we compared our system to existing intelligent
interfaces with similar purposes. The results of our exper-
iment show that the supervised method of our system can
achieve much higher F1scores than traditional unsupervised
methods can achieve. Moreover, our user study also shows
that 46% of the participants preferred the highlighting func-
tion of our interface over the noun-adjective pairs summa-
rization, while only 23% of them preferred the summariza-
tion. This indicates that our system can provide more accu-
rate feature-sentiment information and help users understand
the information better than traditional interfaces with similar
goals can.
The results of our experiment show that implicit crowdsourc-
ing can be useful to improve supervised learning algorithm’s
ability to collect a huge amount of training data at no cost.
The mechanism used in the proposed design also can be ap-
plied to other domains, like status updates in social media or
contents in Q&A forums and is not limited to user reviews.
However, there are still some limitations to the current de-
sign. One of the most essential issues is to find ways to fur-
ther reduce the effort necessary for users to provide useful
information. Moreover, it is also important to find a good
way to include more features or even to let users input un-
specified features themselves. We believe the work presented
in this paper offers a good first step for more future studies
that combine the strengths of intelligent interface and implicit
crowdsourcing.
In the future, we would like to deploy our system in the wild
to see if it really can help users and study how users interact
with the system on a large scale. Furthermore, since the sys-
tem involves its users to provide training data interactively, it
is possible for us to include active learning [27] in our system
design to further improve the performance of the supervised
learning classifiers.
ACKNOWLEDGEMENTS
Many thanks to the anonymous reviewers for their valuable
comments for us to improve the earlier draft of this work.
We also would like to thank Siddharth Gupta for helping us
implementing part of the system used in this study.
REFERENCES
1. Bernstein, M. S., Little, G., Miller, R. C., Hartmann, B.,
Ackerman, M. S., Karger, D. R., Crowell, D., and
Panovich, K. Soylent: a word processor with a crowd
inside. In Proceedings of the 23nd annual ACM
symposium on User interface software and technology,
UIST ’10, ACM (New York, NY, USA, 2010), 313–322.
2. Bigham, J. P., Jayant, C., Ji, H., Little, G., Miller, A.,
Miller, R. C., Miller, R., Tatarowicz, A., White, B.,
White, S., and Yeh, T. Vizwiz: nearly real-time answers
to visual questions. In Proceedings of the 23nd annual
ACM symposium on User interface software and
technology, UIST ’10, ACM (New York, NY, USA,
2010), 333–342.
3. Blei, D., and McAuliffe, J. Supervised topic models. In
Advances in Neural Information Processing Systems 20,
J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. MIT
Press, Cambridge, MA, 2008, 121–128.
4. Callison-Burch, C., and Dredze, M. Creating speech and
language data with amazon’s mechanical turk. In
Proceedings of the NAACL HLT 2010 Workshop on
Creating Speech and Language Data with Amazon’s
Mechanical Turk, CSLDAMT ’10, Association for
Computational Linguistics (Stroudsburg, PA, USA,
2010), 1–12.
5. Carenini, G., Ng, R. T., and Pauls, A. Interactive
multimedia summaries of evaluative text. In
Proceedings of the 11th international conference on
Intelligent user interfaces, IUI ’06, ACM (New York,
NY, USA, 2006), 124–131.
6. Carenini, G., and Rizoli, L. A multimedia interface for
facilitating comparisons of opinions. In Proceedings of
the 14th international conference on Intelligent user
interfaces, IUI ’09, ACM (New York, NY, USA, 2009),
325–334.
7. Dave, K., Lawrence, S., and Pennock, D. M. Mining the
peanut gallery: opinion extraction and semantic
classification of product reviews. In Proceedings of the
12th international conference on World Wide Web,
WWW ’03, ACM (New York, NY, USA, 2003),
519–528.
8. Dong, R., McCarthy, K., O’Mahony, M., Schaal, M.,
and Smyth, B. Towards an intelligent reviewer’s
assistant: recommending topics to help users to write
better product reviews. In Proceedings of the 2012 ACM
international conference on Intelligent User Interfaces,
IUI ’12, ACM (New York, NY, USA, 2012), 159–168.
9. Esuli, A., and Sebastiani, F. Sentiwordnet: A publicly
available lexical resource for opinion mining. In In
Proceedings of the 5th Conference on Language
Resources and Evaluation (LREC06 (2006), 417–422.
10. Ho, C.-J., Chang, T.-H., Lee, J.-C., Hsu, J. Y.-j., and
Chen, K.-T. Kisskissban: a competitive human
computation game for image annotation. SIGKDD
Explor. Newsl. 12, 1 (Nov. 2010), 21–24.
11. Howe, J. The rise of crowdsourcing. Wired Magazine
(06 2006).
12. Hu, M., and Liu, B. Mining and summarizing customer
reviews. In Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and
data mining, KDD ’04, ACM (New York, NY, USA,
2004), 168–177.
13. Huang, J., Etzioni, O., Zettlemoyer, L., Clark, K., and
Lee, C. Revminer: an extractive interface for navigating
reviews on a smartphone. In Proceedings of the 25th
annual ACM symposium on User interface software and
technology, UIST ’12 (2012), 3–12.
14. Kearns, M. J., and Vazirani, U. V. An introduction to
computational learning theory. MIT Press, Cambridge,
MA, USA, 1994.
15. Kullback, S., and Leibler, R. On information and
sufficiency. The Annals of Mathematical Statistics 22, 1
(1951), 79–86.
16. Lease, M., Carvalho, V., and Yilmaz, E., Eds.
Proceedings of the Workshop on Crowdsourcing for
Search and Data Mining (CSDM) at the Fourth ACM
International Conference on Web Search and Data
Mining (WSDM).
17. Lease, M., Carvalho, V., and Yilmaz, E., Eds.
Proceedings of the ACM SIGIR 2010 Workshop on
Crowdsourcing for Search Evaluation (CSE 2010).
Geneva, Switzerland, July 2010.
18. Liu, B., Hu, M., and Cheng, J. Opinion observer:
analyzing and comparing opinions on the web. In
Proceedings of the 14th international conference on
World Wide Web, WWW ’05 (2005), 342–351.
19. Nichols, J., Mahmud, J., and Drews, C. Summarizing
sporting events using twitter. In Proceedings of the 2012
ACM international conference on Intelligent User
Interfaces, IUI ’12, ACM (New York, NY, USA, 2012),
189–198.
20. Pang, B., and Lee, L. Opinion mining and sentiment
analysis. Foundations and Trends in Information
Retrieval 2, 1–2 (2008), 1–135.
21. Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up?:
sentiment classification using machine learning
techniques. In Proceedings of the ACL-02 conference on
Empirical methods in natural language processing -
Volume 10, EMNLP ’02, Association for Computational
Linguistics (Stroudsburg, PA, USA, 2002), 79–86.
22. Rashtchian, C., Young, P., Hodosh, M., and
Hockenmaier, J. Collecting image annotations using
amazon’s mechanical turk. In Proceedings of the
NAACL HLT 2010 Workshop on Creating Speech and
Language Data with Amazon’s Mechanical Turk,
CSLDAMT ’10, Association for Computational
Linguistics (Stroudsburg, PA, USA, 2010), 139–147.
23. Rijsbergen, C. J. V. Information Retrieval, 2nd ed.
Butterworth-Heinemann, Newton, MA, USA, 1979.
24. Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y.
Cheap and fast—but is it good?: evaluating non-expert
annotations for natural language tasks. In Proceedings of
the Conference on Empirical Methods in Natural
Language Processing, EMNLP ’08, Association for
Computational Linguistics (Stroudsburg, PA, USA,
2008), 254–263.
25. Sorokin, A., and Forsyth, D. Utility data annotation with
amazon mechanical turk. In Computer Vision and
Pattern Recognition Workshops, 2008. CVPRW ’08.
IEEE Computer Society Conference on (june 2008), 1
–8.
26. Su, H., Deng, J., and Fei-Fei, L. Crowdsourcing
annotations for visual object detection. In Workshops at
the Twenty-Sixth AAAI Conference on Artificial
Intelligence (2012).
27. Tong, S., and Koller, D. Support vector machine active
learning with applications to text classification. J. Mach.
Learn. Res. 2 (Mar. 2002), 45–66.
28. Turney, P. D. Thumbs up or thumbs down?: semantic
orientation applied to unsupervised classification of
reviews. In Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics, ACL ’02,
Association for Computational Linguistics (Stroudsburg,
PA, USA, 2002), 417–424.
29. von Ahn, L., and Dabbish, L. Labeling images with a
computer game. In Proceedings of the SIGCHI
conference on Human factors in computing systems,
CHI ’04, ACM (New York, NY, USA, 2004), 319–326.
30. von Ahn, L., and Dabbish, L. Designing games with a
purpose. Commun. ACM 51 (Aug. 2008), 58–67.
31. von Ahn, L., Liu, R., and Blum, M. Peekaboom: a game
for locating objects in images. In Proceedings of the
SIGCHI Conference on Human Factors in Computing
Systems, CHI ’06, ACM (New York, NY, USA, 2006),
55–64.
32. Wang, H., Lu, Y., and Zhai, C. Latent aspect rating
analysis on review text data: a rating regression
approach. In Proceedings of the 16th ACM SIGKDD
international conference on Knowledge discovery and
data mining, KDD ’10, ACM (New York, NY, USA,
2010), 783–792.
33. Yatani, K., Novati, M., Trusty, A., and Truong, K. N.
Review spotlight: a user interface for summarizing
user-generated reviews using adjective-noun word pairs.
In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, CHI ’11, ACM (New
York, NY, USA, 2011), 1541–1550.
APPENDIX: QUESTIONNAIRE
Q1: When reading user reviews, do you prefer the first
interface or the second interface? Why?
Q2: When reading user reviews, do you prefer the highlight-
ing function of the second interface or the noun-adjective
pair representation of the last interface? Why?
Q3: When writing user reviews on the review writing
interface you just used, would you correct the erroneous
predictions made by the system? Why or why not?
... Yimam et al. (2015) apply an interactive-learning process to the task of annotating medical abstracts. Huang et al. (2013) show the potential of IML to assist restaurant review authoring and reading. Through an interactive process, the system can learn to highlight and suggest summary sentiment in the three categories of food, service, and price. ...
... Certain constructions can also make it more engaging for the user and promote higher quality input. For example, Huang et al. (2013) motivate users to fix errors in sentiment analysis of restaurant reviews by providing current predictions of the categories that their reviews cover. Shilman et al. (2006) introduce five design principles for correction interfaces: (1) minimise decision points by choosing appropriate operators, (2) design seamless transitions between modes, (3) provide reachability of all states, (4) appropriately scope cascading changes, and (5) promote clear user models. ...
... Providing intuitive representations of progress in training a classifier, such as ModelTracker , are another potential way to allow users to engage in the task. Huang et al. (2013) found that users who could visualise the current predictions related to their restaurant reviews were motivated to fix any issues. ...
Article
Interactive Machine Learning (IML) seeks to complement human perception and intelligence by tightly integrating these strengths with the computational power and speed of computers. The interactive process is designed to involve input from the user but does not require the background knowledge or experience that might be necessary to work with more traditional machine learning techniques. Under the IML process, non-experts can apply their domain knowledge and insight over otherwise unwieldy datasets to find patterns of interest or develop complex data-driven applications. This process is co-adaptive in nature and relies on careful management of the interaction between human and machine. User interface design is fundamental to the success of this approach, yet there is a lack of consolidated principles on how such an interface should be implemented. This article presents a detailed review and characterisation of Interactive Machine Learning from an interactive systems perspective. We propose and describe a structural and behavioural model of a generalised IML system and identify solution principles for building effective interfaces for IML. Where possible, these emergent solution principles are contextualised by reference to the broader human-computer interaction literature. Finally, we identify strands of user interface research key to unlocking more efficient and productive non-expert interactive machine learning applications.
... 30 of the identified studies explicitly address Interactive Classification, resulting in the highest value in our analysis. Interactive Classification has been applied to a wide variety of applications, such as gesture recognition (e.g., Sarasua et al., 2016) or sentiment analysis (e.g., Huang et al., 2013). Quality of labels, features, parameter tuning, as well as model selection are key design factors for building a classifier with a high performance. ...
... Interactive Feature Engineering (6) Ankerst et al., 2000;Bauer and Baldes, 2005;Huang et al., 2013;Kulesza et al., 2015;Micallef et al., 2017;Di Nunzio and Maria, 2016 Interactive Labelling (22) Billewicz and Agnieszka, 2018;Brenton et al., 2014;Bryan et al., 2014;Dey et al., 2004;Fails and Olsen, 2003;Fiebrink et al., 2011;Flutura et al., 2018;Françoise et al., 2016;Françoise and Bevilacqua, 2018;Ghani and Kumar, 2011;Hipke et al., 2014;Kabra et al., 2013;Katan et al., 2015;Kim and Pardo, 2017;Lü et al., 2014;Sarasua et al., 2016;Sun et al., 2017;Wallace et al., 2012;Wu and Yang, 2006 Interactive Model Selection (1) Talbot et al., 2009 Parameter Spaces Analysis (1) Kapoor et al., 2010 Interactive Clustering (9) Cluster-based Exploratory Data Analysis (7) Awasthi et al., 2017; Chang et al., 2016;Chen et al., 2018;Chidlovskii and Lecerf, 2008;Nourashrafeddin et al., 2013;Smith et al., 2018;Yang and Callan, 2008 Comparative Cluster Analysis (1) Datta and Adar, 2018 Constraint Clustering (1) Okabe and Yamada, 2009 Interactive Information Retrieval ...
Conference Paper
Interactive machine learning (IML) is a learning process in which a user interacts with a system to iteratively define and optimize a model. Although recent years have illustrated the proliferation of IML systems in the fields of Human-Computer Interaction (HCI), Information Systems (IS), and Computer Science (CS), current research results are scattered leading to a lack of integration of existing work on IML. Furthermore, due to diverging functionalities and purposes IML systems can refer to, an uncertainty exists regarding the underlying distinct capabilities that constitute this class of systems. By reviewing extensive IML literature, this paper suggests an integrative theoretical framework for IML systems to address these current impediments. Reviewing 2,879 studies in leading journals and conferences during the years 1966-2018, we found an extensive range of applications areas that have implemented IML systems and the necessity to standardize the evaluation of those systems. Our framework offers an essential step to provide a theoretical foundation to integrate concepts and findings across different fields of research. The main contribution of this paper is organizing and structuring the body of knowledge in IML for the advancement of the field. Furthermore, we suggest three opportunities for future IML research. From a practical point of view, our integrative theoretical framework can serve as a reference guide to inform the design and implementation of IML systems.
... IML proposes a collaborative approach that enables users to contribute to the automated process of intelligent systems and ML algorithms, improving them through feedback and allowing for flexible and rapid adaptation of smart solutions to new circumstances. Various applications, including predictive maintenance [53], design of smart spaces and cyber-physical systems [54,55], intelligent wearable robotics [56], medical diagnosis [57], or sentiment analysis [58], among others, have been developed based on the concept of collaboration between humans and machines. These applications are used in domains as varied as education [59], cyber-security [60], industry [61], or healthcare [62][63][64]. ...
... Similarly, interactive machine learning has been also employed in text-related applications such as spam filtering and sentence proofing. Huang et al. proposed a system to support writing and reading online reviews [27]. Their system predicted and presented category score to the reviewers, while they could make a correction to wrong predictions. ...
Conference Paper
Interactive machine learning techniques have a great potential to personalize media recognition models for each individual user by letting them browse and annotate a large amount of training data. However, graphical user interfaces (GUIs) for interactive machine learning have been mainly investigated in image and text recognition scenarios, not in other data modalities such as sound. In a scenario where users browse a large amount of audio files to search and annotate target samples corresponding to their own sound recognition classes, it is difficult for them to easily navigate through the overall sample structure due to the non-visual nature of audio data. In this work, we investigate the design issue for interactive sound recognition by comparing different visualization techniques ranging from audio spectrograms to deep learning-based audio-to-image retrieval. Based on an analysis of the user study, we clarify the advantages and disadvantages of audio visualization techniques, and provide design implications for interactive sound recognition GUIs using a massive amount of audio samples.
... For example, a platform was introduced to detect cheating using a hybrid approach by following three main steps; an automatic cheating detector, a peer cheating detector, and the final review committee (Li et al. 2015). Furthermore, some approaches use both machine learning and crowdsourcing to make the user reviews simpler and allow the reader to digest them (Huang et al. 2013). Moreover, a semiautomated way was presented to support feedback providers by analyzing feedback language using critique style guide (Krause et al. 2017). ...
Article
Full-text available
For many years, learning continues to be a vital developing field since it is the key measure of the world’s civilization and evolution with its enormous effect on both individuals and societies. Enhancing existing learning activities in general will have a significant impact on literacy rates around the world. One of the crucial activities in education is the assessment method because it is the primary way used to evaluate the student during their studies. The main purpose of this review is to examine the existing learning and e-learning approaches that use either crowdsourcing, machine learning, or both crowdsourcing and machine learning in their proposed solutions. This review will also investigate the addressed applications to identify the existing researches related to the assessment. Identifying all existing applications will assist in finding the unexplored gaps and limitations. This study presents a systematic literature review investigating 30 papers from the following databases: IEEE and ACM Digital Library. After performing the analysis, we found that crowdsourcing is utilized in 47.8% of the investigated learning activities, while each of the machine learning and the hybrid solutions are utilized in 26% of the investigated learning activities. Furthermore, all the existing approaches regarding the exam assessment problem that are using machine learning or crowdsourcing were identified. Some of the existing assessment systems are using the crowdsourcing approach and other systems are using the machine learning, however, none of the approaches provide a hybrid assessment system that uses both crowdsourcing and machine learning. Finally, it is found that using either crowdsourcing or machine learning in the online courses will enhance the interactions between the students. It is concluded that the current learning activities need to be enhanced since it is directly affecting the student’s performance. Moreover, merging both the machine learning to the crowd wisdom will increase the accuracy and the efficiency of education.
... Often, requirements elicitation tends to be limited to face-to-face meetings, or interviews, prototyping [29], [36]. Crowd based approaches are becoming more prominent [4], [11], [20]. For example, social and mobile systems are reaching out to a vast number of highly distributed and heterogeneous stakeholders [16]. ...
Preprint
Full-text available
App store mining has proven to be a promising technique for requirements elicitation as companies can gain valuable knowledge to maintain and evolve existing apps. However, despite first advancements in using mining techniques for requirements elicitation, little is yet known how to distill requirements for new apps based on existing (similar) solutions and how exactly practitioners would benefit from such a technique. In the proposed work, we focus on exploring information (e.g. app store data) provided by the crowd about existing solutions to identify key features of applications in a particular domain. We argue that these discovered features and other related influential aspects (e.g. ratings) can help practitioners(e.g. software developer) to identify potential key features for new applications. To support this argument, we first conducted an interview study with practitioners to understand the extent to which such an approach would find champions in practice. In this paper, we present the first results of our ongoing research in the context of a larger road-map. Our interview study confirms that practitioners see the need for our envisioned approach. Furthermore, we present an early conceptual solution to discuss the feasibility of our approach. However, this manuscript is also intended to foster discussions on the extent to which machine learning can and should be applied to elicit automated requirements on crowd generated data on different forums and to identify further collaborations in this endeavor.
Article
Emerging terminals, such as smartwatches, true wireless earphones, in-vehicle computers, etc., are complementing our portals to ubiquitous information services. However, the current ecology of information services, encapsulated into millions of mobile apps, is largely restricted to smartphones; accommodating them to new devices requires tremendous and almost unbearable engineering efforts. Interaction Proxy, firstly proposed as an accessible technique, is a potential solution to mitigate this problem. Rather than re-building an entire application, Interaction Proxy constructs an alternative user interface that intercepts and translates interaction events and states between users and the original app's interface. However, in such a system, one key challenge is how to robustly and efficiently "communicate" with the original interface given the instability and dynamicity of mobile apps (e.g., dynamic application status and unstable layout). To handle this, we first define UI-Independent Application Description (UIAD), a reverse-engineered semantic model of mobile services, and then propose Interaction Proxy Manager (IPManager), which is responsible for synchronizing and managing the original apps' interface, and providing a concise programming interface that exposes information and method entries of the concerned mobile services. In this way, developers can build alternative interfaces without dealing with the complexity of communicating with the original app's interfaces. In this paper, we elaborate the design and implementation of our IPManager, and demonstrate its effectiveness by developing three typical proxies, mobile-smartwatch, mobile-vehicle and mobile-voice. We conclude by discussing the value of our approach to promote ubiquitous computing, as well as its limitations.
Chapter
In recent years, product reviews have taken an important role in helping consumers make online purchasing decisions, but only a small proportion of consumers post their reviews online. Researchers pointed out that consumers can be divided into distinct groups in terms of their motivations for eWOM (electronic word-of-mouth), and different strategies should be developed based on different motivation groups. However, the effort of explicitly acquiring motivations via questionnaire is unavoidably high, which may impede developing different strategies to different motivation groups. In this paper, we identified a set of consumers’ motivations and behavior data. Then, we performed a user survey to validate whether the behavioral features are significantly correlated with consumers’ motivations. These findings lay solid foundation to develop more adaptive design solutions for encouraging eWOM participation.
Article
Full-text available
This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not rec- ommended (thumbs down). The classifi- cation of a review is predicted by the average semantic orientation of the phrases in the review that contain adjec- tives or adverbs. A phrase has a positive semantic orientation when it has good as- sociations (e.g., "subtle nuances") and a negative semantic orientation when it has bad associations (e.g., "very cavalier"). In this paper, the semantic orientation of a phrase is calculated as the mutual infor- mation between the given phrase and the word "excellent" minus the mutual information between the given phrase and the word "poor". A review is classified as recommended if the average semantic ori- entation of its phrases is positive. The al- gorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). The ac- curacy ranges from 84% for automobile reviews to 66% for movie reviews.
Article
In this paper, we propose a competitive human computation game, KissKissBan (KKB), for image annotation. KKB is different from other human computation games since it integrates both collaborative and competitive elements in the game design. In a KKB game, one player, the blocker, competeswith the other two collaborative players, the couples; while the couples try to find consensual descriptions about an image, the blocker's mission is to prevent the couples from reaching consensus. Because of its design, KKB possesses two nice properties over the traditional human computation game. First, since the blocker is encouraged to stop the couples from reaching consensual descriptions, he will try to detect and prevent coalition between the couples; therefore, these efforts naturally form a player-levelcheating-proof mechanism. Second, to evade the restrictions set by the blocker, the couples would endeavor to bring up a more diverse set of image annotations. Experiments hosted on Amazon Mechanical Turk and a gameplay survey involving 17 participants have shown that KKB is a fun and efficient game for collecting diverse image annotations.
Article
A large number of images with ground truth object bounding boxes are critical for learning object detectors, which is a fundamental task in compute vision. In this paper, we study strategies to crowd-source bounding box annotations. The core challenge of building such a system is to effectively control the data quality with minimal cost. Our key observation is that drawing a bounding box is significantly more difficult and time consuming than giving answers to multiple choice questions. Thus quality control through additional verification tasks is more cost effective than consensus based algorithms. In particular, we present a system that consists of three simple sub-tasks - a drawing task, a quality verification task and a coverage verification task. Experimental results demonstrate that our system is scalable, accurate, and cost-effective. Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Chapter
Semantic Information Theory (SIT) is concerned with studies in Logic and Philosophy on the use of the term information, “in the sense in which it is used of whatever it is that meaningful sentences and other comparable combinations of symbols convey to one who understands them” (Hintikka, 1970). Notwithstanding the large scope of this description, SIT has primarily to do with the question of how to weigh sentences according to their informative content. The main difference with conventional information theory is that information is not conveyed by an ordered sequence of binary symbols, but by means of a formal language in which logical statements are defined and explained by a semantics. The investigation of SIT concerns two research directions: the axiomatisation of the logical principles for assigning probabilities or similar weighting functions to logical sentences and the relationship between information content of a sentence and its probability.
Conference Paper
Crowd-sourcing approaches such as Amazon's Mechanical Turk (MTurk) make it possible to annotate or collect large amounts of linguistic data at a relatively low cost and high speed. However, MTurk offers only limited control over who is allowed to particpate in a particular task. This is particularly problematic for tasks requiring free-form text entry. Unlike multiple-choice tasks there is no correct answer, and therefore control items for which the correct answer is known cannot be used. Furthermore, MTurk has no effective built-in mechanism to guarantee workers are proficient English writers. We describe our experience in creating corpora of images annotated with multiple one-sentence descriptions on MTurk and explore the effectiveness of different quality control strategies for collecting linguistic data using Mechanical MTurk. We find that the use of a qualification test provides the highest improvement of quality, whereas refining the annotations through follow-up tasks works rather poorly. Using our best setup, we construct two image corpora, totaling more than 40,000 descriptive captions for 9000 images.
Conference Paper
Smartphones are convenient, but their small screens make searching, clicking, and reading awkward. Thus, perusing product reviews on a smartphone is difficult. In response, we introduce RevMiner - a novel smartphone interface that utilizes Natural Language Processing techniques to analyze and navigate reviews. RevMiner was run over 300K Yelp restaurant reviews extracting attribute-value pairs, where attributes represent restaurant attributes such as sushi and service, and values represent opinions about the attributes such as fresh or fast. These pairs were aggregated and used to: 1) answer queries such as "cheap Indian food", 2) concisely present information about each restaurant, and 3) identify similar restaurants. Our user studies demonstrate that on a smartphone, participants preferred RevMiner's interface to tag clouds and color bars, and that they preferred RevMiner's results to Yelp's, particularly for conjunctive queries (e.g., "great food and huge portions"). Demonstrations of RevMiner are available at revminer.com.
Article
The status updates posted to social networks, such as Twitter and Facebook, contain a myriad of information about what people are doing and watching. During events, such as sports games, many updates are sent describing and expressing opinions about the event. In this paper, we describe an algorithm that generates a journalistic summary of an event using only status updates from Twitter as a source. Temporal cues, such as spikes in the volume of status updates, are used to identify the important moments within an event, and a sentence ranking method is used to extract relevant sentences from the corpus of status updates describing each important moment within an event. We evaluate our algorithm compared to human-generated summaries and the previous best summarization algorithm, and find that the results of our method are superior to the previous algorithm and approach the readability and grammaticality of the human-generated summaries.
Article
Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. In many settings, we also have the option of using pool-based active learning . Instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels for some number of them. We introduce a new algorithm for performing active learning with support vector machines, i.e., an algorithm for choosing which instances to request next. We provide a theoretical motivation for the algorithm using the notion of a version space . We present experimental results showing that employing our active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.