Conference PaperPDF Available

Leveraging the crowd to improve feature-sentiment analysis of user reviews

March 2013

March 2013

DOI:10.1145/2449396.2449400

Conference: Proceedings of the 2013 international conference on Intelligent user interfaces

Authors:

Mohammad Amanzadeh

University of Illinois, Urbana-Champaign

Crowdsourcing and machine learning are both useful techniques for solving difficult problems (e.g., computer vision and natural language processing). In this paper, we propose a novel method that harnesses and combines the strength of these two techniques to better analyze the features and the sentiments toward them in user reviews. To strike a good balance between reducing information overload and providing the original context expressed by review writers, the proposed system (1) allows users to interactively rank the entities based on feature-rating, (2) automatically highlights sentences that are related to relevant features, and (3) utilizes implicit crowdsourcing by encouraging users to provide correct labels of their own reviews to improve the feature-sentiment classifier. The proposed system not only helps users to save time and effort to digest the often massive amount of user reviews, but also provides real-time suggestions on relevant features and ratings as users generate their own reviews. Results from a simulation experiment show that leveraging on the crowd can significantly improve the feature-sentiment analysis of user reviews. Furthermore, results from a user study show that the proposed interface was preferred by more participants than interfaces that use traditional noun-adjective pair summarization, as the current interface allows users to view feature-related information in the original context.

When reading reviews, users can highlight sentences that are related to the aspect which interests them by a single click.

…

When browsing entities, the system allows users to sort them (e.g. restaurants) using their ratings of different features.

…

Users can click the icons on the interface to correct the erroneous predictions made by the system while composing reviews. For example, the cross sign near the sentences allow users to cancel the predictions made by the system. In addition, users also can click the stars to change feature-ratings.

…

When user is writing a review using our novel intelligent interface, the system performs a real-time feature-sentiment analysis. The user can easily correct the erroneous predictions made by the system. The corrected data then is used to improve the classifiers to provide useful information for other users to digest the reviews.

…

The flow of the two-layer feature-sentiment analysis. The system first classify the sentences with their related features. The sentences that are related to the same feature then are grouped together and are used to predict the rating of the feature.

…

Figures - uploaded by Mohammad Amanzadeh

Content may be subject to copyright.

Content uploaded by Mohammad Amanzadeh

Content may be subject to copyright.

Leveraging the Crowd to Improve Feature-Sentiment

Analysis of User Reviews

Shih-Wen Huang1, Pei-Fen Tu1, Wai-Tat Fu1, Mohammad Amanzadeh2

Department of Computer Science1, Industrial & Enterprise System Engineering2

University of Illinois at Urbana-Champaign

{shuang51, ptu3, wfu, amanzad2}@illinois.edu

ABSTRACT

Crowdsourcing and machine learning are both useful tech-

niques for solving difﬁcult problems (e.g., computer vision

and natural language processing). In this paper, we propose

a novel method that harnesses and combines the strength of

these two techniques to better analyze the features and the

sentiments toward them in user reviews. To strike a good

balance between reducing information overload and provid-

ing the original context expressed by review writers, the pro-

posed system (1) allows users to interactively rank the enti-

ties based on feature-rating, (2) automatically highlights sen-

tences that are related to relevant features, and (3) utilizes im-

plicit crowdsourcing by encouraging users to provide correct

labels of their own reviews to improve the feature-sentiment

classiﬁer. The proposed system not only helps users to save

time and effort to digest the often massive amount of user re-

views, but also provides real-time suggestions on relevant fea-

tures and ratings as users generate their own reviews. Results

from a simulation experiment show that leveraging on the

crowd can signiﬁcantly improve the feature-sentiment anal-

ysis of user reviews. Furthermore, results from a user study

show that the proposed interface was preferred by more par-

ticipants than interfaces that use traditional noun-adjective

pairs summarization, as the current interface allows users to

view feature-related information in the original context.

Author Keywords

Human computation, crowdsourcing, interactive machine

learning, sentiment analysis, user generated content

ACM Classiﬁcation Keywords

H.5.2 [Information interfaces and presentation]: User

Interfaces.

INTRODUCTION

With the rapid success of Web 2.0 technologies, user-

generated content haa become a major source of online infor-

mation. One of the most notable examples is online reviews.

Millions of people now write reviews on websites like Yelp or

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

IUI’13, March 1922, 2013, Santa Monica, CA, USA

Amazon to express their opinions regarding different restau-

rants or products. These user-generated reviews can be very

helpful for others to make wiser decisions.

However, extensive amounts of user-generated reviews are

difﬁcult for people to digest, creating a typical problem of

information overload. There are two potentially conﬂicting

goals when designing systems that leverage on user-generated

reviews. First, the system should mitigate information over-

load by summarizing important information for the users;

second, the system should allow users to express their ex-

periences in their own words while at the same time allowing

others to read their reviews in context. With the proposed

system, we aim to provide a balance between the two.

Summarizing the information in user reviews based on re-

lated features and sentiment has proven to be an effective

way to help users to digest the massive amount of informa-

tion more efﬁciently. For example, Review Spotlight [33]

performs feature-sentiment analysis and presents the results

using noun-adjective pairs. Yatani et al. [33] showed that this

interface can help users make decisions signiﬁcantly faster,

which demonstrates that feature-sentiment information can

help users digest user-generated reviews more efﬁciently.

The feature-sentiment analysis in existing intelligent inter-

faces [13, 18, 33] typically incorporates two steps. First, it

ﬁnds the features by identifying nouns with high frequency

counts in the text. Second, after determining the features, it

uses the adjective near each feature and a predeﬁned glossary

(e.g., SentiWordNet [9]) to determine the feature’s orienta-

tion. However, researchers have pointed out that this unsuper-

vised learning approach has two major disadvantages, which

we will summarize below.

First, as mentioned in [12], this kind of analysis typically can

discover only features that are explicitly discussed in the con-

tent. For example, consider the following sentence:

“While light, it will not easily ﬁt in pockets.”

This sentence is related to size of a product; however, it is

very difﬁcult for this unsupervised learning approach to dis-

cover the feature because the word “size” does not appear in

the sentence [12]. This greatly undermines the utility of the

algorithm since many opinions expressed in user-generated

reviews are implicit, and they tend to elude discovery by un-

supervised learning methods.

The second disadvantage is that this approach often makes

mistakes in selecting useful features and deciding the senti-

ment orientation of the features. For instance, as mentioned

in [33], “impeccable”, which should be a positive word, has

a very high negative score in their system. Illustrating simi-

lar mistakes made by the system, some of the system’s users

noted that the features they presented did not make much

sense (as described in the paper [33]). These errors may lower

the motivation to use a system as users perceive it to be unre-

liable.

One way to address these problems is to use labeled data

and supervised learning to ﬁnd the hidden features in the re-

views. Supervised learning has demonstrated better perfor-

mance than unsupervised methods in general [3]. Moreover,

this approach can identify sentences related to a feature even

if the feature itself is not in the sentence because it uses more

than a single term in the sentences to classify. However, su-

pervised methods are difﬁcult, if not impossible, to imple-

ment in existing interfaces because it is difﬁcult to motivate

users to create a label for each sentence they write. Therefore,

one big challenge for this approach is collecting labeled data.

Finding a way to motivate users to provide labels is key to the

success of this supervised approach.

To address the questions discussed above, we propose a

novel intelligent interface that collects training data directly

from users as they generate reviews – a concept often called

implicit crowdsourcing. The system we built can perform

feature-sentiment analysis in nearly real-time. As a result,

it can provide predictions while a user is writing the review.

If the user sees the prediction is wrong, the user can sim-

ply click the icons on the interface to correct the erroneous

prediction. Therefore, instead of providing labels for every

sentence in the review, users need only to correct some mis-

takes made by the system, which greatly reduces the effort

necessary to provide feature-sentiment information. In ad-

dition, users are more motivated to provide labels because

these labels are related to the accuracy of their own reviews.

The collected data can be used as new training instances for

the classiﬁer. Moreover, increasing the number of training

instances raises the coverage of the supervised learning algo-

rithms [14]. Therefore, leveraging the crowd to collect user-

generated labels allows the classiﬁers to provide more accu-

rate predictions as the number of users in the system grows.

To preview our result, the experiment shows that our super-

vised classiﬁers can achieve much higher F1scores than base-

line models that discover feature-sentiment information using

unsupervised methods.

Another drawback of many existing intelligent interfaces is

that the feature-related information is summarized in a very

compressed form (e.g., noun-adjective pairs). Although this

has the advantage of allowing users to retrieve the feature-

related information in the reviews more quickly, it also de-

stroys the original context of the reviews, which often contain

more than pure information, such as social cues, personal ex-

pression, etc. In contrast, the proposed system provides a

highlighting function that highlights the feature-related infor-

mation in its original context. This function integrates into

the traditional review reading experience and creates a good

balance between focusing on feature-related information and

understanding the full reviews. In this study, we compared

the two designs (i.e., noun-adjective pairs summarization and

highlighting) to observe what users liked or disliked about

these two types of systems. To preview our results, we did

ﬁnd that most users preferred to see the original context rather

than the purely summarized features, thus providing support

to the design.

Overview of the paper

In the rest of the paper, we ﬁrst will review related work on

how crowdsourcing can be used to assist supervised learning

and feature-sentiment analysis of user reviews. We will then

describe the current system and how it differs from previous

ones. Then we will perform two sets of evaluation. First, we

will present results from a simulation experiment to demon-

strate how the system can outperform previous systems that

utilize unsupervised learning, and how the system can im-

prove over time as more user-generated labels are collected

to improve the classiﬁer in the current system. Second, we

will present results from a user study that tested whether re-

view readers and writers would like the features of the current

system. Speciﬁcally, we tested the extent to which review

readers would like to see the context of user reviews instead

of merely summarized reviews, and we tested whether the

real-time feedback would encourage review writers to pro-

vide labels for their own reviews. Finally, we will discuss the

implication of our results for the design of systems that rely

on user-generated content in general, as well as the future di-

rection of the current research.

RELATED WORK

Enhancing machine learning algorithms by collecting la-

beled data from crowdsourcing

Crowdsourcing has been proven as an effective way to solve

various kinds of problems [11]. Individuals can easily recruit

online workers from crowdsourcing platforms like Amazon

Mechanical Turk (AMT)1to solve problems that are difﬁ-

cult for digital computers (e.g., text editing [1] and answer-

ing visual query [2]) at a very low cost. These examples have

merely begun to demonstrate the potential of crowdsourcing

as a social computing technique that can be applied in a wide

range of situations.

One of the most notable application for crowdsourcing is col-

lecting labeled data to improve the performance of machine

learning algorithms. Von Ahn and Dabbish [29] pioneered

the ﬁeld by developing the ESP game, which recruits people

to generate image labels while playing an online game. By

2008, this game had recruited 200,000 players and collected

more than 50 million labels [30]. The collected labels then

were used to improve Google image search. Other games

[10, 31] and methods [22, 25, 26] also have been proposed

to collect high-quality image labels to train computer vision

algorithms.

1https://www.mturk.com

Figure 1. When reading reviews, users can highlight sentences that are

related to the aspect which interests them by a single click.

In addition to image labels, crowdsourcing also has been used

to collect training data for natural language processing. Snow

et al. [24] studied how the labeled data generated by the

non-experts from AMT can be more cost-effective than those

generated by experts. They also showed that the labels col-

lected from crowdsourcing could successfully improve ma-

chine learning algorithms. There are also various workshops

in NAACL [4], SIGIR [17], and WSDM [16] that aim at uti-

lizing crowdsourcing to generate labeled data that are useful

to data mining and information retrieval.

Instead of explicitly recruiting crowd workers to generate la-

beled data to enhance machine learning, Nichols et al. [19]

proposed implicit crowdsourcing, a method that directly col-

lects data generate by the users, which is different from tra-

ditional crowdsourcing that pays money to recruit workers

from AMT to create labeled data. This allows the system to

collect more data as the number of users increases. For exam-

ple, they [19] found that by collecting status updates posted

to Twitter, the system can successfully generate meaningful

summaries of sporting events. This design is especially use-

ful for supervised learning because the size of training data

is essential to the performance of the algorithms [14]. Be-

sides summarizing the data that have been already posted, the

intelligent interface we built can generate the predictions in

real time and involve users to correct the errors made by the

system. As a result, our interface further uses artiﬁcial intelli-

gence to assist users to generate labeled data more easily. To

the best of our knowledge, this is a novel concept that has not

yet been explored.

Intelligent interfaces for user reviews

Since user reviews contain much valuable information, many

researches have proposed different methods to analyze the

features and sentiments expressed in user reviews [20]. Hu

and Liu [12] used the minimum support of association rule

to identify frequent terms and phrases in user reviews as fea-

tures. Many researchers [7, 21, 28, 32] also studied how to

use machine learning algorithms to classify the sentiments

expressed in user-generated reviews.

Recently, many intelligent interfaces have been developed to

Figure 2. When browsing entities, the system allows users to sort them

(e.g. restaurants) using their ratings of different features.

help users to understand feature-sentiment information in a

huge amount of user reviews. Liu et al. [18] implemented

Opinion Observer, an interface that uses bar charts to present

the positive and negative sentiments of each feature. Carenini

et al. [5] constructed a treemaps interface for users to inter-

actively explore the information that interests them. In ad-

dition, they also designed a novel visualization for users to

compare the feature-sentiment information between different

entities [6]. Yatani et al. [33] developed Review Spotlight

to present feature-sentiment information in user reviews us-

ing noun-adjective pairs in a tag cloud. Huang et al. [13]

further group the similar features together to display feature-

sentiment information in a more concise format. The biggest

difference between the intelligent interface of our system and

the existing ones is that: instead of summarizing the informa-

tion using noun-adjective pairs, our system presents the infor-

mation by highlighting the feature-related sentences in their

original context. This allows the users to focus on feature-

related information while still have the opportunity to explore

other information in the reviews.

Moreover, intelligent interfaces also have been used to assist

review writing. Dong et al. [8] developed Reviewer’s Assis-

tant, a browser plug-in that identiﬁes the sentences written by

previous users which might also be used by the current user

and recommend them to the user. Their study showed that

the system could suggest sentences that were actually written

by the users. The current system also has an intelligent in-

terface for review writing. Nevertheless, we aim at collecting

feature-sentiment information that can be helpful to the read-

ers instead of assisting reviewers to generate user reviews.

SYSTEM DESIGN AND IMPLEMENTATION

The proposed system incorporates two functions that help

readers digest user reviews: ﬁrst, it allows its users to high-

light sentences based on the features they are interested in by

a single click; this greatly reduces the amount of information

that a user must read. (Figure 1) Second, users of the system

can rank the entities based on their feature ratings, which are

inferred directly from the contents. (Figure 2)

Figure 3. Users can click the icons on the interface to correct the er-

roneous predictions made by the system while composing reviews. For

example, the cross sign near the sentences allow users to cancel the pre-

dictions made by the system. In addition, users also can click the stars

to change feature-ratings.

To collect the information needed to perform the two func-

tions mentioned above, the system also has a novel intelligent

interface that conducts feature-sentiment analysis in real time.

When a user is writing a review using the system, whenever

the user ﬁnishes a sentence, the system would provide the

user real-time predictions about the feature(s) that are related

to it and the star ratings of the features. If the user feels that

the predictions made by the system are wrong, they can sim-

ply click the icons on the interface to correct the errors. (Fig-

ure 3) A graphical representation of the design of our system

is shown in Figure 4.

Data and features of the current system

The data used in the current system was retrieved from Yelp’s

Academic Dataset 2, which consists of 87,173 reviews of

restaurants near 30 schools. In this study, we used three

pre-deﬁned features: food, service, and price. However, one

should notice that the data and features of the system easily

can be altered or expanded and are not limited to the current

settings.

Supervised two-layer feature-sentiment analysis

To discover the related features of each sentence and the in-

ferred ratings of the features, a supervised two-layer feature-

sentiment analysis was conducted on each review in the cor-

pus. A graphical representation of the ﬂow of the two-layer

analysis is shown in Figure 5.

The ﬁrst layer of the analysis is the sentence feature classiﬁ-

cation. In this layer, the system decides if one sentence is re-

lated to a target feature or not. To initiate the classiﬁers of the

system, we collected 5,000 labeled sentences by recruiting

194 workers from Amazon Mechanical Turk at a cost of $9.70

(from 4/11/2012 to 4/21/2012). The workers were asked to

label whether a sentence was related to any of the three pre-

deﬁned features or to none of them (A sentence can be related

to more than one feature). This is used to simulate the data

generated by the initial users of our system. We preprocessed

the text by stemming and removing stop words, and we con-

verted these labeled sentences into unigram feature vectors.

2http://www.yelp.com/academic dataset

Figure 4. When user is writing a review using our novel intelligent in-

terface, the system performs a real-time feature-sentiment analysis. The

user can easily correct the erroneous predictions made by the system.

The corrected data then is used to improve the classiﬁers to provide use-

ful information for other users to digest the reviews.

Then, we used the SV M light package3to train the classiﬁers

on these feature vectors. If a sentence was related to a cer-

tain feature, its associated feature vector would be treated as

a positive training instance for the classiﬁer of that feature.

In contrast, if the sentence is not related to that feature, its

feature vector was used as a negative training instance. Fi-

nally, sentences that did not have labels in the corpus were

converted into unigram feature vectors, and the system used

the classiﬁers trained on the labeled sentences to classify the

features related to the unlabeled sentences.

The second layer of the analysis is the feature-rating pre-

diction, which predicts the star ratings of different features

in each review. To construct the classiﬁers, we utilized the

reviews and their associated star ratings in Yelp’s academic

dataset. First, the reviews with 4 or more stars were used as

positive training instances and reviews with less than 4 stars

were used as negative training instances. These reviews were

converted to feature vectors and were used to train a positive-

negative classiﬁer using SV M light . By a similar procedure,

we built a 4-5-star classiﬁer and a 2-3-star classiﬁer4. These

classiﬁers allowed us to predict the overall ratings of each

review. For example, if a review was predicted as positive

(more than 3 stars) by the positive-negative classiﬁer, the sys-

tem then would run the 4-5-star classiﬁer to see if it should

be classiﬁed as a 4-star review or a 5-star review.

Equipped with the rating classiﬁers, the system then predicts

3http://svmlight.joachims.org/

4We grouped 1-star review to the 2-star reviews category because

there is only a very small portion of the reviews are 1-star

Figure 5. The ﬂow of the two-layer feature-sentiment analysis. The sys-

tem ﬁrst classify the sentences with their related features. The sentences

that are related to the same feature then are grouped together and are

used to predict the rating of the feature.

the feature ratings of each review by classifying the star rating

using only the contents that are related to a feature (based on

the ﬁrst layer analysis). For instance, when predicting the ser-

vice rating of a review, the system would ﬁrst ﬁnd out which

sentences related to service using the service-feature classi-

ﬁer. Then, these sentences would be grouped together and

classiﬁed by the star-rating classiﬁers. Finally, the output of

the classiﬁer would be the service rating of the review.

Collecting corrected feature-sentiment information from

users

To ensure the accuracy of the analysis made by the system and

collect more training data to improve the performance of the

classiﬁer, the system provides an interface that can perform

real-time feature-sentiment analysis while a user composes a

review. Whenever the user ﬁnishes a sentence, the web-based

system sends it to the server using AJAX. When the server

receives the sentence, a Python script performs the text pre-

processing and converts the sentence into a unigram feature

vector that can be processed by the classiﬁers. Then the sys-

tem conducts the two-layer feature-sentiment analysis using

the SV M light package. Finally, the result of the analysis is

sent back to the interface on the client side. The whole analy-

sis can be performed within one second, including the latency

of the Internet, so for the user, the analysis seems to occur in

real time.

If the predictions are wrong, the user can simply click the

icons next to the predictions on the interface to correct them.

For example, if a sentence is mistakenly classiﬁed as a sen-

tence that is related to price but the user judges that it is not,

the user can use the cross sign near the sentence to cancel

this prediction and assign the sentence to other categories, or

not to any existing category. If, for example, the sentence is

judged to be related to food, the user can click the icon that

represents food near the sentence to label it. Furthermore, if

the star-rating predictions are wrong, the user can click the

stars on the interface to change the ratings.

EXPERIMENTS

To evaluate whether our proposed method really improves

feature-sentiment analysis, we manually labeled the related

features (i.e., food, service, and price) of 1,000 sentences in

the dataset and used these as gold standard test dataset. The

system’s ability to discover the related features of our pro-

posed design and the baseline models was evaluated on this

dataset.5In this study, we focused only on the ability to clas-

sify feature-related sentences of the proposed method and left

evaluation of the ability to generate accurate feature-rating for

future work. Speciﬁcally, we wanted to test two hypotheses:

H1: The supervised classiﬁer in our design can achieve

higher performance than a traditional unsupervised approach

can.

H2: More training data improves the performance of the clas-

siﬁer.

Two experiments were performed to test the hypotheses. The

details of these experiments are described below.

Experiment I: Comparisons between supervised and un-

supervised methods

In this experiment, we compared the proposed supervised

method to a baseline model with unsupervised learning to test

if H1 is true.

Method

Three sentence-feature classiﬁers (food, service, and price)

were trained on 5,000 sentences labeled by AMT workers.

We performed text preprocessing which includes stemming

and removing the stop words of the sentences. The sentences

then were converted to the feature vectors using a unigram

model. Finally, we used SV M light to train the classiﬁers on

these feature vectors.

In addition, we built a baseline model similar to the ones in

[13, 18, 33]. This unsupervised model used 434,664 sen-

tences in the full data set. The data size is much larger than

the 5,000 labeled sentences used in the supervised method. To

identify the frequent features in the sentences, we ﬁrst used

the part-of-speech tagging function in NLTK6to ﬁnd all the

nouns and adjectives in the sentences. Then, we performed

the same text preprocessing as in the supervised method. Af-

ter that, the nouns (after stemming) that appeared in more

than 1% of the total sentences were selected as the features.

This threshold is the same as the minimum support used in

[18]. We tried to vary the threshold to 0.1%, 0.5%, and

2%, but there were no signiﬁcant differences in the results

of the various thresholds. Therefore, only the results of the

1% threshold were reported. After the features were selected,

the closest adjective to each feature was considered as the one

5We did not use the labels retrieved from AMT to evaluate the clas-

siﬁers because we found that it contains many low-quality labels.

6http://nltk.org/

Figure 6. The comparisons between the F1scores of the supervised

method and the best performance of the unsupervised methods. The

results show that the supervised method can achieve much better F1

scores than the unsupervised methods in all three features tested.

that described the feature. We then grouped the features us-

ing the Kullback-Leibler divergence [15] between the adjec-

tives that described the features, which is the feature grouping

method used in [13]. Finally, the top (5, 10, 20, 30) closest

features to food, service, and price were assigned to them as

sub-features. If one of the sub-features occurred in a target

sentence, the sentence was classiﬁed as related to the main

feature (food, service, or price).

Evaluation

We evaluated the systems on a 1000-sentence test data man-

ually labeled by the authors. The precision, recall, and F1

score were calculated using the formulas below:

precision =# feature-related sentences classﬁed correctly

# sentences classiﬁed as related to the feature

recall =# feature-related sentences classﬁed correctly

# feature-related sentences

F1= 2 ·precision ·recall

precision + recall

We used the F1scores to evaluate the performance of the

classiﬁers because it is a weighted average of precision and

recall. Since there is a trade-off between precision and recall,

F1scores can evaluate the results more fairly [23]. The pre-

cision, recall, and F1scores of the supervised method and the

unsupervised methods with various number of sub-features

are summarized in the table 1, 2, and 3.

The results show that the proposed supervised method

achieved much higher F1scores in classifying feature-related

sentences. (Figure 6) When looking more carefully into the

precision and recall of each classiﬁer, we see that the super-

vised method can achieve much higher precision than the un-

supervised methods can. In addition, although increasing fea-

tures to the unsupervised methods allows them to outperform

the supervised method in recall of two of the three features,

5 feat. 10 feat. 20 feat. 30 feat. supervised

Food 47.67% 51.69% 53.78% 51.99% 85.75%

Service 18.47% 18.39% 16.59% 17.09% 90.00%

Price 24.55% 16.29% 13.83% 11.91% 80.95%

Table 1. Precision of supervised method and unsupervised methods with

different number of sub-features

5 feat. 10 feat. 20 feat. 30 feat. supervised

Food 30.09% 41.63% 61.09% 68.10% 70.50%

Service 34.33% 41.04% 50.75% 65.67% 47.01%

Price 31.68% 35.64% 49.50% 63.28% 40.50%

Table 2. Recall of supervised method and unsupervised methods with

different number of sub-features

5 feat. 10 feat. 20 feat. 30 feat. supervised

food 36.89% 46.12% 57.20% 58.96% 77.38%

Service 24.02% 25.40% 25.01% 27.12% 61.76%

Price 30.60% 24.51% 22.22% 20.00% 53.99%

Table 3. F1scores of supervised method and unsupervised methods with

different number of sub-features

the precision becomes unacceptably low (around 15%). The

reason is that unsupervised methods would include many gen-

eral terms as sub-features, which are not very helpful to the

classiﬁcation task. In contrast, the supervised method uses

the unigram feature vector to determine whether one sentence

is related to a feature. Therefore, the classiﬁcation result is

not dominated by any single term. In addition, the supervised

method can discover some hidden patterns in sentences. For

instance, consider the following:

“The food is superb and it comes out pretty fast.”

This sentence is related to both food and service. By using an

unsupervised method, the only feature that can be discovered

is food because it is mentioned explicitly in the sentence. On

the other hand, the supervised method used in our system can

successfully capture both features because “fast” and “come”

both carry meanings that are related to service, which is the

hidden feature of this sentence. The supervised method can

learn these implicit concepts (e.g., fast and come) to discover

the hidden feature if it is trained on a massive amount of la-

beled data. As a result, the supervised method can achieve

higher F1scores. The results therefore provide support to H1

– using supervised learning to train the classiﬁer can result in

better performance than that achieved by traditional unsuper-

vised methods. In particular, the current method can signiﬁ-

cantly improve precision because of the fact that many hidden

variables that deﬁne the categories in user reviews cannot be

easily identiﬁed by unsupervised methods.

Experiment II: Comparisons between the supervised

methods with various training data size

In this second experiment, we varied the size of data that is

used to train the supervised classiﬁers to see if H2 is sup-

ported.

Method

We trained the classiﬁers on 0.5K, 1K, 2K, 3K, 4K, and 5K

labeled sentences. These subsets of labeled sentences were

selected randomly from the 5,000 labeled sentences collected

from AMT as described in experiment I.

Figure 7. The F1scores of the feature-classiﬁers trained on various size of data. The results show that there was a very high positive correlation between

the size of training data and the performance of the classiﬁers.

Evaluation

The precision, recall and F1scores of the feature-classiﬁers

trained on various sizes of labeled sentences are summarized

in the table 4, 5, and 6.

500 1000 2000 3000 4000 5000

food 85.02% 82.13% 81.37% 82.90% 84.58% 85.75%

Service 95.00% 68.83% 70.93% 86.67% 85.53% 90.00%

Price 92.59% 96.15% 87.10% 78.38% 79.49% 80.95%

Table 4. Precision of supervised method on various size of training data

500 1000 2000 3000 4000 5000

food 51.13% 64.19% 59.01% 72.07% 74.10% 70.50%

Service 28.36% 39.55% 45.52% 48.51% 48.51% 47.01%

Price 24.75% 24.75% 26.73% 28.71% 30.69% 40.50%

Table 5. Recall of supervised method on various size of training data

500 1000 2000 3000 4000 5000

food 63.86% 72.06% 68.41% 77.11% 78.99% 77.38%

Service 43.68% 50.23% 55.45% 62.20% 61.91% 61.76%

Price 39.06% 39.37% 40.91% 42.03% 44.28% 53.99%

Table 6. F1score of supervised method on various size of training data

The results show that the performances of the classiﬁers

clearly are positively correlated to the size of training data.

The Pearson correlations between the size of training data and

the F1scores of the feature classiﬁers for food, service, and

price are 85%, 90%, and 89%, respectively. We also found

that the improvements of the classiﬁers are mostly from re-

call. The Pearson correlations between the size of training

data and the recall of the feature classiﬁers for food, service,

and price are 81%, 79%, and 91% respectively. This shows

that more training data can help the classiﬁers discover more

hidden patterns in the data, which supports H2, that more

training instances can enhance the performance of the feature

classiﬁers. This also implies that when more users use the

system, the classiﬁers behind the system can generate more

accurate predictions because more training data can be col-

lected from users.

Summary

To summarize, our experiments show that the proposed super-

vised method can be a better mechanism in classifying sen-

tences by their related features than traditional unsupervised

method. Moreover, increasing the training data can further

improve the performance of the supervised classiﬁers, which

means that collecting labeled data from users effectively en-

hances the feature-sentiment analysis of the system.

USER STUDY

We conducted a user study to gain more insight into how

our interface impacts users and to answer the following three

questions that are important to our theses:

1. Does the feature-sentiment analysis and highlighting func-

tion of our system help users digest the information in user

reviews?

2. Is the highlighting function better than existing intelli-

gent interfaces that summarize feature-sentiment by noun-

adjective pairs?

3. Will users correct the erroneous predictions made by the

system?

The ﬁrst two questions show how our proposed interface

helps readers digest user reviews and the last question lets us

know if the system really can collect more corrected feature-

sentiment information when the system is deployed in the

wild.

Procedure

At the beginning of the study, the participants were asked to

ﬁll out a questionnaire collecting demographic information

and asking how often they read and write user reviews. Then,

we introduced our interface and its functions to the partici-

pant. After that, the participants were asked to read reviews

about a restaurant using both our interface and yelp’s web-

site7. When they ﬁnished reading, they were asked if the

highlighting function of our interface helped them get useful

information about the restaurant from the reviews.

Then, we introduced RevMiner8[13], an interface that uses

noun-adjective pairs to summarize feature-sentiment infor-

mation in the reviews to its users. (Figure 8) After the partic-

ipants became familiar with the interfaces, we let them com-

pare the noun-adjective pairs summarization with our high-

lighting function. We chose to use the RevMiner interface

7http://www.yelp.com/

8http://revminer.com/

Figure 8. Noun-adjective pairs visualization in RevMiner. We used

RevMiner as an example to represent the interfaces that summarize

feature-sentiment information using noun-adjective pairs.

because it was one of the most recently developed at the time

this article was written, and because it was publicly available

on the Web. We should point out that our intention was not

to directly compare our interface to this particular interface.

Instead, we are interested in comparing the general design be-

tween summarization using noun-adjective pairs (which are

used primarily in RevMiner) and highlighting (which is used

in our system).

Finally, we demonstrated the review-writing interface of our

system, and asked the participants to write a review about a

restaurant they recently visited using the interface. After they

completed their reviews, we asked the participants whether

they would use the interface to correct the feature-sentiment

predictions when they were wrong.9

Participants

Thirteen college and graduate students (7 males and 6 females

between the ages of 22 and 34) participated in this study. All

of the participants read user reviews online at least once a

month, and 8 of the 13 participants (62%) have experience in

writing user reviews online. The study lasted approximately

30 minutes.

USER STUDY RESULTS

Highlighting helped participants digest user reviews

When comparing our interface with traditional review web-

sites, 11 of the 13 participants (85%) suggested that the high-

lighting function helped them to understand more quickly the

information contained in a massive amount of online reviews.

One participant mentioned:

The highlighting function really helps me focus on the

information I am interested in. I can get the informa-

tion without spending time on the murmur of the re-

viewers.

Although two participants thought this function didn’t make

signiﬁcant differences, this result demonstrated that the

feature-sentiment analysis and highlighting function were

9The exact wordings of our questions are listed in the appendix at

the end of the paper.

perceived as helpful for the majority of the participants in di-

gesting the large number of user reviews.

More participants preferred highlighting over noun-

adjective pairs summarization

When participants were asked to compare our interface (high-

lighting) to RevMiner (noun-adjective pair), 6 of 13 partici-

pants (46%) thought our interface was better, 3 of them (23%)

thought RevMiner was better, and 4 of them (31%) thought it

was a tie. The result suggests that about twice as many partic-

ipants preferred the highlighting function, compared to those

who preferred noun-adjective pairs summarization. The rea-

son was that the highlighting function allowed people to fo-

cus on the feature-sentiment information in its original con-

text, which creates a good balance between focusing on some

feature-sentiment information and the whole review. In con-

trast, the noun-adjective pairs summarization allowed the par-

ticipants to see only the compressed and fractured informa-

tion, which was not easy for them to interpret. One of our

participants noted:

I really like the ﬁrst one (our interface) because it let

me focus on some parts that I am interested in and

I can also see its context. However, when reading

reviews using the second one (RevMiner), I only see

some short phrases. I have to ﬁrst put them together to

guess the meanings, so it takes more time and is hard

to get the original context. This doesn’t help me learn

the experience of the previous users.

The participants who favored noun-adjective pairs summa-

rization preferred it mainly because it offered more features

than the three pre-deﬁned features in our interface. However,

this issue can be addressed by including more features in our

system since the cost of adding new features is low.

Participants were motivated to correct erroneous predic-

tions made by the system

After having the experience using our review-writing inter-

face with real-time feature-sentiment analysis, 9 of the 13

participants (69%) expressed that they would correct errors

made by the system. One of the participants mentioned,

When I write a review, I want to let others get my com-

ments as clear as possible, so I care about the correct-

ness of the information in the review and will correct

the mistakes made by the system.

Another participant said,

Because I read reviews using this system before, I know

that my effort can help others understand my review,

so I will provide the information even if it causes some

extra work for me.

This promising result demonstrates that the real-time feature-

sentiment analysis does motivate users to correct the erro-

neous predictions of their own reviews. This is important as

the system can collect more corrected labels to improve the

classiﬁer over time. We believe that the user’s ability to see

immediately how their reviews will be classiﬁed is an impor-

tant feature that motivates users to provide a low-cost (one

click) easy correction of the automatic classiﬁcation.

Of course, the current study cannot directly prove that most

users really would correct the mistakes made by the system.

Nevertheless, when the number of users increases, even if

only a small portion of them provide feedback, the system

still can beneﬁt from the feedback and enhance classiﬁcation

accuracy.

Summary

To summarize, our user study answered the three questions

related to our theses. First, the highlighting function pro-

vided by the proposed system can improve the user’s read-

ing experience. Second, highlighting is at least as good as,

if not better than, noun-adjective pair summarization because

it preserves the original context of the feature-sentiment in-

formation as expressed by the review writers. Finally, the

interface with real-time feature-sentiment analysis can suc-

cessfully motivate users to correct errors made by the system,

so the classiﬁers behind the system can be improved as the

number of users increases.

DISCUSSION

Reducing effort in order to motivate users to correct erro-

neous predictions

Our interface reduces the effort needed to provide feature-

sentiment information by performing real-time analysis,

which requires only that users correct some mistakes made by

the system instead of segmenting, labeling, and rating their

reviews themselves. As a result, our user study shows that

around 70% of the participants expressed that they would

provide corrected feature-sentiment information. However,

about 30% of the participants did say that they would not cor-

rect the erroneous predictions made by the system. When

asked, they said the effort required needed to be further re-

duced. As one of the participants explained:

I know that correcting the errors can be helpful to my

readers, but I think it’s just too much work for me.

I need to click many icons there to make them right.

That’s why I choose to just ignore the errors.

Therefore, it is important for us to design interfaces that al-

low users to correct the errors more easily. In future, we

would like to experiment with different interface designs to

determine how to motivate more users to provide the cor-

rected data. One possible way is to design an interface that

allows users to drag the icons and sentences directly. On the

other hand, given that most users said they would provide the

label and the classiﬁer could beneﬁt from the input, fewer

and fewer corrections would be needed for future users as the

classiﬁer became more accurate.

Including more features in the proposed design

Our user study shows that about 23% of the participants pre-

ferred the intelligent interface that used noun-adjective pairs

summarization (RevMiner). As suggested by the participants,

the biggest advantage of the interface is that it has more fea-

tures that interest them. In contrast, our system has only three

pre-deﬁned features. This problem can be solved easily by

including more features, so it is not an inherent limitation of

the system. Since the efﬁciency of the classiﬁers is pretty

high, adding more features will not cause any technical prob-

lems when more features are added. However, additional fea-

tures may introduce a different problem. As mentioned ear-

lier, maintaining a low level of effort required is important

for motivating users to provide the correct labels. However,

adding more features may increase the effort, as users need

to remember what the possible categories are. This also may

make the interface more complex. Therefore, there is clearly

a tradeoff between providing more detailed categories and in-

formation and maximizing usability.

Real-time feature-sentiment analysis can encourage

users to generate more structured reviews

Although our system was not intended to help users write

reviews with higher quality, we did see that the real-time

feature-sentiment analysis affected the reviews generated by

the users. As one of the participants mentioned after she

wrote a review using our interface:

The results of the (feature-sentiment) analysis let me

know which part I haven’t mentioned in my review, so

I will try to write some words that are related to that

part.

In our user study, we also found that participants would try

to write something related to the three pre-deﬁned features

before they ﬁnished. This shows that the real-time analysis

can encourage users to generate reviews that cover more fea-

tures, which can improve the quality of reviews collected by

the system. A controlled experiment that shows how the real-

time analysis affects review quality can be done in the future.

The upper bounds of classiﬁcation accuracy

In our experiment, the food-classiﬁer reached the highest F1

score at 4000 training instances, and the service-classiﬁer

reached the highest at 3000 training instances; however, the

F1score of the price-classiﬁer continued to grow even after

5000 training instances. The intuitive explanation for this is

that there were more positive training instances for food and

service in our randomly chosen training dataset, so the clas-

siﬁers learned faster initially.

In the future, it would be valuable to perform a study to de-

termine the upper bounds of classiﬁcation accuracy. Once

a classiﬁer reaches its upper bound, the system could stop

asking users to provide feedback for that classiﬁer. This can

reduce the workload of the users of the system.

CONCLUSIONS AND FUTURE WORKS

In this paper, we presented a novel intelligent system that per-

forms a two-layer feature-sentiment analysis in real time. The

system can provide real-time predictions to users who are

writing user reviews, which makes it very easy for them to

provide feature-sentiment information by simply correcting

the erroneous predictions. Our user study shows that about

70% of the participants were willing to correct the mistakes

made by the system, which means that the proposed interface

can successfully utilize the power of the crowd to collect a

massive amount of labeled data that can be used to train the

supervised classiﬁers. Moreover, our experiment shows that

the size of training data is positively correlated to the per-

formance of the feature-sentiment analysis. As a result, we

can expect that the analysis performed by the system can be-

come more and more accurate as the number of system users

increases.

In addition, we compared our system to existing intelligent

interfaces with similar purposes. The results of our exper-

iment show that the supervised method of our system can

achieve much higher F1scores than traditional unsupervised

methods can achieve. Moreover, our user study also shows

that 46% of the participants preferred the highlighting func-

tion of our interface over the noun-adjective pairs summa-

rization, while only 23% of them preferred the summariza-

tion. This indicates that our system can provide more accu-

rate feature-sentiment information and help users understand

the information better than traditional interfaces with similar

goals can.

The results of our experiment show that implicit crowdsourc-

ing can be useful to improve supervised learning algorithm’s

ability to collect a huge amount of training data at no cost.

The mechanism used in the proposed design also can be ap-

plied to other domains, like status updates in social media or

contents in Q&A forums and is not limited to user reviews.

However, there are still some limitations to the current de-

sign. One of the most essential issues is to ﬁnd ways to fur-

ther reduce the effort necessary for users to provide useful

information. Moreover, it is also important to ﬁnd a good

way to include more features or even to let users input un-

speciﬁed features themselves. We believe the work presented

in this paper offers a good ﬁrst step for more future studies

that combine the strengths of intelligent interface and implicit

crowdsourcing.

In the future, we would like to deploy our system in the wild

to see if it really can help users and study how users interact

with the system on a large scale. Furthermore, since the sys-

tem involves its users to provide training data interactively, it

is possible for us to include active learning [27] in our system

design to further improve the performance of the supervised

learning classiﬁers.

ACKNOWLEDGEMENTS

Many thanks to the anonymous reviewers for their valuable

comments for us to improve the earlier draft of this work.

We also would like to thank Siddharth Gupta for helping us

implementing part of the system used in this study.

REFERENCES

1. Bernstein, M. S., Little, G., Miller, R. C., Hartmann, B.,

Ackerman, M. S., Karger, D. R., Crowell, D., and

Panovich, K. Soylent: a word processor with a crowd

inside. In Proceedings of the 23nd annual ACM

symposium on User interface software and technology,

UIST ’10, ACM (New York, NY, USA, 2010), 313–322.

2. Bigham, J. P., Jayant, C., Ji, H., Little, G., Miller, A.,

Miller, R. C., Miller, R., Tatarowicz, A., White, B.,

White, S., and Yeh, T. Vizwiz: nearly real-time answers

to visual questions. In Proceedings of the 23nd annual

ACM symposium on User interface software and

technology, UIST ’10, ACM (New York, NY, USA,

2010), 333–342.

3. Blei, D., and McAuliffe, J. Supervised topic models. In

Advances in Neural Information Processing Systems 20,

J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. MIT

Press, Cambridge, MA, 2008, 121–128.

4. Callison-Burch, C., and Dredze, M. Creating speech and

language data with amazon’s mechanical turk. In

Proceedings of the NAACL HLT 2010 Workshop on

Creating Speech and Language Data with Amazon’s

Mechanical Turk, CSLDAMT ’10, Association for

Computational Linguistics (Stroudsburg, PA, USA,

2010), 1–12.

5. Carenini, G., Ng, R. T., and Pauls, A. Interactive

multimedia summaries of evaluative text. In

Proceedings of the 11th international conference on

Intelligent user interfaces, IUI ’06, ACM (New York,

NY, USA, 2006), 124–131.

6. Carenini, G., and Rizoli, L. A multimedia interface for

facilitating comparisons of opinions. In Proceedings of

the 14th international conference on Intelligent user

interfaces, IUI ’09, ACM (New York, NY, USA, 2009),

325–334.

7. Dave, K., Lawrence, S., and Pennock, D. M. Mining the

peanut gallery: opinion extraction and semantic

classiﬁcation of product reviews. In Proceedings of the

12th international conference on World Wide Web,

WWW ’03, ACM (New York, NY, USA, 2003),

519–528.

8. Dong, R., McCarthy, K., O’Mahony, M., Schaal, M.,

and Smyth, B. Towards an intelligent reviewer’s

assistant: recommending topics to help users to write

better product reviews. In Proceedings of the 2012 ACM

international conference on Intelligent User Interfaces,

IUI ’12, ACM (New York, NY, USA, 2012), 159–168.

9. Esuli, A., and Sebastiani, F. Sentiwordnet: A publicly

available lexical resource for opinion mining. In In

Proceedings of the 5th Conference on Language

Resources and Evaluation (LREC06 (2006), 417–422.

10. Ho, C.-J., Chang, T.-H., Lee, J.-C., Hsu, J. Y.-j., and

Chen, K.-T. Kisskissban: a competitive human

computation game for image annotation. SIGKDD

Explor. Newsl. 12, 1 (Nov. 2010), 21–24.

11. Howe, J. The rise of crowdsourcing. Wired Magazine

(06 2006).

12. Hu, M., and Liu, B. Mining and summarizing customer

reviews. In Proceedings of the tenth ACM SIGKDD

international conference on Knowledge discovery and

data mining, KDD ’04, ACM (New York, NY, USA,

2004), 168–177.

13. Huang, J., Etzioni, O., Zettlemoyer, L., Clark, K., and

Lee, C. Revminer: an extractive interface for navigating

reviews on a smartphone. In Proceedings of the 25th

annual ACM symposium on User interface software and

technology, UIST ’12 (2012), 3–12.

14. Kearns, M. J., and Vazirani, U. V. An introduction to

computational learning theory. MIT Press, Cambridge,

MA, USA, 1994.

15. Kullback, S., and Leibler, R. On information and

sufﬁciency. The Annals of Mathematical Statistics 22, 1

(1951), 79–86.

16. Lease, M., Carvalho, V., and Yilmaz, E., Eds.

Proceedings of the Workshop on Crowdsourcing for

Search and Data Mining (CSDM) at the Fourth ACM

International Conference on Web Search and Data

Mining (WSDM).

17. Lease, M., Carvalho, V., and Yilmaz, E., Eds.

Proceedings of the ACM SIGIR 2010 Workshop on

Crowdsourcing for Search Evaluation (CSE 2010).

Geneva, Switzerland, July 2010.

18. Liu, B., Hu, M., and Cheng, J. Opinion observer:

analyzing and comparing opinions on the web. In

Proceedings of the 14th international conference on

World Wide Web, WWW ’05 (2005), 342–351.

19. Nichols, J., Mahmud, J., and Drews, C. Summarizing

sporting events using twitter. In Proceedings of the 2012

ACM international conference on Intelligent User

Interfaces, IUI ’12, ACM (New York, NY, USA, 2012),

189–198.

20. Pang, B., and Lee, L. Opinion mining and sentiment

analysis. Foundations and Trends in Information

Retrieval 2, 1–2 (2008), 1–135.

21. Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up?:

sentiment classiﬁcation using machine learning

techniques. In Proceedings of the ACL-02 conference on

Empirical methods in natural language processing -

Volume 10, EMNLP ’02, Association for Computational

Linguistics (Stroudsburg, PA, USA, 2002), 79–86.

22. Rashtchian, C., Young, P., Hodosh, M., and

Hockenmaier, J. Collecting image annotations using

amazon’s mechanical turk. In Proceedings of the

NAACL HLT 2010 Workshop on Creating Speech and

Language Data with Amazon’s Mechanical Turk,

CSLDAMT ’10, Association for Computational

Linguistics (Stroudsburg, PA, USA, 2010), 139–147.

23. Rijsbergen, C. J. V. Information Retrieval, 2nd ed.

Butterworth-Heinemann, Newton, MA, USA, 1979.

24. Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y.

Cheap and fast—but is it good?: evaluating non-expert

annotations for natural language tasks. In Proceedings of

the Conference on Empirical Methods in Natural

Language Processing, EMNLP ’08, Association for

Computational Linguistics (Stroudsburg, PA, USA,

2008), 254–263.

25. Sorokin, A., and Forsyth, D. Utility data annotation with

amazon mechanical turk. In Computer Vision and

Pattern Recognition Workshops, 2008. CVPRW ’08.

IEEE Computer Society Conference on (june 2008), 1

–8.

26. Su, H., Deng, J., and Fei-Fei, L. Crowdsourcing

annotations for visual object detection. In Workshops at

the Twenty-Sixth AAAI Conference on Artiﬁcial

Intelligence (2012).

27. Tong, S., and Koller, D. Support vector machine active

learning with applications to text classiﬁcation. J. Mach.

Learn. Res. 2 (Mar. 2002), 45–66.

28. Turney, P. D. Thumbs up or thumbs down?: semantic

orientation applied to unsupervised classiﬁcation of

reviews. In Proceedings of the 40th Annual Meeting on

Association for Computational Linguistics, ACL ’02,

Association for Computational Linguistics (Stroudsburg,

PA, USA, 2002), 417–424.

29. von Ahn, L., and Dabbish, L. Labeling images with a

computer game. In Proceedings of the SIGCHI

conference on Human factors in computing systems,

CHI ’04, ACM (New York, NY, USA, 2004), 319–326.

30. von Ahn, L., and Dabbish, L. Designing games with a

purpose. Commun. ACM 51 (Aug. 2008), 58–67.

31. von Ahn, L., Liu, R., and Blum, M. Peekaboom: a game

for locating objects in images. In Proceedings of the

SIGCHI Conference on Human Factors in Computing

Systems, CHI ’06, ACM (New York, NY, USA, 2006),

55–64.

32. Wang, H., Lu, Y., and Zhai, C. Latent aspect rating

analysis on review text data: a rating regression

approach. In Proceedings of the 16th ACM SIGKDD

international conference on Knowledge discovery and

data mining, KDD ’10, ACM (New York, NY, USA,

2010), 783–792.

33. Yatani, K., Novati, M., Trusty, A., and Truong, K. N.

Review spotlight: a user interface for summarizing

user-generated reviews using adjective-noun word pairs.

In Proceedings of the SIGCHI Conference on Human

Factors in Computing Systems, CHI ’11, ACM (New

York, NY, USA, 2011), 1541–1550.

APPENDIX: QUESTIONNAIRE

Q1: When reading user reviews, do you prefer the ﬁrst

interface or the second interface? Why?

Q2: When reading user reviews, do you prefer the highlight-

ing function of the second interface or the noun-adjective

pair representation of the last interface? Why?

Q3: When writing user reviews on the review writing

interface you just used, would you correct the erroneous

predictions made by the system? Why or why not?

A Review of User Interface Design for Interactive Machine Learning

Article

Jun 2018

Interactive Machine Learning (IML) seeks to complement human perception and intelligence by tightly integrating these strengths with the computational power and speed of computers. The interactive process is designed to involve input from the user but does not require the background knowledge or experience that might be necessary to work with more traditional machine learning techniques. Under the IML process, non-experts can apply their domain knowledge and insight over otherwise unwieldy datasets to find patterns of interest or develop complex data-driven applications. This process is co-adaptive in nature and relies on careful management of the interaction between human and machine. User interface design is fundamental to the success of this approach, yet there is a lack of consolidated principles on how such an interface should be implemented. This article presents a detailed review and characterisation of Interactive Machine Learning from an interactive systems perspective. We propose and describe a structural and behavioural model of a generalised IML system and identify solution principles for building effective interfaces for IML. Where possible, these emergent solution principles are contextualised by reference to the broader human-computer interaction literature. Finally, we identify strands of user interface research key to unlocking more efficient and productive non-expert interactive machine learning applications.

Towards an Integrative Theoretical Framework of Interactive Machine Learning Systems

Conference Paper

Apr 2019

Interactive machine learning (IML) is a learning process in which a user interacts with a system to iteratively define and optimize a model. Although recent years have illustrated the proliferation of IML systems in the fields of Human-Computer Interaction (HCI), Information Systems (IS), and Computer Science (CS), current research results are scattered leading to a lack of integration of existing work on IML. Furthermore, due to diverging functionalities and purposes IML systems can refer to, an uncertainty exists regarding the underlying distinct capabilities that constitute this class of systems. By reviewing extensive IML literature, this paper suggests an integrative theoretical framework for IML systems to address these current impediments. Reviewing 2,879 studies in leading journals and conferences during the years 1966-2018, we found an extensive range of applications areas that have implemented IML systems and the necessity to standardize the evaluation of those systems. Our framework offers an essential step to provide a theoretical foundation to integrate concepts and findings across different fields of research. The main contribution of this paper is organizing and structuring the body of knowledge in IML for the advancement of the field. Furthermore, we suggest three opportunities for future IML research. From a practical point of view, our integrative theoretical framework can serve as a reference guide to inform the design and implementation of IML systems.

Human-in-the-loop machine learning: Reconceptualizing the role of the user in interactive approaches

Article

Full-text available

Apr 2024

Investigating audio data visualization for interactive sound recognition

Conference Paper

Mar 2020

Interactive machine learning techniques have a great potential to personalize media recognition models for each individual user by letting them browse and annotate a large amount of training data. However, graphical user interfaces (GUIs) for interactive machine learning have been mainly investigated in image and text recognition scenarios, not in other data modalities such as sound. In a scenario where users browse a large amount of audio files to search and annotate target samples corresponding to their own sound recognition classes, it is difficult for them to easily navigate through the overall sample structure due to the non-visual nature of audio data. In this work, we investigate the design issue for interactive sound recognition by comparing different visualization techniques ranging from audio spectrograms to deep learning-based audio-to-image retrieval. Based on an analysis of the user study, we clarify the advantages and disadvantages of audio visualization techniques, and provide design implications for interactive sound recognition GUIs using a massive amount of audio samples.

Utilizing crowdsourcing and machine learning in education: Literature review

Article

Full-text available

Jul 2020
Educ Inform Tech

For many years, learning continues to be a vital developing field since it is the key measure of the world’s civilization and evolution with its enormous effect on both individuals and societies. Enhancing existing learning activities in general will have a significant impact on literacy rates around the world. One of the crucial activities in education is the assessment method because it is the primary way used to evaluate the student during their studies. The main purpose of this review is to examine the existing learning and e-learning approaches that use either crowdsourcing, machine learning, or both crowdsourcing and machine learning in their proposed solutions. This review will also investigate the addressed applications to identify the existing researches related to the assessment. Identifying all existing applications will assist in finding the unexplored gaps and limitations. This study presents a systematic literature review investigating 30 papers from the following databases: IEEE and ACM Digital Library. After performing the analysis, we found that crowdsourcing is utilized in 47.8% of the investigated learning activities, while each of the machine learning and the hybrid solutions are utilized in 26% of the investigated learning activities. Furthermore, all the existing approaches regarding the exam assessment problem that are using machine learning or crowdsourcing were identified. Some of the existing assessment systems are using the crowdsourcing approach and other systems are using the machine learning, however, none of the approaches provide a hybrid assessment system that uses both crowdsourcing and machine learning. Finally, it is found that using either crowdsourcing or machine learning in the online courses will enhance the interactions between the students. It is concluded that the current learning activities need to be enhanced since it is directly affecting the student’s performance. Moreover, merging both the machine learning to the crowd wisdom will increase the accuracy and the efficiency of education.

Generating Requirements Out of Thin Air: Towards Automated Feature Identification for New Apps

Preprint

Full-text available

Sep 2019

App store mining has proven to be a promising technique for requirements elicitation as companies can gain valuable knowledge to maintain and evolve existing apps. However, despite first advancements in using mining techniques for requirements elicitation, little is yet known how to distill requirements for new apps based on existing (similar) solutions and how exactly practitioners would benefit from such a technique. In the proposed work, we focus on exploring information (e.g. app store data) provided by the crowd about existing solutions to identify key features of applications in a particular domain. We argue that these discovered features and other related influential aspects (e.g. ratings) can help practitioners(e.g. software developer) to identify potential key features for new applications. To support this argument, we first conducted an interview study with practitioners to understand the extent to which such an approach would find champions in practice. In this paper, we present the first results of our ongoing research in the context of a larger road-map. Our interview study confirms that practitioners see the need for our envisioned approach. Furthermore, we present an early conceptual solution to discuss the feasibility of our approach. However, this manuscript is also intended to foster discussions on the extent to which machine learning can and should be applied to elicit automated requirements on crowd generated data on different forums and to identify further collaborations in this endeavor.

Teachable Facets: A Framework of Interactive Machine Teaching for Information Filtering

Conference Paper

Mar 2024

Interaction Proxy Manager: Semantic Model Generation and Run-time Support for Reconstructing Ubiquitous User Interfaces of Mobile Services

Article

Sep 2023

Emerging terminals, such as smartwatches, true wireless earphones, in-vehicle computers, etc., are complementing our portals to ubiquitous information services. However, the current ecology of information services, encapsulated into millions of mobile apps, is largely restricted to smartphones; accommodating them to new devices requires tremendous and almost unbearable engineering efforts. Interaction Proxy, firstly proposed as an accessible technique, is a potential solution to mitigate this problem. Rather than re-building an entire application, Interaction Proxy constructs an alternative user interface that intercepts and translates interaction events and states between users and the original app's interface. However, in such a system, one key challenge is how to robustly and efficiently "communicate" with the original interface given the instability and dynamicity of mobile apps (e.g., dynamic application status and unstable layout). To handle this, we first define UI-Independent Application Description (UIAD), a reverse-engineered semantic model of mobile services, and then propose Interaction Proxy Manager (IPManager), which is responsible for synchronizing and managing the original apps' interface, and providing a concise programming interface that exposes information and method entries of the concerned mobile services. In this way, developers can build alternative interfaces without dealing with the complexity of communicating with the original app's interfaces. In this paper, we elaborate the design and implementation of our IPManager, and demonstrate its effectiveness by developing three typical proxies, mobile-smartwatch, mobile-vehicle and mobile-voice. We conclude by discussing the value of our approach to promote ubiquitous computing, as well as its limitations.

Generating Requirements Out of Thin Air: Towards Automated Feature Identification for New Apps

Conference Paper

Sep 2019

Inferring Consumers’ Motivations for Writing Reviews

Chapter

Jun 2018

In recent years, product reviews have taken an important role in helping consumers make online purchasing decisions, but only a small proportion of consumers post their reviews online. Researchers pointed out that consumers can be divided into distinct groups in terms of their motivations for eWOM (electronic word-of-mouth), and different strategies should be developed based on different motivation groups. However, the effort of explicitly acquiring motivations via questionnaire is unavoidably high, which may impede developing different strategies to different motivation groups. In this paper, we identified a set of consumers’ motivations and behavior data. Then, we performed a user survey to validate whether the behavioral features are significantly correlated with consumers’ motivations. These findings lay solid foundation to develop more adaptive design solutions for encouraging eWOM participation.

Thumbs Up or Thumbs Down? {S}emantic Orientation Applied to Unsupervised Classification of Reviews

Article

Full-text available

Dec 2002

Peter David Turney

This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not rec- ommended (thumbs down). The classifi- cation of a review is predicted by the average semantic orientation of the phrases in the review that contain adjec- tives or adverbs. A phrase has a positive semantic orientation when it has good as- sociations (e.g., "subtle nuances") and a negative semantic orientation when it has bad associations (e.g., "very cavalier"). In this paper, the semantic orientation of a phrase is calculated as the mutual infor- mation between the given phrase and the word "excellent" minus the mutual information between the given phrase and the word "poor". A review is classified as recommended if the average semantic ori- entation of its phrases is positive. The al- gorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). The ac- curacy ranges from 84% for automobile reviews to 66% for movie reviews.

KissKissBan

Article

Nov 2010

In this paper, we propose a competitive human computation game, KissKissBan (KKB), for image annotation. KKB is different from other human computation games since it integrates both collaborative and competitive elements in the game design. In a KKB game, one player, the blocker, competeswith the other two collaborative players, the couples; while the couples try to find consensual descriptions about an image, the blocker's mission is to prevent the couples from reaching consensus. Because of its design, KKB possesses two nice properties over the traditional human computation game. First, since the blocker is encouraged to stop the couples from reaching consensual descriptions, he will try to detect and prevent coalition between the couples; therefore, these efforts naturally form a player-levelcheating-proof mechanism. Second, to evade the restrictions set by the blocker, the couples would endeavor to bring up a more diverse set of image annotations. Experiments hosted on Amazon Mechanical Turk and a gameplay survey involving 17 participants have shown that KKB is a fun and efficient game for collecting diverse image annotations.

Crowdsourcing annotations for visual object detection

Article

Jan 2012

A large number of images with ground truth object bounding boxes are critical for learning object detectors, which is a fundamental task in compute vision. In this paper, we study strategies to crowd-source bounding box annotations. The core challenge of building such a system is to effectively control the data quality with minimal cost. Our key observation is that drawing a bounding box is significantly more difficult and time consuming than giving answers to multiple choice questions. Thus quality control through additional verification tasks is more cost effective than consensus based algorithms. In particular, we present a system that consists of three simple sub-tasks - a drawing task, a quality verification task and a coverage verification task. Experimental results demonstrate that our system is scalable, accurate, and cost-effective. Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Semantic Information Retrieval

Chapter

Jan 1998

Semantic Information Theory (SIT) is concerned with studies in Logic and Philosophy on the use of the term information, “in the sense in which it is used of whatever it is that meaningful sentences and other comparable combinations of symbols convey to one who understands them” (Hintikka, 1970). Notwithstanding the large scope of this description, SIT has primarily to do with the question of how to weigh sentences according to their informative content. The main difference with conventional information theory is that information is not conveyed by an ordered sequence of binary symbols, but by means of a formal language in which logical statements are defined and explained by a semantics. The investigation of SIT concerns two research directions: the axiomatisation of the logical principles for assigning probabilities or similar weighting functions to logical sentences and the relationship between information content of a sentence and its probability.

VizWiz

Conference Paper

Oct 2010

Collecting image annotations using Amazon's Mechanical Turk

Conference Paper

Jun 2010

Crowd-sourcing approaches such as Amazon's Mechanical Turk (MTurk) make it possible to annotate or collect large amounts of linguistic data at a relatively low cost and high speed. However, MTurk offers only limited control over who is allowed to particpate in a particular task. This is particularly problematic for tasks requiring free-form text entry. Unlike multiple-choice tasks there is no correct answer, and therefore control items for which the correct answer is known cannot be used. Furthermore, MTurk has no effective built-in mechanism to guarantee workers are proficient English writers. We describe our experience in creating corpora of images annotated with multiple one-sentence descriptions on MTurk and explore the effectiveness of different quality control strategies for collecting linguistic data using Mechanical MTurk. We find that the use of a qualification test provides the highest improvement of quality, whereas refining the annotations through follow-up tasks works rather poorly. Using our best setup, we construct two image corpora, totaling more than 40,000 descriptive captions for 9000 images.

REVMINER: An extractive interface for navigating reviews on a smartphone

Conference Paper

Oct 2012

Smartphones are convenient, but their small screens make searching, clicking, and reading awkward. Thus, perusing product reviews on a smartphone is difficult. In response, we introduce RevMiner - a novel smartphone interface that utilizes Natural Language Processing techniques to analyze and navigate reviews. RevMiner was run over 300K Yelp restaurant reviews extracting attribute-value pairs, where attributes represent restaurant attributes such as sushi and service, and values represent opinions about the attributes such as fresh or fast. These pairs were aggregated and used to: 1) answer queries such as "cheap Indian food", 2) concisely present information about each restaurant, and 3) identify similar restaurants. Our user studies demonstrate that on a smartphone, participants preferred RevMiner's interface to tag clouds and color bars, and that they preferred RevMiner's results to Yelp's, particularly for conjunctive queries (e.g., "great food and huge portions"). Demonstrations of RevMiner are available at revminer.com.

An Introduction to Computational Learning Theory

Article

Jan 1994

Summarizing sporting events using Twitter

Article

Feb 2012

The status updates posted to social networks, such as Twitter and Facebook, contain a myriad of information about what people are doing and watching. During events, such as sports games, many updates are sent describing and expressing opinions about the event. In this paper, we describe an algorithm that generates a journalistic summary of an event using only status updates from Twitter as a source. Temporal cues, such as spikes in the volume of status updates, are used to identify the important moments within an event, and a sentence ranking method is used to extract relevant sentences from the corpus of status updates describing each important moment within an event. We evaluate our algorithm compared to human-generated summaries and the previous best summarization algorithm, and find that the results of our method are superior to the previous algorithm and approach the readability and grammaticality of the human-generated summaries.

Support Vector Machine Active Learning With Applications To Text Classification

Article

Dec 2001
J MACH LEARN RES

Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. In many settings, we also have the option of using pool-based active learning . Instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels for some number of them. We introduce a new algorithm for performing active learning with support vector machines, i.e., an algorithm for choosing which instances to request next. We provide a theoretical motivation for the algorithm using the notion of a version space . We present experimental results showing that employing our active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.

Leveraging the crowd to improve feature-sentiment analysis of user reviews

Abstract and Figures

Recommended publications

FUTURITY 2017-Workshop on Modeling Societal Future

Turkers, Scholars, "Arafat" and "Peace"

Azure Cognitive Services

Empirical evaluation of deep learning models for sentiment analysis