Conference PaperPDF Available

Overview of Opinion Analysis Pilot Task at NTCIR-6

January 2007

January 2007

Conference: Proceedings of the Workshop Meeting of the National Institute of Informatics (NII) Test Collection for Information Retrieval Systems (NTCIR)

Authors:

Yohei Seki

University of Tsukuba

Lun-Wei Ku

Academia Sinica

Show all 6 authorsHide

Abstract This paper describes an overview,of the Opinion Analysis Pilot Task from 2006 to 2007 at the Sixth NT- CIR Workshop. We created test collection for 32, 30, and 28 topics (11,907, 15,279, and 8,379 sentences) in Chinese, Japanese and English. Using this test col- lection, we conducted opinion extraction subtask. The subtask was defined from four perspectives: (a) opin- ionated sentence judgment, (b) opinion holder extrac- tion, (c) relevance sentence judgment, and (d) polarity judgment. 21 run results were submitted by 14 partici- pants with five results submitted by the organizers. We show the evaluation results of the groups participating in opinion extraction subtask. Keywords: Opinion Extraction, Opinion Holder, Rel- evance, Polarity, and NTCIR.

…

. Kappa summary

…

. Inter-annotator agreement using Cohen's Kappa for Chinese

…

. Comparison of Polarity Evaluation Approaches (Strict)

…

. Chinese Opinion Holders Analysis: Sentence-Based Results

…

Figures - uploaded by Noriko Kando

Content may be subject to copyright.

Content uploaded by Noriko Kando

Content may be subject to copyright.

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

Overview of Opinion Analysis Pilot Task at NTCIR-6

Yohei Seki†, David Kirk Evans‡, Lun-Wei Ku§,

Hsin-Hsi Chen§, Noriko Kando‡, Chin-Yew Lin¶

†Dept. of Information and Computer Sciences, Toyohashi University of Technology

Aichi 441-8580, Japan

seki@ics.tut.ac.jp

‡National Institute of Informatics

Tokyo 101-8430, Japan

{devans, kando}@nii.ac.jp

§Dept. of Computer Science and Information Engineering, National Taiwan University

Taipei 10617, Taiwan

{lwku, hhchen}@csie.ntu.edu.tw

¶Microsoft Research Asia

Beijing 100080, P.R. China

cyl@microsoft.com

Abstract

This paper describes an overview of the Opinion

Analysis Pilot Task from 2006 to 2007 at the Sixth NT-

CIR Workshop. We created test collection for 32, 30,

and 28 topics (11,907, 15,279, and 8,379 sentences)

in Chinese, Japanese and English. Using this test col-

lection, we conducted opinion extraction subtask. The

subtask was deﬁned from four perspectives: (a) opin-

ionated sentence judgment, (b) opinion holder extrac-

tion, (c) relevance sentence judgment, and (d) polarity

judgment. 21 run results were submitted by 14 partici-

pants with ﬁve results submitted by the organizers. We

show the evaluation results of the groups participating

in opinion extraction subtask.

Keywords: Opinion Extraction, Opinion Holder, Rel-

evance, Polarity, and NTCIR.

1 Introduction

This paper describes an overview of the Opinion

Analysis Pilot Task [5] from 2006 to 2007 at the Sixth

NTCIR Workshop [4] (NTCIR-6 Opinion). This was

the ﬁrst effort to produce a multi-lingual test collection

for evaluating opinion extraction at NTCIR.

Opinion and sentiment analysis has been receiving

a lot of attention in the natural language processing re-

search community recently [2, 9, 7]. With the broad

range of information sources available on the web,

and rapid increase in the uptake of social community-

oriented websites that foster user-generated content

there has been further interest by both commercial and

governmental parties in trying to automatically ana-

lyze and monitor the tide of prevalent attitudes on the

web. As a result, interest in automatically detecting

sentences in which an opinion is expressed ([12] etc.),

the polarity of the expression ([13] etc.), targets, and

opinion holders ([1] etc.) has been receiving more at-

tention in the research community. Applications in-

clude tracking response to and opinions about com-

mercial products, governmental policies, tracking blog

entries for potential political scandals and so on.

In the Sixth NTCIR Workshop, a new pilot task for

opinion analysis has been introduced. The pilot task

has tracks in three languages: Chinese, English, and

Japanese. In this paper, we present an overview of the

test collection, task design, and evaluation results us-

ing the test collection across the Chinese, Japanese,

and English data.

We believe that this pilot task presents a unique op-

portunity to expand the study of opinionated text anal-

ysis across languages due to the comparable nature of

the corpus. The documents have been carefully se-

lected based on the manual relevance judgments as-

signed in a cross-lingual Information Retrieval task,

ensuring a high quality corpus that is relevant in all

three languages.

This paper is organized as follows. In Section 2,

we explain the task design for the opinion extraction

subtask. Section 3, we brieﬂy introduce the test col-

lection used in NTCIR-6 Opinion Analysis Pilot Task.

Section 4 presents the annotation methodology. Sec-

tion 5 details the evaluation methodology used, and

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

explains the differences in the aproaches taken with

examples. Section 6 describes participant system de-

scription. Section 7 presents evaluation results for the

opinion extraction subtask in Chinese, Japanese and

English. Finally, we present our conclusions in Sec-

tion 8.

2 Task Design

2.1 Schedule

The time schedule for the NTCIR-6 Opinion Analy-

sis Pilot Task is as follows.

2006-08-01 : Start of Registration

2006-10-30 : Registration Due

2006-11-21 : TestingSets Release

(Chinese and Japanese opinion extraction)

2006-11-30 : Submission of Results

(Chinese and Japanese opinion extraction)

2006-12-11 : TestingSets Release

(English opinion extraction)

2006-12-20 : Submission of Results

(English opinion extraction)

2007-02-08 : Delivery of Evaluation Results

2007-03-08 : Paper Due (for Proceedings)

2007-05-15 : NTCIR Workshop-6 (Conference in Tokyo)

2.2 Participants

Results for the opinion extraction subtask have been

collected. Five, three, and six teams participated in

the Chinese, Japanese and English opinion extraction

subtask. Two runs at most were accepted from each

participant.

2.3 Opinion extraction subtask

Four Evaluation Categories

The opinion extraction subtask has four categories in

evaluation, two of which are mandatory and two of

which are optional. In Table 3, the two mandatory cat-

egories are to decide whether each sentence expresses

an opinion or not. The two optional categories are

whether the sentences are relevant to the set topic or

not, and to decide the polarity of the opinionated sen-

tences.

1. Opinionated sentences

The opinionated sentences judgment is a bi-

nary decision, but in the case of opinion hold-

ers we allow for multiple opinion holders to be

recorded for each sentence in the case that mul-

tiple opinions are expressed.

2. Opinion holders

For the Chinese data, all potential opinion hold-

ers are annotated whether the sentence in which

the entity occurs is an opinionated sentence or

not. In Japanese and English, opinion holders

are only annotated for sentences that express an

opinion, however, the opinion holder for a sen-

tence can occur anywhere in the document. The

assessors performed a kind of co-reference reso-

lution by marking the opinion holder for the sen-

tence, and if the opinion holder is an anaphoric

reference noting the target of the anaphora.

3. Relevant sentences

Each set contains documents that were found to

be relevant to a particular topic, such as the one

shown in Figure 1. For those participating in

the relevance category evaluation, each sentence

should be judged as either relevant (Y) or non-

relevant (N) to the topic.

4. Opinion polarities

Polarity is determined for each opinionated sen-

tence, and for sentences where more than one

opinion is expressed the assessors were in-

structed to determine the polarity of the main

opinion expressed. In addition, the polarity is to

be determined with respect to the set topic de-

scription if the sentence is relevant to the topic,

and based on the attitude of the opinion if the

sentence is not relevant to the topic.

Sample (Training) Data

Of 32, 30 topics in NTCIR-6 Opinion Analysis Pilot

Task test collection, four topics were provided as a

sample (training) data to participants in Chinese and

Japanese. For English, only one topic was provided as

a sample data because MPQA opinion corpus [11] was

available for opinion extraction researchers in English.

Evaluation Metrics

Results for precision, recall, and F-measure will be

presented for opinion detection and opinion holders,

and optionally for sentence relevance and polarity for

those participants that elected to submit results for

those optional portions. In Chinese, Japanese, and En-

glish since all sentences were annotated by three as-

sessors there is both a strict (all three assessors must

have the same annotation) and a lenient standard (two

of three assessors have the same annotation) for evalu-

ation, both of which are being computed for all but the

opinion holder evaluation, which require some manual

judgment and will only be performed once for each

participating group. Formal deﬁnition provided for

evaluation is as follows.

1. Mandatory evaluation

(a) Precision, Recall and F-measure of Opinion

Holder using lenient gold standard.

(b) Precision, Recall and F-measure of Opinion

Holder using strict gold standard.

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

ing lenient gold standard.

(d) Precision, Recall and F-measure of Opinion us-

ing strict gold standard.

2. Option 1 evaluation

If Relevance information is provided, extra informa-

tion will be reported including:

(a) Precision, Recall and F-measure of Relevance

using lenient gold standard.

(b) Precision, Recall and F-measure of Relevance

using strict gold standard.

3. Option 2 evaluation

If Polarity information is provided, extra information

will be reported including:

(a) Precision, Recall and F-measure of Polarity us-

ing lenient gold standard.

(b) Precision, Recall and F-measure of Polarity us-

ing strict gold standard.

3 Test collection

3.1 Document Sets

The test collection is based on the NTCIR-3, 4, and

5 CLIR test collection [6] documents and relevance

judgments.

•It consists of Japanese data from 1998 to 1999

from the Yomiuri and Mainichi newspapers.

•The Chinese data contains data from 1998

to 1999 from the United Daily News, China

Times, China Times Express, Commercial

Times, China Daily News, Central and Daily

News.

•The English data also covers from 1998 to 1999

with text from the Mainichi Daily News, Korea

Times, and some data from Xinhua.

The test collection was created using about thirty

queries over data from the NTCIR Cross-Lingual In-

formation Retrieval test collection covering docu-

ments from 1998 to 2001. Document relevance for

each set (query) had already been computed for the IR

evaluation, so relevant documents for each language

were selected based on the relevance judgments. For

the Japanese and English portion of the test collection,

a maximum of twenty documents were selected for

each topic, while the Chinese portion might contain

more than twenty documents for a topic. As an ex-

ample of the topics in the NTCIR-6 opinion analysis

pilot task, please see Figure 1, which shows topic 010,

“History Textbook Controversies, World War II”.

Table 1 shows the number of topics, the number of

documents, and the number of sentences for each lan-

guage. The percentage of sentences that are opinion-

ated and relevant are also computed for both the strict

and lenient standards.

3.2 Topics

Table 2 lists the titles of all the topics in the data

set. While only the English title is given, the topics

and related meta-data as shown in Figure 1 have all

been translated into each language.

4 Annotation

The NTCIR-6 Opinion Analysis Pilot Task extends

previous work in opinion analysis [3, 8, 10] to a multi-

lingual corpus. The initial category focuses on a sim-

pliﬁed sentence-level binary opinionated or not opin-

ionated classiﬁcation as opposed to more complicated

contextual formulations, but we feel that starting with

a simpler task will allow for wider participation from

groups that may not have existing experience in opin-

ion analysis. Table 3 summarizes the annotation cate-

gories, which are all being performed for all three lan-

guages. All categories were annotated by three annota-

tors in each language: Chinese, Japanese, and English.

One sample topic was used for inter-coder session to

improve the agreement between assessors.

4.1 Chinese Annotation Strategy

In the Chinese annotation effort, a pool of seven

annotators were used to annotate the documents, with

three annotators per document. Prior to annotation, the

annotators underwent an hour-long orientation period

where the purpose of the annotation was explained,

and examples of sentences and their annotation were

given. After the hour-long orientation session, the an-

notators were free to ask the annotation coordinator

questions about speciﬁc sentences if they were unsure

of the labelling, but no special care was taken to ensure

consistency between the annotators in those cases.

4.2 Japanese Annotation Strategy

The Japanese data was annotated by three annota-

tors, who were given basic instructions about the an-

notation task, and then annotated a sample topic. They

held a meeting about six hours afterwards to discuss

discrepancies with the explicit goal of trying to im-

prove agreements between annotators. The general or

common knowledge and future plans were not counted

as opinions. The Japanese annotators agreed on a spe-

ciﬁc format for writing out opinion holder description

strings. The three annotators were magazine or news-

paper related editorials or translators.

4.3 English Annotation Strategy

Three annotators were used to mark the English

data. One of the annotators was a journalist, another

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

Table 1. Test collection size at NTCIR-6 Opinion Analysis Pilot Task

Language Topics Documents Sentences Opinionated (Lenient / Strict) Relevant

Chinese 32 843 11,907 62% / 25% 39% / 16%

Japanese 30 490 15,279 29% / 22% 64% / 49%

English 28 439 8,528 30%/7% 69% / 37%

Table 2. Opinion Analysis Task Topic Titles

Number Title

001 Time Warner, American Online (AOL), Merger, Impact

002 President of Peru, Alberto Fujimori, scandal, bribe

003 Kim Dae Jun, Kim Jong Il, Inter-Korea Summit

004 the US Secretary of Defense, William Sebastian Cohen, Beijing

005 G8 Okinawa Summit

006 Wen Ho Lee Case, classiﬁed information, national security

007 Ichiro, Rookie of the Year, Major League

008 Jennifer Capriati, tennis

009 EP-3 surveillance aircraft, F-8 ﬁghter, aircraft collision

010 History Textbook Controversies, World War II

011 Tobacco business, accusation, compensation

012 Tiger Woods, sports star

013 ”Chiutou” (Autumn Struggle), Appeal, Laborer, Protest, Taiwan

014 Expert, Opinion, International Monetary Fund (IMF), Asian countries

015 Find articles dealing with a teenage social problem

016 Divorce, Family Discord, Criticisms

017 China, Reaction, Taiwan, Diplomatic Relations

018 China, Stationing, Weapons, Taiwan

019 Animal Cloning Technique

020 Sexual Harassment, Lawsuits

021 Olympic, Bribe, Suspicion

022 North Korea, Daepodong, Asia, Response

023 Joining WTO

024 China Airlines Crash

025 Province-reﬁning

026 Economic inﬂuence of the European monetary union

027 President Kim Dae-Jung’s policy toward Asia

028 Clinton scandals

029 War crimes lawsuits

030 Nuclear power protests

031 College Admission Policy

032 Counseling for Youths

Table 3. Four annotation categories at NTCIR-6 Opinion Analysis Pilot Task

Categories Values Req’d?

Opinionated Sentences YES, NO Yes

Opinion Holders String, multiple Yes

Relevant Sentences YES, NO No

Opinionated Polarities POS, NEG, NEU No

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

<TOPIC>

<TITLE>History Textbook Controversies, World War II</TITLE>

<DESC>Find reports on the controversial history textbook about the Second World War approved

by the Japanese Ministry of Education.</DESC>

<NARR>

<BACK>The Japanese Ministry of Education approved a controversial high school history text-

book that allegedly glosses over Japan’s atrocities during World War Two such as the Nanjing

Massacre, the use of millions of Asia women as ”comfort women” and the history of the annex-

ations and colonization before the war. It was condemned by other Asian nations and Japan was

asked to revise this textbook.</BACK>

<REL>Reports on the fact that the Japanese Ministry of Education approved the history textbook

or its content are relevant. Reports on reﬂections or reactions to this issue around the world are

partially relevant. Content on victims, ”comfort women”, or Nanjing Massacre or other wars and

colonization are irrelevant. Reports on the reﬂections and reactions of the Japanese government

and people are also irrelevant.</REL>

</NARR>

<CONC>Ministry of Education, Japan, Junichiro Koizumi, textbook, comfort women, sex-

ual slavery, Nanjing Massacre, annexation, colonization, protest, right-wing group, Lee Den

Hui</CONC>

</TOPIC>

Figure 1. Topic title, description, and relevance ﬁelds for set 010

was a translator, and the profession of the third annota-

tor was unknown. Prior to annotation, there was a two

hour meeting between the annotators and the English

coordinator explaining the purpose of the annotation,

and introducing them to the task with some sample an-

notations. Afterwards all three annotators annotated a

sample topic, and later a four-hour meeting was held to

discuss discrepancies and general approaches to anno-

tation. By consensus, expression of common or gen-

eral knowledge were not labeled as opinions, nor were

statements from ofﬁcials or companies about future

plans or schedules.

4.4 Inter-annotator agreement

For the Japanese and English corpora all topics

were annotated by the same three annotators, so it

was possible to compute Cohen’s Kappa for agreement

over all topics between the annotators. Table 5 lists the

pairwise agreement for annotators for the opinionated

tagging subtasks.

Table 4 gives a summary of the Kappa agreement

for annotators in each language. More speciﬁc agree-

ment values for each language are given below. of

determining whether a sentence contains opinionated

language is open to individual interpretation regard-

less of the language.

In general, Japanese has the highest average agree-

ment numbers for opinionated sentence detection. As

the annotators for each language underwent different

training and instruction, and come from different back-

grounds, it is likely that much of the variation in agree-

ment is not due to differences inherent in the language,

Table 4. Kappa summary

Language Minimum Maximum Average

CH Opinionated 0.0537 0.4065 0.2328

JA Opinionated 0.5997 0.7681 0.6740

EN Opinionated 0.1704 0.4806 0.2947

Table 5. Pairwise Inter-annotator agree-

ment using Cohen’s Kappa for Japanese

and English

Language Annotator Pair Task Kappa

J 1-2 Opinionated 0.6541

J 1-3 Opinionated 0.5997

J 2-3 Opinionated 0.7681

E 1-2 Opinionated 0.4806

E 1-3 Opinionated 0.1704

E 2-3 Opinionated 0.2332

but instead is due to differences in the annotators.

In Chinese, since there is a total set of seven anno-

tators, not all topics were annotated by the same three

annotators. It was thus not possible to compute agree-

ment of three annotators over all topics, since the an-

notators change for each topic. Instead, for each topic

the agreement between the three annotators was com-

puted, and the average for each topic is shown in Ta-

ble 6.

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

Table 6. Inter-annotator agreement using

Cohen’s Kappa for Chinese

Topic Opinionated Topic Opinionated

1 0.4009 17 0.1608

2 0.3772 18 0.2747

3 0.2327 19 0.3166

4 0.2210 20 0.0938

5 0.4065 21 0.2617

6 0.1046 22 0.1956

7 0.2355 23 0.3663

8 0.0706 24 0.1427

9 0.2254 25 0.3634

10 0.1228 26 0.2698

11 0.2367 27 0.3667

12 0.2351 28 0.1285

13 0.1942 29 0.3207

14 0.2714 30 0.0537

15 0.1344 31 0.2009

16 0.3119 32 0.1523

5 Evaluation Approach

In each language we tried to take a similar approach

to evaluation, using precision, recall, and f-measure to

report results. Each language had slight differences in

how those measures were computed though. Details

for each language are given below, but as a quick sum-

mary:

1. Opinionated and Relevant: Precision, recall,

and f-measure was computed in the same man-

ner for all languages.

2. Polarity: Three different approaches were used.

We present results from all three approaches in

each language in this overview paper.

3. Opinion Holder: The English and Japanese

evaluations were similar, with a semi-automatic

evaluation that relied on human judgments, and

estimates the recall. The Chinese evaluators

also used a semi-automatic approach but manu-

ally examined all instances where opinion hold-

ers were not exact string matches, and possi-

bly skips some sentences similar to the polarity

evaluation.

In the following sections, we will provide a descrip-

tion of the three evaluation approaches taken.The Chi-

nese evaluation followed the LWK approach, the En-

glish evaluation followed the DKE approach, and the

Japanese evaluation followed the YS approach.

5.1 LWK Evaluation Approach

5.1.1 Opinionated / Relevance

Under the strict evaluation, all three annotators must

agree on the classiﬁcation of the sentence to be

counted as either an opinionated or relevant sen-

tence. Under the lenient evaluation, two of the three

annotators must agree on the classiﬁcation of the

sentence for it to be counted. Precision is com-

puted as #systemcorrect

#systemproposed . Recall is computed as

#systemcorrect

#sentences where the number of sentences is ei-

ther the number of opinionated or relevant sentences

according to the strict or lenient criteria.

5.1.2 Polarity

The LWK approach evaluates only opinionated sen-

tences that match the deﬁnitions for either the strict

or lenient gold standards. For the strict standard, sen-

tences on which all three annotators agree about the

polarity, either all POS, all NEU, all NEG, or all “not

opinionated”. For the lenient evaluation two of the

three annotators must agree that the sentence is opin-

ionated to be included in the evaluation. All other sen-

tences will not be included in the evaluation.

The polarity for the sentence is the polarity with the

largest number of votes by the annotators. In cases

where the polarity of the sentence is ambiguous, POS

+ NEU the gold standard is POS, for NEG + NEU the

gold standard is NEG, for POS + NEG the gold stan-

dard is NEU, and for POS + NEU + NEG the gold

standard is NEU.

5.1.3 Opinion Holder

The LWK evaluation approach for opinion holders is

semi-automatic. All possible aliases of each opinion

holder are generated manually ﬁrst, for example, the

names of holders with or without their titles. The re-

sults then are evaluated according to this information

by keyword matching. At last, to ensure the correct-

ness of the evaluation, every record which is different

from all aliases of the correct holders is checked man-

ually again.

Notice that if we are not sure the proposed answer

is the same entity as the correct answer, it is treated as

a wrong answer. For example, if the correct holder is

”the president of America” but the participant reports

”the president”, there will not be a match. And also

the resolution of the anaphor or the correference has

not been evaluated yet, as we have mentioned earlier.

That is, the holders of the sentence proposed by the

participant should be the same as the form it appears

in this sentence.

5.2 DKE Evaluation Approach

5.2.1 Opinionated / Relevance

The DKE evaluation approach for opinionated and

relevant sentences is the same as described in Sec-

tion 5.1.1.

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

5.2.2 Polarity

The general idea of the DKE approach is to weight

system scores based on their agreement with the an-

notated data. The sentences that are evaluated for po-

larity are determined according to the lenient or strict

standard: for lenient, only sentences in which at least

two annotators have marked the sentence as opinion-

ated are evaluated, for strict only sentences in which

all three annotators marked the sentence as opinion-

ated are evaluated.

The evaluation script creates a contingency ta-

ble for the categories POS, NEU, NEG, and NONE,

where NONE is category that is used when a sen-

tence is not opinionated. For each sentence, the in-

dividual votes for each annotator are added to the

appropriate cell based on the system’s categoriza-

tion. If, for example, the system assigned a sen-

tence a polarity of NEG, and the annotators assign

polarities of NONE (not an opinionated sentence),

NEG, and NEU, the contingency table will be updated

with 1 added to t[GOLDNEG][SY STEMNEG],

t[GOLDNEU][SY ST EMNEG],

and t[GOLDNONE][SY STEMNEG]. Precision and

recall is then calculated in the normal way over the

contingency table.

One advantage of this approach is that all of the an-

notations are taken into account, and the method scales

well to any number of annotators. In addition, for sen-

tences which are truly ambiguous for human annota-

tors, the systems are partially rewarded based on how

ambiguous a sentence is. For example, if one hundred

annotators marked a sentence with 50 POS polarity an-

notations, and 50 NEU polarity annotations, a system

that labels the sentence POS or NEU would beneﬁt by

agreeing with half of the annotators. Other schemes

run the risk of declaring one of either POS or NEU

to be correct, penalizing the system when the sentence

is clearly difﬁcult for humans to label one way or the

other.

5.2.3 Opinion Holder

Opinion Holder evaluation under the DKE strategy

used a perl script to implement a semi-automatic eval-

uation. For each document, an equivalence class is

created for each opinion holder, and system opinion

holders for a given sentence are matched using exact

string matches to the opinion holders in the equiva-

lence class. Matches are counted as correct opinion

holders, if no matches are found then a human judge1

is asked to determine if the system opinion holder

matches ones of the opinion holders in the equiva-

lence class for the sentence given the opinion holders

in the equivalence class and the sentence text. If there

is match, the system opinion holder is added to the

1For this evaluation, the co-author David Kirk Evans

equivalence class, otherwise it is marked as a known

incorrect opinion holder.

The initial database of opinion holder equivalence

classes is created by adding the opinion holders

marked by the annotators. The database grows with

each evaluated system, and after the ﬁrst run for each

system subsequent runs can be done automatically us-

ing the opinion holder database to match opinion hold-

ers.

Precision is computed as the number of correctly

matched opinion holders divided by the number of of-

fered opinion holders. Recall is only an approxima-

tion though: the evaluation script assumes one opinion

holder for each opinionated sentence. While the spec-

iﬁcation allows for multiple opinion holders per sen-

tence, only 3.5% of English annotations actually had

more than one opinion holder annotated in the gold

standard.

5.3 YS Evaluation Approach

5.3.1 Opinionated / Relevance

The YS evaluation approach for opinionated and rel-

evant sentences is the same as described in Sec-

tion 5.1.1. We provide the evaluation script in Perl on

December, 2006 and participants could conduct post

submission analysis using this script.

5.3.2 Polarity

The most important point of YS approach is consis-

tency in evaluation strategies within four categories

(opinionated sentence, relevance, polarity, and opin-

ion holder). In polarity evaluation, the recall, preci-

sion, and F-value was computed based on leniently or

strictly agreed results between assessors for positive,

negative, and neutral values. Therefore, the evaluation

results were slightly more strict than other two evalu-

ation approaches. This evaluation script was also pro-

vided in Perl to participants on January, 2007.

5.3.3 Opinion Holder

Opinon holder evaluation strategy was also consistent

with other three category evaluation strategies: they

were evaluated based on leniently or strictly agreed

opinion holders between assessors.

We only applied a sentence-based evaluation to

evaluate the opinion holders. If multiple holders ex-

isted in one sentence, and the system detected one

of them, then we regarded the system’s extraction as

valid.

In addition, we also applied a ﬁve-grade evaluation

of the agreement between the system’s and the asses-

sor’s detection, as follows. This strategy was useful

to estimate the effectiveness of coreference resolution

approach.

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

1. Agreed semantically and strings were matched

almost completely.

2. Agreed semantically and strings were matched

partially, but a proper name was not detected.

3. Agreed semantically but strings were not

matched.

4. Agreed partially in some aspect, but proper en-

tity could not be speciﬁed.

5. Not agreed.

We counted the results using the above three grades

for valid extractions and computed the precision, re-

call, and F-measure values. Opinion holder evaluation

was conducted semiautomatically by combining per-

fect strict matching approach and manually conducted

ﬁve graded estimation.

5.4 Comparison

Table 7 and Table 8 list a number of cases and the

behavior of the three evaluation approaches.

Note that in the last example in Table 8 the Chinese

score actually increases. This is due to the heuristic

that says that under a lenient evaluation, a POS and

NEU score by two annotators is treated as a POS sen-

tence.

6 Participant System Descriptions

6.1 Chinese (in alphabetic order)

Five teams participated in Chinese side. The Chi-

nese University of Hong Kong (CUHK) implemented

the system with ﬁve modules based on knowledge

learned from unsupervised web data. University of

Shefﬁeld (GATE) implemented SVM-based Chinese

and English opinion extraction system based on sam-

ple four topics and MPQA corpus. Chinese Academy

of Sciences (ISCAS) applied Conditional Random

Filed (CRF) to ﬁnd the opinion holders as a sequen-

tial labeling task. National Taiwan University (NTU)

calculated polarity scores to decide the opinion polari-

ties and strengths of words from composed characters.

University of Maryland (UMCP) implemented the sys-

tem based on sentiment lexicons and explored the ef-

fect of the lexicon size, etc.

6.2 English (in alphabetic order)

Six teams participated in English side. Cornell

Univesity (Cornell) developed the system by using

components and features from their previous work.

University of Shefﬁeld (GATE) used an SVM sys-

tem to train a classiﬁer over MPQA corpus and com-

pare the differences between the MPQA corpus and

the NTCIR-6 English corpus. Information and Com-

munication University (ICU-IR) system was a hybrid

machine-learning and rule-based system. They took a

semi-supervised learning methods based on fourteen

strong clue words and six seed rules. Illinois Institute

of Technology (IIT) system uses a lexicon of words

and phrases used to express appraisal attitudes. For

opinion holders, they determine the subject or agent of

the communication verb list and combine that with ev-

idence from quote positions. National Institute of In-

formatics (NII) uses a machine-learning approach with

shallow parsing to generate features used to train clas-

siﬁers in the WEKA. Toyohashi University of Tech-

nology (TUT) system was based on SVM classiﬁer

trained over surface features and semantic primitives

for predicates and subjects from a thesaurus.

6.3 Japanese (in alphabetic order)

Three teams participated in Japanese side. NEC

Internet Systems Research Laboratory (NEC) took a

SVM machine larning approach with four type fea-

tures and related author and non-author opinion holder

candidates to opinionated sentences. National Insti-

tute of Information and Communications Technology

(NICT) implemented SVM-based opinion sentence

classiﬁcation and applied a pairwise classiﬁcation with

majority voting to polarity classiﬁcation. Toyohashi

University of Technology (TUT) implemented two-

way opinion classiﬁcation systems: an author and an

authority opinion classiﬁcaition system crosslingualy

in Japanese and English.

7 Evaluation Results

7.1 Chinese

Table 9 lists the evaluation results in Chinese opin-

ion analysis based on lenient and strict standards.

Though the CFP shows that the evaluation results of

opinions and opinion holders together should be listed,

they are separated evaluated because of the partial cor-

rect issue of opinion holders.

For opinion holder evaluation, we applied both the

sentence-based evaluation and the holder-based eval-

uation, as shown in Table 12 and Table 13. In the

sentence-based evaluation, because there may be mul-

tiple opinion holders in one opinion sentence, the num-

ber of Correct (With holder), Correct (Without holder),

Partial Correct, Incorrect, Miss, False-alarm, preci-

sion, recall and f-measure are listed. The ﬁeld ”Par-

tial Correct” shows the number of sentences in which

participants did not ﬁnd all holders, while the ﬁeld ”In-

correct” shows the number of sentences in which par-

ticipants propose wrong holders. In the holder-based

evaluation, the evaluation unit is one holder. The num-

ber of Correct, Incorrect, Miss, False-alarm, Proposed

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

Table 7. Comparison of Polarity Evaluation Approaches (Strict)

Annotation System Behavior

POS NEU NEG NOT

3 0 0 0 POS LWK +, DKE +, YS +

2 0 1 0 POS LWK sent. skipped, DKE -, YS -

0 0 0 3 POS LKW -, DKE -, YS -

0 0 1 2 POS LWK sent. skipped, DKE -, YS -

Table 8. Comparison of Polarity Evaluation Approaches (Lenient)

Annotation System Behavior

POS NEU NEG NOT

3 0 0 0 POS LWK +, DKE +, YS +

2 0 1 0 POS LWK+,DKE+by 2

3,YS+

0 0 0 3 POS LWK -, DKE -, YS -

0 0 1 2 POS LWK -, DKE -, YS -

1 0 2 0 POS LWK-,DKE+by 1

3,YS-

1 1 0 1 POS LWK+,DKE+by 1

3, YS P. down R. no change

Table 9. Chinese Opinion Analysis LWK Approach results

Group L/S Opinionated Relevance OpAndPolarity

P R F P R F P R F

CUHK L 0.818 0.519 0.635 0.797 0.828 0.812 0.522 0.331 0.405

ISCAS L 0.590 0.664 0.625 — — — 0.232 0.261 0.246

Gate-1 L 0.643 0.933 0.762 — — — — — —

Gate-2 L 0.746 0.591 0.659 — — — — — —

UMCP-1 L 0.645 0.974 0.776 0.683 0.516 0.588 0.292 0.441 0.351

UMCP-2 L 0.630 0.984 0.768 0.644 0.936 0.763 0.286 0.446 0.348

NTU L 0.664 0.890 0.761 0.636 1.000 0.778 0.335 0.448 0.383

CUHK S 0.341 0.575 0.428 0.468 0.900 0.616 0.197 0.596 0.296

ISCAS S 0.221 0.662 0.331 — — — 0.059 0.314 0.099

Gate-1 S 0.253 0.979 0.402 — — — — — —

Gate-2 S 0.330 0.696 0.448 — — — — — —

UMCP-1 S 0.245 0.986 0.393 0.404 0.565 0.471 0.085 0.615 0.150

UMCP-2 S 0.239 0.993 0.385 0.354 0.953 0.516 0.081 0.604 0.143

NTU S 0.258 0.921 0.404 0.343 1.000 0.511 0.104 0.662 0.180

Table 10. Chinese Opinion Analysis YS Approach results

Group L/S Opinionated Relevance Polarity

P R F P R F P R F

CUHK L 0.819 0.520 0.636 0.797 0.828 0.813 0.480 0.431 0.454

NTU L 0.630 0.890 0.738 0.603 1.000 0.752 0.269 0.537 0.358

UMCP-1 L 0.645 0.974 0.776 0.683 0.516 0.588 0.256 0.547 0.349

UMCP-2 L 0.630 0.984 0.768 0.644 0.936 0.763 0.248 0.548 0.341

ISCAS L 0.590 0.664 0.625 — — — 0.170 0.271 0.209

GATE-1 L 0.643 0.933 0.761 — — — — — —

GATE-2 L 0.747 0.591 0.660 — — — — — —

CUHK S 0.340 0.575 0.428 0.468 0.900 0.616 0.197 0.596 0.296

NTU S 0.245 0.921 0.387 0.326 1.000 0.491 0.099 0.662 0.172

UMCP-1 S 0.245 0.987 0.393 0.404 0.565 0.471 0.086 0.615 0.150

UMCP-2 S 0.239 0.993 0.385 0.354 0.953 0.517 0.081 0.603 0.143

ISCAS S 0.221 0.662 0.331 — — — 0.059 0.314 0.099

GATE-1 S 0.253 0.979 0.402 — — — — — —

GATE-2 S 0.330 0.696 0.448 — — — — — —

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

Table 11. Chinese Opinion Analysis DKE Approach results

Group L/S Opinionated Relevance Polarity

P R F P R F P R F

CUHK L 0.819 0.520 0.636 0.797 0.828 0.813 0.480 0.431 0.454

NTU L 0.667 0.888 0.762 0.636 1.000 0.777 0.286 0.538 0.374

UMCP-1 L 0.645 0.976 0.777 0.683 0.519 0.590 0.256 0.549 0.349

UMCP-2 L 0.630 0.986 0.769 0.644 0.943 0.765 0.248 0.549 0.341

ISCAS L 0.590 0.664 0.625 — — — 0.170 0.271 0.209

GATE-1 L 0.643 0.933 0.761 — — — — — —

GATE-2 L 0.747 0.591 0.660 — — — — — —

CUHK S 0.340 0.575 0.428 0.468 0.901 0.616 0.197 0.595 0.296

NTU S 0.265 0.922 0.412 0.342 1.000 0.509 0.108 0.666 0.186

UMCP-1 S 0.245 0.988 0.393 0.404 0.570 0.473 0.086 0.615 0.150

UMCP-2 S 0.239 0.994 0.385 0.354 0.963 0.518 0.081 0.603 0.143

ISCAS S 0.221 0.662 0.331 — — — 0.059 0.314 0.099

GATE-1 S 0.253 0.979 0.402 — — — — — —

GATE-2 S 0.330 0.695 0.448 — — — — — —

Holders, Correct Number (the total number of hold-

ers in the correct opinion sentences which participants

proposed), precision, recall, and f-measure are listed.

Opinion holders are only meaniningful and extracted

in opinion sentences. Therefore to avoid the propogate

errors from the opinion sentence extraction, only the

holders reported in correct opinion sentences proposed

by participants are further evaluated.

7.2 English

Table 14 lists results using the lenient and strict

standards. Of the nine submitted runs, six contained

relevance information (four of the six groups) and

seven contained polarity information (ﬁve of the six

groups.) While there is no difference in the GATE

runs reported in these results, the two runs took dif-

ferent strategies for opinion holder identiﬁcation, but

only the ﬁrst priority run was evaluated for opinion

holders. The polarity results differ slightly for the two

TUT runs.

For the opinion holder analysis, the English co-

organizer determined whether the system-predicted

opinion holder matched one of the annotated opinion

holders given the context of the sentence. The process

was automated to some extent by looking for exact

string matches, quite common with the -author- opin-

ion holder, and memoization of previous human-made

decisions.

Table 17 lists the precision, recall, and F-measure

for both the lenient and strict evaluations of opinion

holders. The script used to compute the results lists

both precision and recall over all sentences — penal-

izing systems for suggesting opinion holders on non-

opinionated sentences — and over only the sentences

that are marked as opinionated according to the gold

standard data. Table 17 lists results over all opinion-

ated sentences to conform more closely with how the

Chinese and Japanese evaluation was performed.Of

the 6319 sentences marked with opinion holders, only

208 have more than one opinion holder, so I felt that

this was a reasonable approximation.

7.3 Japanese

Table 18 lists the evaluation results of a Japanese

opinion analysis based on lenient and strict standards.

•For opinionated sentence classiﬁcation, NICT

system performed best in precision and TUT

performed best in recall.

•For opinion holder extraction, EHBN-2 best

performed in precision and TUT performed best

in recall.

•For relevance judgment, NICT-2 performed best

in precision and NICT-1 performed best in re-

call.

•For polarity classiﬁcation, NICT performed best

in precision and TUT performed best in recall.

In summary, EHBN system got advantage in opin-

ion holder extraction. NICT implemented balanced

precision-focused system. TUT implemented recall-

focused system and attained best F-values.

8 Discussions and Conclusions

8.1 Overview of Results in NTCIR-6

Performance across languages varies greatly, and

due to both corpora and annotator differences are dif-

ﬁcult to compare directly. In this pilot task, each

language was evaluated independently, and actually

different formulations for precision and recall were

used under each language. The task overview paper

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

Table 12. Chinese Opinion Holders Analysis: Sentence-Based Results

Group L/S CRT-w CRT-wo P-CRT InCRT Miss F-A P R F

CUHK L 1086 1070 189 84 81 319 0.647 0.754 0.697

ISCAS L 665 1724 175 354 447 257 0.458 0.405 0.430

GATE-1 L 364 2685 100 345 1551 44 0.427 0.154 0.227

GATE-2 L 76 1554 5 112 1463 11 0.373 0.046 0.082

UMCP-1 L 1000 916 232 964 243 1955 0.241 0.410 0.303

UMCP-2 L 471 317 103 405 96 628 0.221 0.376 0.278

NTU L 388 2564 57 120 1692 30 0.652 0.172 0.272

CUHK S 550 371 81 41 29 106 0.707 0.785 0.744

ISCAS S 293 544 84 157 188 89 0.470 0.406 0.436

GATE-1 S 165 933 47 171 677 11 0.419 0.156 0.227

GATE-2 S 42 617 3 66 694 3 0.368 0.052 0.091

UMCP-1 S 917 950 213 1051 257 1976 0.293 0.438 0.351

UMCP-2 S 441 327 95 442 97 631 0.274 0.410 0.329

NTU S 179 863 27 53 753 12 0.661 0.177 0.279

Table 13. Chinese Opinion Holders Analysis: Holder-Based Results

Group L/S CRT InCRT Miss F-A P-H CRT-NUM P R F

CUHK L 1375 92 1 386 1854 1476 0.742 0.932 0.826

ISCAS L 871 422 0 396 1689 1958 0.516 0.445 0.478

GATE-1 L 475 363 0 66 904 2774 0.525 0.171 0.258

GATE-2 L 82 112 0 12 206 1943 0.398 0.042 0.076

UMCP-1 L 1232 964 0 1955 4151 2875 0.297 0.429 0.351

UMCP-2 L 1130 1051 0 1976 4157 2874 0.272 0.393 0.321

NTU L 452 121 0 34 607 2672 0.745 0.169 0.276

CUHK S 678 48 1 127 854 841 0.794 0.806 0.800

ISCAS S 391 189 0 162 742 857 0.527 0.456 0.489

GATE-1 S 218 182 0 22 422 1244 0.517 0.175 0.262

GATE-2 S 46 66 0 4 116 952 0.397 0.048 0.086

UMCP-1 S 574 405 0 628 1607 1266 0.357 0.453 0.400

UMCP-2 S 536 442 0 631 1609 1266 0.333 0.423 0.373

NTU S 209 53 0 13 275 1197 0.760 0.175 0.284

Table 14. English Opinion Analysis DKE Approach results

Group L/S Opinionated Relevance Polarity

P R F P R F P R F

IIT-1 L 0.325 0.588 0.419 — — — 0.120 0.287 0.169

IIT-2 L 0.259 0.854 0.397 — — — 0.086 0.376 0.140

TUT-1 L 0.310 0.575 0.403 0.392 0.597 0.473 0.088 0.215 0.125

TUT-2 L 0.310 0.575 0.403 0.392 0.597 0.473 0.094 0.230 0.134

Cornell†L 0.317 0.651 0.427 — — — 0.073 0.197 0.107

NII L 0.325 0.624 0.427 0.510 0.322 0.395 0.077 0.194 0.110

GATE-1 L 0.324 0.905 0.477 0.286 0.632 0.393 — — —

GATE-2 L 0.324 0.905 0.477 0.286 0.632 0.393 — — —

ICU-IR L 0.396 0.524 0.451 0.409 0.263 0.320 0.151 0.264 0.192

IIT-1 S 0.070 0.578 0.125 — — — 0.027 0.322 0.049

IIT-2 S 0.056 0.840 0.105 — — — 0.016 0.359 0.031

TUT-1 S 0.065 0.553 0.117 0.171 0.605 0.266 0.016 0.195 0.029

TUT-2 S 0.065 0.553 0.117 0.171 0.605 0.266 0.019 0.229 0.034

Cornell†S 0.069 0.662 0.125 — — — 0.010 0.135 0.018

NII S 0.073 0.642 0.131 0.242 0.355 0.287 0.014 0.185 0.027

GATE-1 S 0.070 0.940 0.130 0.112 0.579 0.188 — — —

GATE-2 S 0.070 0.940 0.130 0.112 0.579 0.188 — — —

ICU-IR S 0.102 0.616 0.175 0.177 0.266 0.213 0.034 0.301 0.061

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

Table 15. English Opinion Analysis LWK Approach results

Group L/S Opinionated Relevance Polarity

P R F P R F P R F

IIT-1 L 0.326 0.585 0.419 — — — 0.136 0.238 0.173

IIT-2 L 0.260 0.844 0.397 — — — 0.108 0.343 0.164

TUT-1 L 0.311 0.572 0.402 0.395 0.595 0.475 0.129 0.232 0.166

TUT-2 L 0.311 0.572 0.402 0.395 0.595 0.475 0.125 0.226 0.161

Cornell†L 0.326 0.524 0.402 — — — 0.128 0.200 0.156

NII L 0.327 0.625 0.429 0.511 0.321 0.395 0.122 0.228 0.159

GATE-1 L 0.324 0.821 0.465 0.291 0.609 0.394 — — —

GATE-2 L 0.324 0.821 0.465 0.291 0.609 0.394 — — —

ICU-IR L 0.397 0.532 0.454 0.408 0.262 0.319 0.189 0.247 0.214

IIT-1 S 0.073 0.579 0.129 — — — 0.028 0.321 0.051

IIT-2 S 0.058 0.832 0.108 — — — 0.017 0.348 0.032

TUT-1 S 0.067 0.551 0.120 0.173 0.603 0.268 0.016 0.195 0.030

TUT-2 S 0.067 0.551 0.120 0.173 0.603 0.268 0.019 0.225 0.035

Cornell†S 0.072 0.516 0.127 — — — 0.010 0.106 0.019

NII S 0.075 0.638 0.135 0.242 0.353 0.288 0.015 0.181 0.027

GATE-1 S 0.071 0.804 0.131 0.115 0.558 0.191 — — —

GATE-2 S 0.071 0.804 0.131 0.115 0.558 0.191 — — —

ICU-IR S 0.103 0.615 0.177 0.178 0.265 0.213 0.035 0.300 0.062

Table 16. English Opinion Analysis YS Approach results

Group L/S Opinionated Relevance Polarity

P R F P R F P R F

IIT-1 L 0.326 0.583 0.418 — — — 0.120 0.284 0.169

IIT-2 L 0.260 0.842 0.397 — — — 0.086 0.370 0.140

TUT-1 L 0.311 0.571 0.403 0.393 0.598 0.474 0.088 0.214 0.125

TUT-2 L 0.311 0.571 0.403 0.393 0.598 0.474 0.095 0.229 0.134

Cornell†L 0.317 0.500 0.388 — — — 0.073 0.152 0.098

NII L 0.326 0.619 0.427 0.512 0.322 0.395 0.077 0.193 0.110

GATE-1 L 0.327 0.792 0.463 0.287 0.593 0.387 — — —

GATE-2 L 0.325 0.813 0.464 0.286 0.612 0.390 — — —

ICU-IR L 0.392 0.493 0.437 0.409 0.261 0.318 0.149 0.247 0.186

IIT-1 S 0.070 0.578 0.126 — — — 0.027 0.321 0.049

IIT-2 S 0.056 0.835 0.105 — — — 0.016 0.355 0.031

TUT-1 S 0.066 0.555 0.118 0.171 0.605 0.267 0.016 0.194 0.029

TUT-2 S 0.066 0.555 0.118 0.171 0.605 0.267 0.019 0.229 0.034

Cornell†S 0.069 0.499 0.121 — — — 0.001 0.102 0.018

NII S 0.074 0.641 0.132 0.242 0.353 0.287 0.014 0.184 0.027

GATE-1 S 0.071 0.788 0.130 0.113 0.541 0.186 — — —

GATE-2 S 0.070 0.804 0.129 0.113 0.561 0.188 — — —

ICU-IR S 0.100 0.576 0.170 0.178 0.263 0.212 0.032 0.270 0.057

Table 17. English Opinion Holders Analysis results

Group Lenient Strict

P R F P R F

IIT-1 0.198 0.409 0.266 0.054 0.461 0.097

TUT-1 0.117 0.218 0.153 0.029 0.241 0.051

Cornell†0.163 0.346 0.222 0.041 0.392 0.074

NII 0.066 0.166 0.094 0.018 0.169 0.032

GATE-1 0.121 0.349 0.180 0.029 0.398 0.055

ICU-IR 0.303 0.404 0.346 0.085 0.515 0.146

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

Table 18. Japanese Opinion Analysis YS Approach results

Group L/S Opinionated Holder

(S/A/B/C/D/OE/LE) Relevance Polarity

P R F P R F P R F P R F

EHBN-1 L 0.531 0.453 0.489 0.138 0.085 0.105 - - - - - -

(224/46/6/34/806/880/2129)

EHBN-2 L 0.531 0.453 0.489 0.314 0.097 0.149 - - - - - -

(236/39/41/77/321/293/2531)

NICT-1 L 0.671 0.315 0.429 0.238 0.102 0.143 0.598 0.669 0.632 0.299 0.149 0.199

(86/0/246/224/378/462/2311)

NICT-2 L 0.671 0.315 0.429 0.238 0.102 0.143 0.644 0.417 0.506 0.299 0.149 0.199

(86/0/246/224/378/462/2311)

TUT L 0.552 0.609 0.579 0.226 0.224 0.225 0.630 0.646 0.638 0.274 0.322 0.296

(472/137/118/134/1006/1354/1378)

EHBN-1 S 0.414 0.479 0.444 0.079 0.094 0.086 - - - - - -

(128/28/2/22/405/1411/1095)

EHBN-2 S 0.414 0.479 0.444 0.183 0.110 0.137 - - - - - -

(130/25/29/31/166/626/1299)

NICT-1 S 0.546 0.348 0.425 0.133 0.110 0.120 0.470 0.693 0.560 0.168 0.150 0.158

(73/0/112/104/214/893/1177)

NICT-2 S 0.546 0.348 0.425 0.133 0.110 0.120 0.525 0.446 0.482 0.168 0.150 0.158

(73/0/112/104/214/893/1177)

TUT S 0.414 0.620 0.497 0.131 0.251 0.172 0.505 0.681 0.580 0.161 0.339 0.218

(292/68/61/63/501/2236/695)

S/A/B/C/D = Five graded evaluation

OE = Over Estimation

LE = Lack of Estimation

Table 19. Japanese Opinion Analysis DKE Approach results

Group L/S Opinionated Relevance Polarity

P R F P R F P R F

EHBN-1 L 0.531 0.453 0.488 - - - - - -

EHBN-2 L 0.531 0.452 0.488 - - - - - -

NICT-1 L 0.671 0.315 0.429 0.598 0.669 0.632 0.298 0.149 0.199

NICT-2 L 0.671 0.315 0.429 0.644 0.417 0.506 0.298 0.149 0.199

TUT-1 L 0.552 0.609 0.589 0.630 0.645 0.638 0.274 0.322 0.296

EHBN-1 S 0.414 0.479 0.444 - - - - - -

EHBN-2 S 0.414 0.479 0.444 - - - - - -

NICT-1 S 0.545 0.348 0.425 0.470 0.693 0.560 0.168 0.150 0.158

NICT-2 S 0.545 0.348 0.425 0.525 0.446 0.482 0.168 0.150 0.158

TUT-1 S 0.414 0.620 0.497 0.505 0.681 0.580 0.161 0.339 0.218

Table 20. Japanese Opinion Analysis LWK Approach results

Group L/S Opinionated Relevance Polarity

P R F P R F P R F

EHBN-1 L 0.531 0.453 0.489 - - - - - -

EHBN-2 L 0.531 0.452 0.489 - - - - - -

NICT-1 L 0.669 0.313 0.426 0.596 0.666 0.629 0.308 0.140 0.192

NICT-2 L 0.669 0.313 0.426 0.644 0.420 0.509 0.308 0.140 0.192

TUT-1 L 0.550 0.614 0.580 0.628 0.646 0.637 0.287 0.311 0.298

EHBN-1 S 0.412 0.476 0.442 - - - - - -

EHBN-2 S 0.412 0.476 0.442 - - - - - -

NICT-1 S 0.542 0.343 0.420 0.475 0.690 0.563 0.165 0.143 0.154

NICT-2 S 0.542 0.343 0.420 0.527 0.446 0.483 0.165 0.143 0.154

TUT-1 S 0.411 0.621 0.495 0.510 0.680 0.583 0.160 0.331 0.216

Proceedings of NTCIR-6 Workshop Meeting, May 15-18, 2007, Tokyo, Japan

presents the differences between the evaluation ap-

proaches, and also presents evaluations for each lan-

guage using each approach, but the numbers reported

here are the ofﬁcial results. Opinion Holder evaluation

for English was performed semi-automatically, but

due to the manual effort involved only the ﬁrst priority

run from each participant was evaluated. The Chinese

and Japanese evaluation also used semi-automatic ap-

proaches to opinion holder evaluation, but were able

to evaluate all submitted runs.

Of the groups that participated, one group (GATE)

participated in both the Chinese and English task, and

one group (TUT) participated in both the English and

Japanese task. Despite using similar approaches, their

results differ in each language in part due to the differ-

ence in annotation between the languages. An inter-

esting question for future work is whether these differ-

ences stem more from annotator training, differences

in the documents that make up the corpus, or cultural

and language differences.

8.2 DirectionsforNTCIR-7Opinion Analysis

Task

We plan to conduct the Opinion Analysis Task

again in NTCIR-7 and NTCIR-8. The NTCIR meet-

ings are held every year and a half. For NTCIR-7 we

plan to add a new genre to the task, reviews, in ad-

dition to the news genre used in NTCIR-6. We are

currently exploring using review web sites as a source

of data. NTCIR-7 and 8 will both continue to use Chi-

nese, English, and Japanese, and while no further lan-

guages are slated for addition at this time, Korean is a

possible candidate since relevance judgments for some

of the topic already exist. NTCIR-7 will also add a

strength of opinion and stakeholder evaluation in ad-

dition to the subjectivity, polarity, and opinion holder

evaluation performed in NTCIR-6. NTCIR-8 will add

a temporal evaluation, and possibly expand to clause-

level subjectivity.

Acknowledgements

We greatly appreciate the efforts of all the partici-

pants in the Opinion Analysis Pilot Task at the Sixth

NTCIR Workshop. We also greatly appreciate Prof.

Janyce Wiebe at the University of Pittsburgh for her

advisory comments.

References

[1] Y. Choi, C. Cardie, E. Riloff, and S. Patwardhan.

Identifying sources of opinions with conditional ran-

dom ﬁelds and extraction patterns. In Proc. of the

2005 Human Language Technology Conf. and Conf.

on Empirical Methods in Natural Language Process-

ing (HLT/EMNLP 2005), Vancouver, B. C., 2005.

[2] M. Gamon and A. Aue. Proc. of Wksp. on Sentiment

and Subjectivity in Text at the 21th Int’l Conf. on Com-

putational Linguistics / the 44th Ann. Meeting of the

Assoc. for Computational Linguistics (COLING/ACL

2006). The Association for Computational Linguistcs,

Sydney, Austraria, July 2006.

[3] L. W. Ku, T. H. Wu, L. Y. Lee, and H. H. Chen. Con-

struction of an Evaluation Corpus for Opinion Extrac-

tion. In Proc. of the Fifth NTCIR Wksp. on Evaluation

of Information Access Technologies: Information Re-

trieval, Question Answering, and Cross-Lingual Infor-

mation Access, pages 513–520, December 2005.

[4] National Institute of Informatics. NTCIR (NII-

NACSIS Test Collection for IR Systems) Project [on-

line]. In NTCIR (NII-NACSIS Test Collection for IR

Systems) Project website, 1998-2007. [cited 2007-01-

26]. Available from: <http://research.nii.ac.jp/ntcir/>.

[5] National Institute of Informatics. NTCIR-6 Opin-

ion Analysis Pilot Task [online]. In NTCIR web-

site, 2006. [cited 2007-1-26]. Available from:

<http://research.nii.ac.jp/ntcir/ntcir-ws6/opinion/index-en.html>.

[6] National Institute of Informatics. NTCIR CLIR Task

[online]. In NTCIR, 2006. [cited 2007-1-26]. Available

from: <http://homepage3.nifty.com/kz 401/>.

[7] National Institute of Standars and Technology. TREC

(Text REtrieval Conference) 2006-2007: BLOG Track

[online]. In TREC website, 2006. [cited 2007-1-26].

Available from: <http://trec.nist.gov/tracks.html>.

[8] Y. Seki, K. Eguchi, and N. Kando. Multi-document

viewpoint summarization focused on facts, opinion

and knowledge. In J. G. Shanahan, Y. Qu, and

J. Wiebe, editors, Computing Attitude and Affect in

Text: Theory and Applications, volume 20 of The In-

formation Retrieval Series, chapter 24, pages 317–

336. Springer-Verlag, New York, December 2005.

[9] J. G. Shanahan, Y. Qu, and J. Wiebe. Computing Atti-

tude and Affect in Text: Theory and Applications, vol-

ume 20 of The Information Retrieval Series. Springer-

Verlag, New York, December 2005.

[10] J. Wiebe, T. Wilson, and C. Cardie. Annotating

Expressions of Opinions and Emotions in Language.

Language Resources and Evaluation, 39(2-3):165–

210, 2005.

[11] J. M. Wiebe, E. Breck, C. Buckley, C. Cardie, P. Davis,

B. Fraser, D. Litman, D. Pierce, E. Riloff, and T. Wil-

son. MPQA: Multi-Perspective Question Answering

Opinion Corpus

Version 1.2, 2006. [cited 2007-1-26]. Available from:

<http://www.cs.pitt.edu/mpqa/databaserelease/>.

[12] J. M. Wiebe, T. Wilson, R. F. Bruce, M. Bell, and

M. Martin. Learning subjective language. Compu-

tational Linguistics, 30(3):277–308, 2004.

[13] T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing

contextual polarity in phrase-level sentiment analysis.

In Proc. of the 2005 Human Language Technology

Conf. and Conf. on Empirical Methods in Natural Lan-

guage Processing (HLT/EMNLP 2005), Vancouver, B.

C., 2005.

Opinion Analysis Corpora Across Languages

Chapter

Full-text available

Jan 2021

Yohei Seki

At NTCIR-6, 7, and 8, we included a new multilingual opinion analysis task (MOAT) that involved Japanese, English, and Chinese newspapers. This was the first task that compared the performance of sentiment retrieval strategies with common subtasks across languages. In this paper, we introduce the research question posed by NTCIR MOAT and present what has been achieved to date. We then describe the types of tasks and research that have involved our test collection both previously and in current research. Finally, we summarize our contributions and discuss future research directions.

NECLC at Multilingual Opinion Analysis Task in NTCIR8

Conference Paper

Full-text available

Jun 2010

In this paper, we briefly describe our machine-learning based method used in the NITCIR8 MOAT task, particularly, the opinioned sentence judgment subtask on both English side and Chinese side. We view this subtask as a binary classification problem and build a supervised-learning based framework. To extract meaningful sentiment features, we propose several n-gram patterns to assemble basic words and part-of-speech tags. Meanwhile, our basic classifiers are trained merely on the previous NTICR annotated corpus, in which samples are inadequate and unbalanced. Thus, we adopt a few self-learning strategies to utilize the NTCIR8 testing corpus to adjust our basic classifiers. Using the same learning framework in both language sides, we get similar performances.

FinnSentiment: a Finnish social media corpus for sentiment polarity annotation

Article

Full-text available

Mar 2023

Sentiment analysis and opinion mining are essential tasks with many prominent application areas, e.g., when researching popular opinions on products or brands. Sentiments expressed in social media can be used in brand name monitoring and indicating fake news. In our survey of previous work, we note that there is no large-scale social media data set with sentiment polarity annotations for Finnish. This publication aims to remedy this shortcoming by introducing a 27,000-sentence data set annotated independently with sentiment polarity by three native annotators. We had three annotators annotate the whole data set, which provides a unique opportunity for further studies of annotator behavior over the sample annotation order. We analyze their inter-annotator agreement and provide two baselines to validate the usefulness of the data set.

FinnSentiment -- A Finnish Social Media Corpus for Sentiment Polarity Annotation

Preprint

Full-text available

Dec 2020

Sentiment analysis and opinion mining is an important task with obvious application areas in social media, e.g. when indicating hate speech and fake news. In our survey of previous work, we note that there is no large-scale social media data set with sentiment polarity annotations for Finnish. This publications aims to remedy this shortcoming by introducing a 27,000 sentence data set annotated independently with sentiment polarity by three native annotators. We had the same three annotators for the whole data set, which provides a unique opportunity for further studies of annotator behaviour over time. We analyse their inter-annotator agreement and provide two baselines to validate the usefulness of the data set.

Social cognition in schizophrenia: A network-based approach to a Taiwanese version of the Reading the Mind in the Eyes test

Article

Full-text available

Sep 2019
J FORMOS MED ASSOC

Background: This study aimed to examine social-cognitive impairments in patients with schizophrenia using the Eyes test. In contrast to previous methods using the correct answers, we developed the Taiwanese version of the Eyes test and constructed the response network to explore impairments in the emotional aspects of theory of mind in patients with schizophrenia. Methods: Eighteen patients with schizophrenia and 18 healthy controls were recruited to examine their performance of the Eyes test. To explore the internal structures of mental states, we used network analysis to construct the networks of choice patterns (i.e. participants' answers) by using two network indicators, including density (an index of structure diversity of a network) and centrality (an index of the choice patterns within a network). Moreover, we divided all the choices into negative, positive, and neutral item groups based on emotion polarity. Results: The patient group was slower and less accurate than the control group. Moreover, there was a negative correlation between accuracy and blunted affect, and there were positive correlations between reaction time and emotional withdrawal and apathetic social withdrawal. As compared to healthy controls, patients with schizophrenia showed larger density in the network structure and higher centrality than controls. Also, patients showed poorer performance on negative words than healthy controls. Conclusion: Our results demonstrated more diversity to recognize negative emotions from patients' choice patterns as compared to those in the control group. These findings suggest that deficits on recognizing negative emotions might be associated with the dysfunctions of mental states in schizophrenia.

Stochastic Tokenization with a Language Model for Neural Text Classification

Conference Paper

Full-text available

Jan 2019

Computer-assisted Understanding of Stance in Social Media: Formalizations, Data Creation, and Prediction Models

Thesis

Full-text available

Feb 2019

Michael Wojatzki

Stance can be defined as positively or negatively evaluating persons, things, or ideas (Du Bois, 2007). Understanding the stance that people express through social media has several applications: It allows governments, companies, or other information seekers to gain insights into how people evaluate their ideas or products. Being aware of the stance of others also enables social media users to engage in discussions more efficiently, which may ultimately lead to better collective decisions. Since the volume of social media posts is too large to be analyzed manually, computer-aided methods for understanding stance are necessary. In this thesis, we study three major aspects of such computer-aided methods: (i) abstract formalizations of stance which we can quantify across multiple social media posts, (ii) the creation of suitable datasets that correspond to a certain formalization, and (iii) stance detection systems that can automatically assign stance labels to social media posts. We examine four different formalizations that differ in how specific the insights and supported use-cases are: Stance on Single Targets defines stance as a tuple consisting of a single target (e.g. Atheism) and a polarity (e.g. being in favor of the target), Stance on Multiple Targets models a polarity expressed towards an overall target and several logically linked targets, and Stance on Nuanced Targets is defined as a stance towards all texts in a given dataset. Moreover, we study Hateful Stance, which models whether a post expresses hatefulness towards a single target (e.g. women or refugees). Machine learning-based systems require training data that is annotated with stance labels. Since annotated data is not readily available for every formalization, we create our own datasets. On these datasets, we perform quantitative analyses, which provide insights into how reliable the data is, and into how social media users express stance. Our analysis shows that the reliability of datasets is affected by subjective interpretations and by the frequency with which targets occur. Additionally, we show that the perception of hatefulness correlates with the personal stance of the annotators. We conclude that stance annotations are, to a certain extent, subjective and that future attempts on data creation should account for this subjectivity. We present a novel process for creating datasets that contain subjective stances towards nuanced assertions and which provide comprehensive insights into debates on controversial issues. To investigate the state-of-the-art of stance detection methods, we organized and participated in relevant shared tasks, and conducted experiments on our own datasets. Across all datasets, we find that comparatively simple methods yield a competitive performance. Furthermore, we find that neuronal approaches are competitive, but not clearly superior to more traditional approaches on text classification. We show that approaches based on judgment similarity – the degree to which texts are judged similarly by a large number of people – outperform reference approaches by a large margin. We conclude that judgment similarity is a promising direction to achieve improvements beyond the state-of-the-art in automatic stance detection and related tasks such as sentiment analysis or argument mining.

Sentiment Analysis on a corpus of texts written by Primary School children

Chapter

Full-text available

Jan 2010

A Comprehensive Review of Advancements and Ethical Considerations in Personalized Recommendation Systems

Article

Full-text available

Mar 2024

Personalized recommendation systems have become ubiquitous in our digital lives, shaping our interactions with online platforms and services. This paper explores the advancements in recommendation algorithms, delving into their efficacy, and examines the ethical considerations surrounding their implementation. Through a comprehensive review of literature and case studies, this research paper aims to provide insights into the evolving landscape of personalized recommendation systems and the ethical dilemmas they pose.

The Development of an Affective Tutoring System for Japanese Language Learners

Chapter

Nov 2017

Construction of an evaluation corpus for opinion extraction

Article

Full-text available

Jan 2005

Opinion retrieval aims to tell if a document is positive, neutral or negative on a given topic. Opinion extraction further identifies the document's supportive and the non-supportive evidence. This paper defines the annotation of opinionated material. The algorithm employs opinion holders, a topic's conceptual words, sentiment words, opinion operators, and negation operators to recognize opinions. An opinion extraction system is developed and then reflects the major views of selected information sources. The text-based evidence extracted is ready for opinion summarization and opinionated question answering.

Learning Subjective Language

Article

Full-text available

Sep 2004

Subjectivity in natural language refers to aspects of language used to express opinions, evaluations, and speculations. There are numerous natural language processing applications for which subjectivity analysis is relevant, including information extraction and text categorization. The goal of this work is learning subjective language from corpora. Clues of subjectivity are generated and tested, including low-frequency words, collocations, and adjectives and verbs identified using distributional similarity. The features are also examined working together in concert. The features, generated from different data sets using different procedures, exhibit consistency in performance in that they all do better and worse on the same data sets. In addition, this article shows that the density of subjectivity clues in the surrounding context strongly affects how likely it is that a word is subjective, and it provides the results of an annotation study assessing the subjectivity of sentences with high-density features. Finally, the clues are used to perform opinion piece recognition (a type of text categorization and genre detection) to demonstrate the utility of the knowledge acquired in this article.

Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns

Conference Paper

Full-text available

Oct 2005

Recent systems have been developed for sentiment classification, opinion recognition, and opinion analysis (e.g., detecting polarity and strength). We pursue another aspect of opinion analysis: identifying the sources of opinions, emotions, and sentiments. We view this problem as an information extraction task and adopt a hybrid approach that combines Conditional Random Fields (Lafferty et al., 2001) and a variation of AutoSlog (Riloff, 1996a). While CRFs model source identification as a sequence tagging task, AutoSlog learns extraction patterns. Our results show that the combination of these two methods performs better than either one alone. The resulting system identifies opinion sources with 79.3% precision and 59.5% recall using a head noun matching measure, and 81.2% precision and 60.6% recall using an overlap measure.

Computing Attitude and Affect in Text: Theory and Applications

Book

Jan 2006

Human Language Technology (HLT) and Natural Language Processing (NLP) systems have typically focused on the “factual” aspect of content analysis. Other aspects, including pragmatics, opinion, and style, have received much less attention. However, to achieve an adequate understanding of a text, these aspects cannot be ignored. The chapters in this book address the aspect of subjective opinion, which includes identifying different points of view, identifying different emotive dimensions, and classifying text by opinion. Various conceptual models and computational methods are presented. The models explored in this book include the following: distinguishing attitudes from simple factual assertions; distinguishing between the author’s reports from reports of other people’s opinions; and distinguishing between explicitly and implicitly stated attitudes. In addition, many applications are described that promise to benefit from the ability to understand attitudes and affect, including indexing and retrieval of documents by opinion; automatic question answering about opinions; analysis of sentiment in the media and in discussion groups about consumer products, political issues, etc. ; brand and reputation management; discovering and predicting consumer and voting trends; analyzing client discourse in therapy and counseling; determining relations between scientific texts by finding reasons for citations; generating more appropriate texts and making agents more believable; and creating writers’ aids. The studies reported here are carried out on different languages such as English, French, Japanese, and Portuguese. Difficult challenges remain, however. It can be argued that analyzing attitude and affect in text is an “NLP”-complete problem.

Multi-Document Viewpoint Summarization Focused on Facts, Opinion and Knowledge

Chapter

Jan 2006

An interactive information retrieval system that provides different types of summaries of retrieved documents according to each user’s information needs, situation, or purpose of search can be effective for understanding document content. The purpose of this study is to build a multi-document summarizer, “Viewpoint Summarizer With Interactive clustering on Multidocuments (v-SWIM)”, which produces summaries according to such viewpoints. We tested its effectiveness on a new test collection, ViewSumm30, which contains human-made reference summaries of three different summary types for each of the 30 document sets. Once a set of documents on a topic (e.g., documents retrieved by a search engine) is provided to v-SWIM, it returns a list of topics discussed in the given document set, so that the user can select a topic or topics of interest as well as the summary type, such as fact-reporting, opinion-oriented or knowledge-focused, and produces a summary from the viewpoints of the topics and summary type selected by the user. We assume that sentence types and document genres are related to the types of information included in the source documents and are useful for selecting appropriate information for each of the summary types. “Sentence type” defines the type of information in a sentence. “Document genre” defines the type of information in a document. The results of the experiments showed that the proposed system using automatically identified sentence types and document genres of the source documents improved the coverage of the system-produced fact-reporting, opinion-oriented, and knowledge-focused summaries, 13.14%, 34.23%, and 15.89%, respectively, compared with our baseline system which did not differentiate sentence types or document genres.

Annotating Expressions of Opinions and Emotions in Language

Article

May 2005
LANG RESOUR EVAL

This paper describes a corpus annotation project to study issues in the manual annotation of opinions, emotions, sentiments, speculations, evaluations and other private states in language. The resulting corpus annotation scheme is described, as well as examples of its use. In addition, the manual annotation process and the results of an inter-annotator agreement study on a 10,000-sentence corpus of articles drawn from the world press are presented.

Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis

Conference Paper

Oct 2005

Association for Computational Linguistics

Article

Jun 2003

This article describes a spatial model for matching semantic values between two languages, French and English. Based on semantic similarity links, the model constructs a map that represents a word in the source language. Then the algorithm projects the map values onto a space in the target language. The new space abides by the semantic similarity links specific to the second language. Then the two maps are projected onto the same plane in order to detect overlapping values. For instructional purposes, the different steps are presented here using a few examples. The entire set of results is available at the following address: http://dico.isc.cnrs.fr

MPQA: Multi-Perspective Question Answering Opinion Corpus Version 1.2

Jan 2006

J M Wiebe
E Breck
C Buckley
C Cardie
P Davis
B Fraser
D Litman
D Pierce
E Riloff
T Wilson

J. M. Wiebe, E. Breck, C. Buckley, C. Cardie, P. Davis, B. Fraser, D. Litman, D. Pierce, E. Riloff, and T. Wilson. MPQA: Multi-Perspective Question Answering Opinion Corpus Version 1.2, 2006. [cited 2007-1-26]. Available from: <http://www.cs.pitt.edu/mpqa/databaserelease/>.

Overview of Opinion Analysis Pilot Task at NTCIR-6

Abstract and Figures

Recommended publications

Crosslingual Opinion Extraction from Author and Authority Viewpoints at NTCIR-6

Thomson Legal and Regulatory at NTCIR-4: Monolingual and pivot-language retrieval experiments

Multi-lingual Opinion Analysis Applied to World News: A Case Study

Overview of CLIR task at the fourth NTCIR workshop