Conference PaperPDF Available

Unsupervised Event Clustering and Aggregation from Newswire and Web Articles

January 2017

January 2017

DOI:10.18653/v1/W17-4211

Conference: Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

Authors:

Swen Ribeiro

Computer Sciences Laboratory for Mechanics and Engineering Sciences

Olivier Ferret

Atomic Energy and Alternative Energies Commission

Xavier Tannier

Sorbonne Université

Overview of the system.

…

Document scoring. two consecutive scores (scree test). Only the bestranked documents before this elbow value are kept as event instance driven document clusters.

…

Figures - uploaded by Xavier Tannier

Content may be subject to copyright.

Content uploaded by Xavier Tannier

Content may be subject to copyright.

Proceedings of the 2017 EMNLP Workshop on Natural Language Processing meets Journalism, pages 62–67

Copenhagen, Denmark, September 7, 2017. c

2017 Association for Computational Linguistics

Unsupervised Event Clustering and Aggregation from Newswire

and Web Articles

Swen Ribeiro

LIMSI, CNRS

Univ. Paris-Sud

Universit´

e Paris-Saclay

swen.ribeiro@limsi.fr

Olivier Ferret

CEA, LIST,

Gif-sur-Yvette,

F-91191 France.

olivier.ferret@cea.fr

Xavier Tannier

LIMSI, CNRS

Univ. Paris-Sud

Universit´

e Paris-Saclay

xavier.tannier@limsi.fr

Abstract

In this paper, we present an unsupervised

pipeline approach for clustering news arti-

cles based on identiﬁed event instances in

their content. We leverage press agency

newswire and monolingual word align-

ment techniques to build meaningful and

linguistically varied clusters of articles

from the Web in the perspective of a

broader event type detection task. We vali-

date our approach on a manually annotated

corpus of Web articles.

1 Introduction

In the context of news production, an event is the

characterization of a signiﬁcant enough change in

a space-time context to be reported as newsworthy

content. This deﬁnition ﬁts with deﬁnitions pro-

posed in other contexts such as the ACE 2005 and

TAC KBP Event evaluations or work such as (Cy-

bulska and Vossen,2014;Mitamura et al.,2015),

which generally view each event as “something

that happens at a particular place and time”, im-

plying changes in the state of the world and in-

volving participants. In accordance with ontolo-

gies about events such as the Simple Event Model

(SEM) ontology (van Hage et al.,2011), events

can be categorized into different types, for exam-

ple “elections” or “earthquakes”, gathering mul-

tiple real-life instances, for example the “2017

UK General Election” or the “2012 French Pres-

idential Election”. These instances are reported

by journalists through varying textual mentions.

Event extraction is a challenging task that has re-

ceived increasing interest in the past years through

many formulations such as event identiﬁcation or

event detection. It is also an important subtask of

larger NLP applications such as document sum-

marization and event schema induction. Several

approaches have been used to tackle the different

aspects of this task, particularly in an unsupervised

fashion, from linguistic pipelines (Filatova et al.,

2006;Huang et al.,2016) to topic modeling ap-

proaches (Chambers and Jurafsky,2011;Cheung

et al.,2013) and more recently neural networks

(Nguyen et al.,2016). While the deﬁnition and

granularity of an event varies with the task and ob-

jectives at hand, most event identiﬁcation systems

exploit mentions to produce type-level representa-

tions.

We propose to address the unsupervised event

extraction task through two subtasks: ﬁrst, un-

supervised event instance extraction and second,

event type extraction. This paper will focus on our

efforts regarding the ﬁrst step, e.g. unsupervised

event instance extraction. In this perspective, we

present a method based on clustering algorithms

leveraging news data from different sources. We

believe that this ﬁrst step might act as a bridge

between the surface forms that are mentions and

the more abstract concept of instances and types of

events. Moreover, the context of this work is the

ASRAEL project, which aims at providing opera-

tional tools for journalists, and this instance/type

segmentation seems relevant in the perspective of

further event-driven processing developments.

Our clustering approach considers three dimen-

sions: time, space and content. A content align-

ment system is adapted from Sultan et al. (2014)

and a time and space-aware similarity function is

proposed in order to aggregate articles about the

same event.

We work with a large collection of English news

Figure 1: Overview of the system.

and Web articles, where each article describes an

event: the main topic of the article is a speciﬁc

event, and other older events are mentioned in or-

der to put it into perspective. Thus, we consider an

event associated with an article.

Our system’s objective is to build clusters of ar-

ticles describing the same exact real-life event, e.g

the same event instance. We adopt two deﬁnitions

of the relation “same event” (strict and loose) and

evaluate through these two deﬁnitions.

2 Two-step Clustering

Our approach is structured as a pipeline includ-

ing a two-step clustering with an additional ﬁlter-

ing step at the end. The ﬁrst step leverages an

homogeneous corpus of news articles for build-

ing focused and “clean” clusters corresponding to

event instances. The second step exploits these

focused clusters for clustering documents coming

from the Web that are more noisy but also more

likely to bring new information about the consid-

ered events. Figure 1illustrates this pipeline.

2.1 Corpora

The ﬁrst clustering step (represented in blue on

Figure 1) is performed on a corpus from Agence

France-Presse (AFP) news agency. Each news ar-

ticle comes with several metadata providing addi-

tional information about its time-space context of

creation, such as its UTC time-stamp, and its con-

tent, through International Press Telecommunica-

tions Council (IPTC) NewsCodes. NewsCodes are

a standard subject taxonomy created and main-

tained by the IPTC, with a focus on text.

From the 1,400+ existing NewsCodes, we se-

lected 72 that can be viewed as event types1, cov-

1A user-friendly tree visualization of all the NewsCodes is

available at http://show.newscodes.org/index.

html?newscodes=subj.

ering as many event types as possible without

overlapping with one another, and retrieved all

news articles tagged with at least one of these

NewsCodes. This resulted in a corpus of about

52,000 documents for the year 2015.

The second clustering step (in orange on Fig-

ure 1) takes as input news articles crawled from

a list of Web news feeds in English. We used a

corpus of 1.3 million Web news articles published

in 2015, from about 20 different Web news sites

(3,700 documents/day in average) including the

RSS feeds of the New-York Times, the BBC or

the Wall Street Journal.

In both corpora, we process only the title and

ﬁrst paragraph (usually one or two sentences) of

the documents, under the assumption that they fol-

low the journalistic rule of the 5Ws: the lead of

an article must provide information about what,

when,where,who and why.

2.2 Approach

2.2.1 Press Agency Clustering

The ﬁrst clustering step computes the similarity

matrix of the AFP news by the means of the All

Pairs Similarity Search (APSS) algorithm (Ba-

yardo et al.,2007) and applies to it the Markov

Clustering (MCL) algorithm (van Dongen,2000).

News are represented by a bag-of-word repre-

sentation including the lemmatized form of their

nouns, adjectives and verbs.

The similarity function between two documents

d1and d2is the following:

sim(d1, d2) = cos(d1, d2)

eδ/24

where cos(d1, d2)is the cosine similarity and

δis the difference between the documents creation

times (in hours). This time decay ensures that two

similar but different events, occurring at different

moments, will not be grouped together. Only simi-

larities above a threshold τhave been considered2.

This ﬁrst step yields small and instance-focused

clusters of press agency news articles only. While

they can be considered high quality content, they

are quite homogeneous and lack variety in their

wording, and could not be used for broader tasks

such as event type-level detection. An example of

output for this step is provided in Figure 2.

2A grid search led to τ= 0.5.

Hundreds dead in Nepal quake, avalanche triggered on Ever-

est. A massive 7.8 magnitude earthquake killed hundreds of

people Saturday as it ripped through large parts of Nepal,

toppling ofﬁce blocks and towers in Kathmandu and trigger-

ing an avalanche that hit Everest base camp.

Nepal quake kills 1,200, sparks deadly Everest avalanche. A

massive earthquake killed more than 1,200 people Saturday

as it tore through large parts of Nepal, toppling ofﬁce blocks

and towers in Kathmandu and triggering a deadly avalanche

at Everest base camp.

Hundreds dead in Nepal quake, deadly avalanche on Everest.

A massive 7.8 magnitude earthquake killed more than 900

people Saturday as it ripped through large parts of Nepal,

toppling ofﬁce blocks and towers in Kathmandu and trigger-

ing a deadly avalanche that hit Everest base camp.

Figure 2: 3 of 5 AFP news articles clustered to-

gether. While they indeed cover the same event

instance, there are few wording variations between

them, limiting their interest for broader event de-

tection and assimilated tasks.

2.2.2 Web Article Extension

In this step, we aim to alleviate the lack of vari-

ability of our AFP news article clusters by leverag-

ing their high focus to aggregate Web documents

about the same event instances.

To do so, we identify the ﬁrst article published

in each AFP cluster (using the time-stamp) and re-

trieve all Web articles in the next 24 hours. This is

based on the assumption that press agencies are

a primary source of trustworthy information for

most news feeds, so it would be rare to ﬁnd men-

tions of an event instance before an article was re-

leased, especially in an international context. We

call this article the “reference”.

We ﬁrst perform a ﬁrst “coarse-grain” ag-

glomeration by performing low-threshold cosine

similarity-based clustering between the AFP ref-

erence and all Web articles for the given 24-hour

timespan. This results in smaller subsets of data to

feed the next module in the pipeline.

We then use the monolingual word alignment

system described in Sultan et al. (2014). This sys-

tem performs a word-to-word alignment between

two sentences by applying a series of alignment

modules focusing each on a speciﬁc type of lin-

guistic units. The alignment process starts with

n-grams of words (with n >2) including at least

one content word. Then, named entities are con-

sidered, followed by content words and ﬁnally,

stopwords. While alignment of n-grams of words

and named-entities is based only on string match-

ing (exact match for n-grams, partial for named

entities as the system uses Stanford NER to re-

solve acronyms and matching partial mentions),

the system also relies on contextual evidence for

other linguistic units, e.g: syntactic dependencies

and textual neighborhood. Textual neighborhood

is deﬁned as a window of the next and previous 3

content words surrounding each word being con-

sidered for an alignment. The system then com-

putes a similarity score between each candidate

pair available based on this evidence, and selects

the highest scored pair for a given word as the cho-

sen alignment. We adapted the system to better

ﬁt our needs by extending the stopword list, ﬁrst

aligning unigram exact matches and using the ab-

sence of matching content words or named enti-

ties as an early stopping condition of the alignment

process.

For each AFP cluster, we perform alignment be-

tween the reference (earliest article) and each Web

article from the subset. This allows us to build a

word alignment matrix where each column con-

tains the words in a document and each line shows

how each word of the reference has aligned across

all documents.

We then compute a score for each document,

taking into account how many words in a docu-

ment have been aligned with the reference, and

how many times a reference word has found an

alignment across all documents.

Figure 3illustrates how this score is computed.

We ﬁrst build the binary alignment matrix B

where columns represent documents and rows rep-

resent term alignments. If a term i(out of M

aligned terms) from document j(out of Ndocu-

ments) has been aligned with a term from the ref-

erence, then Bi,j = 1, otherwise Bi,j = 0. We

then compute a weight for each alignment, lead-

ing to a vector Align such as for each term i:

Aligni=

j=0

Bi,j

The absolute alignment score of each docu-

ment jis then:

sj=

i=0

Wi,j

where W=B×Align. Finally, we normalize

these by the scores that the reference itself would

have obtained.

Once we have scored the documents of a clus-

ter, we sort them and ﬁnd the greatest gap between

Figure 3: Document scoring.

two consecutive scores (scree test). Only the best-

ranked documents before this elbow value are kept

as event instance driven document clusters.

3 Evaluation and Results

In our evaluation, we focus on assessing the qual-

ity of the clusters produced at the end of the

alignment ﬁltering step. We performed our ex-

periments on the AFP and Web data for the

whole year 2015. Considering that the AFP corpus

sometimes develops more “France-” and “Europe-

centric” content while our Web corpus is more

“Anglo-Saxon-centered”, we need to ensure that

we evaluate on event instances that are covered

in both corpora, which is the case in the resulting

outputs of the coarse-grain agglomeration phase,

by construction. We therefore selected 12 of these

“pre-clusters” of event instances, based on the no-

table events of the year 2015 as per Wikipedia3.

This selection is described in Table 1. The Web

articles in these intermediary outputs are sorted by

descending order of their cosine similarity to the

AFP reference. This ordering will serve as a base-

line to evaluate the capacity of the alignment mod-

ule to produce more relevant clusters, the docu-

ments processed at both steps being the same.

We ran AFP clustering and “coarse-grain” ag-

glomeration, identiﬁed the resulting intermediary

outputs that corresponded to our 12 selected event

instances (content and time-stamp wise). We then

ran the alignment phase, picked the 50 best-ranked

Web articles in each cluster obtained from the se-

lected outputs and tagged them manually with a

relevance attribute as follows:

•0: The document is not related to the refer-

3https://en.wikipedia.org/wiki/2015

France seizes passports of

would-be jihadists. Febru-

ary 23rd

Protesters clash with police

in St Louis, Mo., USA. Au-

gust 20th

Cyclone Pam hit Vanuatu

archipelago. March 15th

Facebook vows to combat

racist content on German

platform. September 14th

UK General Election cam-

paign start. March 30th

Wildﬁres rampage across

northern California.

September 14th

Magnitude 7.9 earthquake

hits Nepal. April 25th

Paris Attacks. November

13th

Pakistan police kill head of

anti-Shiite group. July 7th

Swedish police arrest man

for plotting terror attack.

November 20th

ISIS Truck bombing in

Baghdad market. August

13th

Typhoon Melor causes

heavy ﬂooding in Philip-

pines. December 16th

Table 1: The 12 events of our gold standard.

ence event considered;

•1: The document has a loose relation to the

reference event;

•2: The document has a strict relation to the

reference event.

We deﬁne strict and loose relation as follows: a

strict relation means that the document is focused

on the event and differ from the reference news

article only by its wording or additional/missing

information; a loose relation designates a docu-

ment that is not focused on the event, but provides

a news that is so speciﬁc to this event that its men-

tion is core to the overall information provided.

Examples of strict and loose relations are provided

in Figure 4.

This distinction was introduced when facing

two particular types of documents: death toll up-

dates and responsibility claims for terrorist at-

tacks. In both cases, the causal events (attack or

natural disaster) are ﬁrst released as they are in-

Magnitude 7.5 earthquake hits Nepal: USGS. A powerful 7.5

magnitude earthquake struck Nepal on Saturday, the United

States Geological Survey said, with strong tremors felt across

the Himalayan nation and parts of India.

101 dead as 7.8 quake hits Nepal, causing big damage. A

powerful earthquake struck Nepal Saturday, killing at least

71 people as the violently shaking earth, collapsed houses,

leveled centuries-old temples and triggered avalanches in the

Himalayas.

Nepal quake toll reaches 688: government. KATHMANDU

(Reuters) - The death toll from a powerful earthquake that

struck Nepal on Saturday has risen to 688, a senior home

ministry ofﬁcial told Reuters, with 181 people killed in the

capital Kathmandu.

Figure 4: Examples of strict and loose relations.

The ﬁrst text is from the reference news article,

the second one is assessed as “strict” relation, the

third one as a “loose” relation.

formation of their own. Afterwards, death tolls

and claims become stand-alone newsworthy con-

tent and are updated independently, yet remaining

tightly connected to their causal event.

We use the same metrics as described in Glavaˇ

and ˇ

Snajder (2013): mean R-precision (R-prec.)

and mean average precision (MAP) are computed

over the complete ordering of all the documents in

the cluster with:

R-prec =r

where r=number of relevant retrieved docu-

ments and R=total number of relevant docu-

ments to retrieve. Average Precision (AP ) is given

by:

AP =

k=1

(P(k)∗rel(k))

where k=rank of the document, P(k)is the pre-

cision at cut-off kand rel(k)=1if document k

is relevant, 0otherwise. We also compute preci-

sion, recall and F-score after applying the elbow

splitting to evaluate it separately.

Our results are detailed in Table 2by distin-

guishing for each reference (strict or loose) the ﬁg-

ures with (align) and without (no align) the use of

our ﬁnal alignment algorithm. From that perspec-

tive, Table 2clearly shows the interest of this last

step, with a signiﬁcant increase of both MAP and

R-precision when the ﬁnal alignment algorithm is

applied. This increase is particularly noticeable

for R-precision, which emphasizes the ability of

this last step to rerank the Web documents in a rel-

evant way. Unsurprisingly, the strict reference is

globally more difﬁcult than the loose one, espe-

cially for precision: as loose documents are close

Strict Loose

no align align no align align

MAP 58.6 62.2 63.7 66.9

R-prec. 50.2 60 56.5 63.5

Precision – 70.7 – 77.1

Recall – 80.3 – 76.3

F-score – 75.2 – 77.7

Table 2: Performance of our event instance clus-

tering system. Average values for the 12 events.

to strict documents, the overall system tends to

select more false positives with the strict refer-

ence. Logically, the loose reference makes recall

decrease, but very slightly.

From a qualitative perspective, we observed

several phenomena. Sometimes, the journalis-

tic coverage of an event extends greatly from

the time-space context of the mentioned instance,

which tends to have a negative impact on preci-

sion. For example, in our corpus, the 13 Novem-

ber terrorist attacks of Paris have caused many of-

ﬁcial reactions worldwide as well as actions taken

through social media that have been covered on

their own, all in a very short period of time. More-

over, the event itself might be complex in nature:

while the event “Paris Attacks” can be restricted

to the city of Paris on one particular night (uni-

ﬁed time-space context), it is in fact composite,

consisting in multiple attacks of different natures

(shootings and bombings). For our system, this

results in clusters of abnormal sizes (700+ docu-

ments clustered in this case, against an usual max-

imum of 100+). In such cases, the number of an-

notated documents in the gold standard can be too

low, which is an obstacle to the correct evaluation

of the output. These abnormal clusters also have

another characteristic: being composed of signif-

icantly more documents, the distribution of their

alignment scores tends to be smoother, making the

scree-test less reliable.

4 Conclusion and Perspectives

In this paper, we introduced an unsupervised

pipeline aiming at producing event instance driven

clusters of news articles. To do so, we leverage ho-

mogeneous high-quality news agency articles to

identify event instances and ﬁnd linguistic varia-

tions in their expression from Web news articles.

Our experimental results validate our approach as

a groundwork for future extensions in the broader

task of grouping events according to their type and

inducing a shared representation of each type of

event by identifying and generalizing the partici-

pants of events.

5 Acknowledgment

This work has been partially funded by French Na-

tional Research Agency (ANR) under project AS-

RAEL (ANR-15-CE23-0018). We would like to

thank the French News Agency (AFP) for provid-

ing us with the corpus.

References

Roberto J. Bayardo, Yiming Ma, and Ramakrishnan

Srikant. 2007. Scaling up all pairs similarity search.

In 16th International World Wide Web Conference

(WWW’07). pages 131–140.

Nathanael Chambers and Dan Jurafsky. 2011.

Template-based information extraction without

the templates. In Proceedings of the 49th Annual

Meeting of the Association for Computational

Linguistics (ACL 2011). Portland, Oregon, USA,

pages 976–986.

Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Van-

derwenden. 2013. Probabilistic frame induction. In

Proceedings of NAACL-HLT 2013. Atlanta, Georgia,

USA, pages 837–846.

Agata Cybulska and Piek Vossen. 2014. Using a

Sledgehammer to Crack a Nut? Lexical Diversity

and Event Coreference Resolution. In Ninth In-

ternational Conference on Language Resources and

Evaluation (LREC’14). Reykjavik, Iceland.

Elena Filatova, Vasileios Hatzivassiloglou, and Kath-

leen McKeown. 2006. Automatic creation of do-

main templates. In Proceedings of the COL-

ING/ACL 2006 Main Conference Poster Sessions.

pages 207–214.

Goran Glavaˇ

s and Jan ˇ

Snajder. 2013. Recognizing

identical events with graph kernels. In 51st An-

nual Meeting of the Association for Computational

Linguistics (ACL 2013). Soﬁa, Bulgaria, pages 797–

803.

Lifu Huang, Taylor Cassidy, Feng Xiaocheng, Heng Ji,

Clare R. Voss, Jiawei Han, and Avirup Sil. 2016.

Liberal event extraction and event schema induc-

tion. In Proceedings of the 54th Annual Meeting of

the Association for Computational Linguistics (ACL

2016). Berlin, Germany, pages 258–268.

Teruko Mitamura, Yukari Yamakawa, Susan Holm,

Zhiyi Song, Ann Bies, Seth Kulick, and Stephanie

Strassel. 2015. Event Nugget Annotation: Processes

and Issues. In 3rd Workshop on EVENTS: Deﬁ-

nition, Detection, Coreference, and Representation.

Denver, Colorado, pages 66–76.

Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grish-

man. 2016. Joint event extraction via recurrent neu-

ral networks. In Proceedings of NAACL-HLT 2016.

San Diego, California, USA, pages 300–309.

Md Arafat Sultan, Steven Bethard, and Tamara Sum-

mer. 2014. Back to Basics for Monolingual Align-

ment: Exploiting Word Similarity and Contextual

Evidence. Transactions of the Association for Com-

putational Linguistics (TACL) 2:219–230.

Stijn van Dongen. 2000. Graph Clustering by Flow

Simulation. Ph.D. thesis, University of Utrecht.

Willem Robert van Hage, V´

eronique Malais´

e, Roxane

Segers, Laura Hollink, and Guus Schreiber. 2011.

Design and use of the simple event model (sem).

Web Semantics: Science, Services and Agents on the

World Wide Web 9(2):128–136.

A Survey on Event Extraction for Natural Language Understanding: Riding the Biomedical Literature Wave

Article

Full-text available

Nov 2021

Motivation: The scientific literature embeds an enormous amount of relational knowledge, encompassing interactions between biomedical entities, like proteins, drugs, and symptoms. To cope with the ever-increasing number of publications, researchers are experiencing a surge of interest in extracting valuable, structured, concise, and unambiguous information from plain texts. With the development of deep learning, the granularity of information extraction is evolving from entities and pairwise relations to events. Events can model complex interactions involving multiple participants having a specific semantic role, also handling nested and overlapping definitions. After being studied for years, automatic event extraction is on the road to significantly impact biology in a wide range of applications, from knowledge base enrichment to the formulation of new research hypotheses. Results: This paper provides a comprehensive and up-to-date survey on the link between event extraction and natural language understanding, focusing on the biomedical domain. First, we establish a flexible event definition, summarizing the terminological efforts conducted in various areas. Second, we present the event extraction task, the related challenges, and the available annotated corpora. Third, we deeply explore the most representative methods and present an analysis of the current state-of-the-art, accompanied by performance discussion. To help researchers navigate the avalanche of event extraction works, we provide a detailed taxonomy for classifying the contributions proposed by the community. Fourth, we compare solutions applied in biomedicine with those evaluated in other domains, identifying research opportunities and providing insights for strategies not yet explored. Finally, we discuss applications and our envisions about future perspectives, moving the needle on explainability and knowledge injection.

Task as Context: A Sensemaking Perspective on Annotating Inter-Dependent Event Attributes with Non-Experts

Article

Nov 2023

This paper explores the application of sensemaking theory to support non-expert crowds in intricate data annotation tasks. We investigate the influence of procedural context and data context on the annotation quality of novice crowds, defining procedural context as completing multiple related annotation tasks on the same data point, and data context as annotating multiple data points with semantic relevance. We conducted a controlled experiment involving 140 non-expert crowd workers, who generated 1400 event annotations across various procedural and data context levels. Assessments of annotations demonstrate that high procedural context positively impacts annotation quality, although this effect diminishes with lower data context. Notably, assigning multiple related tasks to novice annotators yields comparable quality to expert annotations, without costing additional time or effort. We discuss the trade-offs associated with procedural and data contexts and draw design implications for engaging non-experts in crowdsourcing complex annotation tasks.

General fine-grained event detection based on fusion of multi-information representation and attention mechanism

Article

Full-text available

Jun 2023

Event extraction is an important field in information extraction, which aims to extract key information from unstructured text automatically. Event extraction is mainly divided into trigger identification and classification. The existing models are deficient in sentence representation in the initial word embeddings training process, which makes it difficult to capture the deep bidirectional representation and can’t handle the semantic information of the context well, thus affecting the performance of event detection. In this paper, a model BMRMC (BERT + Mean pooling layer + Relative position in multi-head attention + CRF) based on multi-information representation and attention mechanism is proposed. Firstly, the BERT pre-training model based on a bidirectional training transformer is used to embed words and extract word-level features. Then the sentence-level semantic representation is fused by mean pooling layer. In addition, relative position is combined with multi-head attention, which can strengthen the connection of contents. Finally, the sequences are labeled by CRF based on the BIO-labeling mechanism. The experimental results show that the proposed model BMRMC improves the performance of event detection, and the F value on the MAVEN dataset is 67.74%, which achieves state-of-the-art performance in the general fine-grained event detection task.

Unsupervised Key Event Detection from Massive Text Corpora

Preprint

Full-text available

Jun 2022

Automated event detection from news corpora is a crucial task towards mining fast-evolving structured knowledge. As real-world events have different granularities, from the top-level themes to key events and then to event mentions corresponding to concrete actions, there are generally two lines of research: (1) theme detection identifies from a news corpus major themes (e.g., "2019 Hong Kong Protests" vs. "2020 U.S. Presidential Election") that have very distinct semantics; and (2) action extraction extracts from one document mention-level actions (e.g., "the police hit the left arm of the protester") that are too fine-grained for comprehending the event. In this paper, we propose a new task, key event detection at the intermediate level, aiming to detect from a news corpus key events (e.g., "HK Airport Protest on Aug. 12-14"), each happening at a particular time/location and focusing on the same topic. This task can bridge event understanding and structuring and is inherently challenging because of the thematic and temporal closeness of key events and the scarcity of labeled data due to the fast-evolving nature of news articles. To address these challenges, we develop an unsupervised key event detection framework, EvMine, that (1) extracts temporally frequent peak phrases using a novel ttf-itf score, (2) merges peak phrases into event-indicative feature sets by detecting communities from our designed peak phrase graph that captures document co-occurrences, semantic similarities, and temporal closeness signals, and (3) iteratively retrieves documents related to each key event by training a classifier with automatically generated pseudo labels from the event-indicative feature sets and refining the detected key events using the retrieved documents. Extensive experiments and case studies show EvMine outperforms all the baseline methods and its ablations on two real-world news corpora.

What is Event Knowledge Graph: A Survey

Preprint

Full-text available

Dec 2021

Besides entity-centric knowledge, usually organized as Knowledge Graph (KG), events are also an essential kind of knowledge in the world, which trigger the spring up of event-centric knowledge representation form like Event KG (EKG). It plays an increasingly important role in many machine learning and artificial intelligence applications, such as intelligent search, question-answering, recommendation, and text generation. This paper provides a comprehensive survey of EKG from history, ontology, instance, and application views. Specifically, to characterize EKG thoroughly, we focus on its history, definitions, schema induction, acquisition, related representative graphs/systems, and applications. The development processes and trends are studied therein. We further summarize perspective directions to facilitate future research on EKG.

Topic Detection and Tracking with Time-Aware Document Embeddings

Preprint

Full-text available

Dec 2021

The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.

Detecting and Classifying Typhoon Information from Chinese News Based on a Neural Network Model

Article

Full-text available

Jun 2021

Typhoons are major natural disasters in China. Much typhoon information is contained in a large number of network media resources, such as news reports and volunteered geographic information (VGI) data, and these are the implicit data sources for typhoon research. However, two problems arise when using typhoon information from Chinese news reports. Since the Chinese language lacks natural delimiters, word segmentation error results in trigger mismatches. Additionally, the polysemy of Chinese affects the classification of triggers. Second, there is no authoritative classification system for typhoon events. This paper defines a classification system for typhoon events, and then uses the system in a neural network model, lattice-structured bidirectional long–short-term memory with a conditional random field (BiLSTM-CRF), to detect these events in Chinese online news. A typhoon dataset is created using texts from the China Weather Typhoon Network. Three other datasets are generated from general Chinese web pages. Experiments on these four datasets show that the model can tackle the problems mentioned above and accurately detect typhoon events in Chinese news reports.

What is Event Knowledge Graph: A Survey

Article

Jan 2022

Besides entity-centric knowledge, usually organized as Knowledge Graph (KG), events are also an essential kind of knowledge in the world, which trigger the spring up of event-centric knowledge representation form like Event KG (EKG). It plays an increasingly important role in many downstream applications, such as search, question-answering, recommendation, financial quantitative investments, and text generation. This paper provides a comprehensive survey of EKG from history, ontology, instance, and application views. Specifically, to characterize EKG thoroughly, we focus on its history, definitions, schema induction, acquisition, related representative graphs/systems, and applications. The development processes and trends are studied therein. We further summarize prospective directions to facilitate future research on EKG.

Design Event Extraction Model from Amharic Texts Using Deep Learning Approach

Chapter

Jan 2022

Every day, a massive amount of information is reported in the form of video, audio, or text through various media such as television, radio, social media, and web blogs. As the number of unstructured documents on those media has grown, finding relevant information has become more difficult. As a result, extracting relevant events from large amounts of unstructured text data is essential. We proposed an event extraction model, which aims to detect, classify and extract various types of events along with their arguments from Amharic text documents. In this paper, the researchers first come up with Amharic language-specific issues and then proposed Bidirectional Long Short Memory (BiLSTM) with a Word2vec model to detect and classify Amharic events from unstructured documents. To achieve this research 9,050 Amharic documents were used for event detection and extraction purpose. In addition to event detection and classification, the model also extracts event arguments that contain additional information about events such as Time and Place. The experimental results showed that the Bidirectional long short-term memory approach with Word2vec word embedding shows a promising result in terms of Amharic event detection and event classification, with 94% and 89% accuracy, respectively.

Identifying Events from Streams of RDF-Graphs Representing News and Social Media Messages

Conference Paper

Jul 2021

Marc Gallofré Ocaña

Identifying news events and relating current news to past events or already identified ones is an open challenge for news agencies. In this paper, I propose a study to identify events from semantic RDF graph representations of real-time and big data streams of news and pre-news. The proposed solution must provide acceptable accuracy over time and consider the requirements of incremental clustering, big data and real-time streams. To design a solution for identifying events, I want to study which clustering approaches are best for this purpose including methods for clustering RDF graphs using machine learning and “classical” algorithmic approaches. I also present three different evaluation approaches.

Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence

Article

Full-text available

Dec 2014

We present a simple, easy-to-replicate monolingual aligner that demonstrates state-of-the-art performance while relying on almost no supervision and a very small number of external resources. Based on the hypothesis that words with similar meanings represent potential pairs for alignment if located in similar contexts, we propose a system that operates by finding such pairs. In two intrinsic evaluations on alignment test data, our system achieves F 1 scores of 88–92%, demonstrating 1–3% absolute improvement over the previous best system. Moreover, in two extrinsic evaluations our aligner outperforms existing aligners, and even a naive application of the aligner approaches state-of-the-art performance in each extrinsic task.

Liberal Event Extraction and Event Schema Induction

Conference Paper

Full-text available

Jan 2016

Joint Event Extraction via Recurrent Neural Networks

Conference Paper

Full-text available

Jan 2016

Event Nugget Annotation: Processes and Issues

Conference Paper

Full-text available

Jan 2015

This paper describes the processes and issues of annotating event nuggets based on DEFT ERE Annotation Guidelines v1.3 and TAC KBP Event Detection Annotation Guidelines 1.7. Using Brat Rapid Annotation Tool (brat), newswire and discussion forum documents were annotated. One of the challenges arising from human annotation of documents is annotators’ disagreement about the way of tagging events. We propose using Event Nuggets to help meet the definitions of the specific type/subtypes which are part of this project. We present case studies of several examples of event annotation issues, including discontinuous multi-word events representing single events. Annotation statistics and consistency analysis is provided to characterize the interannotator agreement, considering single term events and multi-word events which are both continuous and discontinuous. Consistency analysis is conducted using a scorer to compare first pass annotated files against adjudicated files.

Recognizing Identical Events with Graph Kernels

Conference Paper

Aug 2013

Design and use of the Simple Event Model (SEM)

Article

Jul 2011
J WEB SEMANT

Events have become central elements in the representation of data from domains such as history, cultural heritage, multimedia and geography. The Simple Event Model (SEM) is created to model events in these various domains, without making assumptions about the domain-specific vocabularies used. SEM is designed with a minimum of semantic commitment to guarantee maximal interoperability. In this paper, we discuss the general requirements of an event model for Web data and give examples from two use cases: historic events and events in the maritime safety and security domain. The advantages and disadvantages of several existing event models are discussed in the context of the historic example. We discuss the design decisions underlying SEM. SEM is coupled with a Prolog API that enables users to create instances of events without going into the details of the implementation of the model. By a tight coupling to existing Prolog packages, the API facilitates easy integration of event instances to Linked Open Data. We illustrate use of the API with examples from the maritime domain.

Automatic Creation of Domain Templates.

Conference Paper

Jan 2006

Recently, many Natural Language Processing (NLP) applications have improved the quality of their output by using various machine learning tech- niques to mine Information Extraction (IE) patterns for capturing information from the input text. Cur- rently, to mine IE patterns one should know in ad- vance the type of the information that should be captured by these patterns. In this work we pro- pose a novel methodology for corpus analysis based on cross-examination of several document collec- tions representing different instances of the same domain. We show that this methodology can be used for automatic domain template creation. As the problem of automatic domain template creation is rather new, there is no well-defined procedure for the evaluation of the domain template quality. Thus, we propose a methodology for identifying what in- formation should be present in the template. Using this information we evaluate the automatically cre- ated domain templates through the text snippets re- trieved according to the created templates.

Template-Based Information Extraction without the Templates.

Conference Paper

Jan 2011

Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

Scaling up all pairs similarity search

Conference Paper

May 2007

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.

Graph Clustering by Flow Simulation

Article

May 2000

Stijn Marinus van Dongen

Dit proefschrift heeft als onderwerp het clusteren van grafen door middel van simulatie van stroming, een probleem dat in zijn algemeenheid behoort tot het gebied der clustera- nalyse. In deze tak van wetenschap ontwerpt en onderzoekt men methoden die gegeven bepaalde data een onderverdeling in groepen genereren, waarbij het oogmerk is een on- derverdeling in groepen te vinden die natuurlijk is. Dat wil zeggen dat verschillende data-elementen in dezelfde groep idealiter veel op elkaar lijken, en dat data-elementen uit verschillende groepen idealiter veel van elkaar verschillen. Soms ontbreken zulke groepjes helemaal; dan is er weinig patroon te herkennen in de data. Het idee is dat de aanwezigheid van natuurlijke groepjes het mogelijk maakt de data te categoriseren. Een voorbeeld is het clusteren van gegevens (over symptomen of lichaamskarakteristie- ken) van patienten die aan dezelfde ziekte lijden. Als er duidelijke groepjes bestaan in die gegevens, kan dit tot extra inzicht leiden in de ziekte. Clusteranalyse kan al- dus gebruikt worden voor exploratief onderzoek. Verdere voorbeelden komen uit de scheikunde, taxonomie, psychiatrie, archeologie, marktonderzoek en nog vele andere disicplines. Taxonomie, de studie van de classificatie van organismen, heeft een rijke ge- schiedenis beginnend bij Aristoteles en culminerend in de werken van Linnaeus. In feite kan de clusteranalyse gezien worden als het resultaat van een steeds meer systematische en abstracte studie van de diverse methoden ontworpen in verschillende toepassingsge- bieden, waarbij methode zowel wordt gescheiden van data en toepassingsgebied als van berekeningswijze. In de cluster analyse kunnen grofweg twee richtingen onderscheiden worden, naargelang het type data dat geclassificeerd moet worden. De data-elementen in het voorbeeld hier- boven worden beschreven door vectoren (lijstjes van scores of metingen), en het verschil tussen twee elementen wordt bepaald door het verschil van de vectoren. Deze disserta- tie betreft cluster analyse toegepast op data van het type `graaf'. Voorbeelden komen uit de patroonherkenning, het computerondersteund ontwerpen, databases voorzien van hyperlinks en het World Wide Web. In al deze gevallen is er sprake van `punten' die verbonden zijn of niet. Een stelsel van punten samen met hun verbindingen heet een graaf. Een goede clustering van een graaf deelt de punten op in groepjes zodanig dat er weinig verbindingen lopen tussen (punten uit) verschillende groepjes en er veel verbin- dingen zijn in elk groepje afzonderlijk. Het eerste deel van de dissertatie, bestaande uit de hoofdstukken 2 en 3, behandelt de positie van clusteranalyse in het algemeen en de positie van graafclusteren binnen de clusteranalyse in het bijzonder, alsmede de relatie van graafclusteren tot het aanverwante probleem van het partitioneren van grafen. In het cluster probleem zoekt men een `natuurlijke' onderverdeling in groepjes en is het aantal en formaat van de groepjes niet voorgeschreven. In het partitie probleem zijn aantal en afmetingen wel voorgeschreven en zoekt men gegeven deze restricties een toewijzing van de elementen aan de groepjes zodanig dat er een minimale hoeveelheid verbindingen tussen de groepjes is. 163?164 SAMENVATTING De dissertatie beschrijft voorts theorie, implementatie en abstracte toetsing van een krachtig nieuw cluster algoritme voor grafen genaamd Markov Cluster algoritme of MCL algoritme. Het algoritme maakt gebruik van (en is in feite niet meer dan een schil om) een algebraisch proces (genaamd MCL proces) gedefinieerd voor Markov grafen, i.e. gra- fen waarvoor de geassocieerde matrix stochastisch is. In dit proces wordt de aanvangs- graaf successievelijk getransformeerd door alternatie van de twee operatoren expansie en inflatie. Expansie is het nemen van de macht van een matrix volgens het klassieke matrix product. Stochastisch gezien betekent dit het uitrekenen van de overgangskan- sen behorend bij een meerstapsrelatie. Inflatie valt samen met het nemen van de macht van een matrix volgens het elementsgewijze HadamardSchur product, gevolgd door een kolomsgewijze herschaling zodat het uiteindelijke resultaat weer een (kolom) stochas- tische matrix is. Dit is een ongebruikelijke operator in de wereld van de stochastiek; zijn introductie is geheel en al gemotiveerd door de beoogde werking op grafen waar clusterstructuur aanwezig is. Het is namelijk te verwachten dat bij meerstapsrelaties die corresponderen met puntparen liggend binnen een natuurlijke cluster grotere over- gangskansen zullen horen dan bij puntparen waarvan de punten in verschillende clusters liggen. De inflatie operator bevoordeelt meerstapsrelaties met grote bijbehorende kans en benadeelt meerstapsrelaties met kleine bijbehorende kans. De verwachting is dus dat het MCL proces meerstapsrelaties zal creeeren en bestendigen die horen bij relaties liggend in ´e´en cluster, en dat het alle meerstapsrelaties zal decimeren die behoren bij re- laties tussen verschillende clusters. Dit blijkt inderdaad het geval te zijn. Het MCL proces convergeert over het algemeen naar een idempotente matrix die zeer ijl is en bestaat uit meerdere componenten. De componenten worden ge¨interpreteerd als een clustering van de aanvangsgraaf. Doordat de inflatie operator geparametrizeerd is kunnen clusteringen op verschillend niveau van granulariteit ontdekt worden. Het MCL algoritme bestaat ten eerste uit een transformatiestap van een gegeven graaf naar een stochastische aanvangsgraaf, gebruik makend van het standaard concept van een willekeurige wandeling op een graaf. Ten tweede vergt het de specificatie van twee rijen van waarden die de opeenvolgende expansie en inflatie parametrizeringen defini- eeren. Tenslotte berekent het algoritme het bijbehorende proces en interpreteert het de resulterende limiet. Het idee om willekeurige wandelingen te gebruiken om clus- terstructuur te ontdekken is niet nieuw, maar de wijze van uitvoering wel. Het idee wordt als `graafcluster paradigma' ge¨introduceerd in hoofdstuk 5, gevolgd door enige combinatorische voorstellen tot het clusteren van grafen. Getoond wordt dat er een verband is tussen de combinatorische en probabilistische clustermethoden, en dat een belangrijk onderscheid de localisatiestap is die probabilistische methoden over het al- gemeen introduceren. Het hoofdstuk besluit met een voorbeeld van een MCL proces en de formele definitie van zowel proces als algoritme. Notaties en definities zijn dan reeds ge¨introduceerd in hoofdstuk 4. In hoofdstuk 6 wordt de interpretatiefunctie van idempotente matrices naar clusteringen geformaliseerd, worden simpele eigenschappen van de inflatie operator beschreven, en wordt de stabiliteit van MCL limieten en de ge- associeerde clusteringen geanalyseerd. Het fenomeen van overlappende clusters is in principe mogelijk 13 en maakt intrinsiek deel uit van de interpretatiefunctie, maar blijkt 13 De tot nu toe waargenomen overlap van clusters correspondeerde altijd met een graafauto- morfisme dat het overlappende deel van clusters op zichzelf afbeeldde.?SAMENVATTING 165 instabiel te zijn. Hoofdstuk 7 introduceert de klassen van diagonaal symmetrische en diagonaal positief semi-definiete matrices (matrices die diagonaal gelijkvormig zijn met een symmetrische respectievelijk positief semi-definiete matrix). Beide klassen worden in zichzelf overgevoerd door zowel expansie als inflatie 14 . Getoond wordt dat diagonaal positief semi-definiete matrices structuur bevatten die de interpretatiefunctie van idem- potente matrices naar clusteringen generaliseert. Hieruit volgt een preciezere duiding van het inflatoire effect van de inflatieoperator op het spectrum van de argumentma- trix. Ontkoppelingsaspecten van grafen en matrices zijn altijd nauw verbonden met ka- rakteristieken van de geassocieerde spectra. Hoofdstuk 8 beschrijft een aantal bekende resultaten die ten grondslag liggen aan de meest gebruikte technieken ten behoeve van het partitioneren van grafen. De hoofdstukken 4 tot en met 8 vormen het tweede deel van de dissertatie. Het derde deel doet verslag van experimenten met het MCL algoritme. Hoofdstuk 9 is theoretisch van aard en introduceert functies die gebruikt kunnen worden als maat voor de kwaliteit van een graafclustering. Ondermeer wordt een generieke maat afgeleid die uitdrukt hoe goed een karakteristieke vector de massa van een andere (niet nega- tieve) vector representeert. Elements of kolomsgewijze toepassing van de maat geeft een uitdrukking voor de mate waarin een clustering de massa van een gewogen graaf of matrix representeert. Tevens wordt een metriek op de ruimte van clusteringen of par- tities afgeleid, die gebruikt wordt om de continu¨iteitseigenschappen en het onderschei- dend vermogen van het MCL algoritme te toetsen in hoofdstuk 12. Hoofdstuk 10 doet verslag van experimenten op kleine symmetrische grafen met welbepaalde dichtheids- karakteristieken zoals rastervormige grafen. Het MCL algoritme blijkt experimenteel een sterk scheidend vermogen te hebben. Experimenten met buurgrafen 15 wijzen uit dat het algoritme niet geschikt is indien de diameter van de natuurlijke clusters groot is. Dit verschijnsel kan begrepen worden in termen van de (stochastische) stromings- eigenschappen van het algoritme. Hoofdstuk 11 gaat in op de schaalbaarheid van het algoritme. Cruciaal is dat de limiet van het MCL proces over het algemeen zeer ijl is en dat de iteranden van het proces ijl zijn in een gewogen interpretatie van het begrip ijl. Dat wil zeggen, de inflatie operator zorgt ervoor dat de meeste nieuwe niet-nul ele- menten (corresponderend met meerstapsrelaties) zeer klein blijven en uiteindelijk weer verdwijnen. Dit is des te meer waar naarmate de diameter van de natuurlijke clusters klein is, en naarmate de connectiviteit van de totale graaf laag is. Dit suggereert dat tijdens elke expansie stap die ervoor zorgt dat de matrix vol loopt de kolommen van de nieuw berekende matrix uitgedund kunnen worden door simpelweg de k grootste elementen van een nieuw berekende (stochastische) kolom te nemen, en deze elementen te herschalen op 1, waar k afhangt van de aanwezige rekencapaciteit. Omdat het bereke- nen van de k grootste waarden van een vector in principe niet in lineaire tijd kan, blijkt het in praktijk noodzakelijk een verfijnder schema te hanteren waarin de vector eerst uitgedund wordt door middel van drempelwaardes die afhangen van homogeniteitsei- genschappen van de vector. Dit leidt in principe tot een complexiteit in de orde van grootte O Nk 2 , waar N de dimensie van de matrix is. Hoofdstuk 12 doet verslag van 14 Voor diagonaal positief semi-definiete matrices geldt dit voor slechts een deel van de para- metrizeringsruimte van de inflatie operator. 15 Rasterachtige grafen gedefinieerd op punten in de Euclidische ruimte.?166 SAMENVATTING experimenten op testgrafen met tienduizend punten waarvan de verbindingen op zo'n manier (willekeurig) zijn gegenereerd dat een a priori beste clustering bekend is. Deze grafen hebben natuurlijke clusters met kleine diameter maar hebben als geheel hoge tot zeer hoge connectiviteit. Het geschaalde MCL algoritme blijkt zeer goede clusteringen te genereren die dicht bij de a priori bekende clustering liggen. De parameter k kan laag gekozen worden, maar de prestaties van het algoritme nemen sterker af naarmate k lager is en de totale connectiviteit van de input graaf hoger. De appendix A cluster miscellany beginnend op pagina 149 is geschreven voor een algemeen publiek en bevat korte uiteenzettingen over diverse aspecten van clusteranalyse, zoals de geschiedenis van het vakgebied en de rol van de computer.

Unsupervised Event Clustering and Aggregation from Newswire and Web Articles

Figures

Recommended publications

Minimization of the Disagreements in Clustering Aggregation

Unraveling the Morphology of [CnC1Im]Cl Ionic Liquids Combining Cluster and Aggregation Analyses

Algebraic Foundations of the Theory of Aggregation

Clustering and Aggregation in Economics