Conference PaperPDF Available

Unsupervised Event Clustering and Aggregation from Newswire and Web Articles

Authors:

Figures

Proceedings of the 2017 EMNLP Workshop on Natural Language Processing meets Journalism, pages 62–67
Copenhagen, Denmark, September 7, 2017. c
2017 Association for Computational Linguistics
Unsupervised Event Clustering and Aggregation from Newswire
and Web Articles
Swen Ribeiro
LIMSI, CNRS
Univ. Paris-Sud
Universit´
e Paris-Saclay
swen.ribeiro@limsi.fr
Olivier Ferret
CEA, LIST,
Gif-sur-Yvette,
F-91191 France.
olivier.ferret@cea.fr
Xavier Tannier
LIMSI, CNRS
Univ. Paris-Sud
Universit´
e Paris-Saclay
xavier.tannier@limsi.fr
Abstract
In this paper, we present an unsupervised
pipeline approach for clustering news arti-
cles based on identified event instances in
their content. We leverage press agency
newswire and monolingual word align-
ment techniques to build meaningful and
linguistically varied clusters of articles
from the Web in the perspective of a
broader event type detection task. We vali-
date our approach on a manually annotated
corpus of Web articles.
1 Introduction
In the context of news production, an event is the
characterization of a significant enough change in
a space-time context to be reported as newsworthy
content. This definition fits with definitions pro-
posed in other contexts such as the ACE 2005 and
TAC KBP Event evaluations or work such as (Cy-
bulska and Vossen,2014;Mitamura et al.,2015),
which generally view each event as “something
that happens at a particular place and time”, im-
plying changes in the state of the world and in-
volving participants. In accordance with ontolo-
gies about events such as the Simple Event Model
(SEM) ontology (van Hage et al.,2011), events
can be categorized into different types, for exam-
ple “elections” or “earthquakes”, gathering mul-
tiple real-life instances, for example the “2017
UK General Election” or the “2012 French Pres-
idential Election”. These instances are reported
by journalists through varying textual mentions.
Event extraction is a challenging task that has re-
ceived increasing interest in the past years through
many formulations such as event identification or
event detection. It is also an important subtask of
larger NLP applications such as document sum-
marization and event schema induction. Several
approaches have been used to tackle the different
aspects of this task, particularly in an unsupervised
fashion, from linguistic pipelines (Filatova et al.,
2006;Huang et al.,2016) to topic modeling ap-
proaches (Chambers and Jurafsky,2011;Cheung
et al.,2013) and more recently neural networks
(Nguyen et al.,2016). While the definition and
granularity of an event varies with the task and ob-
jectives at hand, most event identification systems
exploit mentions to produce type-level representa-
tions.
We propose to address the unsupervised event
extraction task through two subtasks: first, un-
supervised event instance extraction and second,
event type extraction. This paper will focus on our
efforts regarding the first step, e.g. unsupervised
event instance extraction. In this perspective, we
present a method based on clustering algorithms
leveraging news data from different sources. We
believe that this first step might act as a bridge
between the surface forms that are mentions and
the more abstract concept of instances and types of
events. Moreover, the context of this work is the
ASRAEL project, which aims at providing opera-
tional tools for journalists, and this instance/type
segmentation seems relevant in the perspective of
further event-driven processing developments.
Our clustering approach considers three dimen-
sions: time, space and content. A content align-
ment system is adapted from Sultan et al. (2014)
and a time and space-aware similarity function is
proposed in order to aggregate articles about the
same event.
We work with a large collection of English news
62
Figure 1: Overview of the system.
and Web articles, where each article describes an
event: the main topic of the article is a specific
event, and other older events are mentioned in or-
der to put it into perspective. Thus, we consider an
event associated with an article.
Our system’s objective is to build clusters of ar-
ticles describing the same exact real-life event, e.g
the same event instance. We adopt two definitions
of the relation “same event” (strict and loose) and
evaluate through these two definitions.
2 Two-step Clustering
Our approach is structured as a pipeline includ-
ing a two-step clustering with an additional filter-
ing step at the end. The first step leverages an
homogeneous corpus of news articles for build-
ing focused and “clean” clusters corresponding to
event instances. The second step exploits these
focused clusters for clustering documents coming
from the Web that are more noisy but also more
likely to bring new information about the consid-
ered events. Figure 1illustrates this pipeline.
2.1 Corpora
The first clustering step (represented in blue on
Figure 1) is performed on a corpus from Agence
France-Presse (AFP) news agency. Each news ar-
ticle comes with several metadata providing addi-
tional information about its time-space context of
creation, such as its UTC time-stamp, and its con-
tent, through International Press Telecommunica-
tions Council (IPTC) NewsCodes. NewsCodes are
a standard subject taxonomy created and main-
tained by the IPTC, with a focus on text.
From the 1,400+ existing NewsCodes, we se-
lected 72 that can be viewed as event types1, cov-
1A user-friendly tree visualization of all the NewsCodes is
available at http://show.newscodes.org/index.
html?newscodes=subj.
ering as many event types as possible without
overlapping with one another, and retrieved all
news articles tagged with at least one of these
NewsCodes. This resulted in a corpus of about
52,000 documents for the year 2015.
The second clustering step (in orange on Fig-
ure 1) takes as input news articles crawled from
a list of Web news feeds in English. We used a
corpus of 1.3 million Web news articles published
in 2015, from about 20 different Web news sites
(3,700 documents/day in average) including the
RSS feeds of the New-York Times, the BBC or
the Wall Street Journal.
In both corpora, we process only the title and
first paragraph (usually one or two sentences) of
the documents, under the assumption that they fol-
low the journalistic rule of the 5Ws: the lead of
an article must provide information about what,
when,where,who and why.
2.2 Approach
2.2.1 Press Agency Clustering
The first clustering step computes the similarity
matrix of the AFP news by the means of the All
Pairs Similarity Search (APSS) algorithm (Ba-
yardo et al.,2007) and applies to it the Markov
Clustering (MCL) algorithm (van Dongen,2000).
News are represented by a bag-of-word repre-
sentation including the lemmatized form of their
nouns, adjectives and verbs.
The similarity function between two documents
d1and d2is the following:
sim(d1, d2) = cos(d1, d2)
eδ/24
where cos(d1, d2)is the cosine similarity and
δis the difference between the documents creation
times (in hours). This time decay ensures that two
similar but different events, occurring at different
moments, will not be grouped together. Only simi-
larities above a threshold τhave been considered2.
This first step yields small and instance-focused
clusters of press agency news articles only. While
they can be considered high quality content, they
are quite homogeneous and lack variety in their
wording, and could not be used for broader tasks
such as event type-level detection. An example of
output for this step is provided in Figure 2.
2A grid search led to τ= 0.5.
63
Hundreds dead in Nepal quake, avalanche triggered on Ever-
est. A massive 7.8 magnitude earthquake killed hundreds of
people Saturday as it ripped through large parts of Nepal,
toppling office blocks and towers in Kathmandu and trigger-
ing an avalanche that hit Everest base camp.
Nepal quake kills 1,200, sparks deadly Everest avalanche. A
massive earthquake killed more than 1,200 people Saturday
as it tore through large parts of Nepal, toppling office blocks
and towers in Kathmandu and triggering a deadly avalanche
at Everest base camp.
Hundreds dead in Nepal quake, deadly avalanche on Everest.
A massive 7.8 magnitude earthquake killed more than 900
people Saturday as it ripped through large parts of Nepal,
toppling office blocks and towers in Kathmandu and trigger-
ing a deadly avalanche that hit Everest base camp.
Figure 2: 3 of 5 AFP news articles clustered to-
gether. While they indeed cover the same event
instance, there are few wording variations between
them, limiting their interest for broader event de-
tection and assimilated tasks.
2.2.2 Web Article Extension
In this step, we aim to alleviate the lack of vari-
ability of our AFP news article clusters by leverag-
ing their high focus to aggregate Web documents
about the same event instances.
To do so, we identify the first article published
in each AFP cluster (using the time-stamp) and re-
trieve all Web articles in the next 24 hours. This is
based on the assumption that press agencies are
a primary source of trustworthy information for
most news feeds, so it would be rare to find men-
tions of an event instance before an article was re-
leased, especially in an international context. We
call this article the “reference”.
We first perform a first “coarse-grain” ag-
glomeration by performing low-threshold cosine
similarity-based clustering between the AFP ref-
erence and all Web articles for the given 24-hour
timespan. This results in smaller subsets of data to
feed the next module in the pipeline.
We then use the monolingual word alignment
system described in Sultan et al. (2014). This sys-
tem performs a word-to-word alignment between
two sentences by applying a series of alignment
modules focusing each on a specific type of lin-
guistic units. The alignment process starts with
n-grams of words (with n >2) including at least
one content word. Then, named entities are con-
sidered, followed by content words and finally,
stopwords. While alignment of n-grams of words
and named-entities is based only on string match-
ing (exact match for n-grams, partial for named
entities as the system uses Stanford NER to re-
solve acronyms and matching partial mentions),
the system also relies on contextual evidence for
other linguistic units, e.g: syntactic dependencies
and textual neighborhood. Textual neighborhood
is defined as a window of the next and previous 3
content words surrounding each word being con-
sidered for an alignment. The system then com-
putes a similarity score between each candidate
pair available based on this evidence, and selects
the highest scored pair for a given word as the cho-
sen alignment. We adapted the system to better
fit our needs by extending the stopword list, first
aligning unigram exact matches and using the ab-
sence of matching content words or named enti-
ties as an early stopping condition of the alignment
process.
For each AFP cluster, we perform alignment be-
tween the reference (earliest article) and each Web
article from the subset. This allows us to build a
word alignment matrix where each column con-
tains the words in a document and each line shows
how each word of the reference has aligned across
all documents.
We then compute a score for each document,
taking into account how many words in a docu-
ment have been aligned with the reference, and
how many times a reference word has found an
alignment across all documents.
Figure 3illustrates how this score is computed.
We first build the binary alignment matrix B
where columns represent documents and rows rep-
resent term alignments. If a term i(out of M
aligned terms) from document j(out of Ndocu-
ments) has been aligned with a term from the ref-
erence, then Bi,j = 1, otherwise Bi,j = 0. We
then compute a weight for each alignment, lead-
ing to a vector Align such as for each term i:
Aligni=
N
X
j=0
Bi,j
The absolute alignment score of each docu-
ment jis then:
sj=
M
X
i=0
Wi,j
where W=B×Align. Finally, we normalize
these by the scores that the reference itself would
have obtained.
Once we have scored the documents of a clus-
ter, we sort them and find the greatest gap between
64
Figure 3: Document scoring.
two consecutive scores (scree test). Only the best-
ranked documents before this elbow value are kept
as event instance driven document clusters.
3 Evaluation and Results
In our evaluation, we focus on assessing the qual-
ity of the clusters produced at the end of the
alignment filtering step. We performed our ex-
periments on the AFP and Web data for the
whole year 2015. Considering that the AFP corpus
sometimes develops more “France-” and “Europe-
centric” content while our Web corpus is more
Anglo-Saxon-centered”, we need to ensure that
we evaluate on event instances that are covered
in both corpora, which is the case in the resulting
outputs of the coarse-grain agglomeration phase,
by construction. We therefore selected 12 of these
“pre-clusters” of event instances, based on the no-
table events of the year 2015 as per Wikipedia3.
This selection is described in Table 1. The Web
articles in these intermediary outputs are sorted by
descending order of their cosine similarity to the
AFP reference. This ordering will serve as a base-
line to evaluate the capacity of the alignment mod-
ule to produce more relevant clusters, the docu-
ments processed at both steps being the same.
We ran AFP clustering and “coarse-grain” ag-
glomeration, identified the resulting intermediary
outputs that corresponded to our 12 selected event
instances (content and time-stamp wise). We then
ran the alignment phase, picked the 50 best-ranked
Web articles in each cluster obtained from the se-
lected outputs and tagged them manually with a
relevance attribute as follows:
0: The document is not related to the refer-
3https://en.wikipedia.org/wiki/2015
France seizes passports of
would-be jihadists. Febru-
ary 23rd
Protesters clash with police
in St Louis, Mo., USA. Au-
gust 20th
Cyclone Pam hit Vanuatu
archipelago. March 15th
Facebook vows to combat
racist content on German
platform. September 14th
UK General Election cam-
paign start. March 30th
Wildfires rampage across
northern California.
September 14th
Magnitude 7.9 earthquake
hits Nepal. April 25th
Paris Attacks. November
13th
Pakistan police kill head of
anti-Shiite group. July 7th
Swedish police arrest man
for plotting terror attack.
November 20th
ISIS Truck bombing in
Baghdad market. August
13th
Typhoon Melor causes
heavy flooding in Philip-
pines. December 16th
Table 1: The 12 events of our gold standard.
ence event considered;
1: The document has a loose relation to the
reference event;
2: The document has a strict relation to the
reference event.
We define strict and loose relation as follows: a
strict relation means that the document is focused
on the event and differ from the reference news
article only by its wording or additional/missing
information; a loose relation designates a docu-
ment that is not focused on the event, but provides
a news that is so specific to this event that its men-
tion is core to the overall information provided.
Examples of strict and loose relations are provided
in Figure 4.
This distinction was introduced when facing
two particular types of documents: death toll up-
dates and responsibility claims for terrorist at-
tacks. In both cases, the causal events (attack or
natural disaster) are first released as they are in-
65
Magnitude 7.5 earthquake hits Nepal: USGS. A powerful 7.5
magnitude earthquake struck Nepal on Saturday, the United
States Geological Survey said, with strong tremors felt across
the Himalayan nation and parts of India.
101 dead as 7.8 quake hits Nepal, causing big damage. A
powerful earthquake struck Nepal Saturday, killing at least
71 people as the violently shaking earth, collapsed houses,
leveled centuries-old temples and triggered avalanches in the
Himalayas.
Nepal quake toll reaches 688: government. KATHMANDU
(Reuters) - The death toll from a powerful earthquake that
struck Nepal on Saturday has risen to 688, a senior home
ministry official told Reuters, with 181 people killed in the
capital Kathmandu.
Figure 4: Examples of strict and loose relations.
The first text is from the reference news article,
the second one is assessed as “strict” relation, the
third one as a “loose” relation.
formation of their own. Afterwards, death tolls
and claims become stand-alone newsworthy con-
tent and are updated independently, yet remaining
tightly connected to their causal event.
We use the same metrics as described in Glavaˇ
s
and ˇ
Snajder (2013): mean R-precision (R-prec.)
and mean average precision (MAP) are computed
over the complete ordering of all the documents in
the cluster with:
R-prec =r
R
where r=number of relevant retrieved docu-
ments and R=total number of relevant docu-
ments to retrieve. Average Precision (AP ) is given
by:
AP =
n
P
k=1
(P(k)rel(k))
R
where k=rank of the document, P(k)is the pre-
cision at cut-off kand rel(k)=1if document k
is relevant, 0otherwise. We also compute preci-
sion, recall and F-score after applying the elbow
splitting to evaluate it separately.
Our results are detailed in Table 2by distin-
guishing for each reference (strict or loose) the fig-
ures with (align) and without (no align) the use of
our final alignment algorithm. From that perspec-
tive, Table 2clearly shows the interest of this last
step, with a significant increase of both MAP and
R-precision when the final alignment algorithm is
applied. This increase is particularly noticeable
for R-precision, which emphasizes the ability of
this last step to rerank the Web documents in a rel-
evant way. Unsurprisingly, the strict reference is
globally more difficult than the loose one, espe-
cially for precision: as loose documents are close
Strict Loose
no align align no align align
MAP 58.6 62.2 63.7 66.9
R-prec. 50.2 60 56.5 63.5
Precision – 70.7 – 77.1
Recall – 80.3 – 76.3
F-score – 75.2 – 77.7
Table 2: Performance of our event instance clus-
tering system. Average values for the 12 events.
to strict documents, the overall system tends to
select more false positives with the strict refer-
ence. Logically, the loose reference makes recall
decrease, but very slightly.
From a qualitative perspective, we observed
several phenomena. Sometimes, the journalis-
tic coverage of an event extends greatly from
the time-space context of the mentioned instance,
which tends to have a negative impact on preci-
sion. For example, in our corpus, the 13 Novem-
ber terrorist attacks of Paris have caused many of-
ficial reactions worldwide as well as actions taken
through social media that have been covered on
their own, all in a very short period of time. More-
over, the event itself might be complex in nature:
while the event “Paris Attacks” can be restricted
to the city of Paris on one particular night (uni-
fied time-space context), it is in fact composite,
consisting in multiple attacks of different natures
(shootings and bombings). For our system, this
results in clusters of abnormal sizes (700+ docu-
ments clustered in this case, against an usual max-
imum of 100+). In such cases, the number of an-
notated documents in the gold standard can be too
low, which is an obstacle to the correct evaluation
of the output. These abnormal clusters also have
another characteristic: being composed of signif-
icantly more documents, the distribution of their
alignment scores tends to be smoother, making the
scree-test less reliable.
4 Conclusion and Perspectives
In this paper, we introduced an unsupervised
pipeline aiming at producing event instance driven
clusters of news articles. To do so, we leverage ho-
mogeneous high-quality news agency articles to
identify event instances and find linguistic varia-
tions in their expression from Web news articles.
Our experimental results validate our approach as
a groundwork for future extensions in the broader
task of grouping events according to their type and
inducing a shared representation of each type of
66
event by identifying and generalizing the partici-
pants of events.
5 Acknowledgment
This work has been partially funded by French Na-
tional Research Agency (ANR) under project AS-
RAEL (ANR-15-CE23-0018). We would like to
thank the French News Agency (AFP) for provid-
ing us with the corpus.
References
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan
Srikant. 2007. Scaling up all pairs similarity search.
In 16th International World Wide Web Conference
(WWW’07). pages 131–140.
Nathanael Chambers and Dan Jurafsky. 2011.
Template-based information extraction without
the templates. In Proceedings of the 49th Annual
Meeting of the Association for Computational
Linguistics (ACL 2011). Portland, Oregon, USA,
pages 976–986.
Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Van-
derwenden. 2013. Probabilistic frame induction. In
Proceedings of NAACL-HLT 2013. Atlanta, Georgia,
USA, pages 837–846.
Agata Cybulska and Piek Vossen. 2014. Using a
Sledgehammer to Crack a Nut? Lexical Diversity
and Event Coreference Resolution. In Ninth In-
ternational Conference on Language Resources and
Evaluation (LREC’14). Reykjavik, Iceland.
Elena Filatova, Vasileios Hatzivassiloglou, and Kath-
leen McKeown. 2006. Automatic creation of do-
main templates. In Proceedings of the COL-
ING/ACL 2006 Main Conference Poster Sessions.
pages 207–214.
Goran Glavaˇ
s and Jan ˇ
Snajder. 2013. Recognizing
identical events with graph kernels. In 51st An-
nual Meeting of the Association for Computational
Linguistics (ACL 2013). Sofia, Bulgaria, pages 797–
803.
Lifu Huang, Taylor Cassidy, Feng Xiaocheng, Heng Ji,
Clare R. Voss, Jiawei Han, and Avirup Sil. 2016.
Liberal event extraction and event schema induc-
tion. In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (ACL
2016). Berlin, Germany, pages 258–268.
Teruko Mitamura, Yukari Yamakawa, Susan Holm,
Zhiyi Song, Ann Bies, Seth Kulick, and Stephanie
Strassel. 2015. Event Nugget Annotation: Processes
and Issues. In 3rd Workshop on EVENTS: Defi-
nition, Detection, Coreference, and Representation.
Denver, Colorado, pages 66–76.
Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grish-
man. 2016. Joint event extraction via recurrent neu-
ral networks. In Proceedings of NAACL-HLT 2016.
San Diego, California, USA, pages 300–309.
Md Arafat Sultan, Steven Bethard, and Tamara Sum-
mer. 2014. Back to Basics for Monolingual Align-
ment: Exploiting Word Similarity and Contextual
Evidence. Transactions of the Association for Com-
putational Linguistics (TACL) 2:219–230.
Stijn van Dongen. 2000. Graph Clustering by Flow
Simulation. Ph.D. thesis, University of Utrecht.
Willem Robert van Hage, V´
eronique Malais´
e, Roxane
Segers, Laura Hollink, and Guus Schreiber. 2011.
Design and use of the simple event model (sem).
Web Semantics: Science, Services and Agents on the
World Wide Web 9(2):128–136.
67
... [42]- [44], [46], [47], [58], [60], [99], [114], [135], [139], [140], [149], [162], [163], [178], [179], [187]- [191], [199], [203]- [205], [212], [213], [221]- [227], [234]- [239] [86], [141], [144], [146], [147], [152]- [158], [164], [192]- [194], [200], [207], [208], [214], [215] [151] [113], [195]- [198] Analysis granularity Document-level [115], [117]- [120], [138], [148], [183]- [185], [229] [73], [199] [127]- [129], [131], [132] Sentence-level [61], [69]- [72], [85], [87]- [89], [100], [125], [133], [134], [136], [137], [142], [143], [145], [150], [159]- [161], [165]- [174], [176], [180]- [182], [201], [202], [216]- [220], [230]- [233], [241]- [244] [126], [175], [186] [42]- [44], [46], [47], [58], [60], [65], [99], [114], [135], [139], [140], [149], [162], [163], [177]- [179], [187]- [191], [203]- [205], [212], [213], [221]- [227], [234]- [240], [245], [246] [86], [130], [141], [144], [146], [147], [152]- [158], [164], [192]- [194], [200], [206]- [211], [214], [215], [228] [151] [113], [195]- [198] Subtask management Joint [70], [71], [85], [88], [115], [133], [134], [142], [143], [148], [166], [185], [201], [202], [217] [126] [47], [99], [135], [139], [140], [190], [199], [203], [204], [221]- [227], [239] MTL [100], [125], [136]- [138], [182], [218]- [220] [186] [44] [207], [208] Pipeline [69], [72], [87], [89], [145], [150], [159]- [161], [165], [180], [181], [183], [184], [216], [230]- [233] [42], [43], [46], [58], [60], [73], [114], [149], [162], [163], [178], [179], [187]- [189], [191], [205], [212], [213], [234]- [238], [240] Schema dependency Open-domain [87], [88], [117]- [124], [150], [168], [170]- [174] [126], [175] [151] [127]- [132] OIE [61], [242]- [244] [176] Closed-domain [69]- [72], [85], [89], [100], [115], [125], [133], [134], [136]- [138], [142], [143], [145], [148], [159]- [161], [165]- [167], [169], [180]- [185], [201], [202], [216]- [220], [229]- [233], [241] [186] [42]- [44], [46], [47], [58], [60], [65], [73], [99], [114], [135], [139], [140], [149], [162], [163], [177]- [179], [187]- [191], [199], [203]- [205], [212], [213], [221]- [227], [234]- [240], [245], [246] [86], [141], [144], [146], [147], [152]- [158], [164], [192]- [194], [200], [206]- [211], [214], [215], [228] [113], [195]- [198] FIGURE 5: Taxonomy of the reviewed literature and main approaches identified for event extraction. The gray nodes represent different and partially overlapping classification criteria. ...
... A great portion of works focuses on detecting keywords to cluster sentences/articles expressing the same event or discussing a target topic in a text stream (i.e., topic tracking) [117]- [120], [127]- [132]. Documents are typically first converted into vectorial representations (e.g., BOW [131], TF-IDF [127], [130], lexical chains [128], multinomial distributions via Bayesian approaches [117]- [119], [129] or GANs [120]), sometimes augmented with named entities, time, and location [117], [119], [120], [129], [131], [132]. Each resulting cluster (e.g., k-means [132], agglomerative hierarchical clustering [127], sequence-based and iterative TF-IDF clustering [130], Markov clustering [131]) brings together documents describing the same high-level event (termed as meta-event), whose simplified definition depends on the previously modeled and embedded information (i.e, semantically similar words, temporal proximity, etc.). ...
... A great portion of works focuses on detecting keywords to cluster sentences/articles expressing the same event or discussing a target topic in a text stream (i.e., topic tracking) [117]- [120], [127]- [132]. Documents are typically first converted into vectorial representations (e.g., BOW [131], TF-IDF [127], [130], lexical chains [128], multinomial distributions via Bayesian approaches [117]- [119], [129] or GANs [120]), sometimes augmented with named entities, time, and location [117], [119], [120], [129], [131], [132]. Each resulting cluster (e.g., k-means [132], agglomerative hierarchical clustering [127], sequence-based and iterative TF-IDF clustering [130], Markov clustering [131]) brings together documents describing the same high-level event (termed as meta-event), whose simplified definition depends on the previously modeled and embedded information (i.e, semantically similar words, temporal proximity, etc.). ...
Article
Full-text available
Motivation: The scientific literature embeds an enormous amount of relational knowledge, encompassing interactions between biomedical entities, like proteins, drugs, and symptoms. To cope with the ever-increasing number of publications, researchers are experiencing a surge of interest in extracting valuable, structured, concise, and unambiguous information from plain texts. With the development of deep learning, the granularity of information extraction is evolving from entities and pairwise relations to events. Events can model complex interactions involving multiple participants having a specific semantic role, also handling nested and overlapping definitions. After being studied for years, automatic event extraction is on the road to significantly impact biology in a wide range of applications, from knowledge base enrichment to the formulation of new research hypotheses. Results: This paper provides a comprehensive and up-to-date survey on the link between event extraction and natural language understanding, focusing on the biomedical domain. First, we establish a flexible event definition, summarizing the terminological efforts conducted in various areas. Second, we present the event extraction task, the related challenges, and the available annotated corpora. Third, we deeply explore the most representative methods and present an analysis of the current state-of-the-art, accompanied by performance discussion. To help researchers navigate the avalanche of event extraction works, we provide a detailed taxonomy for classifying the contributions proposed by the community. Fourth, we compare solutions applied in biomedicine with those evaluated in other domains, identifying research opportunities and providing insights for strategies not yet explored. Finally, we discuss applications and our envisions about future perspectives, moving the needle on explainability and knowledge injection.
... A closed-domain EE follows a predefined event structure, usually referred to as the event schema, that defines a set of event types and the corresponding event argument roles. An open-domain EE does not assume such a predefined event structure and the main task is detecting and clustering similar events in the text (Allan 2012;Ribeiro, Ferret, and Tannier 2017;Liu et al. 2008). ...
Article
This paper explores the application of sensemaking theory to support non-expert crowds in intricate data annotation tasks. We investigate the influence of procedural context and data context on the annotation quality of novice crowds, defining procedural context as completing multiple related annotation tasks on the same data point, and data context as annotating multiple data points with semantic relevance. We conducted a controlled experiment involving 140 non-expert crowd workers, who generated 1400 event annotations across various procedural and data context levels. Assessments of annotations demonstrate that high procedural context positively impacts annotation quality, although this effect diminishes with lower data context. Notably, assigning multiple related tasks to novice annotators yields comparable quality to expert annotations, without costing additional time or effort. We discuss the trade-offs associated with procedural and data contexts and draw design implications for engaging non-experts in crowdsourcing complex annotation tasks.
... Conversely, Open domain event extraction often doesn't limit the domain scope, and there is no specific structure for the types of events and their frameworks. The open domain event extraction [4,5] is mainly based on unsupervised methods. Compared with specific domains, open domain event extraction has a larger scale and involves more types of events, which better guarantees the universality and practicability of the model. ...
Article
Full-text available
Event extraction is an important field in information extraction, which aims to extract key information from unstructured text automatically. Event extraction is mainly divided into trigger identification and classification. The existing models are deficient in sentence representation in the initial word embeddings training process, which makes it difficult to capture the deep bidirectional representation and can’t handle the semantic information of the context well, thus affecting the performance of event detection. In this paper, a model BMRMC (BERT + Mean pooling layer + Relative position in multi-head attention + CRF) based on multi-information representation and attention mechanism is proposed. Firstly, the BERT pre-training model based on a bidirectional training transformer is used to embed words and extract word-level features. Then the sentence-level semantic representation is fused by mean pooling layer. In addition, relative position is combined with multi-head attention, which can strengthen the connection of contents. Finally, the sequences are labeled by CRF based on the BIO-labeling mechanism. The experimental results show that the proposed model BMRMC improves the performance of event detection, and the F value on the MAVEN dataset is 67.74%, which achieves state-of-the-art performance in the general fine-grained event detection task.
... Here, we only use the lead-3 sentences, since news articles tend to summarize the event at the very beginning. Such a strategy has been used in previous related work [25,32] and also serves as a strong baseline for the text summarization task [31]. ...
Preprint
Full-text available
Automated event detection from news corpora is a crucial task towards mining fast-evolving structured knowledge. As real-world events have different granularities, from the top-level themes to key events and then to event mentions corresponding to concrete actions, there are generally two lines of research: (1) theme detection identifies from a news corpus major themes (e.g., "2019 Hong Kong Protests" vs. "2020 U.S. Presidential Election") that have very distinct semantics; and (2) action extraction extracts from one document mention-level actions (e.g., "the police hit the left arm of the protester") that are too fine-grained for comprehending the event. In this paper, we propose a new task, key event detection at the intermediate level, aiming to detect from a news corpus key events (e.g., "HK Airport Protest on Aug. 12-14"), each happening at a particular time/location and focusing on the same topic. This task can bridge event understanding and structuring and is inherently challenging because of the thematic and temporal closeness of key events and the scarcity of labeled data due to the fast-evolving nature of news articles. To address these challenges, we develop an unsupervised key event detection framework, EvMine, that (1) extracts temporally frequent peak phrases using a novel ttf-itf score, (2) merges peak phrases into event-indicative feature sets by detecting communities from our designed peak phrase graph that captures document co-occurrences, semantic similarities, and temporal closeness signals, and (3) iteratively retrieves documents related to each key event by training a classifier with automatically generated pseudo labels from the event-indicative feature sets and refining the detected key events using the retrieved documents. Extensive experiments and case studies show EvMine outperforms all the baseline methods and its ablations on two real-world news corpora.
... Specifically, existing methods for event coreference resolution can be divided into unsupervised, semi-supervised, and supervised methods. Unsupervised methods construct feature templatebased event representations and then perform pattern matching or adopt unsupervised probabilistic models to identify the coreferential relations between events [127], [128], [129], [130], [131], [132], [133], [134]. ...
Preprint
Full-text available
Besides entity-centric knowledge, usually organized as Knowledge Graph (KG), events are also an essential kind of knowledge in the world, which trigger the spring up of event-centric knowledge representation form like Event KG (EKG). It plays an increasingly important role in many machine learning and artificial intelligence applications, such as intelligent search, question-answering, recommendation, and text generation. This paper provides a comprehensive survey of EKG from history, ontology, instance, and application views. Specifically, to characterize EKG thoroughly, we focus on its history, definitions, schema induction, acquisition, related representative graphs/systems, and applications. The development processes and trends are studied therein. We further summarize perspective directions to facilitate future research on EKG.
... Some online approaches combine the time element implicitly by sorting documents in chronological order, dividing them with time slicing, and processing each slice (Allan et al. 1998;Yang, Pierce, and Carbonell 1998;Dai, He, and Sun 2010;Hu et al. 2017). Other work uses decay functions to extract sparse time features (Yang, Pierce, and Carbonell 1998;Brants, Chen, and Farahat 2003;Li et al. 2005;He et al. 2010;Ribeiro, Ferret, and Tannier 2017;Miranda et al. 2018;Saravanakumar et al. 2021). None of the previous work has used temporal embeddings to represent time for the TDT task. ...
Preprint
Full-text available
The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems.
... The resources for event detection can be online news or social media data. Many studies have attempted to detect and cluster events from news reports [16][17][18][19][20]. For example, Liu et al. [20] clustered news reports according to daily major events such as economic and societal news, and Yu and Wu [19] aggregated news reports related to the same event into a topic-centered collection. ...
Article
Full-text available
Typhoons are major natural disasters in China. Much typhoon information is contained in a large number of network media resources, such as news reports and volunteered geographic information (VGI) data, and these are the implicit data sources for typhoon research. However, two problems arise when using typhoon information from Chinese news reports. Since the Chinese language lacks natural delimiters, word segmentation error results in trigger mismatches. Additionally, the polysemy of Chinese affects the classification of triggers. Second, there is no authoritative classification system for typhoon events. This paper defines a classification system for typhoon events, and then uses the system in a neural network model, lattice-structured bidirectional long–short-term memory with a conditional random field (BiLSTM-CRF), to detect these events in Chinese online news. A typhoon dataset is created using texts from the China Weather Typhoon Network. Three other datasets are generated from general Chinese web pages. Experiments on these four datasets show that the model can tackle the problems mentioned above and accurately detect typhoon events in Chinese news reports.
Article
Besides entity-centric knowledge, usually organized as Knowledge Graph (KG), events are also an essential kind of knowledge in the world, which trigger the spring up of event-centric knowledge representation form like Event KG (EKG). It plays an increasingly important role in many downstream applications, such as search, question-answering, recommendation, financial quantitative investments, and text generation. This paper provides a comprehensive survey of EKG from history, ontology, instance, and application views. Specifically, to characterize EKG thoroughly, we focus on its history, definitions, schema induction, acquisition, related representative graphs/systems, and applications. The development processes and trends are studied therein. We further summarize prospective directions to facilitate future research on EKG.
Chapter
Every day, a massive amount of information is reported in the form of video, audio, or text through various media such as television, radio, social media, and web blogs. As the number of unstructured documents on those media has grown, finding relevant information has become more difficult. As a result, extracting relevant events from large amounts of unstructured text data is essential. We proposed an event extraction model, which aims to detect, classify and extract various types of events along with their arguments from Amharic text documents. In this paper, the researchers first come up with Amharic language-specific issues and then proposed Bidirectional Long Short Memory (BiLSTM) with a Word2vec model to detect and classify Amharic events from unstructured documents. To achieve this research 9,050 Amharic documents were used for event detection and extraction purpose. In addition to event detection and classification, the model also extracts event arguments that contain additional information about events such as Time and Place. The experimental results showed that the Bidirectional long short-term memory approach with Word2vec word embedding shows a promising result in terms of Amharic event detection and event classification, with 94% and 89% accuracy, respectively.
Conference Paper
Identifying news events and relating current news to past events or already identified ones is an open challenge for news agencies. In this paper, I propose a study to identify events from semantic RDF graph representations of real-time and big data streams of news and pre-news. The proposed solution must provide acceptable accuracy over time and consider the requirements of incremental clustering, big data and real-time streams. To design a solution for identifying events, I want to study which clustering approaches are best for this purpose including methods for clustering RDF graphs using machine learning and “classical” algorithmic approaches. I also present three different evaluation approaches.
Article
Full-text available
We present a simple, easy-to-replicate monolingual aligner that demonstrates state-of-the-art performance while relying on almost no supervision and a very small number of external resources. Based on the hypothesis that words with similar meanings represent potential pairs for alignment if located in similar contexts, we propose a system that operates by finding such pairs. In two intrinsic evaluations on alignment test data, our system achieves F 1 scores of 88–92%, demonstrating 1–3% absolute improvement over the previous best system. Moreover, in two extrinsic evaluations our aligner outperforms existing aligners, and even a naive application of the aligner approaches state-of-the-art performance in each extrinsic task.
Conference Paper
Full-text available
This paper describes the processes and issues of annotating event nuggets based on DEFT ERE Annotation Guidelines v1.3 and TAC KBP Event Detection Annotation Guidelines 1.7. Using Brat Rapid Annotation Tool (brat), newswire and discussion forum documents were annotated. One of the challenges arising from human annotation of documents is annotators’ disagreement about the way of tagging events. We propose using Event Nuggets to help meet the definitions of the specific type/subtypes which are part of this project. We present case studies of several examples of event annotation issues, including discontinuous multi-word events representing single events. Annotation statistics and consistency analysis is provided to characterize the interannotator agreement, considering single term events and multi-word events which are both continuous and discontinuous. Consistency analysis is conducted using a scorer to compare first pass annotated files against adjudicated files.
Article
Events have become central elements in the representation of data from domains such as history, cultural heritage, multimedia and geography. The Simple Event Model (SEM) is created to model events in these various domains, without making assumptions about the domain-specific vocabularies used. SEM is designed with a minimum of semantic commitment to guarantee maximal interoperability. In this paper, we discuss the general requirements of an event model for Web data and give examples from two use cases: historic events and events in the maritime safety and security domain. The advantages and disadvantages of several existing event models are discussed in the context of the historic example. We discuss the design decisions underlying SEM. SEM is coupled with a Prolog API that enables users to create instances of events without going into the details of the implementation of the model. By a tight coupling to existing Prolog packages, the API facilitates easy integration of event instances to Linked Open Data. We illustrate use of the API with examples from the maritime domain.
Conference Paper
Recently, many Natural Language Processing (NLP) applications have improved the quality of their output by using various machine learning tech- niques to mine Information Extraction (IE) patterns for capturing information from the input text. Cur- rently, to mine IE patterns one should know in ad- vance the type of the information that should be captured by these patterns. In this work we pro- pose a novel methodology for corpus analysis based on cross-examination of several document collec- tions representing different instances of the same domain. We show that this methodology can be used for automatic domain template creation. As the problem of automatic domain template creation is rather new, there is no well-defined procedure for the evaluation of the domain template quality. Thus, we propose a methodology for identifying what in- formation should be present in the template. Using this information we evaluate the automatically cre- ated domain templates through the text snippets re- trieved according to the created templates.
Conference Paper
Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.
Conference Paper
Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.
Article
Dit proefschrift heeft als onderwerp het clusteren van grafen door middel van simulatie van stroming, een probleem dat in zijn algemeenheid behoort tot het gebied der clustera- nalyse. In deze tak van wetenschap ontwerpt en onderzoekt men methoden die gegeven bepaalde data een onderverdeling in groepen genereren, waarbij het oogmerk is een on- derverdeling in groepen te vinden die natuurlijk is. Dat wil zeggen dat verschillende data-elementen in dezelfde groep idealiter veel op elkaar lijken, en dat data-elementen uit verschillende groepen idealiter veel van elkaar verschillen. Soms ontbreken zulke groepjes helemaal; dan is er weinig patroon te herkennen in de data. Het idee is dat de aanwezigheid van natuurlijke groepjes het mogelijk maakt de data te categoriseren. Een voorbeeld is het clusteren van gegevens (over symptomen of lichaamskarakteristie- ken) van patienten die aan dezelfde ziekte lijden. Als er duidelijke groepjes bestaan in die gegevens, kan dit tot extra inzicht leiden in de ziekte. Clusteranalyse kan al- dus gebruikt worden voor exploratief onderzoek. Verdere voorbeelden komen uit de scheikunde, taxonomie, psychiatrie, archeologie, marktonderzoek en nog vele andere disicplines. Taxonomie, de studie van de classificatie van organismen, heeft een rijke ge- schiedenis beginnend bij Aristoteles en culminerend in de werken van Linnaeus. In feite kan de clusteranalyse gezien worden als het resultaat van een steeds meer systematische en abstracte studie van de diverse methoden ontworpen in verschillende toepassingsge- bieden, waarbij methode zowel wordt gescheiden van data en toepassingsgebied als van berekeningswijze. In de cluster analyse kunnen grofweg twee richtingen onderscheiden worden, naargelang het type data dat geclassificeerd moet worden. De data-elementen in het voorbeeld hier- boven worden beschreven door vectoren (lijstjes van scores of metingen), en het verschil tussen twee elementen wordt bepaald door het verschil van de vectoren. Deze disserta- tie betreft cluster analyse toegepast op data van het type `graaf'. Voorbeelden komen uit de patroonherkenning, het computer–ondersteund ontwerpen, databases voorzien van hyperlinks en het World Wide Web. In al deze gevallen is er sprake van `punten' die verbonden zijn of niet. Een stelsel van punten samen met hun verbindingen heet een graaf. Een goede clustering van een graaf deelt de punten op in groepjes zodanig dat er weinig verbindingen lopen tussen (punten uit) verschillende groepjes en er veel verbin- dingen zijn in elk groepje afzonderlijk. Het eerste deel van de dissertatie, bestaande uit de hoofdstukken 2 en 3, behandelt de positie van clusteranalyse in het algemeen en de positie van graafclusteren binnen de clusteranalyse in het bijzonder, alsmede de relatie van graafclusteren tot het aanverwante probleem van het partitioneren van grafen. In het cluster probleem zoekt men een `natuurlijke' onderverdeling in groepjes en is het aantal en formaat van de groepjes niet voorgeschreven. In het partitie probleem zijn aantal en afmetingen wel voorgeschreven en zoekt men gegeven deze restricties een toewijzing van de elementen aan de groepjes zodanig dat er een minimale hoeveelheid verbindingen tussen de groepjes is. 163?164 SAMENVATTING De dissertatie beschrijft voorts theorie, implementatie en abstracte toetsing van een krachtig nieuw cluster algoritme voor grafen genaamd Markov Cluster algoritme of MCL algoritme. Het algoritme maakt gebruik van (en is in feite niet meer dan een schil om) een algebraisch proces (genaamd MCL proces) gedefinieerd voor Markov grafen, i.e. gra- fen waarvoor de geassocieerde matrix stochastisch is. In dit proces wordt de aanvangs- graaf successievelijk getransformeerd door alternatie van de twee operatoren expansie en inflatie. Expansie is het nemen van de macht van een matrix volgens het klassieke matrix product. Stochastisch gezien betekent dit het uitrekenen van de overgangskan- sen behorend bij een meerstapsrelatie. Inflatie valt samen met het nemen van de macht van een matrix volgens het elementsgewijze Hadamard–Schur product, gevolgd door een kolomsgewijze herschaling zodat het uiteindelijke resultaat weer een (kolom) stochas- tische matrix is. Dit is een ongebruikelijke operator in de wereld van de stochastiek; zijn introductie is geheel en al gemotiveerd door de beoogde werking op grafen waar clusterstructuur aanwezig is. Het is namelijk te verwachten dat bij meerstapsrelaties die corresponderen met puntparen liggend binnen een natuurlijke cluster grotere over- gangskansen zullen horen dan bij puntparen waarvan de punten in verschillende clusters liggen. De inflatie operator bevoordeelt meerstapsrelaties met grote bijbehorende kans en benadeelt meerstapsrelaties met kleine bijbehorende kans. De verwachting is dus dat het MCL proces meerstapsrelaties zal creeeren en bestendigen die horen bij relaties liggend in ´e´en cluster, en dat het alle meerstapsrelaties zal decimeren die behoren bij re- laties tussen verschillende clusters. Dit blijkt inderdaad het geval te zijn. Het MCL proces convergeert over het algemeen naar een idempotente matrix die zeer ijl is en bestaat uit meerdere componenten. De componenten worden ge¨interpreteerd als een clustering van de aanvangsgraaf. Doordat de inflatie operator geparametrizeerd is kunnen clusteringen op verschillend niveau van granulariteit ontdekt worden. Het MCL algoritme bestaat ten eerste uit een transformatiestap van een gegeven graaf naar een stochastische aanvangsgraaf, gebruik makend van het standaard concept van een willekeurige wandeling op een graaf. Ten tweede vergt het de specificatie van twee rijen van waarden die de opeenvolgende expansie en inflatie parametrizeringen defini- eeren. Tenslotte berekent het algoritme het bijbehorende proces en interpreteert het de resulterende limiet. Het idee om willekeurige wandelingen te gebruiken om clus- terstructuur te ontdekken is niet nieuw, maar de wijze van uitvoering wel. Het idee wordt als `graafcluster paradigma' ge¨introduceerd in hoofdstuk 5, gevolgd door enige combinatorische voorstellen tot het clusteren van grafen. Getoond wordt dat er een verband is tussen de combinatorische en probabilistische clustermethoden, en dat een belangrijk onderscheid de localisatiestap is die probabilistische methoden over het al- gemeen introduceren. Het hoofdstuk besluit met een voorbeeld van een MCL proces en de formele definitie van zowel proces als algoritme. Notaties en definities zijn dan reeds ge¨introduceerd in hoofdstuk 4. In hoofdstuk 6 wordt de interpretatiefunctie van idempotente matrices naar clusteringen geformaliseerd, worden simpele eigenschappen van de inflatie operator beschreven, en wordt de stabiliteit van MCL limieten en de ge- associeerde clusteringen geanalyseerd. Het fenomeen van overlappende clusters is in principe mogelijk 13 en maakt intrinsiek deel uit van de interpretatiefunctie, maar blijkt 13 De tot nu toe waargenomen overlap van clusters correspondeerde altijd met een graafauto- morfisme dat het overlappende deel van clusters op zichzelf afbeeldde.?SAMENVATTING 165 instabiel te zijn. Hoofdstuk 7 introduceert de klassen van diagonaal symmetrische en diagonaal positief semi-definiete matrices (matrices die diagonaal gelijkvormig zijn met een symmetrische respectievelijk positief semi-definiete matrix). Beide klassen worden in zichzelf overgevoerd door zowel expansie als inflatie 14 . Getoond wordt dat diagonaal positief semi-definiete matrices structuur bevatten die de interpretatiefunctie van idem- potente matrices naar clusteringen generaliseert. Hieruit volgt een preciezere duiding van het inflatoire effect van de inflatie–operator op het spectrum van de argumentma- trix. Ontkoppelingsaspecten van grafen en matrices zijn altijd nauw verbonden met ka- rakteristieken van de geassocieerde spectra. Hoofdstuk 8 beschrijft een aantal bekende resultaten die ten grondslag liggen aan de meest gebruikte technieken ten behoeve van het partitioneren van grafen. De hoofdstukken 4 tot en met 8 vormen het tweede deel van de dissertatie. Het derde deel doet verslag van experimenten met het MCL algoritme. Hoofdstuk 9 is theoretisch van aard en introduceert functies die gebruikt kunnen worden als maat voor de kwaliteit van een graafclustering. Ondermeer wordt een generieke maat afgeleid die uitdrukt hoe goed een karakteristieke vector de massa van een andere (niet nega- tieve) vector representeert. Elements– of kolomsgewijze toepassing van de maat geeft een uitdrukking voor de mate waarin een clustering de massa van een gewogen graaf of matrix representeert. Tevens wordt een metriek op de ruimte van clusteringen of par- tities afgeleid, die gebruikt wordt om de continu¨iteitseigenschappen en het onderschei- dend vermogen van het MCL algoritme te toetsen in hoofdstuk 12. Hoofdstuk 10 doet verslag van experimenten op kleine symmetrische grafen met welbepaalde dichtheids- karakteristieken zoals rastervormige grafen. Het MCL algoritme blijkt — experimenteel — een sterk scheidend vermogen te hebben. Experimenten met buurgrafen 15 wijzen uit dat het algoritme niet geschikt is indien de diameter van de natuurlijke clusters groot is. Dit verschijnsel kan begrepen worden in termen van de (stochastische) stromings- eigenschappen van het algoritme. Hoofdstuk 11 gaat in op de schaalbaarheid van het algoritme. Cruciaal is dat de limiet van het MCL proces over het algemeen zeer ijl is en dat de iteranden van het proces ijl zijn in een gewogen interpretatie van het begrip ijl. Dat wil zeggen, de inflatie operator zorgt ervoor dat de meeste nieuwe niet-nul ele- menten (corresponderend met meerstapsrelaties) zeer klein blijven en uiteindelijk weer verdwijnen. Dit is des te meer waar naarmate de diameter van de natuurlijke clusters klein is, en naarmate de connectiviteit van de totale graaf laag is. Dit suggereert dat tijdens elke expansie stap — die ervoor zorgt dat de matrix vol loopt — de kolommen van de nieuw berekende matrix uitgedund kunnen worden door simpelweg de k grootste elementen van een nieuw berekende (stochastische) kolom te nemen, en deze elementen te herschalen op 1, waar k afhangt van de aanwezige rekencapaciteit. Omdat het bereke- nen van de k grootste waarden van een vector in principe niet in lineaire tijd kan, blijkt het in praktijk noodzakelijk een verfijnder schema te hanteren waarin de vector eerst uitgedund wordt door middel van drempelwaardes die afhangen van homogeniteitsei- genschappen van de vector. Dit leidt in principe tot een complexiteit in de orde van grootte O —Nk 2 –, waar N de dimensie van de matrix is. Hoofdstuk 12 doet verslag van 14 Voor diagonaal positief semi-definiete matrices geldt dit voor slechts een deel van de para- metrizeringsruimte van de inflatie operator. 15 Rasterachtige grafen gedefinieerd op punten in de Euclidische ruimte.?166 SAMENVATTING experimenten op testgrafen met tienduizend punten waarvan de verbindingen op zo'n manier (willekeurig) zijn gegenereerd dat een a priori beste clustering bekend is. Deze grafen hebben natuurlijke clusters met kleine diameter maar hebben als geheel hoge tot zeer hoge connectiviteit. Het geschaalde MCL algoritme blijkt zeer goede clusteringen te genereren die dicht bij de a priori bekende clustering liggen. De parameter k kan laag gekozen worden, maar de prestaties van het algoritme nemen sterker af naarmate k lager is en de totale connectiviteit van de input graaf hoger. De appendix A cluster miscellany beginnend op pagina 149 is geschreven voor een algemeen publiek en bevat korte uiteenzettingen over diverse aspecten van clusteranalyse, zoals de geschiedenis van het vakgebied en de rol van de computer.