Conference PaperPDF Available

TDV-based Filter for Novelty and Diversity in a Real-time Pub/Sub System

January 2014

January 2014

DOI:10.1145/2790755.2790768

Conference: the 19th International Database Engineering & Applications Symposium

Authors:

Cédric du Mouza

Conservatoire National des Arts et Métiers

Nicolas Travers

Leonardo da Vinci Engineering School

Publish/Subscribe (Pub/Sub) systems have been designed to face the exponential growth of information published on the Web by subscribing to sources of interest which produce flows of items. However users may receive some information several times, or information that does not contain any new content, and conversely miss some information of interest hidden in all information received. Pub/Sub systems are consequently witnessing a real challenge to efficiently filter relevant information. We propose in this paper a scalable approach for filtering news (items) which match the user interests (expressed as subscriptions). Introducing for the first time Term Discrimination Value (TDV) in this context, which allows to measure how a term discrimines an item, we filter out in real-time items whose content has already been notified recently to the user, either in another item (filtering by novelty) or globally in his recent history (filtering by diversity). Our experiments illustrate the impact of our different parameters and confirm the scalability of our approach and the relevance of the results notified.

Example of a sliding window

…

Filtering rate by varying novelty threshold

…

Variation of histories size over time

…

Memory space vs number of subscriptions

…

Processing time for different sizes of sliding windows

…

Figures - uploaded by Nicolas Travers

Content may be subject to copyright.

Content uploaded by Nicolas Travers

Content may be subject to copyright.

TDV-based Filter for Novelty and Diversity in a Real-time

Pub/Sub System

[Extended Abstract]

Zeinab Hmedeh

University Paris X

Nanterre, France

z.hmedeh@u-paris10.fr

Cedric du Mouza

CEDRIC Lab. - CNAM

Paris, France

dumouza@cnam.fr

Nicolas Travers

CEDRIC Lab. - CNAM

Paris, France

nicolas.travers@cnam.fr

ABSTRACT

Publish/Subscribe (Pub/Sub) systems have been designed

to face the exponential growth of information published on

the Web by subscribing to sources of interest which produce

flows of items. However users may receive some information

several times, or information that does not contain any new

content, and conversely miss some information of interest

hidden in all information received. Pub/Sub systems are

consequently witnessing a real challenge to efficiently filter

relevant information. We propose in this paper a scalable

approach for filtering news (items) which match the user

interests (expressed as subscriptions). Introducing for the

first time Term Discrimination Value (TDV) in this context,

which allows to measure how a term discrimines an item,

we filter out in real-time items whose content has already

been notified recently to the user, either in another item

(filtering by novelty) or globally in his recent history (filtering

by diversity). Our experiments illustrate the impact of

our different parameters and confirm the scalability of our

approach and the relevance of the results notified.

Categories and Subject Descriptors

H.2.4 [Database Management]: Systems

Keywords

Pub/Sub, Novelty & Diversity, Web Syndication, TDV

1. INTRODUCTION

Sources of information are multiplying on the Web for

several years, especially due to the success of news portals

and social networks which have become the most popular

means for being informed in real-time of published infor-

mation. These sources produce more and more items [19]

containing small pieces of information. It turns out that

nowadays the amount of data which has to be analyzed

daily is so large that a user may miss information of interest.

BDA 2016 Conference (15-18 November, 2016, Poitiers, France). Distri-

bution of this paper is permitted under the terms of the Creative Commons

license CC-by-nc-nd 4.0.

fÃl’rence BDA 2016 (15 au 18 Novembre 2016, Poitiers, France). Redis-

tribution de cet article autorisÃl’e selon les termes de la licence Creative

Commons CC-by-nc-nd 4.0.

BDA 2016, 15 au 18 Novembre, Poitiers, France.

Thus, a given user can be lost on the Web, even for the

amount of sources available so as the amount of data pro-

vided by those sources [28]. Publish/Subscribe (Pub/Sub)

systems (Redis [24], Scribe [25], Siena [9], Echo [15]) have

been designed to face the problem of delivering information

of interest to end-users and to avoid time-waste on Web

searches.

In Pub/Sub systems, the user defines interests (subscrip-

tions) thanks to topics, keywords, bookmarks, etc. These

systems deliver items (notifications) according to subscribers’

criteria. Even if information is filtered through the matching

process, users remain flooded by notifications [19]. Some

propositions enhance the filtering process by removing redun-

dant information (i.e.,Novelty [32, 10]) and/or taking into

account information diversification in the delivered items

(i.e.,Diversity [14, 11, 23]) which is generally presented as

atop-k issue. The Pub/Sub context discards however tra-

ditional top-k approaches due to real-time notifications and

the impossibility to remove a notified item from the past.

Very few works have been proposed to take into consid-

eration both relevance, novelty and diversity in a real-time

Pub/Sub context. Our Pub/Sub system [17] has a two-step

process: matching and filtering. For matching keyword-

based subscriptions, we assume the existence of an index [18]

providing matched items to corresponding subscriptions. The

second step is a filter by novelty and diversity presented in

this work. The difficulty for keeping a real-time Pub/Sub

system is to evaluate novelty and diversity on-the-fly for

every incoming items.

We propose in this paper an efficient real-time filtering

approach for Pub/Sub systems based on items’ content where

already delivered information to a user will be used to filter

incoming items. Our contributions in this paper are:

•

definitions for novelty and diversity in this particular

context, along with a proposal for a weighting score

adapted to the characteristics of items and subscrip-

tions;

•

an efficient filtering algorithm for real-time Pub/Sub

systems based on novelty and diversity which exploits

redundancy between subscriptions’ history;

•

a validation which highlights the complementarity of

novelty and diversity.

The paper is organized as follows. Section 2 defines items,

novelty and diversity, used in Section 3 which presents our

system and its different optimizations. Section 4 discusses

about specific implementation choices for our system. Sec-

tion 5 experimentally validates our approach. We compare

our approach with existing systems in Section 6 and we

conclude in Section 7.

2. OVERVIEW OF OUR APPROACH

While the matching process relies on a subscription, filter-

ing of items is based on a set of notified items. Unlike top-k

approaches computed on the whole set of items to be notified

leading to delays for items delivery, each subscription in the

Pub/Sub context is associated to a set of notified items that

we call subscription history. The decision to notify an item

is performed in real-time just after the matching process.

This section presents an overview of our approach and the

definitions adopted, and its instantiation is presented in the

Section 4.

2.1 Items and Histories

In our context, we define an item as a set of terms. Each

term is associated to a term weight denoted by w

used to

compute distances and similarities. Weights will be discussed

in Section 4.1.

To compute novelty and diversity, a Pub/Sub system

must keep already notified items, also called subscription

history H. Each one is a time-ordered set of items linked to a

subscription. Each time an item is notified for a subscription,

it is added to its history.

2.2 Novelty

The objective when filtering by novelty is to discard an

item that does not contain new information with respect

to items in the subscription history, i.e., an item Iwith a

truncate or a similar content of a previous item I

. Since, in

our context, history is time dependent, the novelty measure

new(I, I’) should be asymmetric [32] to test how new an

incoming item is w.r.t. an existing one and not conversely.

Finally we can define the novelty of an incoming item I

with respect to an existing history Hby comparing Iwith

all items in H, one by one.

Definition 1 (Novelty item-history).

Given a his-

tory of items Hand an item I,Iis said new with respect to

Hiff:

∀I0∈H, new(I , I0)≥α

We assume that the novelty threshold αis a parameter

fixed for the user for his subscription according to the defined

or required items’ ouput rate. We study in Section 5 its

default value and impact on the filtering rate and quality,

histories, and performances in our system.

2.3 Diversity

Diversity captures a complementary kind of redundancy

since it measures whether the information contained in a

given item is globally present in the set of recently notified

items or not (segmented information). Filtering by diversity

is complementary to filtering by novelty. The objective is to

detect whether an incoming item conveys new information

regarding the whole set of notified items (history) for a given

subscription. Users objective is to receive only the items with

different information; the items are filtered by their content.

So, an item is interesting to be sent to users if the information

that is contained is not present in the set of items that were

already notified. The degree of diversity of an item for a user

w.r.t. its subscription history is measured as how much it can

increase the average pairwise distance dist(I,I’) between the

history’s items [11]. Observe that to keep D(H) and D(H

)

(with I) comparable, we must remove from Han old item

before adding I. To satisfy diversity criterium, Imust be

on average more distant from all items in Hthan at least

one of the items in H. We decided to choose I

as the oldest

item in Hassuming that I

is more likely to be the most

distant item since its information is older and deprecated.

Focusing on only one item of the history allows us to avoid

the quadratic complexity and scaling up the system.

Definition 2 (Diversity of items).

Assume a history

of items Hwhere D(H)is the average pairwise distance be-

tween its items. An item Iimproves the diversity of Hif

and only if:

D(H∪ {I} − {Io})> D(H)

with Iothe oldest item of Hand:

D(H) = 1

|H| ∗ (|H| − 1) X

I∈H

(I0∈H∧I06=I)

dist(I, I 0)

Observe that the two average distances must be compa-

rable, so the number of items in histories must be identical.

Otherwise, if we compare Hto H∪I, the new item Imust

be far more distant from items in Hto make it more diverse

than in our proposition. It justifies our choice to interchange

Iwith I

the oldest one, the more likely distant item since the

difference of time between the two items makes information

naturally more distant.

Filtering process overview

To resume our time dependant filtering process, an incoming

item Iwhich matches a subscription must verify novelty

and diversity over the subscription history H. First, the

novelty of Iis checked by comparing it with each item in

H. If at least one similarity is below the threshold αit is

discarded for H. Second, the diversity of His compared

to the diversity of H∪I−I

. If Iincreases the average

distance, then it is notified and added to H.

A subscription is said to be satisfied by an item only if

either matching and filtering processes are validated. For the

matching process, it meens that all subscriptions’ terms are

contained into the item. According to the filtering process,

the item passes through both novelty and diversity.

3. FILTERING IN REAL-TIME

In this section, we present our solution to quickly filter

out items based on novelty and diversity criteria. It also

allows to efficiently store and to manage items histories for

all subscriptions.

3.1 Shared history

Since an item can belong to several histories, we need

to avoid keeping all items. A simple solution consists in

storing the last N notified items [32] for each subscription

and in factorizing histories by storing each item only once.

However the publication rate strongly differs from one source

to another and this approach will impact the filtering quality.

In fact, important items can be removed too quickly (for very

active source) or a highly-filtering item could never disappear

(for source rarely notified). We conclude that relevance of

filtering will be impact by those item-based histories.

To optimize memory consumption, we adopt a shared-

history which is basically a time-based sliding window W

which contains all items notified at least once during the

last period p. Subscriptions histories are stored as ordered

sets of pointers to related items in W. Figure 1 presents an

example of a sliding window Wand two subscriptions S

and S

with their corresponding histories and pointers to the

shared-history.

3.2 Shared-history Filtering algorithm

Filtering by novelty and diversity with a large number

of subscriptions which share common items raises a real

optimization challenge. Indeed, a na¨

ıve algorithm which

checks for an incoming item novelty then diversity with the

histories of all the subscriptions it matches has the following

cost (Definitions 1 and 2):

Cfilter (I , S) = X

s∈S

I0∈Hs

new(I , I0)

s∈ρ(S)

D(Hs∪ {In} − {Io})

where Scorresponds to the set of subscriptions matched

by the incoming item Iand ρ(S) represents the ratio of Sfor

which Isatisfied the novelty threshold. Assume that term

weights are computed and considered as constant, that the

average history size is N

items (number of computations

per history) and the average item size is S

(time for each

computation is based on item size). Since the cost depends on

the number of computation per history and the time for each

computation is based on item size, the average complexity

for this algorithm is:

Cfilter (I , S) = O(|S|.NH.SI) + O(|ρ(S)|.(NH.SI)2)

Experiments in Figure 2 show that the novelty has a

filtering rate proportional to the chosen threshold α. This

results in |S|∼|ρ(S)|and in a global quadratic complexity:

Cfilter (I , S) = O(|S|.(NH.SI)2)

To achieve Web scaling we propose to optimize the fil-

tering algorithm to filter incoming items by novelty and

diversity, we need by sharing the filtering process with all

subscriptions. As explained previously, the quadratic com-

plexity of the diversity computation makes preferable to

compute novelty at first step. However, browsing Hseveral

times for processing similarities can be costly, especially as

novelty does not filter enough (see Section 5.2.1). To avoid

to scan twice an history both filters are applied in one course.

Algorithm 1 presents the processing with shared-histories

and optimized computations for novelty and diversity.

The algorithm processes each item I0in H. Since Imust

be compared to I

each time it appears in an history, we

compute new(I , I

) and dist(I, I

) only once to benefit from

(I, I

) co-occurrences. Thus we check if this value has already

been computed for another subscription. If not, we compute

Algorithm 1:

Novelty and diversity filtering on an his-

tory

Require: An item I, a history Hand α∈[0,1] novelty

threshold

1: sumH←0;

2: sumI←0;

3: Io←H[0];//oldest item

4: for all I0∈Hdo

5: if I.getInfo(I’) = null then

6: N←Novelty(I,I0);

7: d←dist(I, I 0);

8: I.putInfo(I’,N,d);

9: else

10: N←I.getInfo(I’).N;

11: d←I.getInfo(I’).d;

12: end if

13: if N< α then

14: return;

15: end if

16: if I0! = Iothen

17: I.sum ←I .sum +d;

18: end if

19: end for

20: if I.sum > Io.S um then

21: for all I0∈Hdo

22: I0.sum ←I0.sum +I.getInf o(I0).d

23: end for

24: H←H∪I;

25: Notify I;

26: end if

and register it (line 6-7-8), otherwise we just retrieve the

stored value (line 10-11). If Iis not new, the algorithm stops

(line 13-14). Remember that, as explained previously, the

quadratic complexity of the diversity computation makes

preferable to compute novelty at first step, and diversity

requires to compute the average pairwise distances between

items of H. Then, we cumulate the distance (I , I

) with

others from H(line 17) only if it is not the oldest item

(line 16). Secondly, as explained above, the diversity

computation can be simplified to the comparison between

the sum of distances from I

and I(line 20). In that case,

sums of distances are updated (line 21-22), Iis added to the

history and notified (line 24-25).

To resume, our algorithm integrates two main optimiza-

tions. The first one exploits the high probability of computing

several times the similarity and the distance for each pair

of items (I, I0). Further computations of pairs are constant

and no longer dependant of item size S

. These values are

stored during the filtering process of Iand deleted when

there is no more subscription to check. The gain depends

on the co-occurrence ratio σ∈[0,1] of items in subscription

histories, defined by the number of co-occurrences of items

pairs checked during the filtering step on the total number

of pairs required:

σ= 1 −#cooccurrences

#pairs

The second optimization deals with the computation of

the density which changes for each notified item. To avoid

the quadratic complexity of computing the sum of pairwise

Figure 1: Example of a sliding window

distances in H, we propose in our algorithm to store com-

puted sums of distances I.sum for each item of the history

Hwith all items received later. Then the density of His

the sum of I.sum. Since oldest items are removed with their

stored values, no other update has to be done for remaining

items. Furthermore computation of distances also benefits

of the ‘‘σ’’ co-occurrence gain. Formally I.sum is a stored

value equal to:

I.sum =X

(I0∈H∧I0.τ>I .τ )

dist(I, I 0)

Since diversity is the comparison between D(H

) and D(H)

with H

=H∪ {I

}−{I

}, it results that checking whether

an incoming item increases the diversity or not may be

simplified as follows:

2×PI0∈H0I0.sum

|H0| × (|H0| − 1) >2×PI0∈HI0.sum

|H| × (|H| − 1)

|H|=|H0| ⇒ X

Ik∈H0

Ik.sum > X

Ik∈H

Ik.sum

H0=H∪ {In} − {Io}

⇒X

Ik∈H

Ik.sum +I.S um −Io.Sum > X

Ik∈H

Ik.sum

So the diversity test consists in checking if:

I.sum > Io.sum

To conclude, the complexity of our algorithm benefits

from the co-occurrence between items for the novelty and

the density computations, which results in the following

linear complexity.

Proposition 1 (Shared-history complexity).

Algorithm 1

has a linear complexity w.r.t. the number of subscriptions

matched by the incoming item, the average history size and

item size.

Proof.

Assume σdenotes the average co-occurrence ratio

between items, |S|the number of subscriptions matched by

the incoming item, N

the average history size and S

the

average item size. Then with the shared-history management,

the filtering cost given by the Algorithm 1 is:

Cfilter (I , S) = Cnov (I , S) + Cdiv(I , ρ(S))

=O(|S|.σ.NH.SI) + O(α|S|.σ.NH.SI)

=O(|S|.NH.SI)

The complexity of computing D(H) and D(H∪I) is about

O(|H|), while the one of computing the average pairwise

distance between items of Hby using the classical method is

about O(|H|

). This optimization in computing the density

reduces processing time.

4. DISCUSSION ON METRICS

4.1 Weighting terms with TDV

Definitions of items novelty and diversity are both based

on weight of their terms. Several term weighting models

are proposed in the literature like the Term Frequency (TF)

combined with Inverse Documents Frequency (IDF) [3], the

Term Discrimination Value (TDV) [26] or the Term Preci-

sion [6].

In our context, where items are short sets of terms (social

network items), weights based on term frequencies like the

widespread TF-IDF function, are inappropriate since it is

very unlikely to have multiple term-occurrences within an

item. Consequently for an item Ithe TF is generally equals

|I|

and TF-IDF turns into an IDF score, so the less

frequent terms got the highest scores. This motivates our

choice to weight terms independently of their occurrences in

items using the TDV score. This choice is validated by our

experiments (Table 3).

The TDV function measures how a term helps to distin-

guish a set of documents (i.e., impact on the global entropy).

Consequently neither a frequent term, nor an uncommon

one have an important TDV value [30]. This is, as far as

we know, the first time TDV is used in such a context. The

TDV value represents the capacity for a term to make items

more similar globally. Then the discrimination value for a

term t

is the difference of density between the items set

with t

and without t

. We compute the density as the

average pairwise similarity between distinct items:

∆(I) = 1

|I| × (|I | − 1) X

I∈I

(I0∈I∧I06=I)

sim(I, I 0)

where sim(I, I

) corresponds to a similarity function be-

tween items, like for instance the Cosine distance. Finally

the TDV value for a term tkis:

tdv(I, tk) = ∆(I − {tk})−∆(I)

We denote for simplicity reason tdv(t

) instead of tdv(I, t

4.2 Novelty

Novelty checks if information from Ihas already been

delivered. For example I

contains Iand appends additional

information, then Iis not new compared to I

, but I

new compared to I. Therefore symmetric measures like the

Jaccard measure are not suitable. Consequently we adopt

the following measure, inspired from Newsjunkie [16], for the

novelty of an item compared to another one from the history:

Definition 3 (Novelty item-item).

Let αbe a thresh-

old of novelty, α∈[0,1], and Iand I

two items. Iis said

to be new compared to I0if and only if:

new(I , I0) = Pt∈(I\I∩I0)tdv(t)

Pt∈Itdv(t)≥α

This measure computes the weighted coverage of terms from

Iwithout taking into account terms present in I

w.r.t sum

of weight for terms only present in I. Note that we chose

the tdv value as terms weight according to our discussion

above.

4.3 Similarity in Diversity

To measure diversity we need to compute the distance

between items. Several distance measures are proposed in

literature to compute diversity on a set of documents. Most

frequently used are Cosine [32], Euclidean [12, 23] and Jac-

card [13] but we can also quote Pearson derived from Cosine,

Dice derived from Jaccard or Levenstein. For short items,

Euclidean is known to produce more relevant results [4].

Thus we consider in our system a diversity function based

on an Euclidean distance weighted by TDV values. The

comparison between those different measures will be studied

in future work.

5. EXPERIMENTS

In this section, we study the impact of several parameters

(i.e., novelty threshold, diversity and size of the sliding

window) with a real dataset of items. We measure especially

their impact on the filtering rate and the size of histories and

on the performances. Finally, thanks to a user validation,

we study the quality of our system with different settings

and a periodic filtering based on a top-k approach.

5.1 Implementation and description of datasets

We implemented the system using the standard Java

v1.6.0 20. All experiments were run on a 3.60 GHz quad-core

processor with 16 GB in JVM memory.

Figure 2: Filtering rate by varying novelty threshold

For our experiments, we used a subset from a real dataset

of items acquired over a 8-month campaign from March to

October 2010 [28]. The set of items considered corresponds

to the first week of October (258,480 items).

Alias sampling method [29] was used to generate 10M of

subscriptions which follow the distribution of terms occur-

rences on the Web, and Web queries size reported in [5]. The

vocabulary of 1.5M of distinct terms extracted from items

is used to generate subscriptions. It is characterized among

others by a maximal size equal to 12 terms and on aver-

age 2.2 terms. Note that only 5.28 millions of subscriptions

are satisfied at least once during the studied week, which

means that 5.28 millions of subscriptions pass through both

matching and filtering processes.

5.2 Filtering rate

We study in this section the impact of the novelty thresh-

old and diversity on the filtering rate, as well as of the

number of subscriptions and the window size. The results

presented in this section correspond to the average filtering

rate (number of notified items over the number of items that

match the subscription) of the subscriptions satisfied at least

once during the last day of the studied week.

5.2.1 Impact of the novelty threshold

Figure 2 shows the novelty’s filtering rate when varying

novelty threshold for a window size of 24 hours (dashed line).

We observe that the filtering rate linearly increases with the

novelty threshold. We notice that 38% of items are filtered

when the novelty threshold is set to 50%, i.e., when half

information is not redundant. We recall that item’s novelty

is based on its weighted coverage (Definition 3). On average,

only 20% of items that satisfy a subscription do not contain

redundant information: 80% of the items are filtered out

when the novelty threshold is equal to 100%.

5.2.2 Impact of diversity

Figure 2 also illustrates that filtering by diversity reduces

the number of items to notify (solid line). Diversity acts

as a strong filter since the filtering rate when considering

only diversity (i.e., novelty threshold of 0%) is equal to

64.34%. Figure 2 proves that novelty and diversity are

complementary filters. Observe that the filtering rate slightly

increases with the value of the novelty threshold if diversity

is also considered (64.34% for a novelty threshold of 0% up

to 82% for a threshold of 100%). However the benefit for

Table 1: Filtering rate by

varying window size

window Filtering

Size Rate

12 H 61.06%

24 H 70.71%

48 H 75.93%

Table 2:

Number of subscriptions & notifications w.r.t. subscrip-

tions size

|s|# of subscriptions Average # of Filtering rate

statisfied items

1 2 030 375 505.31 86.79%

2 1 804 265 21.94 57.71%

3 293 666 4.98 42.88%

>3 28 776 2.45 35.52%

having both filters is double since filtering by novelty allows

in plus to decrease the number of items to consider for the

costly diversity computation.

For the following experiments, we set the novelty threshold

to 50% (best quality from Table 3) and take into account

the diversity for filtering process.

5.2.3 Impact of window size

Table 1 shows the filtering rate for different sliding window

sizes. The size of the window impacts the filtering rate: with

larger window size, items stay longer in histories and are

used to filter new incoming items. Although larger sliding

windows have an impact on histories length (see next section),

items notified in large sliding windows stay more time in

histories but information remain diverse enough to generate

new notifications. For following experiments the sliding

window size is set to 24 hours.

5.2.4 Impact of subscriptions size

Table 2 presents for each size of subscription its distribu-

tion which follows the one from Web queries [5]. Most of

subscriptions are short (size lower than 4). We can also note

that the number of notified items by subscription decreases

drastically with subscription size: while short subscriptions

are often matched (>500 items/day), large subscriptions are

rarely notified(<5 items/day).

According to our results, we can say that the filtering rate

is highly dependent on diversity (present or not) as well as

the novelty threshold. But subscriptions size have also a

significant impact.

5.2.5 Histories size

We capture the variation of history size over time. We

get the average size every six hours over the studied week

with three different sliding window sizes. Figure 3 shows this

variation for a novelty threshold equal to 50%, the values

presented are the average size of the history of subscriptions

satisfied at least once in the first six hours of the week (3.35

millions of subscriptions). It should be noted that the size

of the history at time τis equal to the number of items in

the window of the considered size p(items published after

τ−p). During the initialization phase histories become

larger with large sliding windows. The peak of each sliding

Figure 3: Variation of histories size over time

Figure 4: Memory space vs novelty threshold

window corresponds to the accumulation of items that ends

at window-size period (12/24/48 H). The accumulation is due

to the fact that empty histories do not play their filtering role.

Indeed since density keeps growing during this initialization

step, there is almost no filtering by diversity and most items

are notified. After this initialization period, the items which

were greedily added to histories at the beginning go out

the sliding window what leads to a drastic decrease of the

histories size during one window-size period. First items

disappear which contributes to the gap between the peak

and the deep exactly one period after. h The same effects

occurs with less magnitude for the 12 and 48H sliding window-

size, in fact a small sliding window empties quickly and must

restart the density computation, while a large sliding window

empties slowly and filters a lot. The 24H sliding window-size

has a more stable behavior since fills up and empties with

a appropriate rate. The history now allows to filter items

by novelty and diversity and its size stabilizes. We also

measured this variation with different novelty thresholds

and confirmed these conclusions. The initialization phase

corresponds to the diversification of histories.

Another conclusion from Figure 3 is that history size is

dependant of the window size. For 12 and 24h, the number

of items in the history is globally equal to the window-size

(10/20), while the 48H sliding window-size is half more with

70. Even if old items contribute to diversify the information,

the filtering rate (Table 1) growth is not proportional to the

window size. So as the history size which needs a greater

amount of items to filter diversity.

5.3 Performances evaluation

As presented in Section 3.2, we present here three differ-

ent implementations of our system with a Na¨

ıve approach

without optimization, a co-occurrence approach with the

exploitation of the co-occurrences ratio σof items, and the

Figure 5: Memory space vs number of subscriptions

Figure 6: Processing time when varying novelty threshold

Diversity approach which pre-computes and stores densities

in every history.

5.3.1 Memory requirements

Since the co-occurrence approach stores extra-values

only during the filtering process, the amount of space used by

this implementation is equal to the Na¨

ıve implementation.

Consequently we present comparison only between the Na

ıve

and the Diversity implementations.

Figure 4 shows the memory space used by the sliding win-

dow and subscription histories for various novelty thresholds.

When the filtering rate is increasing, less items are stored in

the sliding window, thus it reduces memory consumption for

both optimized and normal implementations. The Diver-

sity implementation requires more memory space since sums

of distance scores are precomputed and stored for each his-

tory. Observed that it requires consequently a memory space

proportional to the size of histories, so inversely proportional

to the filtering rate noticed in Figure 2. For instance for a

rate of 50% we require 2,866MB of memory, while for a rate

of 100% (+16%) we require only 2,387MB (-16.68%).

Figure 5 illustrates the variation of the memory consump-

tion by varying the number of subscriptions. For this ex-

periment, the filtering rate and the average sliding window

size are fixed. We observe that the memory space increases

linearly w.r.t. the number of subscriptions indexed in both

implementations, since each history stores information linked

to the sliding window. However since the Diversity im-

plementation requires more space to store extra-information

compare to the Na¨

ıve version, but the ratio remains con-

stant at 2.4. The Na

ıve implementation uses 399MB (resp.

1009MB) while the Diversity optimization uses 1227MB

(resp. 2866MB) for 2M (resp. 10M) of subscriptions.

5.3.2 Processing time

We now study the gain in processing time obtained by

the optimizations of our system. Figure 6 shows that the

Figure 7: Processing time for different sizes of sliding windows

Figure 8:

Processing time by varying the number of subscriptions

average time (in log-scale) decreases with the novelty thresh-

old and therefore history size. The Na¨

ıve implementation

requires much more computing time especially for low nov-

elty thresholds. The rationale lies in its co-occurrence

optimization which reduces the amount of similarities and

distances computations. The Na¨

ıve implementation is on

average 5 times more costly than the optimized ones, except

for high thresholds where histories are short and few sim-

ilarities/distances are computed. Moreover, the difference

between co-occurrence and Diversity results decreases

with the size of histories which depends on the novelty thresh-

old: the gain is 68% for a novelty threshold of 0%, and 13%

for a novelty threshold of 80%, due to the complexity of O(1)

(find Io.sum) for the diversity computation.

Since the processing time mainly relies on history size, it

is also dependent on the sliding window size. Especially for

the Na

ıve and co-occurrence implementations where the

growth of computation time is more important as shown in

Figure 7. In fact, computation of diversity is dependant on

sliding window size. At the opposite, the processing time for

the Diversity implementation exhibits a moderate increase,

except for large windows size (48h) where histories are larger

which means more distances computation and updates of

sums. In fact, Diversity stores sums of distances between

items, while co-occurrence implementation requires to

recompute them. Since larger windows filters more (Table 1),

so do distance computations for each history, except for the

Diversity implementation which adds those values the first

time an item is notified whatever the number of updates

for histories. At the opposite, the Na¨

ıve implementation

computes distances each time, even if histories are updated.

This requires 21 to 31 times more times than optimized

solutions.

As we can see in Figure 8, processing time increases lin-

early with the number of subscriptions for both optimizations,

while the Na

ıve implementation increases very fast since no

Table 3: Filtering relevance with various technics, thresholds and metrics

Diversity top-k coverage coverage coverage TF-IDF TF-IDF Jaccard

only 25% 50% 75% 50% 50%

Precision 0.782 0.711 0.930 0.939 0.944 0.764 0.884 0.916

recall 0.698 0.634 0.652 0.652 0.610 0.618 0.545 0.652

F-Measure 0.726 0.660 0.732 0.736 0.710 0.626 0.646 0.729

co-occurrences between subscriptions is used. According to

co-occurrence and Diversity implementations, it was

expected to be sub-linear since similarity and distance com-

putations between items are stored during the process to

avoid its re-computation. So, with the growth of the number

of subscriptions, the probability to have a same couple of

items in different subscriptions grows. However, the gain for

co-occurrence is far more interesting (-93%) than Diver-

sity (-63%) since similarity and distance functions are very

costly (compared to sums for diversities). Nevertheless, the

Diversity implementation needs 2.7 less time on average

than the co-occurrence one.

5.4 Quality of Filtering

In this section we study the quality of our filtering step

with users’ behavior. To compute the relevance of our system,

we compare chosen items by users and those obtained by

our system. To validate our choice, we compre the quality

of filtering when changing the weigthing score, the novelty

similarity and its threshold. We also compare our real-time

filtering with a top-k algorithm [14].

To achieve this we have extracted 10 subscriptions

which we gathered matched items. Then we asked users to

filter manually items according to novelty and diversity of

information. Users had to read texts and to decide if an item

is new or if its information is globally contained in previous

items. In order to preserve our context of real-time filtering,

items were displayed in sequence in order to filter them in

chronogical order and histories were shown to users. 60 users

performed 106 validations on our subscriptions. Those users

come from academics and phd students in computer science.

Since filtering out by novelty is more trivial than by diversity,

we kept items in the result set only if they were chosen by

more than 60% of the users (75% for novelty), giving more

weight to diversity.

The top-k algorithm [14] determines the kmost distant

items from a set of items satisfying subscriptions to achieve

diversity. A result set is initialized with the two most dis-

tant items among the items satisfying the subscription and

extended with the next most diversifying items. Each sub-

scription has its own value kwhich is equal to the history

size generated by our approach. Having a same size will

allow us to make results sets comparable for quality measure-

ment. Moreover, this algorithm cannot take into account

the novelty since it is an asymmetric measure based on time.

We must recall that our window-based approach relies on

the time assumption which means that none of the notified

items can be removed from the result set, while the top-k

algorithm could remove a previously chosen item to choose

another one in the following snapshot.

Table 3 shows the average precision, recall and F-Measure

for all the subscriptions compared to the user result set.

1Subscriptions items and user results are available at:

http://cedric.cnam.fr/∼traversn/research/FiND/userset/

Different settings on our system have been made to find

the most relevant measure for our filtering step with: the

diversity step without novelty, the top-K approach result

set, different thresholds for the weighted coverage with di-

versity, changing TDV by the standard TF-IDF with and

without novelty, and finally novelty computed by the Jac-

card distance. We compare especially our novelty measure by

weighted coverage (Definition 3) with the standard Jaccard

similarity for different thresholds, but also the relevance of

our terms weight TDV versus TF-IDF, either in diversity or

novelty. We also study the behavior of the top-k algorithm.

We can see that a combination of diversity and novelty

produces better results than diversity alone, especially for

the precision of the result. However, the recall of result set

decreases when using the novelty which can be too selective

and diversity not enough. As expected, TF-IDF weights

cannot have a good impact on measures since items are

short, so the TF is low and only IDF is taken into account.

With a low precision (0.884) and recall (0.545) it gives the

lowest F-Measure of our tests. According to novelty, the

effect of the asymetric measure and lack of weights for terms

makes the Jaccard measure less relevant for the precision of

the result set. Finally, the top-k technics is not as relevant as

our solution since using an interchange algorithm to choose

most diverse items do not rely on a real-time assomption

as for user validation. The relevance of our technic with a

real-time filtering system, using a TDV-weighted coverage

measure for novelty with a threshold of 50% gives a good

accuracy.

6. RELATED WORK

When searching a document on the Web, we generally

assume that the set of queried documents is static and al-

ready known. The objective of search engines is to present to

users a ranked list of kmost relevant and diverse documents

matching a query. To achieve this, some models are based

on probabilities for matching and diversity [2] or on graphs

for computing distance [12] between items to select their

minimum representative set. Some of them propose to mod-

ify diversity measures by focusing on uncommon attributes

between items based on user-defined filters [31], by defining a

trade-off between similarity and diversity [27], by integrating

entities and sentiment in a Greedy Max-Min algorithm [1],

by defining time-based distances with a gaussian similarity

for blog retrieval [20], or by comparing an item with the

compression of all previous texts like the NCD distance [8].

In [7] they propose Maximal Marginal Relevance method.

This method combines query relevance and the diversity of

documents to compute their score and rank it in the result

set. Thus the document is high ranked if it is similar to

the query and less similar to previously selected documents.

Also to solve the problem of queries ambiguity in IR, [10]

proposes a probabilistic model to rank documents by taking

into account their novelty and diversity. Globally, those

technics allow to compute large texts in a static top-k eval-

uation and cannot adapt to our context since we consider

small items, which changes relevance of previous methods.

Moreover real-time delivery of information is an important

constraint that cannot be ignored.

Some approaches focus on continuous filtering like in the

Pub/Sub context, combined with top-k technics. They may

be based on fixed size windows in order to garantee the

amount of items to keep in the system like [13] which uses a

dynamic index to quickly find if an item is diverse or not on

a frequently updated snapshot of items, [16] which focuses

only on novelty with extracted entities from items, [21]

which presents an incremental approach for diversification

while integrating time to weigth items or [22] which resumes

items to a small set of topics allowing a simple coverage

distance with a set of items (not adapted to high dimension

comparison). However fixed-size windows hardly manage

different notification rates for subscriptions. In fact, low

rates will keep very old items to filter out incoming items

and high rates will remove recent items which should remove

duplicates.

The closest approach from our solution is [14] which uses

top-k windows to compute diversity on real-time delivery.

It is based on the interchange algorithm which notifies an

item if exchanging it with one from the previous top-k levels

up the diversity. However this solution can deliver items

from previous windows if considered as non-diverse, or re-

move items from the past for future filtering steps. As we

saw in experiment, this approach tends to locally diversify

information, but not over time. Moreover, keeping all items

will lead to scale up issues.

7. CONCLUSION AND FUTURE WORK

In this paper, we present a Pub/Sub system, which filters

by novelty and diversity on the fly. The filtering is based on

items already notified to a user. We choose a sliding window

based on time to manage the subscriptions history. Our main

contributions are (a) the proposition of the TDV to weight

terms, combined with (b) a weighted coverage measure for

novelty which is asymetric and adapted to small items, (c)

designing an optimized system which factorizes similarities

and distances, and reduces diversity computations costs,

and (d) a quality measurement of our propositions with a

user validation based on real-time filtering with novelty and

diversity.

From our experimental study, we show that novelty and

diversity are complementary filters. Moreover we observe

that the filtering rate depends on novelty threshold and on

window size, and diversity has less effect for large window

size. The performances of our system are also studied and

we obtain an average gain of 97% in processing time with our

optimization for factorizing co-occurrences and computing

the density of history. We compare the quality of our system

with different settings and a top-k and show that real-time

delivery is a strong constraint which our system guarantees

with a TDV-weighted coverage combined with diversity.

For further work, we aim to tune the quality of the diversity

measure since cosinus and euclidean do not focus on the same

kind of filtering. Another necessity is to solve the problem

of rarely notified subscriptions by extending the set of items

terms to be matched with subscriptions terms, based on

the TDV values and our item distance. We also intend

to propose a distributed version of our algorithm and our

histories management in a NoSQL environment by focusing

on computation of measures on items instead of subscriptions

in order to keep factorization and scalability.

8. REFERENCES

[1] S. Abbar, S. Amer-Yahia, P. Indyk, and S. Mahabadi.

Real-time Recommendation of Diverse Related Articles.

In World Wide Web Conference (WWW), pages 1--12,

2013.

[2] A. Angel and N. Koudas. Efficient Diversity-Aware

Search. In International Conference on Management of

Data (SIGMOD), pages 781--792, 2011.

[3] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern

Information Retrieval. ACM Press / Addison-Wesley,

1999.

[4] V. Bavi, T. Beirne, N. Bone, J. Mohr, and B. Neal.

Comparison of Document Similarity Metrics, 2010.

Computer Science Department, Western Washington

University Information Retrieval.

[5] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. A.

Grossman, and O. Frieder. Hourly Analysis of a Very

Large Topically Categorized Web Query Log. In ACM

Conference on Research and Development in

Information Retrieval (SIGIR), pages 321--328, 2004.

[6]

A. Bookstein and D. Swanson. Probabilistic Models for

Automatic Indexing. Journal of the American Society

for Information Science, 25(5):312--318, 1974.

[7] J. Carbonell and J. Goldstein. The Use of MMR,

Diversity-based Reranking for Reordering Documents

and Producing Summaries. In ACM Conference on

Research and Development in Information Retrieval

(SIGIR), pages 335--336, 1998.

[8] D. Carmel, H. Roitman, and E. Yom-Tov. On the

Relationship Between Novelty and Popularity of

User-generated Content. In ACM International

Conference on Information and Knowledge

Management (CIKM), pages 1509--1512, 2010.

[9] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf.

Design and evaluation of a wide-area event notification

service. ACM Transactions on Computer Systems

(TOCS), 19(3):332--383, Aug. 2001.

[10] C. L. Clarke, M. Kolla, G. V. Cormack,

O. Vechtomova, A. Ashkan, S. B¨

uttcher, and

I. MacKinnon. Novelty and Diversity in Information

Retrieval Evaluation. In ACM Conference on Research

and Development in Information Retrieval (SIGIR),

pages 659--666, 2008.

[11] M. Drosou and E. Pitoura. Diversity over Continuous

Data. IEEE Data Engineering Bulletin, 32(4):49--56,

2009.

[12] M. Drosou and E. Pitoura. DisC Diversity: Result

Diversification Based on Dissimilarity and Coverage.

Very Large Data Bases (PVLDB), 6(1):13--24, 2012.

[13] M. Drosou and E. Pitoura. Dynamic Diversification of

Continuous Data. In Proceeding of the ACM

International Conference on Extending Database

Technology - EDBT, pages 216--227, 2012.

[14] M. Drosou, K. Stefanidis, and E. Pitoura.

Preference-Aware Publish/Subscribe Delivery with

Diversity. In ACM International Conference on

Distributed Event-Based Systems (DEBS), pages

6:1--6:12, 2009.

[15] G. Eisenhauer, F. Bustamante, and K. Schwan. Event

services for high performance computing. In

High-Performance Distributed Computing, 2000.

Proceedings. The Ninth International Symposium on,

pages 113--120, 2000.

[16]

E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie:

Providing Personalized Newsfeeds via Analysis of

Information Novelty. In World Wide Web Conference

(WWW), pages 482--490, 2004.

[17]

Z. Hmedeh, C. du Mouza, and N. Travers. A Real-time

Filtering by Novelty and Diversity for

Publish/Subscribe Systems. In International

Conference on Scientific and Statistical Database

Management (SSDBM), San Diego, USA, June 2015.

[18] Z. Hmedeh, H. Kourdounakis, V. Christophides,

C. du Mouza, M. Scholl, and N. Travers. Subscription

Indexes for Web Syndication Systems. In Proceeding of

the ACM International Conference on Extending

Database Technology - EDBT, pages 311--322, 2012.

[19] Z. Hmedeh, N. Vouzoukidou, N. Travers,

V. Christophides, C. du Mouza, and M. Scholl.

Characterizing Web Syndication Behavior and Content.

In Web Information System Engineering (WISE),

pages 29--42, 2011.

[20] M. Keikha, F. Crestani, and W. B. Croft. Diversity in

Blog Feed Retrieval. In ACM International Conference

on Information and Knowledge Management (CIKM),

pages 525--534, 2012.

[21] E. Minack, W. Siberski, and W. Nejdl. Incremental

Diversification for Very Large Sets: a Streaming-based

Approach. In ACM Conference on Research and

Development in Information Retrieval (SIGIR), pages

585--594, 2011.

[22] D. Panigrahi, A. Das Sarma, G. Aggarwal, and

A. Tomkins. Online Selection of Diverse Results. In

Web Search and Data Mining (WSDM), pages 263--272,

2012.

[23] K. Pripuˇzi´c, I. P. ˇ

Zarko, and K. Aberer. Top-k

Publish/Subscribe: Finding k Most Relevant

Publications in Sliding Time Window. In ACM

International Conference on Distributed Event-Based

Systems (DEBS), pages 127--138, 2008.

[24]

Redis. Redis : Pub/sub. http://redis.io/topics/pubsub.

[25] A. Rowstron, A.-M. Kermarrec, M. Castro, and

P. Druschel. Scribe: The design of a large-scale event

notification infrastructure. In J. Crowcroft and

M. Hofmann, editors, Networked Group

Communication (NGC), volume 2233 of Lecture Notes

in Computer Science, pages 30--43. Springer Berlin

Heidelberg, 2001.

[26] G. Salton, A. Wong, and C. S. Yang. A Vector Space

Model for Automatic Indexing. Commun. ACM,

18(11):613--620, 1975.

[27] B. Smyth and P. McClave. Similarity vs. Diversity. In

International Conference on Case-based Reasoning

(ICCBR), pages 347--361, 2001.

[28]

N. Travers, Z. Hmedeh, N. Vouzoukidou, C. du Mouza,

V. Christophides, and M. Scholl. RSS feeds behavior

analysis, structure and vocabulary. International

Journal of Web Information Systems (IJWIS),

10(3):291--320, 2014.

[29] A. Walker. An Efficient Method for Generating

Discrete Random Variables with General Distributions.

ACM Transactions on Mathematical Software (TOMS),

3:253--256, 1977.

[30] P. Willett. An Algorithm for the Calculation of Exact

Term Discrimination Values. Information Processing

Management, 21(3):225--232, 1985.

[31] C. Yu, L. Lakshmanan, and S. Amer-Yahia. It Takes

Variety to Make a World: Diversification in

Recommender Systems. In Proceeding of the ACM

International Conference on Extending Database

Technology - EDBT, pages 368--378, 2009.

[32] Y. Zhang, J. Callan, and T. Minka. Novelty and

Redundancy Detection in Adaptive Filtering. In ACM

Conference on Research and Development in

Information Retrieval (SIGIR), pages 81--88, 2002.

Entity Summarisation for Entity-centric Publish/Subscribe Systems

Thesis

Full-text available

Dec 2021

Niki Pavlopoulou

The Internet of Things (IoT) has contributed to physical devices generating entity-centric data (e.g. smart buildings). To bridge the gap between the devices’ data and the users’ interests, Publish/Subscribe systems (Pub/Sub) are suitable middleware to deal with dynamic large-scale IoT applications due to their decoupling traits. However, the IoT contains more challenges than dynamism related to data and users. Specifically, data can be voluminous and heterogeneous due to integration or enrichment as well as redundant or semantically similar due to the sensors’ spatial proximity. Existing approaches tackle semantic interoperability through ontologies and taxonomies resulting in rigidness, non-scalability, and domain-dependency. At the same time, users can either create representationally-coupled queries that could be complex (e.g. SPARQL), independent of their data knowledge and expertise, or simple queries that lead to redundant information, which can overwhelm them. Existing approaches either use complex queries or create high-level data abstractions that are either not usable or complex for dynamic environments and suffer from representational coupling. This thesis addresses these problems and analyses two research questions involving the formulation of a new Pub/Sub scheme; the Entity-centric Publish/Subscribe Summarisation System that involves user-friendly and contextually-aware subscriptions as well as extractive and abstractive summarisation approaches for the publications. Its goal is to address usability, user expressibility, data expressiveness, user and data effectiveness, and system efficiency. Three approaches are proposed; PubSum, IoTSAX, and PoSSUM. PubSum is a dynamic diverse entity summarisation of heterogeneous Linked Data streams through windowing policies, embedding-based DBSCAN clustering, and geometric-based top-k ranking. IoTSAX is a dynamic abstractive summarisation of heterogeneous numerical entity graph streams through enhanced Symbolic Aggregate approximation (SAX) and approximate rule-based reasoning. PoSSUM is an extractive and abstractive diverse summarisation of heterogeneous numerical and Linked Data streams through novel partly-incremental conceptual clustering based on embedding models and variance as well as contextual-based top-k ranking. As an example, doctors are not experts in query languages and are unaware of the content and representations of patient data in a system. The proposed system will require a simple patient-centric subscription that will create a summary as a notification. This summary will be abstractive by interpreting the shape of real-time health sensor readings and providing a high-level inference as well as extractive by including the most important and conceptually/contextually diverse information coming from external sources (e.g. personal information). The proposed system has been extensively evaluated by synthetic and real-world data from the domains of Healthcare and Smart Cities achieving comparable results in correctness and system performance. Specifically, PubSum, involving DBpedia data, achieves up to 92% reduction of forwarded messages, 69.3% duplication reduction, and 0.95 redundancy-aware F-score compared to traditional Pub/Sub, but at the expense of 4 times more latency, while achieving 6 times less latency and 3 times less memory compared to the state-of-the-art diverse entity summarisation with throughput ranging from 833 to 1,005 events/second. IoTSAX, involving real-world heterogeneous data related to Healthcare and Smart Cities, achieves up to 0.87 reasoning F-score, 98% reduction of forwarded messages, and outperforms the original SAX in approximation error (2 to 3 times less) and compression space-saving percentage when data redundancy occurs (from 71.75% to 94.99%) while maintaining similar or better latency and throughput. The latency is 2 to 3 times more compared to traditional Pub/Sub and the throughput ranges from 13.231 to 97.393 events/second. PoSSUM, involving real-world heterogeneous data, discovers up to 80% data diversity desire by users and achieves the best summary quality for more than half of the entities as well as the best conceptual clustering F-score from 0.69 to 0.83 compared to traditional Pub/Sub and the state-of-the-art diverse entity summarisation. Also, up to 0.95 redundancy-aware F-score and 99% message reduction compared to traditional Pub/Sub. Finally, it has less clustering processing time, scoring and memory consumption, and comparable latency and throughput.

Content-Based Publish/Subscribe System for Web Syndication

Article

Full-text available

Mar 2016

Content syndication has become a popular way for timely delivery of frequently updated information on the Web. Today, web syndication technologies such as RSS or Atom are used in a wide variety of applications spreading from large-scale news broadcasting to medium-scale information sharing in scientific and professional communities. However, they exhibit serious limitations for dealing with information overload in Web 2.0. There is a vital need for efficient real-time filtering methods across feeds, to allow users to effectively follow personally interesting information. We investigate in this paper three indexing techniques for users' subscriptions based on inverted lists or on an ordered trie for exact and partial matching. We present analytical models for memory requirements and matching time and we conduct a thorough experimental evaluation to exhibit the impact of critical parameters of realistic web syndication workloads.

RSS feeds behavior analysis, structure and vocabulary

Article

Full-text available

Aug 2014
Int J Web Inform Syst

Purpose – The purpose of this paper is to present a thorough analysis of three complementary features of real-scale really simple syndication (RSS)/Atom feeds, namely, publication activity, items characteristics and their textual vocabulary, that the authors believe are crucial for emerging Web 2.0 applications. Previous works on RSS/Atom statistical characteristics do not provide a precise and updated characterization of feeds’ behavior and content, characterization that can be used to successfully benchmark the effectiveness and efficiency of various Web syndication processing/analysis techniques. Design/methodology/approach – The authors empirical study relies on a large-scale testbed acquired over an eight-month campaign from 2010. They collected a total number of 10,794,285 items originating from 8,155 productive feeds. The authors deeply analyze feeds productivity (types and bandwidth), content (XML, text and duplicates) and textual content (vocabulary and buzz-words). Findings – The findings of the study are as follows: 17 per cent of feeds produce 97 per cent of the items; a formal characterization of feeds publication rate conducted by using a modified power law; most popular textual elements are the title and description, with the average size of 52 terms; cumulative item size follows a lognormal distribution, varying greatly with feeds type; 47 per cent of the feed-published items share the same description; the vocabulary does not belong to Wordnet terms (4 per cent); characterization of vocabulary growth using Heaps’ laws and the number of occurrences by a stretched exponential distribution conducted; and ranking of terms does not significantly vary for frequent terms. Research limitations/implications – Modeling dedicated Web applications capacities, Defining benchmarks, optimizing Publish/Subscribe index structures. Practical implications – It especially opens many possibilities for tuning Web applications, like an RSS crawler designed with a resource allocator and a refreshing strategy based on the Gini values and evolution to predict bursts for each feed, according to their category and class for targeted feeds; an indexing structure which matches textual items’ content, which takes into account item size according to targeted feeds, size of the vocabulary and term occurrences, updates of the vocabulary and evolution of term ranks, typos and misspelling correction; filtering by pruning items for content duplicates of different feeds and correlation of terms to easily detect replicates. Originality/value – A content-oriented analysis of dynamic Web information.

Real-Time Recommendation of Diverse Related Articles

Conference Paper

Full-text available

May 2013

News articles typically drive a lot of traffic in the form of comments posted by users on a news site. Such user-generated content tends to carry additional information such as entities and sentiment. In general, when articles are recommended to users, only popularity (e.g., most shared and most commented), recency, and sometimes (manual) editors' picks (based on daily hot topics), are considered. We formalize a novel recommendation problem where the goal is to find the closest most diverse articles to the one the user is currently browsing. Our diversity measure incorporates entities and sentiment extracted from comments. Given the real-time nature of our recommendations, we explore the applicability of nearest neighbor algorithms to solve the problem. Our user study on real opinion articles from aljazeera.net and reuters.com validates the use of entities and sentiment extracted from articles and their comments to achieve news diversity when compared to content-based diversity. Finally, our performance experiments show the real-time feasibility of our solution.

Subscription Indexes for Web Syndication Systems

Article

Full-text available

May 2012

The explosion of published information on the Web leads to the emergence of a Web syndication paradigm, which trans-forms the passive reader into an active information collec-tor. Information consumers subscribe to RSS/Atom feeds and are notified whenever a piece of news (item) is pub-lished. The success of this Web syndication now offered on Web sites, blogs, and social media, however raises scalability issues. There is a vital need for efficient real-time filtering methods across feeds, to allow users to follow effectively per-sonally interesting information. We investigate in this paper three indexing techniques for users' subscriptions based on inverted lists or on an ordered trie. We present analytical models for memory requirements and matching time and we conduct a thorough experimental evaluation to exhibit the impact of critical workload parameters on these structures.

FiND: a real-time filtering by novelty and diversity for publish/subscribe systems

Conference Paper

Jun 2015

Content syndication has become a popular way for timely delivery of frequently updated information on the Web. It essentially enhances traditional pull-oriented searching and browsing of web pages with push-oriented protocols. However many Web syndication applications imply a tight coupling between feed producers and consumers and do not help users to find, in all information they received, items with interesting and new content. We present the FiND Pub/Sub system which integrates an in-memory filtering process based on keyword subscriptions. Unlike existing proposals, FiND is designed for real-time notifications on item streams. This demonstration illustrates the main features of the FiND system namely (i) a scalable real-time notification process when the most important terms of the subscription are matched, (ii) a tunable filtering by novelty and diversity to reduce user flooding.

Modern Information Retrieval

Book

Jan 1999

DisC diversity

Article

Nov 2012

Recently, result diversification has attracted a lot of attention as a means to improve the quality of results retrieved by user queries. In this paper, we propose a new, intuitive definition of diversity called DisC diversity. A DisC diverse subset of a query result contains objects such that each object in the result is represented by a similar object in the diverse subset and the objects in the diverse subset are dissimilar to each other. We show that locating a minimum DisC diverse subset is an NP-hard problem and provide heuristics for its approximation. We also propose adapting DisC diverse subsets to a different degree of diversification. We call this operation zooming. We present efficient implementations of our algorithms based on the M-tree, a spatial index structure, and experimentally evaluate their performance.

Diversity in blog feed retrieval

Conference Paper

Oct 2012

Blog distillation (blog feed retrieval) is a task in blog retrieval where the goal is to rank blogs according to their recurrent relevance to a query topic. One of the main properties of blog feed retrieval is that the unit of retrieval is a collection of documents as opposed to a single document as in other IR tasks. This collection retrieval nature of blog distillation introduces new challenges and requires new investigations specific to this problem. Researchers have addressed this problem by considering a wide range of evidence and information resources. However, previous work has not studied the effect of on-topic diversity of blog posts in blog relevance. By on-topic diversity of blog posts we mean that those posts that are about the query topic need to have high diversity and cover different sub-topics of the query. In this study, we investigate three types of on-topic diversity and their effect on retrieval performance: topical diversity, temporal diversity and hybrid diversity. Our experiments over different blog collections and different baseline methods show that on-topic diversity can improve the performance of the retrieval system. Among the three types of diversity, hybrid diversity, that considers both topical and temporal diversities, achieves the best performance.

Novelty and diversity in information retrieval evaluation

Article

Jul 2008

Evaluation measures act as objective functions to be optimized by information retrieval systems. Such objective functions must accurately reflect user requirements, particularly when tuning IR systems and learning ranking functions. Ambiguity in queries and redundancy in retrieved documents are poorly reflected by current evaluation measures. In this paper, we present a framework for evaluation that systematically rewards novelty and diversity. We develop this framework into a specific evaluation measure, based on cumulative gain. We demonstrate the feasibility of our approach using a test collection based on the TREC question answering track.

DisC Diversity: Result Diversification based on Dissimilarity and Coverage

Article

Aug 2012

Probabilistic Models for Automatic Indexing

Article

Sep 1974
J Am Soc Inform Sci

This paper considers the pattern of occurrences of words in text as part of an attempt to develop formal rules for identifying those indicative of content and thereby suitable for use as index terms. A probabilistic model was proposed which, with a suitable fitting of parameters, could account for the occupancy distribution of most words, both index terms and non-index terms. The parameters take quite different values for the two classes. In this model each abstract was considered to receive word occurrences in a Poisson process. Abstracts can then be divided into classes, such that all abstracts within a given class receive word occurrences at the same average rate. The appearance of a particular number of occurrences of some word within an abstract then serves to give information, in a Bayesian sense, on the class membership of that abstract. It is of central interest to determine the minimum number of classes that can account for the occupancy distribution of each word. Though more testing needs to be done it may be concluded that the distribution of a very large majority of words can be accounted for by assuming three or fewer classes.

TDV-based Filter for Novelty and Diversity in a Real-time Pub/Sub System

Abstract and Figures

Recommended publications

Real-time α/γ pulse shape discrimination in CsI(Tl) scintillators

Real-time discrimination of photon pairs using machine learning at the LHC

Spectral binning optimization for the ARTEMIS real-time processor

Comparison of four hearing aid prescriptive procedures