Conference PaperPDF Available

TDV-based Filter for Novelty and Diversity in a Real-time Pub/Sub System

Authors:

Abstract and Figures

Publish/Subscribe (Pub/Sub) systems have been designed to face the exponential growth of information published on the Web by subscribing to sources of interest which produce flows of items. However users may receive some information several times, or information that does not contain any new content, and conversely miss some information of interest hidden in all information received. Pub/Sub systems are consequently witnessing a real challenge to efficiently filter relevant information. We propose in this paper a scalable approach for filtering news (items) which match the user interests (expressed as subscriptions). Introducing for the first time Term Discrimination Value (TDV) in this context, which allows to measure how a term discrimines an item, we filter out in real-time items whose content has already been notified recently to the user, either in another item (filtering by novelty) or globally in his recent history (filtering by diversity). Our experiments illustrate the impact of our different parameters and confirm the scalability of our approach and the relevance of the results notified.
Content may be subject to copyright.
TDV-based Filter for Novelty and Diversity in a Real-time
Pub/Sub System
[Extended Abstract]
Zeinab Hmedeh
University Paris X
Nanterre, France
z.hmedeh@u-paris10.fr
Cedric du Mouza
CEDRIC Lab. - CNAM
Paris, France
dumouza@cnam.fr
Nicolas Travers
CEDRIC Lab. - CNAM
Paris, France
nicolas.travers@cnam.fr
ABSTRACT
Publish/Subscribe (Pub/Sub) systems have been designed
to face the exponential growth of information published on
the Web by subscribing to sources of interest which produce
flows of items. However users may receive some information
several times, or information that does not contain any new
content, and conversely miss some information of interest
hidden in all information received. Pub/Sub systems are
consequently witnessing a real challenge to efficiently filter
relevant information. We propose in this paper a scalable
approach for filtering news (items) which match the user
interests (expressed as subscriptions). Introducing for the
first time Term Discrimination Value (TDV) in this context,
which allows to measure how a term discrimines an item,
we filter out in real-time items whose content has already
been notified recently to the user, either in another item
(filtering by novelty) or globally in his recent history (filtering
by diversity). Our experiments illustrate the impact of
our different parameters and confirm the scalability of our
approach and the relevance of the results notified.
Categories and Subject Descriptors
H.2.4 [Database Management]: Systems
Keywords
Pub/Sub, Novelty & Diversity, Web Syndication, TDV
1. INTRODUCTION
Sources of information are multiplying on the Web for
several years, especially due to the success of news portals
and social networks which have become the most popular
means for being informed in real-time of published infor-
mation. These sources produce more and more items [19]
containing small pieces of information. It turns out that
nowadays the amount of data which has to be analyzed
daily is so large that a user may miss information of interest.
(c) 2016, Copyright is with the authors. Published in the Proceedings of the
BDA 2016 Conference (15-18 November, 2016, Poitiers, France). Distri-
bution of this paper is permitted under the terms of the Creative Commons
license CC-by-nc-nd 4.0.
(c) 2016, Droits restant aux auteurs. PubliÃl’ dans les actes de la con-
fÃl’rence BDA 2016 (15 au 18 Novembre 2016, Poitiers, France). Redis-
tribution de cet article autorisÃl’e selon les termes de la licence Creative
Commons CC-by-nc-nd 4.0.
BDA 2016, 15 au 18 Novembre, Poitiers, France.
Thus, a given user can be lost on the Web, even for the
amount of sources available so as the amount of data pro-
vided by those sources [28]. Publish/Subscribe (Pub/Sub)
systems (Redis [24], Scribe [25], Siena [9], Echo [15]) have
been designed to face the problem of delivering information
of interest to end-users and to avoid time-waste on Web
searches.
In Pub/Sub systems, the user defines interests (subscrip-
tions) thanks to topics, keywords, bookmarks, etc. These
systems deliver items (notifications) according to subscribers’
criteria. Even if information is filtered through the matching
process, users remain flooded by notifications [19]. Some
propositions enhance the filtering process by removing redun-
dant information (i.e.,Novelty [32, 10]) and/or taking into
account information diversification in the delivered items
(i.e.,Diversity [14, 11, 23]) which is generally presented as
atop-k issue. The Pub/Sub context discards however tra-
ditional top-k approaches due to real-time notifications and
the impossibility to remove a notified item from the past.
Very few works have been proposed to take into consid-
eration both relevance, novelty and diversity in a real-time
Pub/Sub context. Our Pub/Sub system [17] has a two-step
process: matching and filtering. For matching keyword-
based subscriptions, we assume the existence of an index [18]
providing matched items to corresponding subscriptions. The
second step is a filter by novelty and diversity presented in
this work. The difficulty for keeping a real-time Pub/Sub
system is to evaluate novelty and diversity on-the-fly for
every incoming items.
We propose in this paper an efficient real-time filtering
approach for Pub/Sub systems based on items’ content where
already delivered information to a user will be used to filter
incoming items. Our contributions in this paper are:
definitions for novelty and diversity in this particular
context, along with a proposal for a weighting score
adapted to the characteristics of items and subscrip-
tions;
an efficient filtering algorithm for real-time Pub/Sub
systems based on novelty and diversity which exploits
redundancy between subscriptions’ history;
a validation which highlights the complementarity of
novelty and diversity.
The paper is organized as follows. Section 2 defines items,
novelty and diversity, used in Section 3 which presents our
system and its different optimizations. Section 4 discusses
about specific implementation choices for our system. Sec-
tion 5 experimentally validates our approach. We compare
our approach with existing systems in Section 6 and we
conclude in Section 7.
2. OVERVIEW OF OUR APPROACH
While the matching process relies on a subscription, filter-
ing of items is based on a set of notified items. Unlike top-k
approaches computed on the whole set of items to be notified
leading to delays for items delivery, each subscription in the
Pub/Sub context is associated to a set of notified items that
we call subscription history. The decision to notify an item
is performed in real-time just after the matching process.
This section presents an overview of our approach and the
definitions adopted, and its instantiation is presented in the
Section 4.
2.1 Items and Histories
In our context, we define an item as a set of terms. Each
term is associated to a term weight denoted by w
i
used to
compute distances and similarities. Weights will be discussed
in Section 4.1.
To compute novelty and diversity, a Pub/Sub system
must keep already notified items, also called subscription
history H. Each one is a time-ordered set of items linked to a
subscription. Each time an item is notified for a subscription,
it is added to its history.
2.2 Novelty
The objective when filtering by novelty is to discard an
item that does not contain new information with respect
to items in the subscription history, i.e., an item Iwith a
truncate or a similar content of a previous item I
0
. Since, in
our context, history is time dependent, the novelty measure
new(I, I’) should be asymmetric [32] to test how new an
incoming item is w.r.t. an existing one and not conversely.
Finally we can define the novelty of an incoming item I
with respect to an existing history Hby comparing Iwith
all items in H, one by one.
Definition 1 (Novelty item-history).
Given a his-
tory of items Hand an item I,Iis said new with respect to
Hiff:
I0H, new(I , I0)α
We assume that the novelty threshold αis a parameter
fixed for the user for his subscription according to the defined
or required items’ ouput rate. We study in Section 5 its
default value and impact on the filtering rate and quality,
histories, and performances in our system.
2.3 Diversity
Diversity captures a complementary kind of redundancy
since it measures whether the information contained in a
given item is globally present in the set of recently notified
items or not (segmented information). Filtering by diversity
is complementary to filtering by novelty. The objective is to
detect whether an incoming item conveys new information
regarding the whole set of notified items (history) for a given
subscription. Users objective is to receive only the items with
different information; the items are filtered by their content.
So, an item is interesting to be sent to users if the information
that is contained is not present in the set of items that were
already notified. The degree of diversity of an item for a user
w.r.t. its subscription history is measured as how much it can
increase the average pairwise distance dist(I,I’) between the
history’s items [11]. Observe that to keep D(H) and D(H
0
)
(with I) comparable, we must remove from Han old item
I
o
before adding I. To satisfy diversity criterium, Imust be
on average more distant from all items in Hthan at least
one of the items in H. We decided to choose I
o
as the oldest
item in Hassuming that I
o
is more likely to be the most
distant item since its information is older and deprecated.
Focusing on only one item of the history allows us to avoid
the quadratic complexity and scaling up the system.
Definition 2 (Diversity of items).
Assume a history
of items Hwhere D(H)is the average pairwise distance be-
tween its items. An item Iimproves the diversity of Hif
and only if:
D(H∪ {I} − {Io})> D(H)
with Iothe oldest item of Hand:
D(H) = 1
|H| ∗ (|H| − 1) X
IH
X
(I0HI06=I)
dist(I, I 0)
Observe that the two average distances must be compa-
rable, so the number of items in histories must be identical.
Otherwise, if we compare Hto HI, the new item Imust
be far more distant from items in Hto make it more diverse
than in our proposition. It justifies our choice to interchange
Iwith I
o
the oldest one, the more likely distant item since the
difference of time between the two items makes information
naturally more distant.
Filtering process overview
To resume our time dependant filtering process, an incoming
item Iwhich matches a subscription must verify novelty
and diversity over the subscription history H. First, the
novelty of Iis checked by comparing it with each item in
H. If at least one similarity is below the threshold αit is
discarded for H. Second, the diversity of His compared
to the diversity of HII
o
. If Iincreases the average
distance, then it is notified and added to H.
A subscription is said to be satisfied by an item only if
either matching and filtering processes are validated. For the
matching process, it meens that all subscriptions’ terms are
contained into the item. According to the filtering process,
the item passes through both novelty and diversity.
3. FILTERING IN REAL-TIME
In this section, we present our solution to quickly filter
out items based on novelty and diversity criteria. It also
allows to efficiently store and to manage items histories for
all subscriptions.
3.1 Shared history
Since an item can belong to several histories, we need
to avoid keeping all items. A simple solution consists in
storing the last N notified items [32] for each subscription
and in factorizing histories by storing each item only once.
However the publication rate strongly differs from one source
to another and this approach will impact the filtering quality.
In fact, important items can be removed too quickly (for very
active source) or a highly-filtering item could never disappear
(for source rarely notified). We conclude that relevance of
filtering will be impact by those item-based histories.
To optimize memory consumption, we adopt a shared-
history which is basically a time-based sliding window W
which contains all items notified at least once during the
last period p. Subscriptions histories are stored as ordered
sets of pointers to related items in W. Figure 1 presents an
example of a sliding window Wand two subscriptions S
1
and S
2
with their corresponding histories and pointers to the
shared-history.
3.2 Shared-history Filtering algorithm
Filtering by novelty and diversity with a large number
of subscriptions which share common items raises a real
optimization challenge. Indeed, a na¨
ıve algorithm which
checks for an incoming item novelty then diversity with the
histories of all the subscriptions it matches has the following
cost (Definitions 1 and 2):
Cfilter (I , S) = X
sS
X
I0Hs
new(I , I0)
+X
sρ(S)
D(Hs∪ {In} − {Io})
where Scorresponds to the set of subscriptions matched
by the incoming item Iand ρ(S) represents the ratio of Sfor
which Isatisfied the novelty threshold. Assume that term
weights are computed and considered as constant, that the
average history size is N
H
items (number of computations
per history) and the average item size is S
I
(time for each
computation is based on item size). Since the cost depends on
the number of computation per history and the time for each
computation is based on item size, the average complexity
for this algorithm is:
Cfilter (I , S) = O(|S|.NH.SI) + O(|ρ(S)|.(NH.SI)2)
Experiments in Figure 2 show that the novelty has a
filtering rate proportional to the chosen threshold α. This
results in |S|∼|ρ(S)|and in a global quadratic complexity:
Cfilter (I , S) = O(|S|.(NH.SI)2)
To achieve Web scaling we propose to optimize the fil-
tering algorithm to filter incoming items by novelty and
diversity, we need by sharing the filtering process with all
subscriptions. As explained previously, the quadratic com-
plexity of the diversity computation makes preferable to
compute novelty at first step. However, browsing Hseveral
times for processing similarities can be costly, especially as
novelty does not filter enough (see Section 5.2.1). To avoid
to scan twice an history both filters are applied in one course.
Algorithm 1 presents the processing with shared-histories
and optimized computations for novelty and diversity.
The algorithm processes each item I0in H. Since Imust
be compared to I
0
each time it appears in an history, we
compute new(I , I
0
) and dist(I, I
0
) only once to benefit from
(I, I
0
) co-occurrences. Thus we check if this value has already
been computed for another subscription. If not, we compute
Algorithm 1:
Novelty and diversity filtering on an his-
tory
Require: An item I, a history Hand α[0,1] novelty
threshold
1: sumH0;
2: sumI0;
3: IoH[0];//oldest item
4: for all I0Hdo
5: if I.getInfo(I’) = null then
6: NNovelty(I,I0);
7: ddist(I, I 0);
8: I.putInfo(I’,N,d);
9: else
10: NI.getInfo(I’).N;
11: dI.getInfo(I’).d;
12: end if
13: if N< α then
14: return;
15: end if
16: if I0! = Iothen
17: I.sum I .sum +d;
18: end if
19: end for
20: if I.sum > Io.S um then
21: for all I0Hdo
22: I0.sum I0.sum +I.getInf o(I0).d
23: end for
24: HHI;
25: Notify I;
26: end if
and register it (line 6-7-8), otherwise we just retrieve the
stored value (line 10-11). If Iis not new, the algorithm stops
(line 13-14). Remember that, as explained previously, the
quadratic complexity of the diversity computation makes
preferable to compute novelty at first step, and diversity
requires to compute the average pairwise distances between
items of H. Then, we cumulate the distance (I , I
0
) with
others from H(line 17) only if it is not the oldest item
I
o
(line 16). Secondly, as explained above, the diversity
computation can be simplified to the comparison between
the sum of distances from I
o
and I(line 20). In that case,
sums of distances are updated (line 21-22), Iis added to the
history and notified (line 24-25).
To resume, our algorithm integrates two main optimiza-
tions. The first one exploits the high probability of computing
several times the similarity and the distance for each pair
of items (I, I0). Further computations of pairs are constant
and no longer dependant of item size S
I
. These values are
stored during the filtering process of Iand deleted when
there is no more subscription to check. The gain depends
on the co-occurrence ratio σ[0,1] of items in subscription
histories, defined by the number of co-occurrences of items
pairs checked during the filtering step on the total number
of pairs required:
σ= 1 #cooccurrences
#pairs
The second optimization deals with the computation of
the density which changes for each notified item. To avoid
the quadratic complexity of computing the sum of pairwise
Figure 1: Example of a sliding window
distances in H, we propose in our algorithm to store com-
puted sums of distances I.sum for each item of the history
Hwith all items received later. Then the density of His
the sum of I.sum. Since oldest items are removed with their
stored values, no other update has to be done for remaining
items. Furthermore computation of distances also benefits
of the ‘‘σ’’ co-occurrence gain. Formally I.sum is a stored
value equal to:
I.sum =X
(I0HI0.τ>I .τ )
dist(I, I 0)
Since diversity is the comparison between D(H
0
) and D(H)
with H
0
=H∪ {I
n
}−{I
o
}, it results that checking whether
an incoming item increases the diversity or not may be
simplified as follows:
2×PI0H0I0.sum
|H0| × (|H0| − 1) >2×PI0HI0.sum
|H| × (|H| − 1)
|H|=|H0| ⇒ X
IkH0
Ik.sum > X
IkH
Ik.sum
H0=H∪ {In} − {Io}
X
IkH
Ik.sum +I.S um Io.Sum > X
IkH
Ik.sum
So the diversity test consists in checking if:
I.sum > Io.sum
To conclude, the complexity of our algorithm benefits
from the co-occurrence between items for the novelty and
the density computations, which results in the following
linear complexity.
Proposition 1 (Shared-history complexity).
Algorithm 1
has a linear complexity w.r.t. the number of subscriptions
matched by the incoming item, the average history size and
item size.
Proof.
Assume σdenotes the average co-occurrence ratio
between items, |S|the number of subscriptions matched by
the incoming item, N
H
the average history size and S
I
the
average item size. Then with the shared-history management,
the filtering cost given by the Algorithm 1 is:
Cfilter (I , S) = Cnov (I , S) + Cdiv(I , ρ(S))
=O(|S|.σ.NH.SI) + O(α|S|.σ.NH.SI)
=O(|S|.NH.SI)
The complexity of computing D(H) and D(HI) is about
O(|H|), while the one of computing the average pairwise
distance between items of Hby using the classical method is
about O(|H|
2
). This optimization in computing the density
reduces processing time.
4. DISCUSSION ON METRICS
4.1 Weighting terms with TDV
Definitions of items novelty and diversity are both based
on weight of their terms. Several term weighting models
are proposed in the literature like the Term Frequency (TF)
combined with Inverse Documents Frequency (IDF) [3], the
Term Discrimination Value (TDV) [26] or the Term Preci-
sion [6].
In our context, where items are short sets of terms (social
network items), weights based on term frequencies like the
widespread TF-IDF function, are inappropriate since it is
very unlikely to have multiple term-occurrences within an
item. Consequently for an item Ithe TF is generally equals
to
1
|I|
and TF-IDF turns into an IDF score, so the less
frequent terms got the highest scores. This motivates our
choice to weight terms independently of their occurrences in
items using the TDV score. This choice is validated by our
experiments (Table 3).
The TDV function measures how a term helps to distin-
guish a set of documents (i.e., impact on the global entropy).
Consequently neither a frequent term, nor an uncommon
one have an important TDV value [30]. This is, as far as
we know, the first time TDV is used in such a context. The
TDV value represents the capacity for a term to make items
more similar globally. Then the discrimination value for a
term t
k
is the difference of density between the items set
with t
k
and without t
k
. We compute the density as the
average pairwise similarity between distinct items:
∆(I) = 1
|I| × (|I | − 1) X
I∈I
X
(I0∈I∧I06=I)
sim(I, I 0)
where sim(I, I
0
) corresponds to a similarity function be-
tween items, like for instance the Cosine distance. Finally
the TDV value for a term tkis:
tdv(I, tk) = ∆(I − {tk})∆(I)
We denote for simplicity reason tdv(t
k
) instead of tdv(I, t
k
).
4.2 Novelty
Novelty checks if information from Ihas already been
delivered. For example I
0
contains Iand appends additional
information, then Iis not new compared to I
0
, but I
0
is
new compared to I. Therefore symmetric measures like the
Jaccard measure are not suitable. Consequently we adopt
the following measure, inspired from Newsjunkie [16], for the
novelty of an item compared to another one from the history:
Definition 3 (Novelty item-item).
Let αbe a thresh-
old of novelty, α[0,1], and Iand I
0
two items. Iis said
to be new compared to I0if and only if:
new(I , I0) = Pt(I\II0)tdv(t)
PtItdv(t)α
This measure computes the weighted coverage of terms from
Iwithout taking into account terms present in I
0
w.r.t sum
of weight for terms only present in I. Note that we chose
the tdv value as terms weight according to our discussion
above.
4.3 Similarity in Diversity
To measure diversity we need to compute the distance
between items. Several distance measures are proposed in
literature to compute diversity on a set of documents. Most
frequently used are Cosine [32], Euclidean [12, 23] and Jac-
card [13] but we can also quote Pearson derived from Cosine,
Dice derived from Jaccard or Levenstein. For short items,
Euclidean is known to produce more relevant results [4].
Thus we consider in our system a diversity function based
on an Euclidean distance weighted by TDV values. The
comparison between those different measures will be studied
in future work.
5. EXPERIMENTS
In this section, we study the impact of several parameters
(i.e., novelty threshold, diversity and size of the sliding
window) with a real dataset of items. We measure especially
their impact on the filtering rate and the size of histories and
on the performances. Finally, thanks to a user validation,
we study the quality of our system with different settings
and a periodic filtering based on a top-k approach.
5.1 Implementation and description of datasets
We implemented the system using the standard Java
v1.6.0 20. All experiments were run on a 3.60 GHz quad-core
processor with 16 GB in JVM memory.
Figure 2: Filtering rate by varying novelty threshold
For our experiments, we used a subset from a real dataset
of items acquired over a 8-month campaign from March to
October 2010 [28]. The set of items considered corresponds
to the first week of October (258,480 items).
Alias sampling method [29] was used to generate 10M of
subscriptions which follow the distribution of terms occur-
rences on the Web, and Web queries size reported in [5]. The
vocabulary of 1.5M of distinct terms extracted from items
is used to generate subscriptions. It is characterized among
others by a maximal size equal to 12 terms and on aver-
age 2.2 terms. Note that only 5.28 millions of subscriptions
are satisfied at least once during the studied week, which
means that 5.28 millions of subscriptions pass through both
matching and filtering processes.
5.2 Filtering rate
We study in this section the impact of the novelty thresh-
old and diversity on the filtering rate, as well as of the
number of subscriptions and the window size. The results
presented in this section correspond to the average filtering
rate (number of notified items over the number of items that
match the subscription) of the subscriptions satisfied at least
once during the last day of the studied week.
5.2.1 Impact of the novelty threshold
Figure 2 shows the novelty’s filtering rate when varying
novelty threshold for a window size of 24 hours (dashed line).
We observe that the filtering rate linearly increases with the
novelty threshold. We notice that 38% of items are filtered
when the novelty threshold is set to 50%, i.e., when half
information is not redundant. We recall that item’s novelty
is based on its weighted coverage (Definition 3). On average,
only 20% of items that satisfy a subscription do not contain
redundant information: 80% of the items are filtered out
when the novelty threshold is equal to 100%.
5.2.2 Impact of diversity
Figure 2 also illustrates that filtering by diversity reduces
the number of items to notify (solid line). Diversity acts
as a strong filter since the filtering rate when considering
only diversity (i.e., novelty threshold of 0%) is equal to
64.34%. Figure 2 proves that novelty and diversity are
complementary filters. Observe that the filtering rate slightly
increases with the value of the novelty threshold if diversity
is also considered (64.34% for a novelty threshold of 0% up
to 82% for a threshold of 100%). However the benefit for
Table 1: Filtering rate by
varying window size
window Filtering
Size Rate
12 H 61.06%
24 H 70.71%
48 H 75.93%
Table 2:
Number of subscriptions & notifications w.r.t. subscrip-
tions size
|s|# of subscriptions Average # of Filtering rate
statisfied items
1 2 030 375 505.31 86.79%
2 1 804 265 21.94 57.71%
3 293 666 4.98 42.88%
>3 28 776 2.45 35.52%
having both filters is double since filtering by novelty allows
in plus to decrease the number of items to consider for the
costly diversity computation.
For the following experiments, we set the novelty threshold
to 50% (best quality from Table 3) and take into account
the diversity for filtering process.
5.2.3 Impact of window size
Table 1 shows the filtering rate for different sliding window
sizes. The size of the window impacts the filtering rate: with
larger window size, items stay longer in histories and are
used to filter new incoming items. Although larger sliding
windows have an impact on histories length (see next section),
items notified in large sliding windows stay more time in
histories but information remain diverse enough to generate
new notifications. For following experiments the sliding
window size is set to 24 hours.
5.2.4 Impact of subscriptions size
Table 2 presents for each size of subscription its distribu-
tion which follows the one from Web queries [5]. Most of
subscriptions are short (size lower than 4). We can also note
that the number of notified items by subscription decreases
drastically with subscription size: while short subscriptions
are often matched (>500 items/day), large subscriptions are
rarely notified(<5 items/day).
According to our results, we can say that the filtering rate
is highly dependent on diversity (present or not) as well as
the novelty threshold. But subscriptions size have also a
significant impact.
5.2.5 Histories size
We capture the variation of history size over time. We
get the average size every six hours over the studied week
with three different sliding window sizes. Figure 3 shows this
variation for a novelty threshold equal to 50%, the values
presented are the average size of the history of subscriptions
satisfied at least once in the first six hours of the week (3.35
millions of subscriptions). It should be noted that the size
of the history at time τis equal to the number of items in
the window of the considered size p(items published after
τp). During the initialization phase histories become
larger with large sliding windows. The peak of each sliding
Figure 3: Variation of histories size over time
Figure 4: Memory space vs novelty threshold
window corresponds to the accumulation of items that ends
at window-size period (12/24/48 H). The accumulation is due
to the fact that empty histories do not play their filtering role.
Indeed since density keeps growing during this initialization
step, there is almost no filtering by diversity and most items
are notified. After this initialization period, the items which
were greedily added to histories at the beginning go out
the sliding window what leads to a drastic decrease of the
histories size during one window-size period. First items
disappear which contributes to the gap between the peak
and the deep exactly one period after. h The same effects
occurs with less magnitude for the 12 and 48H sliding window-
size, in fact a small sliding window empties quickly and must
restart the density computation, while a large sliding window
empties slowly and filters a lot. The 24H sliding window-size
has a more stable behavior since fills up and empties with
a appropriate rate. The history now allows to filter items
by novelty and diversity and its size stabilizes. We also
measured this variation with different novelty thresholds
and confirmed these conclusions. The initialization phase
corresponds to the diversification of histories.
Another conclusion from Figure 3 is that history size is
dependant of the window size. For 12 and 24h, the number
of items in the history is globally equal to the window-size
(10/20), while the 48H sliding window-size is half more with
70. Even if old items contribute to diversify the information,
the filtering rate (Table 1) growth is not proportional to the
window size. So as the history size which needs a greater
amount of items to filter diversity.
5.3 Performances evaluation
As presented in Section 3.2, we present here three differ-
ent implementations of our system with a Na¨
ıve approach
without optimization, a co-occurrence approach with the
exploitation of the co-occurrences ratio σof items, and the
Figure 5: Memory space vs number of subscriptions
Figure 6: Processing time when varying novelty threshold
Diversity approach which pre-computes and stores densities
in every history.
5.3.1 Memory requirements
Since the co-occurrence approach stores extra-values
only during the filtering process, the amount of space used by
this implementation is equal to the Na¨
ıve implementation.
Consequently we present comparison only between the Na
¨
ıve
and the Diversity implementations.
Figure 4 shows the memory space used by the sliding win-
dow and subscription histories for various novelty thresholds.
When the filtering rate is increasing, less items are stored in
the sliding window, thus it reduces memory consumption for
both optimized and normal implementations. The Diver-
sity implementation requires more memory space since sums
of distance scores are precomputed and stored for each his-
tory. Observed that it requires consequently a memory space
proportional to the size of histories, so inversely proportional
to the filtering rate noticed in Figure 2. For instance for a
rate of 50% we require 2,866MB of memory, while for a rate
of 100% (+16%) we require only 2,387MB (-16.68%).
Figure 5 illustrates the variation of the memory consump-
tion by varying the number of subscriptions. For this ex-
periment, the filtering rate and the average sliding window
size are fixed. We observe that the memory space increases
linearly w.r.t. the number of subscriptions indexed in both
implementations, since each history stores information linked
to the sliding window. However since the Diversity im-
plementation requires more space to store extra-information
compare to the Na¨
ıve version, but the ratio remains con-
stant at 2.4. The Na
¨
ıve implementation uses 399MB (resp.
1009MB) while the Diversity optimization uses 1227MB
(resp. 2866MB) for 2M (resp. 10M) of subscriptions.
5.3.2 Processing time
We now study the gain in processing time obtained by
the optimizations of our system. Figure 6 shows that the
Figure 7: Processing time for different sizes of sliding windows
Figure 8:
Processing time by varying the number of subscriptions
average time (in log-scale) decreases with the novelty thresh-
old and therefore history size. The Na¨
ıve implementation
requires much more computing time especially for low nov-
elty thresholds. The rationale lies in its co-occurrence
optimization which reduces the amount of similarities and
distances computations. The Na¨
ıve implementation is on
average 5 times more costly than the optimized ones, except
for high thresholds where histories are short and few sim-
ilarities/distances are computed. Moreover, the difference
between co-occurrence and Diversity results decreases
with the size of histories which depends on the novelty thresh-
old: the gain is 68% for a novelty threshold of 0%, and 13%
for a novelty threshold of 80%, due to the complexity of O(1)
(find Io.sum) for the diversity computation.
Since the processing time mainly relies on history size, it
is also dependent on the sliding window size. Especially for
the Na
¨
ıve and co-occurrence implementations where the
growth of computation time is more important as shown in
Figure 7. In fact, computation of diversity is dependant on
sliding window size. At the opposite, the processing time for
the Diversity implementation exhibits a moderate increase,
except for large windows size (48h) where histories are larger
which means more distances computation and updates of
sums. In fact, Diversity stores sums of distances between
items, while co-occurrence implementation requires to
recompute them. Since larger windows filters more (Table 1),
so do distance computations for each history, except for the
Diversity implementation which adds those values the first
time an item is notified whatever the number of updates
for histories. At the opposite, the Na¨
ıve implementation
computes distances each time, even if histories are updated.
This requires 21 to 31 times more times than optimized
solutions.
As we can see in Figure 8, processing time increases lin-
early with the number of subscriptions for both optimizations,
while the Na
¨
ıve implementation increases very fast since no
Table 3: Filtering relevance with various technics, thresholds and metrics
Diversity top-k coverage coverage coverage TF-IDF TF-IDF Jaccard
only 25% 50% 75% 50% 50%
Precision 0.782 0.711 0.930 0.939 0.944 0.764 0.884 0.916
recall 0.698 0.634 0.652 0.652 0.610 0.618 0.545 0.652
F-Measure 0.726 0.660 0.732 0.736 0.710 0.626 0.646 0.729
co-occurrences between subscriptions is used. According to
co-occurrence and Diversity implementations, it was
expected to be sub-linear since similarity and distance com-
putations between items are stored during the process to
avoid its re-computation. So, with the growth of the number
of subscriptions, the probability to have a same couple of
items in different subscriptions grows. However, the gain for
co-occurrence is far more interesting (-93%) than Diver-
sity (-63%) since similarity and distance functions are very
costly (compared to sums for diversities). Nevertheless, the
Diversity implementation needs 2.7 less time on average
than the co-occurrence one.
5.4 Quality of Filtering
In this section we study the quality of our filtering step
with users’ behavior. To compute the relevance of our system,
we compare chosen items by users and those obtained by
our system. To validate our choice, we compre the quality
of filtering when changing the weigthing score, the novelty
similarity and its threshold. We also compare our real-time
filtering with a top-k algorithm [14].
To achieve this we have extracted 10 subscriptions
1
on
which we gathered matched items. Then we asked users to
filter manually items according to novelty and diversity of
information. Users had to read texts and to decide if an item
is new or if its information is globally contained in previous
items. In order to preserve our context of real-time filtering,
items were displayed in sequence in order to filter them in
chronogical order and histories were shown to users. 60 users
performed 106 validations on our subscriptions. Those users
come from academics and phd students in computer science.
Since filtering out by novelty is more trivial than by diversity,
we kept items in the result set only if they were chosen by
more than 60% of the users (75% for novelty), giving more
weight to diversity.
The top-k algorithm [14] determines the kmost distant
items from a set of items satisfying subscriptions to achieve
diversity. A result set is initialized with the two most dis-
tant items among the items satisfying the subscription and
extended with the next most diversifying items. Each sub-
scription has its own value kwhich is equal to the history
size generated by our approach. Having a same size will
allow us to make results sets comparable for quality measure-
ment. Moreover, this algorithm cannot take into account
the novelty since it is an asymmetric measure based on time.
We must recall that our window-based approach relies on
the time assumption which means that none of the notified
items can be removed from the result set, while the top-k
algorithm could remove a previously chosen item to choose
another one in the following snapshot.
Table 3 shows the average precision, recall and F-Measure
for all the subscriptions compared to the user result set.
1Subscriptions items and user results are available at:
http://cedric.cnam.fr/traversn/research/FiND/userset/
Different settings on our system have been made to find
the most relevant measure for our filtering step with: the
diversity step without novelty, the top-K approach result
set, different thresholds for the weighted coverage with di-
versity, changing TDV by the standard TF-IDF with and
without novelty, and finally novelty computed by the Jac-
card distance. We compare especially our novelty measure by
weighted coverage (Definition 3) with the standard Jaccard
similarity for different thresholds, but also the relevance of
our terms weight TDV versus TF-IDF, either in diversity or
novelty. We also study the behavior of the top-k algorithm.
We can see that a combination of diversity and novelty
produces better results than diversity alone, especially for
the precision of the result. However, the recall of result set
decreases when using the novelty which can be too selective
and diversity not enough. As expected, TF-IDF weights
cannot have a good impact on measures since items are
short, so the TF is low and only IDF is taken into account.
With a low precision (0.884) and recall (0.545) it gives the
lowest F-Measure of our tests. According to novelty, the
effect of the asymetric measure and lack of weights for terms
makes the Jaccard measure less relevant for the precision of
the result set. Finally, the top-k technics is not as relevant as
our solution since using an interchange algorithm to choose
most diverse items do not rely on a real-time assomption
as for user validation. The relevance of our technic with a
real-time filtering system, using a TDV-weighted coverage
measure for novelty with a threshold of 50% gives a good
accuracy.
6. RELATED WORK
When searching a document on the Web, we generally
assume that the set of queried documents is static and al-
ready known. The objective of search engines is to present to
users a ranked list of kmost relevant and diverse documents
matching a query. To achieve this, some models are based
on probabilities for matching and diversity [2] or on graphs
for computing distance [12] between items to select their
minimum representative set. Some of them propose to mod-
ify diversity measures by focusing on uncommon attributes
between items based on user-defined filters [31], by defining a
trade-off between similarity and diversity [27], by integrating
entities and sentiment in a Greedy Max-Min algorithm [1],
by defining time-based distances with a gaussian similarity
for blog retrieval [20], or by comparing an item with the
compression of all previous texts like the NCD distance [8].
In [7] they propose Maximal Marginal Relevance method.
This method combines query relevance and the diversity of
documents to compute their score and rank it in the result
set. Thus the document is high ranked if it is similar to
the query and less similar to previously selected documents.
Also to solve the problem of queries ambiguity in IR, [10]
proposes a probabilistic model to rank documents by taking
into account their novelty and diversity. Globally, those
technics allow to compute large texts in a static top-k eval-
uation and cannot adapt to our context since we consider
small items, which changes relevance of previous methods.
Moreover real-time delivery of information is an important
constraint that cannot be ignored.
Some approaches focus on continuous filtering like in the
Pub/Sub context, combined with top-k technics. They may
be based on fixed size windows in order to garantee the
amount of items to keep in the system like [13] which uses a
dynamic index to quickly find if an item is diverse or not on
a frequently updated snapshot of items, [16] which focuses
only on novelty with extracted entities from items, [21]
which presents an incremental approach for diversification
while integrating time to weigth items or [22] which resumes
items to a small set of topics allowing a simple coverage
distance with a set of items (not adapted to high dimension
comparison). However fixed-size windows hardly manage
different notification rates for subscriptions. In fact, low
rates will keep very old items to filter out incoming items
and high rates will remove recent items which should remove
duplicates.
The closest approach from our solution is [14] which uses
top-k windows to compute diversity on real-time delivery.
It is based on the interchange algorithm which notifies an
item if exchanging it with one from the previous top-k levels
up the diversity. However this solution can deliver items
from previous windows if considered as non-diverse, or re-
move items from the past for future filtering steps. As we
saw in experiment, this approach tends to locally diversify
information, but not over time. Moreover, keeping all items
will lead to scale up issues.
7. CONCLUSION AND FUTURE WORK
In this paper, we present a Pub/Sub system, which filters
by novelty and diversity on the fly. The filtering is based on
items already notified to a user. We choose a sliding window
based on time to manage the subscriptions history. Our main
contributions are (a) the proposition of the TDV to weight
terms, combined with (b) a weighted coverage measure for
novelty which is asymetric and adapted to small items, (c)
designing an optimized system which factorizes similarities
and distances, and reduces diversity computations costs,
and (d) a quality measurement of our propositions with a
user validation based on real-time filtering with novelty and
diversity.
From our experimental study, we show that novelty and
diversity are complementary filters. Moreover we observe
that the filtering rate depends on novelty threshold and on
window size, and diversity has less effect for large window
size. The performances of our system are also studied and
we obtain an average gain of 97% in processing time with our
optimization for factorizing co-occurrences and computing
the density of history. We compare the quality of our system
with different settings and a top-k and show that real-time
delivery is a strong constraint which our system guarantees
with a TDV-weighted coverage combined with diversity.
For further work, we aim to tune the quality of the diversity
measure since cosinus and euclidean do not focus on the same
kind of filtering. Another necessity is to solve the problem
of rarely notified subscriptions by extending the set of items
terms to be matched with subscriptions terms, based on
the TDV values and our item distance. We also intend
to propose a distributed version of our algorithm and our
histories management in a NoSQL environment by focusing
on computation of measures on items instead of subscriptions
in order to keep factorization and scalability.
8. REFERENCES
[1] S. Abbar, S. Amer-Yahia, P. Indyk, and S. Mahabadi.
Real-time Recommendation of Diverse Related Articles.
In World Wide Web Conference (WWW), pages 1--12,
2013.
[2] A. Angel and N. Koudas. Efficient Diversity-Aware
Search. In International Conference on Management of
Data (SIGMOD), pages 781--792, 2011.
[3] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern
Information Retrieval. ACM Press / Addison-Wesley,
1999.
[4] V. Bavi, T. Beirne, N. Bone, J. Mohr, and B. Neal.
Comparison of Document Similarity Metrics, 2010.
Computer Science Department, Western Washington
University Information Retrieval.
[5] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. A.
Grossman, and O. Frieder. Hourly Analysis of a Very
Large Topically Categorized Web Query Log. In ACM
Conference on Research and Development in
Information Retrieval (SIGIR), pages 321--328, 2004.
[6]
A. Bookstein and D. Swanson. Probabilistic Models for
Automatic Indexing. Journal of the American Society
for Information Science, 25(5):312--318, 1974.
[7] J. Carbonell and J. Goldstein. The Use of MMR,
Diversity-based Reranking for Reordering Documents
and Producing Summaries. In ACM Conference on
Research and Development in Information Retrieval
(SIGIR), pages 335--336, 1998.
[8] D. Carmel, H. Roitman, and E. Yom-Tov. On the
Relationship Between Novelty and Popularity of
User-generated Content. In ACM International
Conference on Information and Knowledge
Management (CIKM), pages 1509--1512, 2010.
[9] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf.
Design and evaluation of a wide-area event notification
service. ACM Transactions on Computer Systems
(TOCS), 19(3):332--383, Aug. 2001.
[10] C. L. Clarke, M. Kolla, G. V. Cormack,
O. Vechtomova, A. Ashkan, S. B¨
uttcher, and
I. MacKinnon. Novelty and Diversity in Information
Retrieval Evaluation. In ACM Conference on Research
and Development in Information Retrieval (SIGIR),
pages 659--666, 2008.
[11] M. Drosou and E. Pitoura. Diversity over Continuous
Data. IEEE Data Engineering Bulletin, 32(4):49--56,
2009.
[12] M. Drosou and E. Pitoura. DisC Diversity: Result
Diversification Based on Dissimilarity and Coverage.
Very Large Data Bases (PVLDB), 6(1):13--24, 2012.
[13] M. Drosou and E. Pitoura. Dynamic Diversification of
Continuous Data. In Proceeding of the ACM
International Conference on Extending Database
Technology - EDBT, pages 216--227, 2012.
[14] M. Drosou, K. Stefanidis, and E. Pitoura.
Preference-Aware Publish/Subscribe Delivery with
Diversity. In ACM International Conference on
Distributed Event-Based Systems (DEBS), pages
6:1--6:12, 2009.
[15] G. Eisenhauer, F. Bustamante, and K. Schwan. Event
services for high performance computing. In
High-Performance Distributed Computing, 2000.
Proceedings. The Ninth International Symposium on,
pages 113--120, 2000.
[16]
E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie:
Providing Personalized Newsfeeds via Analysis of
Information Novelty. In World Wide Web Conference
(WWW), pages 482--490, 2004.
[17]
Z. Hmedeh, C. du Mouza, and N. Travers. A Real-time
Filtering by Novelty and Diversity for
Publish/Subscribe Systems. In International
Conference on Scientific and Statistical Database
Management (SSDBM), San Diego, USA, June 2015.
[18] Z. Hmedeh, H. Kourdounakis, V. Christophides,
C. du Mouza, M. Scholl, and N. Travers. Subscription
Indexes for Web Syndication Systems. In Proceeding of
the ACM International Conference on Extending
Database Technology - EDBT, pages 311--322, 2012.
[19] Z. Hmedeh, N. Vouzoukidou, N. Travers,
V. Christophides, C. du Mouza, and M. Scholl.
Characterizing Web Syndication Behavior and Content.
In Web Information System Engineering (WISE),
pages 29--42, 2011.
[20] M. Keikha, F. Crestani, and W. B. Croft. Diversity in
Blog Feed Retrieval. In ACM International Conference
on Information and Knowledge Management (CIKM),
pages 525--534, 2012.
[21] E. Minack, W. Siberski, and W. Nejdl. Incremental
Diversification for Very Large Sets: a Streaming-based
Approach. In ACM Conference on Research and
Development in Information Retrieval (SIGIR), pages
585--594, 2011.
[22] D. Panigrahi, A. Das Sarma, G. Aggarwal, and
A. Tomkins. Online Selection of Diverse Results. In
Web Search and Data Mining (WSDM), pages 263--272,
2012.
[23] K. Pripuˇzi´c, I. P. ˇ
Zarko, and K. Aberer. Top-k
Publish/Subscribe: Finding k Most Relevant
Publications in Sliding Time Window. In ACM
International Conference on Distributed Event-Based
Systems (DEBS), pages 127--138, 2008.
[24]
Redis. Redis : Pub/sub. http://redis.io/topics/pubsub.
[25] A. Rowstron, A.-M. Kermarrec, M. Castro, and
P. Druschel. Scribe: The design of a large-scale event
notification infrastructure. In J. Crowcroft and
M. Hofmann, editors, Networked Group
Communication (NGC), volume 2233 of Lecture Notes
in Computer Science, pages 30--43. Springer Berlin
Heidelberg, 2001.
[26] G. Salton, A. Wong, and C. S. Yang. A Vector Space
Model for Automatic Indexing. Commun. ACM,
18(11):613--620, 1975.
[27] B. Smyth and P. McClave. Similarity vs. Diversity. In
International Conference on Case-based Reasoning
(ICCBR), pages 347--361, 2001.
[28]
N. Travers, Z. Hmedeh, N. Vouzoukidou, C. du Mouza,
V. Christophides, and M. Scholl. RSS feeds behavior
analysis, structure and vocabulary. International
Journal of Web Information Systems (IJWIS),
10(3):291--320, 2014.
[29] A. Walker. An Efficient Method for Generating
Discrete Random Variables with General Distributions.
ACM Transactions on Mathematical Software (TOMS),
3:253--256, 1977.
[30] P. Willett. An Algorithm for the Calculation of Exact
Term Discrimination Values. Information Processing
Management, 21(3):225--232, 1985.
[31] C. Yu, L. Lakshmanan, and S. Amer-Yahia. It Takes
Variety to Make a World: Diversification in
Recommender Systems. In Proceeding of the ACM
International Conference on Extending Database
Technology - EDBT, pages 368--378, 2009.
[32] Y. Zhang, J. Callan, and T. Minka. Novelty and
Redundancy Detection in Adaptive Filtering. In ACM
Conference on Research and Development in
Information Retrieval (SIGIR), pages 81--88, 2002.
... Sophisticated Pub/Sub Systems: Traditional Pub/Sub may have focused on single events; nevertheless, several approaches have tried to extend them to satisfy more expressive subscriptions and data information since then. Approaches involve Diversity in Pub/Sub [54][55][56] that produce diverse noti cations to tackle redundant information, Approximate Semantic Matching in Pub/Sub [39,57,58] that try to resolve the rigidness of subscription models by proposing approximate subscriptions and matchers, and Semantic Engines in Pub/Sub [59,60] that integrate data from heterogeneous sources or semantically enrich data with information deriving from external sources. These approaches have a range of advantages and disadvantages related to data expressiveness, usability, dependency on ontologies, thesauri and taxonomies, and they do not apply to entity-based publications that contain rich conceptual and contextual information. ...
... Hmedeh et al. [56] Description: This work is based on novel and diverse items in Web syndication. ...
Thesis
Full-text available
The Internet of Things (IoT) has contributed to physical devices generating entity-centric data (e.g. smart buildings). To bridge the gap between the devices’ data and the users’ interests, Publish/Subscribe systems (Pub/Sub) are suitable middleware to deal with dynamic large-scale IoT applications due to their decoupling traits. However, the IoT contains more challenges than dynamism related to data and users. Specifically, data can be voluminous and heterogeneous due to integration or enrichment as well as redundant or semantically similar due to the sensors’ spatial proximity. Existing approaches tackle semantic interoperability through ontologies and taxonomies resulting in rigidness, non-scalability, and domain-dependency. At the same time, users can either create representationally-coupled queries that could be complex (e.g. SPARQL), independent of their data knowledge and expertise, or simple queries that lead to redundant information, which can overwhelm them. Existing approaches either use complex queries or create high-level data abstractions that are either not usable or complex for dynamic environments and suffer from representational coupling. This thesis addresses these problems and analyses two research questions involving the formulation of a new Pub/Sub scheme; the Entity-centric Publish/Subscribe Summarisation System that involves user-friendly and contextually-aware subscriptions as well as extractive and abstractive summarisation approaches for the publications. Its goal is to address usability, user expressibility, data expressiveness, user and data effectiveness, and system efficiency. Three approaches are proposed; PubSum, IoTSAX, and PoSSUM. PubSum is a dynamic diverse entity summarisation of heterogeneous Linked Data streams through windowing policies, embedding-based DBSCAN clustering, and geometric-based top-k ranking. IoTSAX is a dynamic abstractive summarisation of heterogeneous numerical entity graph streams through enhanced Symbolic Aggregate approximation (SAX) and approximate rule-based reasoning. PoSSUM is an extractive and abstractive diverse summarisation of heterogeneous numerical and Linked Data streams through novel partly-incremental conceptual clustering based on embedding models and variance as well as contextual-based top-k ranking. As an example, doctors are not experts in query languages and are unaware of the content and representations of patient data in a system. The proposed system will require a simple patient-centric subscription that will create a summary as a notification. This summary will be abstractive by interpreting the shape of real-time health sensor readings and providing a high-level inference as well as extractive by including the most important and conceptually/contextually diverse information coming from external sources (e.g. personal information). The proposed system has been extensively evaluated by synthetic and real-world data from the domains of Healthcare and Smart Cities achieving comparable results in correctness and system performance. Specifically, PubSum, involving DBpedia data, achieves up to 92% reduction of forwarded messages, 69.3% duplication reduction, and 0.95 redundancy-aware F-score compared to traditional Pub/Sub, but at the expense of 4 times more latency, while achieving 6 times less latency and 3 times less memory compared to the state-of-the-art diverse entity summarisation with throughput ranging from 833 to 1,005 events/second. IoTSAX, involving real-world heterogeneous data related to Healthcare and Smart Cities, achieves up to 0.87 reasoning F-score, 98% reduction of forwarded messages, and outperforms the original SAX in approximation error (2 to 3 times less) and compression space-saving percentage when data redundancy occurs (from 71.75% to 94.99%) while maintaining similar or better latency and throughput. The latency is 2 to 3 times more compared to traditional Pub/Sub and the throughput ranges from 13.231 to 97.393 events/second. PoSSUM, involving real-world heterogeneous data, discovers up to 80% data diversity desire by users and achieves the best summary quality for more than half of the entities as well as the best conceptual clustering F-score from 0.69 to 0.83 compared to traditional Pub/Sub and the state-of-the-art diverse entity summarisation. Also, up to 0.95 redundancy-aware F-score and 99% message reduction compared to traditional Pub/Sub. Finally, it has less clustering processing time, scoring and memory consumption, and comparable latency and throughput.
... As future work, we propose to introduce a filtering system to reduce the number of delivered items to users. Based on the history of the item notified for each subscription, we want to filter out incoming item that does not satisfy novelty and diversity criteria as introduced in [34][35]. ...
Article
Full-text available
Content syndication has become a popular way for timely delivery of frequently updated information on the Web. Today, web syndication technologies such as RSS or Atom are used in a wide variety of applications spreading from large-scale news broadcasting to medium-scale information sharing in scientific and professional communities. However, they exhibit serious limitations for dealing with information overload in Web 2.0. There is a vital need for efficient real-time filtering methods across feeds, to allow users to effectively follow personally interesting information. We investigate in this paper three indexing techniques for users' subscriptions based on inverted lists or on an ordered trie for exact and partial matching. We present analytical models for memory requirements and matching time and we conduct a thorough experimental evaluation to exhibit the impact of critical parameters of realistic web syndication workloads.
Article
Full-text available
Purpose – The purpose of this paper is to present a thorough analysis of three complementary features of real-scale really simple syndication (RSS)/Atom feeds, namely, publication activity, items characteristics and their textual vocabulary, that the authors believe are crucial for emerging Web 2.0 applications. Previous works on RSS/Atom statistical characteristics do not provide a precise and updated characterization of feeds’ behavior and content, characterization that can be used to successfully benchmark the effectiveness and efficiency of various Web syndication processing/analysis techniques. Design/methodology/approach – The authors empirical study relies on a large-scale testbed acquired over an eight-month campaign from 2010. They collected a total number of 10,794,285 items originating from 8,155 productive feeds. The authors deeply analyze feeds productivity (types and bandwidth), content (XML, text and duplicates) and textual content (vocabulary and buzz-words). Findings – The findings of the study are as follows: 17 per cent of feeds produce 97 per cent of the items; a formal characterization of feeds publication rate conducted by using a modified power law; most popular textual elements are the title and description, with the average size of 52 terms; cumulative item size follows a lognormal distribution, varying greatly with feeds type; 47 per cent of the feed-published items share the same description; the vocabulary does not belong to Wordnet terms (4 per cent); characterization of vocabulary growth using Heaps’ laws and the number of occurrences by a stretched exponential distribution conducted; and ranking of terms does not significantly vary for frequent terms. Research limitations/implications – Modeling dedicated Web applications capacities, Defining benchmarks, optimizing Publish/Subscribe index structures. Practical implications – It especially opens many possibilities for tuning Web applications, like an RSS crawler designed with a resource allocator and a refreshing strategy based on the Gini values and evolution to predict bursts for each feed, according to their category and class for targeted feeds; an indexing structure which matches textual items’ content, which takes into account item size according to targeted feeds, size of the vocabulary and term occurrences, updates of the vocabulary and evolution of term ranks, typos and misspelling correction; filtering by pruning items for content duplicates of different feeds and correlation of terms to easily detect replicates. Originality/value – A content-oriented analysis of dynamic Web information.
Conference Paper
Full-text available
News articles typically drive a lot of traffic in the form of comments posted by users on a news site. Such user-generated content tends to carry additional information such as entities and sentiment. In general, when articles are recommended to users, only popularity (e.g., most shared and most commented), recency, and sometimes (manual) editors' picks (based on daily hot topics), are considered. We formalize a novel recommendation problem where the goal is to find the closest most diverse articles to the one the user is currently browsing. Our diversity measure incorporates entities and sentiment extracted from comments. Given the real-time nature of our recommendations, we explore the applicability of nearest neighbor algorithms to solve the problem. Our user study on real opinion articles from aljazeera.net and reuters.com validates the use of entities and sentiment extracted from articles and their comments to achieve news diversity when compared to content-based diversity. Finally, our performance experiments show the real-time feasibility of our solution.
Article
Full-text available
The explosion of published information on the Web leads to the emergence of a Web syndication paradigm, which trans-forms the passive reader into an active information collec-tor. Information consumers subscribe to RSS/Atom feeds and are notified whenever a piece of news (item) is pub-lished. The success of this Web syndication now offered on Web sites, blogs, and social media, however raises scalability issues. There is a vital need for efficient real-time filtering methods across feeds, to allow users to follow effectively per-sonally interesting information. We investigate in this paper three indexing techniques for users' subscriptions based on inverted lists or on an ordered trie. We present analytical models for memory requirements and matching time and we conduct a thorough experimental evaluation to exhibit the impact of critical workload parameters on these structures.
Conference Paper
Content syndication has become a popular way for timely delivery of frequently updated information on the Web. It essentially enhances traditional pull-oriented searching and browsing of web pages with push-oriented protocols. However many Web syndication applications imply a tight coupling between feed producers and consumers and do not help users to find, in all information they received, items with interesting and new content. We present the FiND Pub/Sub system which integrates an in-memory filtering process based on keyword subscriptions. Unlike existing proposals, FiND is designed for real-time notifications on item streams. This demonstration illustrates the main features of the FiND system namely (i) a scalable real-time notification process when the most important terms of the subscription are matched, (ii) a tunable filtering by novelty and diversity to reduce user flooding.
Article
Recently, result diversification has attracted a lot of attention as a means to improve the quality of results retrieved by user queries. In this paper, we propose a new, intuitive definition of diversity called DisC diversity. A DisC diverse subset of a query result contains objects such that each object in the result is represented by a similar object in the diverse subset and the objects in the diverse subset are dissimilar to each other. We show that locating a minimum DisC diverse subset is an NP-hard problem and provide heuristics for its approximation. We also propose adapting DisC diverse subsets to a different degree of diversification. We call this operation zooming. We present efficient implementations of our algorithms based on the M-tree, a spatial index structure, and experimentally evaluate their performance.
Conference Paper
Blog distillation (blog feed retrieval) is a task in blog retrieval where the goal is to rank blogs according to their recurrent relevance to a query topic. One of the main properties of blog feed retrieval is that the unit of retrieval is a collection of documents as opposed to a single document as in other IR tasks. This collection retrieval nature of blog distillation introduces new challenges and requires new investigations specific to this problem. Researchers have addressed this problem by considering a wide range of evidence and information resources. However, previous work has not studied the effect of on-topic diversity of blog posts in blog relevance. By on-topic diversity of blog posts we mean that those posts that are about the query topic need to have high diversity and cover different sub-topics of the query. In this study, we investigate three types of on-topic diversity and their effect on retrieval performance: topical diversity, temporal diversity and hybrid diversity. Our experiments over different blog collections and different baseline methods show that on-topic diversity can improve the performance of the retrieval system. Among the three types of diversity, hybrid diversity, that considers both topical and temporal diversities, achieves the best performance.
Article
Evaluation measures act as objective functions to be optimized by information retrieval systems. Such objective functions must accurately reflect user requirements, particularly when tuning IR systems and learning ranking functions. Ambiguity in queries and redundancy in retrieved documents are poorly reflected by current evaluation measures. In this paper, we present a framework for evaluation that systematically rewards novelty and diversity. We develop this framework into a specific evaluation measure, based on cumulative gain. We demonstrate the feasibility of our approach using a test collection based on the TREC question answering track.
Article
Recently, result diversification has attracted a lot of attention as a means to improve the quality of results retrieved by user queries. In this paper, we propose a new, intuitive definition of diversity called DisC diversity. A DisC diverse subset of a query result contains objects such that each object in the result is represented by a similar object in the diverse subset and the objects in the diverse subset are dissimilar to each other. We show that locating a minimum DisC diverse subset is an NP-hard problem and provide heuristics for its approximation. We also propose adapting DisC diverse subsets to a different degree of diversification. We call this operation zooming. We present efficient implementations of our algorithms based on the M-tree, a spatial index structure, and experimentally evaluate their performance.
Article
This paper considers the pattern of occurrences of words in text as part of an attempt to develop formal rules for identifying those indicative of content and thereby suitable for use as index terms. A probabilistic model was proposed which, with a suitable fitting of parameters, could account for the occupancy distribution of most words, both index terms and non-index terms. The parameters take quite different values for the two classes. In this model each abstract was considered to receive word occurrences in a Poisson process. Abstracts can then be divided into classes, such that all abstracts within a given class receive word occurrences at the same average rate. The appearance of a particular number of occurrences of some word within an abstract then serves to give information, in a Bayesian sense, on the class membership of that abstract. It is of central interest to determine the minimum number of classes that can account for the occupancy distribution of each word. Though more testing needs to be done it may be concluded that the distribution of a very large majority of words can be accounted for by assuming three or fewer classes.