ChapterPDF Available

Wiki-MID: A Very Large Multi-domain Interests Dataset of Twitter Users with Mappings to Wikipedia: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12, 2018, Proceedings, Part II

September 2018

September 2018

DOI:10.1007/978-3-030-00668-6_3

In book: The Semantic Web – ISWC 2018 (pp.36-52)

Authors:

Giorgia Di Tommaso

Sapienza University of Rome

Stefano Faralli

Sapienza University of Rome

Giovanni Stilo

Università degli Studi dell'Aquila

This paper presents Wiki-MID, a LOD compliant multi-domain interests dataset to train and test Recommender Systems, and the methodology to create the dataset from Twitter messages in English and Italian. Our English dataset includes an average of 90 multi-domain preferences per user on music, books, movies, celebrities, sport, politicsand much more, for about half million users traced during six monthsin 2017. Preferences are either extracted from messages of users whouse Spotify, Goodreads and other similar content sharing platforms, orinduced from their ”topical” friends, i.e., followees representing an in-terest rather than a social relation between peers. In addition, preferred items are matched with Wikipedia articles describing them. This unique feature of our dataset provides a mean to categorize preferred items, exploiting available semantic resources linked to Wikipedia such as the Wikipedia Category Graph, DBpedia, BabelNet and others.

Content uploaded by Giorgia Di Tommaso

Content may be subject to copyright.

Wiki-MID: a very large Multi-domain Interests

Dataset of Twitter users with mappings to

Wikipedia

Giorgia Di Tommaso, Stefano Faralli*, Giovanni Stilo, and Paola Velardi

Department of Computer Science, University of Rome,

*Unitelma-Sapienza, Italy

{ditommaso,stilo,velardi}@di.uniroma1.it

stefano.faralli@unitelmasapienza.it

Abstract. This paper presents Wiki-MID, a LOD compliant multi-

domain interests dataset to train and test Recommender Systems, and

the methodology to create the dataset from Twitter messages in English

and Italian. Our English dataset includes an average of 90 multi-domain

preferences per user on music, books, movies, celebrities, sport, politics

and much more, for about half million users traced during six months

in 2017. Preferences are either extracted from messages of users who

use Spotify, Goodreads and other similar content sharing platforms, or

induced from their ”topical” friends, i.e., followees representing an in-

terest rather than a social relation between peers. In addition, preferred

items are matched with Wikipedia articles describing them. This unique

feature of our dataset provides a mean to categorize preferred items,

exploiting available semantic resources linked to Wikipedia such as the

Wikipedia Category Graph, DBpedia, BabelNet and others.

Keywords: semantic recommenders, Twitter, Wikipedia, users’ interest

Permanent URL: https://doi.org/10.6084/m9.figshare.6231326

1 Introduction

Recommender systems are widely integrated in online services to provide sugges-

tions and personalize the on-line store for each customer. Recommenders identify

preferred items for individual users based on their past behaviors or on other

similar users. Popular examples are Amazon [1] and Youtube [2]. Other sites

that incorporate recommendation engines include Facebook, Netﬂix, Goodreads,

Pandora and many others.

Despite the vast amount of proposed algorithms, the evaluation of recom-

mender systems is very diﬃcult [3]. In particular, if the system is not operational

and no real users are available, the quality of recommendations must be evalu-

ated on existing datasets, whose number is limited and what is more, they are

focused on speciﬁc domains (i.e, music, movies, etc.). Since diﬀerent algorithms

may be better or worse depending on the speciﬁc purpose of the recommender,

2 Giorgia Di Tommaso, Stefano Faralli, Giovanni Stilo and Paola Velardi

the availability of multi-domain datasets could be greatly beneﬁcial. Unfortu-

nately, real-life cross-domain datasets are quite scarce, mostly gathered by ”big

players” such as Amazon and eBay, and they not available to the research com-

munity1.

In this paper we present a methodology for extracting from Twitter a large

dataset of user preferences - that we call Wiki-MID - in multiple domains and in

two languages, Italian and English. To reliably extract preferences from users’

messages, we exploit popular services such as Spotify, Goodreads and others. Fur-

thermore, we infer many other preferences from users’ friendship lists, identifying

those followees representing an interest rather than a peer friendship relation.

In this way we learn, for any user, several interests concerning books, movies,

music, actors, politics, sport, etc. The other unique feature of our dataset, in

addition to multiple languages and domains, is that preferred items are matched

with corresponding Wikipedia pages, thus providing the possibility to generalize

users’ interests exploiting available semantic resources linked to Wikipedia, such

as the Wikipedia Category Graph, BabelNet, DBpedia, and others.

The paper is organized as follows: Section 2 summarizes previous research

on creating datasets for recommender systems, Sections 3, 4 and 5 present the

methodology to create Wiki-MID, Section 6 is dedicated to dataset statistics

and evaluation, and Section 7 describes the released resource, which has been

designed on top of the Semantically-Interlinked Online Communities (SIOC)

core ontology. Finally, in Section 8 we provide a summary of distinctive features

of our resource and some directions for future work.

2 Related work

Recommender systems are based on one of three basic approaches [4]: collabora-

tive ﬁltering [5] generates recommendations collecting preferences of many users,

content-based ﬁltering [6] suggests items similar to those already chosen by the

users, and knowledge-based recommendation [7] identiﬁes a semantic correlation

between user’s preferences and existing items. Hybrid approaches are also widely

adopted, e.g., [8]. All approaches share the need of suﬃciently large datasets to

learn preferences and to evaluate the system, a problem that is one of the main

obstacles to a wider diﬀusion of recommenders [9], since only a small number of

researchers can access real users data, due to privacy issues.

To overcome the lack of datasets, challenges as RecSys have been launched2,

and dedicated web sites have been created (e.g., SNAP3or Kaggle4), where re-

searchers can upload their datasets and make them available to the community.

However it is still diﬃcult to ﬁnd appropriate data for novel types of recom-

menders, as the majority is focused on a single topic, like music [10], [11], ) food

([12], [13]), travel ([14], [15]) and more [16]. Furthermore, while a small number

1https://recsys.acm.org/wp-content/uploads/2014/10/recsys2014- tutorial-cross_domain.pdf

2https://recsys.acm.org/

3http://snap.stanford.edu/data

4https://www.kaggle.com/datasets/?sortBy=hottest&group=all

Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia 3

of large datasets are available, such as Movielens [17], Million song dataset [18]

and Netﬂix Prize Dataset [19], many others are quite small and based on very

focused experiments.

Concerning the source of data for extracting preferences, social networks are

often used (mainly Twitter, Facebook, Google+, LinkedIn or a combination of

sources, such as in [20]), since their content is freely available with more or less

severe restrictions. The interested reader can refer to [21] for a detailed survey

of methods adopted in literature to collect social data for the purpose of infer-

ring and enhancing users’ interests proﬁles. Preferences are induced from users’

proﬁles (e.g, [22]), authoritative (topical) friendship relations [23], followee bi-

ographies [24], and messages ([25], [26], [27] and many others).

Data extraction from Twitter messages is a popular strategy, however, it is also

computationally expensive and error-prone, since it requires natural language

processing techniques to analyze the text. To overcome this diﬃculty, a number

of studies exploited platforms (e.g., Youtube, Spotify) that integrate among their

services the ability to post the user’s personal content on the most popular social

network sites, such as movies that users are watching. Sharing this information

is done in a simple and predeﬁned way. Depending on the social network chosen,

the content, for example a Youtube video, will be shared with a pre-formatted

message formed by the video name, a link, a self-generated text and, if provided, a

numerical rating (eg. ”How It’s Made: Bread” https://youtu.be/3UjUWfwWAC4

via YouTube). The message can also be enriched and personalized by the user.

In [25] these types of messages are extracted from Twitter, to detect music in-

terests. The dataset is based on 100,000,000 tweets with the #nowplaying main

tag. Tweets are extracted via Twitter APIs over 3-years and next, MusicBrainz

and Spotify are used to add more details. Other studies extract data about music

[27] or sport [28] events. However, all the datasets generated in this way concern

only one domain of interest.

To the best of our knowledge, the only really multi-domain dataset is presented

in [29], where pre-structured tweets about three domains - movies, books and

video-clips - are extracted respectively from IMDb (Internet Movies Database),

Youtube and Goodreads. With respect to this work, we collect a much wider

number of interests, since in addition to pre-formatted messages based on a

number of available services, we reliably extract many additional types of in-

terests exploiting users’ followees lists. Furthermore, as shown in Section 7, we

collected many interest types for each user, while the dataset released in [29]

includes only 7 users with at least 3 types of interests.

3 Workﬂow

This section summarizes the data sources and workﬂow to create the Wiki-

MID multi-domain resource. We extract preferences (with unary ratings) from a

user’s messages and from his/her friendship list, identifying those followees who

represent an interest rather than a peer friendship relationship. The process is

in three steps:

4 Giorgia Di Tommaso, Stefano Faralli, Giovanni Stilo and Paola Velardi

1. Extracting interests from users’ textual communications. Using textual fea-

tures extracted from users’ communications, proﬁles or lists seems a natural

way for modeling their interests. However, this information source has sev-

eral drawbacks when applied to large data streams, such as the set of Twitter

users. First, it is computationally very demanding to process millions of daily

tweets in real time; secondly, the extraction process is error prone, given the

highly ungrammatical nature of micro-blogs. To reliably extract preferences

from Twitter users’ messages, in line with other works surveyed in Section

2, we use a number of available services, described hereafter, that allow to

share activities and preferences in diﬀerent domains - movies, books etc. -

using pre-formatted expressions (e.g, for Spotify: #NowPlaying) followed by

the url of a web site, from which we can extract information without errors.

The drawback is that a relatively small number of users access these services

and in addition, preferences are extracted only in few domains.

2. Extracting interests from users’ friendship lists. In [30] the authors argue that

users’ interests can also be implicitly represented by the authoritative (topi-

cal) friends they are linked to. This information is available in users’ proﬁles

and does not require additional textual processing. Furthermore, interests

inferred from topical friends are less volatile since, as shown in [31], ”com-

mon” users tend to be rather stable in their relationships. Topical friends are

therefore both relatively stable and readily accessible indicators of a user’s

interest. Another advantage is that average Twitter users have hundreds of

followees, many of which, rather than genuine friends, are indicators of a

variety of interests in diﬀerent domains, such as entertainment, sport, art

and culture, politics, etc.

3. Mapping interests onto Wikipedia pages. The ﬁnal step is to associate each

interest, either extracted from messages or inferred from friendship relations,

with a corresponding Wikipedia page, e.g.:

@nytimes ⇒WIKI:EN:The New York Times

(in this example, @nytimes is a Twitter account extracted from a user’s

friendship list). Although not all interests can be mapped on Wikipedia,

our experiments show that this is possible in a large number of cases, since

Wikipedia articles are created almost in real-time in correspondence with

virtually any popular entity, either book, or song, actor, event, etc.

We applied this workﬂow to two Twitter streams in two languages, English

and Italian, as explained in the next Sections.

4 Extraction of users’ interests

4.1 Extracting interests from messages

Everyday a huge number of people uses on-line platforms (eg. Yelp, Foursquare,

Spotify, etc.) that allow to share activities and preferences on diﬀerent domains

on a social network in a standard way. Among the most popular services accessed

by Twitter users, we selected those providing pre-formatted messages:

Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia 5

– Spotify: Spotify is a music service oﬀering on-demand streaming of music,

both desktop and mobile. Users can also create playlists, share and edit them

in collaboration with other users. In addition to accessing the Spotify web

site, users can retrieve additional information such as the record label, song

releases, date of release etc.. Since 2014, Spotify is widely used in America,

Europe and Australia. Spotify is among the services allowing to generate self-

generated content shares in Twitter. An example of these tweets is: ”#Now-

Playing The Sound Of Silence by Disturbed https://t.co/d8Sib5EDVf”.

The standard form of these tweets is:

#NowPlaying <title>by <artist > <URL>

By ﬁltering the tweets stream and using Twitter APIs for hashtag detection,

we generated a stream of all the users who listened music using Spotify.

– Goodreads and aNobii: Similarly to Spotify, a number of platforms al-

lows to share opinions and reviews on books. In these platforms, users can

share both titles and ratings. Similarly to Spotify, generated tweets have

a predeﬁned structure and point to an URL. In the book domain, we use

Goodreads (10 million users and 300 million books in the database) and for

Italian, the more popular aNobii service.

– IMDb and TVShowTime: In the domain of movies, currently there are no

dominant services. Popular platforms in this area are Flixter, themoviedb.org

and iCheckMovies. However, many of these platforms use the IMDb database,

owned by Amazon, which handles information about movies, actors, direc-

tors, TV shows, and video games. We also use the TvShowTime service for

Italian users.

In order to extract users’ preferences from these services, we ﬁrst collect in a

Twitter stream T S all messages including a hashtag related to one of the above

mentioned services (#NowPlaying, #IMDb ..). Then, we extract from T S the

music, movie and book preferences for the set of users Uwho accessed these ser-

vices. Unlike [29], we avoid parsing tweets using speciﬁc regular expressions, since

users are free to insert additional text in the pre-formatted message. Rather, in

line with [32], we exploit an element that most pre-formatted tweets have: the

URL, e.g., #NowPlaying High by James Blunt https://t.co/7EiepE2Bvz

Accordingly, we collect all tweets containing the selected hashtags and discard

those which do not include an URL. The reason for extracting the informa-

tion from the URL (which is computationally more demanding) rather than

from the tweet itself is twofold: i) Tweets can be ambiguous or malformed,

and furthermore, users can insert additional text in the pre-formatted message,

e.g, ”#NowPlaying Marty. This guy is amazing.http: // t. co/ jwxvLiNenW ”.

Scraping the html page at the URL address ensures that we extract data with-

out errors, even for complex items such as book and movie titles; ii) The URL

includes additional information (e.g., not only the title of a song, but also the

singer and the record label), which provide us a context to reliably match the

extracted entity (song, book, movie) with a Wikipedia article, as detailed in

Section 5.1. Since the URL in tweets is a short URL, we ﬁrst extend the origi-

nal URL so that all URLs belonging to a given platform can be identiﬁed (for

6 Giorgia Di Tommaso, Stefano Faralli, Giovanni Stilo and Paola Velardi

example, https://t.co/oShYDc6DeL →http://spoti.fi/2cTPn0U). Next, we

access the web site and scrape its content. For each platform we obtain the fol-

lowing data:

Music:<Title, Author (eg. singer, band)>

Books:<Title, Author>

Movie:<Title, Year of production, Type (eg. movie, tv series)>

4.2 Extracting interests from users’ ”topical” friends

In addition to preferences extracted from users’ messages, we also induce inter-

ests from their topical friends, a notion that we ﬁrst introduced in [23]. We denote

as topical friends those Twitter accounts in a user’s followees list representing

popular entities (celebrities, products, locations, events . . .). For example, if a

user follows @David Lynch, this means that he/she likes his movies, rather than

being a genuine friend of the director. There are several clues to identify topical

friends in a friendship list: ﬁrst, topical relations are mostly not reciprocated, sec-

ond, popular users have a high in-degree. However, these two clues alone do not

allow to distinguish e.g., bloggers or very social users from truly popular entities.

To learn a classiﬁcation model to distinguish between topical and peer friends,

we ﬁrst collected a network of Veriﬁed Twitter Accounts. Veriﬁed accounts 5

are authentic accounts of public interest. We started from a set of seed veriﬁed

contemporary accounts in 2016, and we then crawled the network following only

veriﬁed friends, until no more veriﬁed accounts could be found. This left us with

a network of 107,018 accounts of veriﬁed contemporary users (V), representing

a ”training set” to identify authoritative users’ proﬁles. To learn a model of au-

thoritativeness, we used the set Vand a random balanced set of ¬Vusers. For

each account in Vand ¬V, we extracted three structural features (in degree,

out degree and their ratio) and one binary textual feature (presence in the user’s

account proﬁle of role words such as singer, artist, musicians, writer..). Then, we

used 80% of these accounts to train a SVM classiﬁer with Laplacian kernel and

the remaining 20% for testing with cross-validation, obtaining a total accuracy

of 0.88 (true positive rate 0.95 and true negative rate 0.82).

Next, from the set Uof users in our Twitter datasets (separately for the

English and Italian streams), we collected the set Fof Twitter accounts such

that, for any f∈Fthere is at least one u∈Usuch that ufollows f. The

previously learned classiﬁer was used to select a subset Ft⊆Fof authoritative

users representing ”candidate” topical friends.

Finally, an additional ﬁltering step is applied to identify ”true” topical friends in

Ft, i.e., genuine users’ interests, which consists in determining which members

of the set Fthave a matching Wikipedia page. This step is described in Section

5. The intuition is that, if one such match exists, the entity to which the Twitter

account belongs is indeed ”topical”6. Although this ﬁltering step may aﬀect the

recall of the method, it provides high accuracy, as demonstrated in Section 6.

5https://developer.twitter.com/en/docs/api-reference-index

6we do not directly attempt a match of all f∈Fwith Wikipedia, since it is very

computationally demanding and has a reduced precision.

Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia 7

5 Mapping interests to Wikipages

The last step of our methodology consists in mapping the collected users’ inter-

ests to Wikipages. This step has both the advantage of improving the precision

of detected users’ interests, and providing a mean to categorize them. We use

diﬀerent mapping methodologies for interests extracted from messages and those

induced from users’ friendship lists.

5.1 Mapping movies, songs and books

Mapping interests extracted from users’ messages to Wikipedia pages is a very

reliable process, given the additional contextual information extracted from the

URL (see previous Section 4.1). Wikipedia mapping is obtained by a cascade of

weighted boolean query on a Lucene Index, as in the example below, used to

search the Wikipage of an item:

< T I T LE ∈W ik iT itle >w1∧< AU T H OR ∈W ikiGl oss >w2∧((< W ORD S ∈

W ikiT itle >w3∨< AU T H OR ∈W ikiT itl e >w4∨¬(< W OR DS ∈W ik iT itle >w3

∨< AUT H OR ∈W ikiT itle >w4∨< W ORDS ∈W ikiT ext >w5))

∨< W ORDS ∈W ik iT ext >w5< W O RDS > for music = {”song”}< W ORDS >

for books = {”books”, ”novel”, ”saga” .. . }< W ORDS > for movie = {”ﬁlm”, ”se-

ries”, ”TV series”, ”episode” . . . }

where wiis a weight assigned to a query. When the page doesn’t exist or is not

available, we search the page of the item’s author, using similar queries.

5.2 Mapping topical friends

Matching interests extracted from a user’s friendship list with corresponding

Wikipedia pages is far more complex, because of homonymy, polysemy and am-

biguity. Furthermore, the information included in a user’s Twitter proﬁle is very

sketchy and in some case misleading, therefore it may not provide suﬃcient

context to detect a similarity with the correspondent Wikipedia article. For ex-

ample, Bill Gate’s description ﬁeld7in his Twitter proﬁle is: ”Sharing things I’m

learning through my foundation work and other interests...” which has little in

common with his Wikipedia page: ”William Henry Gates III (born October 28,

1955) is an American business magnate, investor, author, philanthropist, hu-

manitarian and co-founder of the Microsoft Corporation along with Paul Allen.”

We note that other studies have considered this task. For example, the authors in

[33]) use an heuristics based on the overlap coeﬃcient of last 20 topical followees’

tweets and the Wikipedia article summary, which is rather data demanding. In

[34] the authors use a methodology which is similar to the one we ﬁrstly pre-

sented in [23], based on a comparison between Twitter description ﬁelds and the

content of a Wikipage. As previously noted (see the Bill Gates example), this

might not be suﬃcient in many cases. In the present work, to reliably assign a

Wikipedia page to a large fragment of users in the set Ftof U0sauthoritative

7as retrieved on January 2018

8 Giorgia Di Tommaso, Stefano Faralli, Giovanni Stilo and Paola Velardi

friends, we use an ensemble of methods, with adjudication by majority voting.

The methodology is described in what follows.

1. Task Description and data - Given a set Ft={f1, f2, ..., fn}of candidate

”topical” Twitter proﬁles and a set of Wikipages W={w1, w2, ..., wm}we deﬁne

a mapping function M:Ft⇒W∪ {λ}where the value of the function Mfor a

given Twitter proﬁle fiis a Wikipage wj, which is the corresponding Wikipage

of the entity having the twitter proﬁle fior λ, where λmeans ”no match”.

We deﬁne an ensemble of three mappers exploiting the information included

in Twitter proﬁles and in DBpedia entities associated to Wikipedia.

Proﬁles of Twitter users provide, among the others, the following information:

– proﬁle address: e.g., https://twitter.com/katyperry

– user ID: a numeric value to uniquely identify a user (not visible on the

rendered web page);

– screen name a string that can be used to refer to a user when posting a

message (e.g. @katyperry);

– name the extensive name of the owner of the proﬁle (e.g. ”Katy Perry”);

– url: the link to a proﬁle-relevant homepage (e.g. ”katyperry.com”). Only a

fragment of proﬁles have an URL to a homepage.

– description: a short description to describe the user and welcome proﬁle

visitors.

Furthermore, from each wikipage wj(e.g., Figure 1, upper right, shows the

Wikipedia page of the singer ”Katy Perry”) it is possible - thanks to DBpedia -

to collect additional information, here is a small subset:

– title: the title of the page (e.g. ”Katy Perry”);

– content: the textual content of the page;

– homepage: a property (collected and included on DBpedia from infoboxes)

which (when present) links to a web page (homepage) related to the main

entity described in wj(e.g., ”katyperry.com”)

– links extracted from the homepage: are those links included on the

source html of the above mentioned homepage, e.g., in the html of the web-

page at katyperry.com we ﬁnd: https://facebook.com/katyperry,https:

//twitter.com/katyperry, . . .

2. Mapping methods - We rely on an ensemble of three diﬀerent methodolo-

gies (M1,M2and M3) of association between the set Ftof Twitter proﬁles and

Wikipages. The ﬁrst is based on text mining and structural properties of the so-

cial network, the other two are based on ﬁnding direct correspondences between

the ﬁeld url in a Twitter proﬁle and the property homepage in a DBpedia entity.

1. M1- Context Based mapping: We use the methodology that we ﬁrst

presented in [35], summarized in what follows:

a) Selection of candidate senses: For any fiin Ft, ﬁnd a (possibly empty) list

Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia 9

Wikipedia2Twitter

Twitter2Wikipedia

Webpage wj.homepage

wj.homepage

katyperry.com

wj.address

https://en.wikipedia.org/wiki/Katy_Perry

wj.title

Katy Perry

links_extracted_from_the_homepage

https:twitter.com/katyperry, ...

fi.url

katyperry.com

fi.description

(empty)

fi.profile_address

https:twitter.com/katyperry

fi.screen_name

katyperry

fi.name

”Katy Perry”

Wikipage wj

Twitter Profile fi

wj.homepage

katyperry.com

-Wj.homepage equals to fi.profile_address ?

-links_extracted_from_the_homepage contains

fi.profile_address ?

-links_extracted_from fi.description contains

wj.homepage or wj.address ?

-fi.url equals wj.homepage or wj.address ?

Fig. 1. Example of Twitter2Wikipedia and Wikipedia2Twitter mapping

of candidate wikipages, using BabelNet [36] synonym sets (in BabelNet, each

”BabelSynset” points to a unique Wikipedia entry ). For example, @kathy-

perry has candidates Katy Perry and Katy Perry discography, but there are

cases with dozens of candidates (e.g., https://en.wikipedia.org/wiki/

John_Williams_(disambiguation));

b) BoW Disambiguation: Compute the bag-of-words (BoW) similarity be-

tween the user description in fi’s Twitter account and each candidate wikipage.

The BoW representation for each wikipage is obtained from its associated

BabelNet relations (relations are described in [37]);

c) Structural Similarity: If no Wikipages can be found with a suﬃcient level

of similarity (as for the previous example of Bill Gates description ﬁeld),

select from fi’s friendship list those friends already mapped to a wikipage

-if any- and compute the similarity between those wikipages and candidate

wikipages. For example, to correctly map the Twitter account of Bill Gates to

Wikipedia, proﬁle information of the following Twitter users in his friendship

list are used: Paul Allen, Melinda Gates, TechCrunch, Microsoft Foundation,

and more. Note that Paul Allen is explicitly mentioned in the ﬁrst sentence

of Bill Gate’s wikipage.

2. M2- Twitter2Wikipedia: as sketched in Figure 1, we ﬁrst collect a set

of URL from a given proﬁle fiincluding: the link (if any) in the ﬁeld url

and all the links extracted from the proﬁle description ﬁled (In the ex-

ample of Figure 1, since the proﬁle description is empty, we collect only

10 Giorgia Di Tommaso, Stefano Faralli, Giovanni Stilo and Paola Velardi

the link katyperry.com). Second, we search a Wikipage wj(if any) for

which one of the links collected for fi(in our example we collected the link

katyperry.com) matches with the link provided in the homepage property,

(in our example the Wikipage with title ”Katy Perry” has a property home-

page whose value matches exactly the link katyperrry.com), or directly with

the address of the page itself (e.g. https://en.wikipedia.org/wiki/Katy_

Perry). Note that this mapping method is error prone: for example, from the

Twitter proﬁle of Paul Gilmour we extract the following url: skysports.com

matching with the homepage property of the following wikipage: https:

//en.wikipedia.org/wiki/Sky_Sports. Although related, this is not Paul

Gilmour’s page https://en.wikipedia.org/wiki/Paul_Gilmour.

3. M3- Wikipedia2Twitter:M3is symmetric to M2. As shown in Figure 1,

we map a given fito a Wikipage wjif the homepage property, or one of the

links extracted from the source html of the homepage in wj, matches the

Twitter proﬁle address. Like for Twitter2Wikipedia, this mapping method

is error prone.

For each of the above three approaches we add three additional mapping

functions ES M1,ES M2,ESM3where each mapping function is deﬁned as :

ESMk(ti) = (wj,if Mk(ti) = wjand ti.name =wj.title

λ, otherwise

In other words, ESMk”reinforces” the result of Mkif the name ﬁeld in the

Twitter proﬁle perfectly matches with the title of the Wikipedia page. Note that

this is often not the case, as for @realDonaldTrump.

3. Ensemble Voting - For a given Twitter proﬁle fithe ensemble voting mech-

anism selects the Wikipage wjfor which there is maximum agreement among the

6 mapping functions (M1,M2,M3,ES M1,ES M2,ESM3), and there are at least 2

Mj,Mkin agreement (j6=k). The threshold 2 has been empirically selected to

obtain the best compromise between number of mapped interests and precision,

as detailed in Section 6.

6 Wiki-MID statistics and evaluation

The outlined process has been applied to two streams of Twitter data, in English

and Italian, extracted during 6 months (April-September 2017) using Twitter

APIs. We collected the maximum allowed Twitter traﬃc of English users men-

tioning service-related hashtags (e.g., #NowPlaying for Spotify), and the full

stream of messages in Italian, since they do not exceed the maximum. As a ﬁnal

result, we obtained for a large number of users a variety of interests along with

their corresponding Wikipedia pages. An excerpt of a Twitter user’s interests is

shown in Table 1. In the example, we selected two interests from each of the four

sources from which they have been induced: IMDb (movies), Goodreads (books),

Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia 11

USER ID:787930***

Source Interest Wikipage

IMDb

Eyes Wide Open - 2009 - movie WIKI:EN:Eyes Wide Open (2009 ﬁlm)

Okja - 2017 - movie WIKI:EN:Okja

Goodreads

The Beautifull Cassandra - Jane Austen WIKI:EN:Jane Austen

The Beach - Alex Garland WIKI:EN:The Beach (novel)

Spotify

I Don’t Know What I Can Save You From - Kings of Convenience WIKI:EN:Kings of Convenience!

Nothing Matters When We’re Dancing - The Magnetic Fields WIKI:EN:The Magnetic Fields

Topical friends

@IMDb WIKI:EN:IMDb

@UNICEF uk WIKI:EN:UNICEF UK

@TheMagFields WIKI:EN:The Magnetic Fields

@BarackObama WIKI:EN:Barack Obama

@Spotify WIKI:EN:Spotify

Table 1. Excerpt of a Twitter user’s interests

Spotify (music) and the user’s topical friends. Although a detailed analytics of

interest categories is deferred to further studies, the example shows the common

trend that a user’s interests, either extracted from his/her messages or from top-

ical friends, are strongly related, and in same case identical. For example, the

user in Table 1 frequently accesses the IMDb and Spotify services, and he/she is

also a follower of the IMDb and Spotify Twitter accounts. Furthermore, his/her

interest in the band The Magnetic Field emerges from both source types.

Overall, we followed 444,744 English-speaking and 25,135 italian-speaking

users (the set U) who accessed at least one of the services mentioned in Section

4.1. Tables 2 and 3 show general statistics of interests extracted from users’ mes-

sages respectively, for English and Italian speaking user. In the English dataset

we crawled more than 20M tweets from these users, of which, about 2.7M could

be associated to the URL of a corresponding book, movie or song. On average,

we collected 6 interests per user. What is more, several users have interests in at

least two of the three domains. Figure 2 compares the Venn diagram of interest

types in our dataset (left) with that reported in [29] (right), to demonstrate

the superior coverage of our dataset, even when considering only preferences ex-

tracted from users’ messages. The last line of Tables 2 and 3 (precision) shows

that the methodology to extract and map preferences from messages is very re-

liable. We evaluated the precision (two judges with adjudication) on a randomly

selected balanced sample of 1200 songs, books, and movies in English, obtain-

ing a precision of 96% with a k-Fleiss Inter Annotator Agreement (IAA) of 18.

For the Italian dataset, we evaluated 750 songs, books, and movies, obtaining a

precision of 98%, and a k-Fleiss of 0.97.

The number and variety of extracted preferences is mostly determined by

the interests induced from users’ topical friends, as shown in Table 4 (Table 5

for the Italian dataset). The average number of interests induced for each user

is as high as 82, and the distribution is shown in Figure 3, left (English stream),

and right (Italian stream). Figure 3 (left) shows, e.g., that there are 100,000

users in Uwith ≥100 interests induced from their topical friends. As far as the

topical interests mapping performance is concerned, in [35] we estimated that

inducing interests from topical friends and subsequent mapping to Wikipedia

with mapping method M1 has an accuracy of 84%. Since our aim in this work

8The evaluation is rather straightforward, as readers may verify inspecting the re-

leased dataset and mappings.

12 Giorgia Di Tommaso, Stefano Faralli, Giovanni Stilo and Paola Velardi

message-based interests (|U|=444,744 English speaking users)

Music Books Movie Total

Platform: Spotify Goodreads IMDb All

#crawled tweets (tweets with selected hash-

tags)

19,941,046 693,975 97,772 20,732,793

#cleaned tweets (tweets fro which an URL was

extracted)

2,519,166 139,882 88,355 2,747,403

# of unique interests with a mapping to a

Wikipage

253,311 20,710 8,282 282,303

average #interests per user 6 8 6 6

average #users per interest 7 3 7 6

precision of Wikipedia mapping (on 3 samples

of 400 items each)

94% 96% 97% 96%

Table 2. 6-months (April-September 2017) statistics on message-based interests

extracted from English-speaking users

message-based interests (|U|= 25,135 Italian speaking users)

Music Books Movie Total

platform Spotify ANobii IMDb

TVShowTime

All

#crawled tweets (tweets with selected hash-

tags)

273,256 12,198 2,229 287,683

#cleaned tweets (tweets for which an URL was

extracted)

70,330 12,193 2,119 84,642

# of unique interests with a mapping to a

Wikipage

9,926 4,690 279 14,895

average #interests per user 3 9 7 6

average #users per interest 5 2 5 4

precision of Wikipedia mapping (on 3 samples

of 250 items each)

96% 98% 100% 98%

Table 3. 6-months (April-September 2017) statistics on message-based interests

extracted from Italian-speaking users

Fig. 2. Venn Diagram of message-based interest types for our English dataset (left)

and the dataset in Dooms et al. (right)

is to generate a highly accurate dataset, ﬁrst, we used an ensemble of methods,

as detailed in Section 5.2, and furthermore, we considered only the subset F0

in Ftwith indegree (with respect to our population U) higher than 40. In fact,

we noted that less popular topical friends may still include bloggers or Twitter

Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia 13

Interests induced from topical friends (|U|=444,744 English speaking users)

# of topical friends F0

twith indegree ≥40 in U409,743

# of unique interests with a mapping to a Wikipage 58,789

average #interests per user 82

precision of Wikipedia mapping (tested on a sample of 1,250 items in F0

t) 90%

Table 4. 6-months (April-September 2017) statistics on interests induced from topical

friends of English-speaking users

Interests induced from topical friends (|U|=25,135 Italian speaking users)

# of topical friends F0

twith indegree ≥42 in U29,075

# of unique interests with a mapping to a Wikipage 4,580

average #interests per user 41.96

precision of Wikipedia mapping (tested on a sample of 1,250 items in F0

t) 90%

Table 5. 6-months (April-September 2017) statistics on interests induced from topical

friends of Italian-speaking users

users for which, despite some popularity, a Wikipage does not exist. In these

cases, our methodology may suggest false positives. When applying the indegree

ﬁlter, the precision - manually evaluated with adjudication on 1250 accounts

randomly chosen in this restricted population F0

t- is as high as 90%, as shown

in the last line of Tables 4 and 5. The k-Fleiss IAA are 0.95 and 0.92, respectively.

We remark that we are not concerned here with measuring the recall, since the

objective is to release a dataset with high precision and high coverage, in terms

of number of interests per user, over the considered populations. To this end, the

indegree threshold 40 was selected upon repeated experiments to obtain the best

trade-oﬀ between the distribution of interests in the population Uand precision

of Wikipedia mapping.

Concerning coverage, when merging the two sources of information, our English

dataset includes an average of 90 interests per user for about 450k users, and

a total of 282,303 + 58,789 = 341,092 unique interests in a large variety of

domains. As a comparison, even when considering single domains, the largest

available datasets9, like MovieLens and Bookcrossing, do not exceed 150,000

users and 250,000 items, with a much lower density in terms of interests per user

-although these resources provide ranked preferences rather than unary, as in

WikiMED. Even the popular Million Songs Dataset Challenge [18] consists of a

larger set of users (1.2 million users) but a comparable number of unique interests

in a single domain (380,000 songs). To the best of our knowledge, this is the

largest freely available multi-domain interest dataset reported in literature, and

furthermore, we provide the unique feature of a reliable mapping to Wikipedia.

14 Giorgia Di Tommaso, Stefano Faralli, Giovanni Stilo and Paola Velardi

Fig. 3. Distribution of interests induced from users’ topical friends: English dataset

(left) and Italian dataset (right).

sioc:UserAccount

sioc:follows

interest

skos:relatedMatch

sioc:likes

resource

skos:relatedMatch

Fig. 4. The data model adopted for the design of our resource.

7 The Wiki-MID resource

Our resource is designed on top of the Semantically-Interlinked Online Commu-

nities (SIOC) core ontology.10 The SIOC ontology favors the inclusion of data

mined from social networks communities into the Linked Open Data (LOD)

cloud. As shown in Figure 4 we represent Twitter users as instances of the

SIOC UserAccount class. Topical users and message based user interests are

then associated, through the usage of the Simple Knowledge Organization Sys-

tem Namespace Document (SKOS)11 predicate relatedMatch, to a corresponding

Wikipedia page as a result of our automated mapping methodology. We release at

http://wikimid.tweets.di.uniroma1.it/wikimid/ both the dataset and the

related software under Creative Commons Attribution-Non Commercial-Share

Alike 4.0 License.

8 Concluding Remarks

In this paper we presented Wiki-MID, a LOD-compliant resource that captures

Twitter users’ interests in multiple domains. With respect to other available

datasets for Recommender Systems, our resource has several unique features:

1) users’ interests are induced from their messages and authoritative (”topical”)

9https://www.kdnuggets.com/2016/02/nine-datasets- investigating-recommender- systems.html

10 http://rdfs.org/sioc/spec/sioc.html

11 http://www.w3.org/2004/02/skos/core.html

Multi-domain Interests Dataset of Twitter users with mappings to Wikipedia 15

friends, and associated with corresponding Wikipedia articles, thus providing a

mean to derive a semantic categorization of interests through the exploitation of

available resources linked to Wikipedia, such as the Wikipedia Category Graph,

DBPedia, BabelNet, and others;

2) for every user, we are hence able to extract in two languages (English and

Italian) a variety of interests in multiple categories, such as art, science, enter-

tainment, politics, sport and more;

3) the dimension of the dataset is comparable with the largest single-domain

interest datasets in literature, and the average number of multi-domain interests

per user is far more large than other multi-domain datasets.

Further note, as shown in Section 6, that extracting interests from messages

and topical friends, and subsequent mapping to Wikipedia, is a very reliable

process (4% error rate for message-induced interests and 10% for friendship-

induced). In addition, the availability of semantic resources linked to Wikipedia

oﬀers the possibility to identify for each user the ”dominant” interest categories,

on which recommenders could rely when suggesting new items. We leave to

future research the exploitation of these features.

Acknowledgments. This work has been supported by the IBM Faculty Award

#2305895190 and by the MIUR under grant ”Dipartimenti di eccellenza 2018-

2022” of the Department of Computer Science of Sapienza University.

References

1. Linden, G., Smith, B., York, J.: Amazon. com recommendations: Item-to-item

collaborative ﬁltering. IEEE Internet computing 7(1) (2003) 76–80

2. Davidson, J., Liebald, B., Liu, J., et al.: The youtube video recommendation

system. In: Proc. of the 4th RecSys, ACM (2010) 293–296

3. Fouss, F., Saerens, M.: Evaluating performance of recommender systems: An ex-

perimental comparison. In: WI-IAT 2008. Int. Conf. on. Volume 1., IEEE 735–738

4. Felfernig, A., Jeran, M., Ninaus, G., Reinfrank, F., Reiterer, S., Stettinger, M.:

Basic approaches in recommendation systems. In: RSSE. Springer (2014) 15–37

5. Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S.: Collaborative ﬁltering recom-

mender systems. In: The adaptive web. Springer (2007) 291–324

6. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: The adap-

tive web. Springer (2007) 325–341

7. Trewin, S.: Knowledge-based recommender systems. Encyclopedia of library and

information science 69(Supplement 32) (2000) 180

8. Burke, R.: Hybrid recommender systems: Survey and experiments. User modeling

and user-adapted interaction 12(4) (2002) 331–370

9. Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recom-

mendation tasks. JMLR 10(Dec) (2009) 2935–2962

10. Dror, G., Koenigstein, N., Koren, Y., Weimer, M.: The yahoo! music dataset and

kdd-cup’11. In: Proc. of KDD Cup 2011. (2012) 3–18

11. Shepitsen, A., Gemmell, J., Mobasher, B., Burke, R.: Personalized recommendation

in social tagging systems using hierarchical clustering. In: RecSys, 2008, ACM

16 Giorgia Di Tommaso, Stefano Faralli, Giovanni Stilo and Paola Velardi

12. Kamishima, T., Akaho, S.: Nantonac collaborative ﬁltering: A model-based ap-

proach. In: Proc. of the 4th RecSys, ACM (2010) 273–276

13. Sawant, S., Pai, G.: Yelp food recommendation system (2013)

14. Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a

rating regression approach. In: Proc. of the 16th ACM SIGKDD. (2010) 783–792

15. Mavalankar, A.A., et al.: Hotel recommendation system. Internal Report (2017)

16. C¸ ano, E., Morisio, M.: Characterization of public datasets for recommender sys-

tems. In: RTSI, IEEE 1st International Forum on, IEEE (2015) 249–257

17. Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. TiiS16

18. McFee, B., Bertin-Mahieux, T., Ellis, D.P., Lanckriet, G.R.: The million song

dataset challenge. In: Proc. of the 21st WWW, ACM (2012) 909–916

19. Bennett, J., Lanning, S., et al.: The netﬂix prize. In: Proc. of KDD, NY (2007)

20. Yan, M., Sang, J., Xu, C.: Mining cross-network association for youtube video

promotion. In: Proc. of the 22nd ACMMM, ACM (2014) 557–566

21. Piao, G., Breslin, J.G.: Inferring user interests in microblogging social networks:

A survey. In: arXiv:1712.07691v3. (2017)

22. Chaabane, A., Acs, G., Kaafar, M.A., et al.: You are what you like! information

leakage through users? interests. In: Proc. of the 19th NDSS Symposium. (2012)

23. Faralli, S., Stilo, G., Velardi, P.: Large scale homophily analysis in twitter using a

twixonomy. In: Proc. of 24th IJCAI, 2015, Buenos Aires,Jul. 25-31, 2015. (2015)

2334–2340

24. Piao, G., Breslin, J.G.: Inferring user interests for passive users on twitter by

leveraging followee biographies. In: Proc. of ECIR. (2017)

25. Pichl, M., Zangerle, E., Specht, G.: # nowplaying on# spotify: Leveraging spotify

information on twitter for artist recommendations. In: ICWE. (2015) 163–174

26. P., K., P., J., C., V., A., S.: User interests identiﬁcation on twitter using a hierar-

chical knowledge base. LNCS: The Semantic Web: Trends and Challenges (2014)

27. Schinas, E., Papadopoulos, S., Diplaris, S., Kompatsiaris, Y., Mass, Y., Herzig, J.,

Boudakidis, L.: Eventsense: Capturing the pulse of large-scale events by mining

social media streams. In: Proc. of the 17th PCI, ACM (2013) 17–24

28. Nichols, J., Mahmud, J., Drews, C.: Summarizing sporting events using twitter.

In: Proc. of the 2012 Int. Conf. on Intelligent User Interfaces, ACM (2012) 189–198

29. Dooms, S., De Pessemier, T., Martens, L.: Mining cross-domain rating datasets

from structured data on twitter. In: Proc. of the 23rd WWW, ACM (2014) 621–624

30. Barbieri, N., Bonchi, F., Manco, G.: Who to follow and why: link prediction with

explanations. In: Proc. of the 20th ACM SIGKDD, ACM (2014) 1266–1275

31. Myers, S.A., Leskovec, J.: The bursty dynamics of the twitter information network.

In: Proc. of the 23rd WWW, ACM (2014) 913–924

32. Pichl, M., Zangerle, E., Specht, G.: Combining spotify and twitter data for gener-

ating a recent and public dataset for music recommendation. In: Grundlagen von

Datenbanken. (2014) 35–40

33. Besel, C., Schl¨otterer, J., Granitzer, M.: Inferring semantic interest proﬁles from

twitter followees: Does twitter know better than your friends? SAC ’16 (2016)

34. Nechaev, Y., Corcoglioniti, F., Giuliano, C.: Sociallink: Linking dbpedia entities

to corresponding twitter accounts. In: The Semantic Web – ISWC 2017. (2017)

35. Faralli, S., Stilo, G., Velardi, P.: Automatic acquisition of a taxonomy of microblogs

users’ interests. Journal of Web Semantics (2017)

36. Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and

application of a wide-coverage multilingual semantic network. AI (2012) 217–250

37. Delli Bovi, L., Telesca, L., Navigli, R.: Large-scale information extraction from

textual deﬁnitions through deep syntactic and semantic analysis. TACL 3(2015)

Multi-interest User Profiling in Short Text Microblogs

Chapter

Full-text available

Dec 2020

Discourse on short text platforms like Twitter shapes the design of underlying knowledge-based recommendation engines. The resulting recommendations are powered by user connections as social network nodes as well as with shared interests. Twitter as a platform provides a complex mesh of users’ interest levels where some users tend to consume certain topical content to a lesser or greater extent. This consumption of content is usually considered a defining factor in curation of their online identity. Our aim in this paper is to quantify the multi-interests of users based on the tweets they disseminate. We do this by (i) representing all tweets as vectors for computations (ii) generating cluster centroids representative of the topics of interest. (iii) computing a responsibility matrix to depict their interest levels in the topics (iv) aggregating intra-user interest levels to define the user’s multi-topic affinities. We use a Twitter dataset geolocated to Kenya to validate users’ intra-topical interests. Our experimental results demonstrate the effectiveness of our approach in terms of capturing their multi-interests and in turn generate their multi-topic interest profiles.

Mining User Interests from Social Media

Conference Paper

Oct 2020

Social media users readily share their preferences, life events, sentiment and opinions, and implicitly signal their thoughts, feelings, and psychological behavior. This makes social media a viable source of information to accurately and effectively mine users' interests with the hopes of enabling more effective user engagement, better quality delivery of appropriate services and higher user satisfaction. In this tutorial, we cover five important aspects related to the effective mining of user interests: (1) the foundations of social user interest modeling, such as information sources, various types of representation models and temporal features, (2) techniques that have been adopted or proposed for mining user interests, (3) different evaluation methodologies and benchmark datasets, (4) different applications that have been taking advantage of user interest mining from social media platforms, and (5) existing challenges, open research questions and exciting opportunities for further work.

Dissecting chirping patterns of invasive Tweeter flocks in the German Twitter forest

Article

Sep 2022

Twitter as a platform is used for news dissemination, with high volumes of campaigning and populism. This situation coincides with the growth of audiences who embrace social media as their primary news source. In general, effects like the deterioration of political education, misinformation, or ideological segregation then arguably represent a tremendous risk for democratic societies. We analyze a comprehensive data set of the German-speaking Twitter community – a concise, well-defined Twitter population – to understand the extent and form of consumption of controversial news. Our results affirm a high interest of German Twitter users in daily news and corresponding discussions. In-depth studies on the behavior, including tweeting- and grouping patterns, revealed the emergence of a new, more self-assured form of echo chambers.

SIMT: A Semantic Interest Modeling Toolkit

Conference Paper

Jun 2021

SeRenA: a semantic recommender for all

Conference Paper

Full-text available

Sep 2018

Giorgia Di Tommaso

The growth of data available on the Web, especially through social networks and business transactions, has served as a driving force for the development of recommender systems. Although there are many techniques in the literature, these systems suffer from some problems, including the well-known cold start problem. This problem is related to the recommendations of new elements or new users when there is no initial knowledge base. In this study we propose a solution to this and other problems based on use of semantic. We present SeRenA (Semantic Recommender for All), an unsupervised recommending strategy based on extraction of initial interests through online data (eg. posted-message and friendship) and mapped onto a number of Wikipedia documents. We introduce the methods and techniques we plan to apply to discover new items over ambiguous knowledge base.

Inferring User Interests in Microblogging Social Networks: A Survey

Article

Full-text available

Aug 2018
USER MODEL USER-ADAP

With the popularity of microblogging services such as Twitter in recent years, an increasing number of users use these services in their daily lives. The huge volume of information generated by users raises new opportunities in various applications and areas. Inferring user interests plays a significant role in providing personalized recommendations on microblogging services, and third-party applications providing social logins via these services, especially in cold-start situations. In this survey, we review user modeling strategies with respect to inferring user interests in previous studies. To this end, we focus on four dimensions of inferring user interest profiles: (1) data collection, (2) representation of user interest profiles, (3) construction and enhancement of user interest profiles, and (4) the evaluation of the constructed profiles. Through this survey, we aim to provide an overview of state-of-the-art user modeling strategies for inferring user interest profiles on microblogging social networks with respect to the four dimensions. For each dimension, we review and summarize previous studies based on specified criteria. Finally, we discuss some challenges and opportunities for future work in this research domain.

SocialLink: Linking DBpedia Entities to Corresponding Twitter Accounts

Conference Paper

Full-text available

Oct 2017

We present SocialLink, a publicly available Linked Open Data dataset that matches social media accounts on Twitter to the corresponding entities in multiple language chapters of DBpedia. By effectively bridging the Twitter social media world and the Linked Open Data cloud, SocialLink enables knowledge transfer between the two: on the one hand, it supports Semantic Web practitioners in better harvesting the vast amounts of valuable, up-to-date information available in Twitter; on the other hand, it permits Social Media researchers to leverage DBpedia data when processing the noisy, semi-structured data of Twitter. SocialLink is automatically updated with periodic releases and the code along with the gold standard dataset used for its training are made available as an open source project.

Inferring User Interests for Passive Users on Twitter by Leveraging Followee Biographies

Conference Paper

Full-text available

Apr 2017
Lect Notes Comput Sci

User modeling based on the user-generated content of users on social networks such as Twitter has been studied widely, and has been used to provide personalized recommendations via inferred user interest profiles. Most previous studies have focused on active users who actively post tweets, and the corresponding inferred user interest profiles are generated by analyzing these users' tweets. However, there are also a great number of passive users who only consume information from Twitter but do not post any tweets. In this paper, we propose a user modeling approach using the biographies (i.e, self descriptions in Twitter profiles) of a user's followees (i.e., the accounts that they follow) to infer user interest profiles for passive users. We evaluate our user modeling strategy in the context of a link recommender system on Twitter. Results show that exploring the biographies of a user's followees improves the quality of user modeling significantly compared to two state-of-the-art approaches leveraging the names and tweets of followees.

Characterization of Public Datasets for Recommender Systems

Conference Paper

Full-text available

Sep 2015

As Recommender Systems are becoming very common and widespread, there is an increasing need to evaluate their characteristics such as accuracy, diversity, scalability etc. One of the most fruitful ways to do this is by using public datasets with explicit user feedback about the items. In this paper we present and describe more than 20 available datasets covering different domains such as movies, books, music etc. Each dataset is described over a number of attributes such as size, domain, format of the data, type of access. Unfortunately we did not find any information about the quality of the data contained, that remains an open issue. We also refer to examples from the literature about using the datasets to evaluate recommendation algorithms or solutions. Overall aim of the paper is to offer a convenient resource for finding and selecting datasets as a support for the empirical evaluation of recommendation algorithms and techniques.

Large Scale Homophily Analysis in Twitter Using a Twixonomy

Conference Paper

Full-text available

Jan 2015

In this paper we perform a large-scale homophily analysis on Twitter using a hierarchical representation of users' interests which we call a Twixon-omy. In order to build a population, community, or single-user Twixonomy we first associate " topical " friends in users' friendship lists (i.e. friends representing an interest rather than a social relation between peers) with Wikipedia categories. A word-sense disambiguation algorithm is used to select the appropriate wikipage for each topical friend. Starting from the set of wikipages representing " primitive " interests, we extract all paths connecting these pages with topmost Wikipedia category nodes, and we then prune the resulting graph G efficiently so as to induce a direct acyclic graph. This graph is the Twixonomy. Then, to analyze homophily, we compare different methods to detect communities in a peer friends Twitter network, and then for each community we compute the degree of homophily on the basis of a measure of pairwise semantic similarity. We show that the Twixonomy provides a means for describing users' interests in a compact and readable way and allows for a fine-grained ho-mophily analysis. Furthermore, we show that mid-low level categories in the Twixonomy represent the best balance between informativeness and com-pactness of the representation. http://twixonomy.stilo.di.uniroma1.it/resources/

Inferring semantic interest profiles from Twitter followees: does Twitter know better than your friends?

Conference Paper

Full-text available

Apr 2016

Social media based recommendation systems infer users' interests from their social network activity in order to provide personalised recommendations. Typically, the user profiles are generated by analysing the users' posts or tweets. However, there might be a significant difference between what a user produces and what she consumes. We propose an approach for inferring user interests from followees (the accounts the user follows) rather than tweets. This is done by extracting named entities from a user's followees using the English Wikipedia as knowledge base and regarding them as interests. Afterwards, a spreading activation algorithm is performed on a Wikipedia category taxonomy to aggregate the various interests to a more abstract interest profile. With over 7 out of 10 items being relevant to the users in our evaluation, we show that this approach can compete with the state of the art and performs even better in predicting the users' interests than their human friends do.

Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis

Article

Full-text available

Oct 2015

We present DefIE, an approach to large-scale Information Extraction (IE) based on a syntactic-semantic analysis of textual definitions. Given a large corpus of definitions we leverage syntactic dependencies to reduce data sparsity, then disambiguate the arguments and content words of the relation strings, and finally exploit the resulting information to organize the acquired relations hierarchically. The output of DefIE is a high-quality knowledge base consisting of several million automatically acquired semantic relations.

Automatic acquisition of a taxonomy of microblogs users’ interests

Article

Sep 2017
J WEB SEMANT

Modeling users' interests plays an important role in the current web since it is at the basis of many services such as recommendation and customization. Using semantic technologies to represent users' interests may help to reduce problems such as sparsity, over-specialization and domain-dependency, which are known to be critical issues of state of the art recommenders. In this paper we present a method for high-coverage modeling of Twitter users supported by a hierarchical representation of their interests, which we call a Twixonomy. In order to automatically build a population, community, or single-user Twixonomy we first identify "topical" friends in users' friendship lists (i.e., friends representing an interest rather than a social relation between peers). We classify as topical those users with an associated page on Wikipedia. A word-sense disambiguation algorithm is used to select the appropriate Wikipedia page for each topical friend. Next, starting from the set of wikipages representing the main topics of interests of the considered Twitter population, we extract all paths connecting these pages with topmost Wikipedia category nodes, and we then prune the resulting graph efficiently so as to induce a direct acyclic graph and significantly reduce over ambiguity, a well known problem of the Wikipedia category graph. We release the Twixonomy produced in this work under creative common license.

Amazon. com recommendations: Item-to-item collaborative filtering

Article

Jan 2003

The MovieLens Datasets

Article

Dec 2015

The MovieLens datasets are widely used in education, research, and industry. They are downloaded hundreds of thousands of times each year, reflecting their use in popular press programming books, traditional and online courses, and software. These datasets are a product of member activity in the MovieLens movie recommendation system, an active research platform that has hosted many experiments since its launch in 1997. This article documents the history of MovieLens and the MovieLens datasets. We include a discussion of lessons learned from running a long-standing, live research platform from the perspective of a research organization. We document best practices and limitations of using the MovieLens datasets in new research.

Wiki-MID: A Very Large Multi-domain Interests Dataset of Twitter Users with Mappings to Wikipedia: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12, 2018, Proceedings, Part II

Abstract

Recommended publications

A Large Multilingual and Multi-domain Dataset for Recommender Systems

Automatic acquisition of a taxonomy of microblogs users’ interests

Large Scale Homophily Analysis in Twitter Using a Twixonomy

Automatic Acquisition of a Taxonomy of Microblogs Userss Interests