ArticlePDF Available

Methods for User Profiling Across Social Networks

April 2020

April 2020

Authors:

Rishabh Kaushal

Indira Gandhi Delhi Technical University for Women

Vasundhara Ghose

Indira Gandhi Delhi Technical University for Women

Ponnurangam Kumaraguru

International Institute of Information Technology, Hyderabad

Users have their accounts on multiple Online Social Networks (OSNs) to access a variety of content and connect to their friends. Consequently, user behaviors get distributed across many OSNs. Collection of comprehensive user information referred to as user profiling; an essential first step is to link user accounts (identities) belonging to the same individual across OSNs. To this end, we provide a detailed methodology of five methods useful for user profiling, which we refer to as Advanced Search Operator (ASO), Social Aggregator (SA), Cross-Platform Sharing (CPS), Self-Disclosure (SD) and Friend Finding Feature (FFF). Taken together, we collect linked identities of 208,120 individuals distributed across 43 different OSNs. We compare these methods quantitatively based on social network coverage and the number of linked identities obtained per-individual. And also perform a qualitative assessment of linked user data, thus obtained by these methods, on the criteria of completeness, validity, consistency, accuracy, and timeliness.

Generic Framework for User Profiling.

…

Pipeline for Advanced Search Operator (ASO) method.

…

Pipeline for Social Aggregator (SA) Method.

…

Pipeline for Cross Platform Sharing (CPS) method.

…

Pipeline for Self-Disclosure (SD) Method.

…

Figures - uploaded by Ponnurangam Kumaraguru

Content may be subject to copyright.

Content uploaded by Ponnurangam Kumaraguru

Content may be subject to copyright.

Methods for User Proﬁling Across Social Networks

Rishabh Kaushal1,2, Vasundhara Ghose1, and Ponnurangam Kumaraguru2

1Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Delhi, India

2Precog Research Group, Indraprashtha Institute of Information Technology, Delhi, India

Abstract—Users have their accounts on multiple Online Social

Networks (OSNs) to access a variety of content and connect

to their friends. Consequently, user behaviors get distributed

across many OSNs. Collection of comprehensive user information

referred to as user proﬁling; an essential ﬁrst step is to link

user accounts (identities) belonging to the same individual across

OSNs. To this end, we provide a detailed methodology of ﬁve

methods useful for user proﬁling, which we refer to as Advanced

Search Operator (ASO), Social Aggregator (SA), Cross-Platform

Sharing (CPS), Self-Disclosure (SD) and Friend Finding Feature

(FFF). Taken together, we collect linked identities of 208,120

individuals distributed across 43 different OSNs. We compare

these methods quantitatively based on social network coverage

and the number of linked identities obtained per-individual.

And also perform a qualitative assessment of linked user data,

thus obtained by these methods, on the criteria of completeness,

validity, consistency, accuracy, and timeliness.

Index Terms—User Proﬁling, Social Media Analysis, Online

Social Networks.

I. INTRODUCTION

The popularity of Online Social Networks (OSNs) is in-

creasing by the day with more and more people joining

multiple OSNs to share information about themselves, connect

to other users, and receive updates from them. Each OSN

offers a unique service or ecosystem which attracts users to

join more than just one of them. Facebook for personal friends,

LinkedIn for the professional network, YouTube for viewing &

sharing videos, and Twitter to get quick updates are the best

options. The average number of social media accounts per

online user has risen from 4.3 to 7.6 during 2013 to 20171.

To collect user information in a comprehensive manner, an

essential ﬁrst step is to gather user accounts (identities) of the

same individual across multiple OSNs, which we refer to as

linked identities. And the systematic approach to performing a

large scale collection of user behaviors across OSNs is referred

to as user proﬁling [1].

User proﬁling has many advantages and applications. Users

tend to provide incomplete information on a single social

network, either with purpose or otherwise. Knowing the same

user’s identity on other social networks would help in the com-

prehensive proﬁling of the user in terms of user’s proﬁle, user’s

content, user’s behavior, user’s preferences, and user’s friends.

In the advertising world, it enables targetted advertisement

[2] and improved recommendations. Researchers have studied

most of the problems in the domain of social networks like

1https://www.statista.com/statistics/788084/number-of-social-media-

accounts/

information propagation, link prediction, algorithmic biases,

discrimination studies, and community detection in the realm

of a single social network, which we can study across multiple

social networks. In social media crimes and cybersecurity

problems like cyberbullying, fake accounts, and spamming, we

are often looking for user footprints within the same social

network in which incident occurred. If the user’s identities

on other social networks are known, then it is only going to

help in the investigation [3]. From a user’s privacy standpoint,

individuals can be shown their comprehensive proﬁles and

likelihood of linkage of their identities and nudged to control

their online behavior so that their digital footprint decrease [4].

Lastly, there is no agreed benchmark dataset in the problem

domain of identity resolution. So, large scale data collection of

linked identities would help researchers compare and evaluate

their proposed solutions.

Given the signiﬁcance of user proﬁling, a lot of emphases

has been given in the research community to solve the ﬁrst

step in user proﬁling which involves linking user identities

belonging to the same person, referred to as identity resolution

(or identity linkage).2A data-driven approach to solve the

identity resolution problem has two key steps. Firstly, we

collect a large number of user identity pairs belonging to

linked identities and non-linked identities. Secondly, we con-

struct a machine learning-based model over the user behavioral

features extracted from user identities. In this paper, we focus

our attention on the ﬁrst step, which involves the collection

of linked identities across OSNs. Figure 1 depicts the before

and after stages involved in linked identity collection for user

proﬁling. In this paper, we explain ﬁve methods to obtain

linked identities namely Advanced Search Operator (ASO),

Social Aggregator (SA), Cross-Platform Sharing (CPS), Self-

Disclosure (SD) and Friend Finding Feature (FFF). Taking all

these methods together, we collect linked identities of 208,120

individuals across 43 different OSNs, which is by far the

most comprehensive coverage, towards user proﬁling, refer

at http://precog.iiitd.edu.in/resources.html for dataset details.

Subsequently, we present a detailed quantitative and qualitative

assessment of these methods. For quantitative assessment,

we evaluate the number of social networks covered by a

method and number of linked identities obtained per-individual

across OSNs. For qualitative assessment, we leverage standard

2This problem is known in literature by multiple names such as Social

Identity linkage [5], User identity linkage [6], user Identity Resolution Social

Network Reconciliation [7] , User Account Linkage Inference [8], Proﬁle

Linkage [9], Anchor Link prediction [10] and Detecting me edges [11].

Fig. 1. Visual depiction of progressive stages in which linked identities are collected starting from no linked identities and gradually progressing to collect

as many of them as possible by applying methods for user proﬁling.

parameters from ISO 9000:20153namely data completeness,

consistency, accuracy, validity, availability and timeliness.

Collecting user data from online social networks have

always been a challenge and given that the data is related

to users, there are privacy issues as well. Application Pro-

grammer’s Interfaces (APIs) offered by OSNs have been

dwindled their capabilities over the years owing to data privacy

concerns. The recent data breach4involving Facebook and

Cambridge Analytica would have an adverse implication on

data collection by academics for research purposes.5With

all this happening, users are becoming even more privacy

aware which would dissuade them from mentioning all the

details in their accounts, resulting in missing values when

data is collected. To make things worse, there are social

network platforms like Twitter which allow users to change

their account handles, thereby, complicating the data collection

process.

Regardless, this is the ﬁrst work to the best of our knowl-

edge which focuses exclusively upon the methods for collect-

ing linked identities, which is the the de-facto ﬁrst step for

user proﬁling. Key contributions of our work are:-

•Detailed description of data collection methods to retrieve

linked identities, thereby facilitating user proﬁling.

•Comprehensive evaluation of data collection methods

both qualitatively and quantitatively.

•Creation of a comprehensive dataset that can be used as

benchmark dataset for identity resolution research.

II. RELATED WORK

One of the earliest works is that of Perito et al. [12],

who conducted various data collection methods to obtain the

usernames belonging to the same individual across many sites.

Prominent sites used were Google proﬁles and eBay to collect

3.5 million usernames and 6.5 million usernames, respectively.

For ground truth, they relied on Google proﬁle users who

3International Standards Organization: https://www.iso.org/standard/45481.html

4https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-

facebook-inﬂuence-us-election

5https://scroll.in/article/872770/cambridge-analytica-scandal-could-hurt-

legitimate-researchers-using-facebook-data

have listed their usernames on other sites. Malhotra et al. [13]

used Google API to retrieve a list of user accounts declared

by users on their Google proﬁle. They collected ground truth

from Twitter, YouTube, and Flickr; however, there were many

missing ﬁelds. Twitter and LinkedIn were other platforms to

obtain 29,129 pairs of user accounts. Zafarani et al. [14],

in their work, collected usernames of the same real world

user across 32 different sites. Three primary sources to obtain

these matching usernames were namely social network sites

like Facebook or Google+ where online users mention their

usernames on other websites, blogs, or blog advertisement

portals. Oana Goga et al. [15] collected identical usernames

on three of the social network being considered Twitter,

Yelp and Flickr using ‘Friend Finder’ mechanism present

on these sites. Further, in their work [16], they focused

on Twitter, Facebook, Google+, Flickr, and Myspace. User

features from these social networks were obtained using their

APIs. However, for ground truth, 3 million Google+ proﬁles

were randomly crawled to exploit the fact that user list down

their social network accounts on Google+ proﬁle pages. Iofciu

et al. [17] used Social Graph API to crawl 421,188 public

proﬁles of users while considering Flickr and Delicious and

StumbleUpon (FDS dataset). Kong et al. [10] considered

Twitter and Foursquare as the two real-world networks for

their investigation. For ground truth, they used users who had

mentioned in Twitter ID’s in their Foursquare account pages.

Xin Mu et al. [6] considered Chinese social networks Weibo,

Renren, 36.cn, and Zhaopin for studying identity resolution.

For ground truth, they annotated 2,186 pairs of user accounts

across these social network pairs. Man et al. [18] took their

ﬁrst dataset from crawling of Facebook users, after deleting

the users who have less than ﬁve friends, there remains 40,710

users with 766, 519 connections. The second data set was co-

author network formed from papers published in conferences

in the domain of Data Mining and Artiﬁcial Intelligence.

Peled et al. [19] collected data using web crawling from two

social networks, Xing and Facebook. They manually obtained

seed proﬁles for crawling pairs of user proﬁles, one from

each network, that belong to the same individual. A tool was

Fig. 2. Generic Framework for User Proﬁling.

developed to extract user data like gender, name, education,

professional experience, friend list, etc. Labitzke et al. [20]

collected 110,000 Facebook proﬁles, more than 43,000 proﬁles

of StudiVZ, more than 25,000 MySpace proﬁles and more than

10,000 proﬁles of Xing. Liu et al. [21] collected the dataset by

performing a survey based on 153 respondents and an analysis

of 75,472 users on About.me website. For the collection of

ground truth, they hired one human annotator who considered

user proﬁle content pages with more details related to users

and their posts to mark true positive identities. Riederer et

al. [22] collected datasets from location-based data extrac-

tion from OSN. Most interestingly, they included Call Data

Records (CDRs) as well for making use of cell phone location

tracking. OSNs considered were Foursquare-Twitter-Instagram

cellphone-credit card records. They collected ground truth

from the publicly available dataset and cell-bank. Zhang et

al. [9] used a synthetic network based on Renren, Facebook,

Sima, WeChat, and Twitter. They collected ground truth by

using page crawling and open API wherever possible as per

the platform is chosen.

Our work is different from prior works in the sense that we

focus on methods to collect linked identities to enable the user

proﬁling and present comparative assessment of these methods

on qualitative and quantitative parameters.

III. METHODOLOGY

A generic framework for user proﬁling (Figure 2) comprises

three steps, namely data collection, data integration, and data

extraction & indexing. The ﬁrst step is data collection, in

which we identify a source of data followed by a selection

of data collection methods. We follow it by data integration

in which we store user identities collected from all methods

at a single data store point, which we refer to as Linked

Identity Data Store (LIDS). Finally, data extraction and in-

dexing involve collecting the three components of user identity

namely, proﬁle, content, and network. Next, we describe each

data collection method in detail. Next, we present a detailed

methodology adopted to perform data collection using ﬁve

methods, which is the focus of this paper.

A. Advanced Search Operator (ASO)

Search engines typically provide advanced search oper-

ators using which users use to obtain more detailed and

speciﬁc information. In this work, we leverage Google’s

advanced operator search, also referred to as google hack-

ing or google dorking (Figure 3). For instance, the search

query intext:facebook.com,twitter.com ﬁletype:xlsx would

locate all web documents that have facebook.com or

twitter.com written as text anywhere in the record with

the additional constraint that these documents must be of

xlsx ﬁle type. As per Figure 3, we ﬁrst run a script which

searches using pre-conﬁgured search queries on a speciﬁc

search engine. Downloaded ﬁles are ﬁltered and subsequently

read through automated scripts which store linked identities in

LIDS.

Fig. 3. Pipeline for Advanced Search Operator (ASO) method.

B. Social Aggregator (SA)

There are several social aggregating websites on which users

create an account and provide details of their multiple OSN

accounts. One such site that we investigate is about.me6which

is a website that offers its users with a platform to mention

numerous user identities, external websites, and well-known

social networking websites such as Facebook, Flickr, Google+,

Pinterest, LinkedIn, Twitter, Tumblr, and YouTube. Users put

their one-page descriptions giving details of their social media

proﬁles along with their background image and abbreviated

biography. Initially, when we started data collection using

this method, about.me provided an option to search user

proﬁle using the topic-based search (referred to as discovery

feature). Given an interest-topic as input, it would return all

the user proﬁles having that interest. After one month of data

6About.me: https://about.me/

collection, in March 2018, we found that this discovery feature

of about.me got discontinued. Subsequently, on exploring

other options, we found a public dataset7containing about.me

proﬁles which we use in this work.

Fig. 4. Pipeline for Social Aggregator (SA) Method.

Besides, we leverage the previous ASO method using in-

terests as intext and site as about.me to obtain more user

proﬁles. Figure 4 explains the three data sources employed

in this method.

C. Cross-Platform Sharing (CPS)

Many OSNs provide an option to share content across other

(target) OSNs which we refer as cross-platform sharing (CPS).

As depicted in Figure 5, a user makes a post on the source

network (say Zomato, Facebook or Instagram) and then subse-

quently shares the same post on the target network (Twitter).

Such shared content on the target network appears with a

speciﬁc pattern. In our work, when we took Twitter as the

target network and Instagram as the source network, then the

pattern that appears on the shared post is \instagram.com\p\.

Using the API provided by the source network, we search

for posts that contain such patterns. Besides this pattern,

we also speciﬁcally check for the source ﬁeld present in

the Tweet JSON object and make sure that it has the name

of the source network (in our case, Instagram). This check

ensures that we ﬁlter out those scenarios in which a user might

have copy-pasted the URL pattern because of such situations

are not guaranteed to link to the same individual across the

two networks. We parse the collected posts from the target

network, identify the URL and expand the URL to reach the

desired content on the source social network. On reaching the

source social network, we either use source social network API

or scrap the post page to obtain the tagged user (mentioned

user) in the post on the source social network. In this way, we

obtain linked identity pair between source and target social

network.

7http://scholarbank.nus.edu.sg/bitstream/10635/137403/2/about me.sql

Fig. 5. Pipeline for Cross Platform Sharing (CPS) method.

D. Self-Disclosure (SD)

Whenever a user signs up on OSN, there is an option to

provide a user description. At times, users provide details of

their identities on other OSNs, which we refer to as self-

disclosure. More speciﬁcally, we focus on the user’s bio ﬁeld

in the Twitter network (Figure 6). We ﬁrst use Twiangulate

web tool8to collect all those twitter proﬁles which have at

least one social network mentioned in their bio-ﬁeld. Then,

we observe various patterns in the bio-ﬁeld on Twitter because

a user can specify other OSN details in multiple ways. For

instance, a user can mention TV Host and Media Trainer

- Instagram: @NeshanTVxyz Snapchat: @Neshaxyz while

another user can use acronyms like TV Host and Media Trainer

- IG: @NeshanTVxyz SP: @Neshaxyz FB: nashbin123. To

address these variations, we tokenize all text and check for

the occurrence of URL which could lead to other OSNs.

E. Friend Finder Feature (FFF)

Whenever a user joins a new OSN, we sign up using our

unique identiﬁer, say email or phone number. This information

is used by OSN to ﬁnd our friends in our email contacts or

phone contacts. Using this information, OSN offers a friend

ﬁnder option to help connect to those friends who already

have an account in OSN. Figure 7 depicts the entire sequence

of steps that we followed in this method. In the ﬁrst step, we

use a deep web search engine like Duckduckgo 9for retrieving

emails present over the web. Next, we create an email account

and add the extracted in its contact list. Then we sign-up in a

social network to exploit friend ﬁnder feature using the created

account. We use string matching on display name of users to

ﬁnd identity belonging to the same user.

IV. RES ULT S AN D EVALUATI ON

In this section, we compare ﬁve methods, as stated before,

by performing a quantitative and qualitative assessment of

the linked identities obtained by them. Table I depicts total

identities collected by prior works along with the OSNs

covered by them. Few of them have a higher number of

8Twiangulate: http://twiangulate.com/search/

9Deep Web: www.duckduckgo.com

Fig. 6. Pipeline for Self-Disclosure (SD) Method.

Fig. 7. Pipeline for Friend-Finder Feature method.

TABLE I

IDENTITIES COLLECTED IN PREVIOUS WORKS.

OSNs Covered with Reference Identity collected

Twitter, Foursquare and yelp [15] 17,276

Twitter, Flicker, Facebook, Google+,

Myspace, Yelp [16] 655,079

Social Graph API [17] 421,188

Facebook Twitter [10] 500

StudiVZ, Facebook, Myspace and Xing [7] 89,000

Facebook and Myspace [23] 5,296

Twitter, Flicker and Linkedin and 12 more.. [21] 75,472

Weibo, Renren, 36.cn and Zhoopin [6] 25,647

Facebook and Xing [24] 158

Twitter, Flicker [19] 27,000

32 Social Network sites [14] 100,179

Foursquare, Twitter and Instagram [22] 2,579

Twitter, YouTube and Flicker [13] 41,336

Twitter and BlogCatlog [25] 3,000

Facebook and MAG [18] 1,154

identities; however, their coverage in terms of the number

of OSNs reached is less than our dataset. Additionally,

datasets of prior works as mentioned in Table I are not

publicly available while we release our dataset, more details at

http://precog.iiitd.edu.in/resources.html . Table II summarizes

the number of linked identities collected using each of the

data collection methods. Among all the ﬁve methods, Cross-

Platform Sharing (CPS) method yielded the maximum number

TABLE II

RES ULTS O F DATA COLL EC TIO N MET HO DS IM PL EME NT ED IN T HI S WOR K.

DATA COLLECTION IN EACH OF THEM IS CONTINUING AND NUMBERS ARE

IN CRE AS ING B Y THE D AY.

Data Collection Method Linked Identities

Collected So Far

Advanced Search Operator (ASO) 9,695

Social Aggregator (SA) 53,692

Cross-Platform Sharing (CPS) 104,233

Self Disclosure (SD) 40,000

Friend Finder Feature (FFF) 500

Total Linked Identities 208,120

of linked identities (104,233) keeping Twitter as the target

network and Zomato, Facebook, and Instagram being the

source network from where the post was shared on to Twitter.

Social Aggregator (SA) method using about.me gave 53,692

linked identities taking into account all three approaches

followed in it, namely discovery feature, which contributed

15,973, a standard dataset that added 15,620 and search engine

based which yielded 22,099. Self Disclosure (SD) method,

which extracted identities by parsing bio ﬁeld of Twitter gave

40,000 linked identities. We collected 9,695 identities using

Advanced Search Operator (ASO) queries on google. Lastly,

Friend-Finder Feature (FFF) gave 500 linked identities.

Fig. 8. Distribution of coverage of OSNs on which linked identities got

collected using Advanced Search Operator (ASO) and Self Disclosure (SD)

methods. Values on Y-axis are on log-scale to the base 10.

A. Quantitative Evaluation

For quantitative evaluation, we evaluate data collection

methods based on two metrics explained below.

1) Social Network Coverage: The data collection method

is intended to collect linked identities across as many OSNs

as possible. Social network coverage refers to the number of

OSNs on which the given data collection method was able

to collect linked identities. As depicted in Figure 8 Advanced

Search Operator (ASO) method covers nine social networks

Facebook, Twitter, Youtube, Linkedin, Google+, Pinterest,

Instagram, Soundcloud, and Twiplomacy, during coverage of

Self Disclosure (SD) method is across four social networks

Twitter, Facebook, Instagram, and Snapchat (others comprises

of LinkedIn and their blog/ websites). Further, in terms of

OSNs coverage, Social Aggregator (SA) method performs

the best. As depicted in Figure 10, a total of 43 OSNs

got covered using this method. Among the three approaches

employed in the SA method, the one that leverages search

engine (duckduckgo) is giving the best results.

2) Per-user linked identity count: Number of linked iden-

tities found for a given user is referred to as per-user linked

identity count. Figure 9 depicts the number of linked identities

per user obtained using Advanced Search Operator (ASO)

and Self Disclosure (SD) method. For per-user linked identity

count less than 4, SD performs better, but subsequently ASO

performs well. Also as expected, with the increase in per-user

linked identity count, the number of such users, decrease. Fig

11 shows the per-user identity count distribution for Social

Aggregator (SA) method. Discover feature and public dataset

approaches give better results during the ongoing approach of

the search engine is providing comparable results with discov-

ery feature when per-user identity count increases beyond 5.

3) Results of Cross-Platform Sharing (CPS) Method:

Quantitatively, Cross Platform Sharing (CPS) method is giving

the best results, in terms of number of linked identities

Fig. 9. Distribution of per-user identity count using all methods except social

aggregator (SA) method.

TABLE III

RES ULTS O F CROSS PL ATFO RM SHARING (CPS) METHOD IN WHICH WE

DE PIC T DI STR IBU TI ON OF C ROS S PL ATFOR M SH ARE D PO STS F ROM T HRE E

SO URC E NE TWO RKS N AME LY ZOMATO , FACE BOO K AND IN STAG RAM O N

TWITTER.

Source Network Linked Identities

Zomato 6,000

Facebook 40,201

Instagram 58,032

Total Linked Identities 104,233

obtained. In the CPS method, we have used Twitter as the

target network on which posts from other source networks

namely Zomato, Facebook and Instagram have been shared,

Table III gives the distribution of the same. Cross platforms

sharing from Instagram to Twitter got the most of the linked

identities.

B. Qualitative Evaluation

We leverage metrics from ISO 9000:201510 Standard for

quality assessment, namely data completeness, consistency,

accuracy, validity, availability, and timeliness.

•Completeness: Completeness, in our context, can be

deﬁned as the ratio of collected linked identities of a

user to the actual linked identities across all OSNs for the

same user. From information retrieval perspective, this is

similar to recall. Ideally, the methods should contain all

linked identities but in practice, it is not possible, refer

Table IV for explanations.

•Validity: Validity in the context of linked identity col-

lected would mean whether the collected identity pair

indeed belong to the same user in the real world. We

are expected to get valid linked identities as long as

the users keep their identity lists and proﬁle descriptions

correctly updated in methods namely Social Aggregators

(SA) and Self Disclosure (SD), respectively. In the case

of the Advanced Search Operator (ASO) method, if the

10International Standards Organization: https://www.iso.org/standard/45481.html

Fig. 10. Distribution of social network covered using Social Aggregator (SA) method for collection of linked identities. This method by far is the best in

terms of OSN coverage with total 43 OSNs covered.

Fig. 11. Distribution of per-user linked identity count using Social Aggregator

(SA) method for collection of linked identities. It may be noted that 24 users

identity count more than 20 have not been plotted in this graph to keep

visualization comprehensible.

TABLE IV

COMPLETENESS ANALYS IS O F DATA COLLECTION METHO DS.

Method Remarks on Completeness

ASO Depends on number of social identities submitted by

user to any server whose data is indexed by search

engines

SA Depends on number of social identities displayed by

user on social aggregator sites

CPS Depends on cross sharing activity of user

SD Depends on amount of URLs mentioning identities

on other OSNs revealed by user in his/her account

description

FFF Depends on availability of friend-ﬁnder feature on

OSN and friends having account on that OSN

indexed ﬁle is quite outdated, then linked identities could

be stale.

•Consistency: While each of the data collection methods

would execute consistently, however, due to the dynamics

of the OSNs, the results for each run could vary. Some

OSNs provide greater re-conﬁgurability in user proﬁles,

for instance, username can be changed in Twitter and

Instagram. If a given data collection is relying upon

username, then results would vary over time.

•Accuracy: All the methods rely upon user-contributed

information. In the case of Advanced Search Operator

(ASO), it is the data entered into servers which are

indexed by search engines whereas in Social Aggrega-

tor (SA) and Self Disclosure (SD) methods is directly

dependent on the information provided by the user. As

long as user-supplied information is accurate, the data

collection methods are guaranteed to return true positive

linked identities.

•Timeliness: In the context of our problem, timeliness

would mean whether linked identities for a given input

user can be provided by the data collection method

whenever requested. Out of the ﬁve methods, methods

namely Self Disclosure (SD), Cross Platform Sharing

(CPS) and Friend Finder Feature (FFF) method could be

employed in such a situation.

V. CONCLUSION

Users are joining multiple on-line social networks for dif-

ferent purposes. An essential ﬁrst step in the social proﬁling of

users is to link their identities across OSNs. Besides proﬁling,

other important applications include recommendations and

link prediction. In this work, we explained ﬁve data collection

methods and compared them both qualitatively and quantita-

tively. Based on our experience of collecting linked identities

across multiple social networks, we list down few suggestions

for prospective researchers. Social Aggregator (SA) method is

useful in the scenario when we want to study user behavior

across a large number of OSNs. Self Disclosure (SD) method

would yield a good coverage of OSNs but in a limited manner.

On the contrary, if one has to target only a speciﬁc pair of

OSN, then Cross Platform Sharing (CPS) method would be a

good option. Advanced Search Operator (ASO) method would

be useful if only popular social networks (like Facebook,

LinkedIn, Twitter, etc) are to be targeted. Friend Finder Feature

(FFF) is applicable only when a large pool of emails are

available. FFF would also be useful in the scenario when one

has to investigate an unexplored social network.

There are a few limitations to our work. For Social Aggre-

gator (SA) method, we have investigated about.me, it would

be interesting to extend it over other platforms like Google+.

Similarly in Advanced Search Operator (ASO) method, we

may go beyond google search engine and explore another

search engines like bing, duckduckgo, etc. In Cross Platform

Sharing (CSP) method, we have taken Twitter as the target

social network, which can be extended to include other OSNs

as well. Similarly, only Twitter’s bio ﬁeld is being parsed in

Self Disclosure (SD) method.

Finally, for ethical reasons, all data collection methods in

this paper operate on public data only and rely upon the fact

that the user has shared this data explicitly at some point in

time. However, users may not be aware of the implications

of public availability of their data. For users who are privacy

concerned and would not want their identities to be linked, it

is highly recommended that they should not cross-post content

across OSNs, not provide details of other OSNs on their social

media proﬁle pages, use a speciﬁc email (not known to their

friends) for registering at OSN and not register on websites

whose robots.txt allows crawling. However, regardless, this

work is a step towards a tool [4] that can help users understand

and control the amount of their data that is available on OSNs

so that they could safeguard themselves from online proﬁling.

REFERENCES

[1] B. Krulwich, “Lifestyle ﬁnder: Intelligent user proﬁling using large-scale

demographic data,” AI magazine, vol. 18, no. 2, pp. 37–37, 1997.

[2] W.-S. Yang, J.-B. Dia, H.-C. Cheng, and H.-T. Lin, “Mining social

networks for targeted advertising,” in Proceedings of the 39th Annual

Hawaii International Conference on System Sciences (HICSS’06), vol. 6.

IEEE, 2006, pp. 137a–137a.

[3] M. Huber, M. Mulazzani, M. Leithner, S. Schrittwieser, G. Wondracek,

and E. Weippl, “Social snapshots: Digital forensics for online social

networks,” in Proceedings of the 27th annual computer security appli-

cations conference. ACM, 2011, pp. 113–122.

[4] R. Kaushal, S. Chandok, P. Jain, P. Dewan, N. Gupta, and P. Ku-

maraguru, “Nudging nemo: Helping users control linkability across

social networks,” in International Conference on Social Informatics.

Springer, 2017, pp. 477–490.

[5] S. Liu, S. Wang, F. Zhu, J. Zhang, and R. Krishnan, “Hydra: Large-

scale social identity linkage via heterogeneous behavior modeling,” in

Proceedings of the 2014 ACM SIGMOD international conference on

Management of data. ACM, 2014, pp. 51–62.

[6] X. Mu, F. Zhu, E.-P. Lim, J. Xiao, J. Wang, and Z.-H. Zhou, “User

identity linkage by latent user space modelling,” 2016.

[7] N. Korula and S. Lattanzi, “An efﬁcient reconciliation algorithm for

social networks,” Proceedings of the VLDB Endowment, vol. 7, no. 5,

pp. 377–388, 2014.

[8] Y. Shen and H. Jin, “Controllable information sharing for user accounts

linkage across multiple online social networks,” in Proceedings of the

23rd ACM International Conference on Conference on Information and

Knowledge Management. ACM, 2014, pp. 381–390.

[9] H. Zhang, M.-Y. Kan, Y. Liu, and S. Ma, “Online social network proﬁle

linkage,” in Asia Information Retrieval Symposium. Springer, 2014, pp.

197–208.

[10] X. Kong, J. Zhang, and P. S. Yu, “Inferring anchor links across multiple

heterogeneous social networks,” in Proceedings of the 22nd ACM

international conference on Information & Knowledge Management.

ACM, 2013, pp. 179–188.

[11] F. Buccafurri, G. Lax, A. Nocera, and D. Ursino, “Discovering links

among social networks,” in Joint European Conference on Machine

Learning and Knowledge Discovery in Databases. Springer, 2012,

pp. 467–482.

[12] D. Perito, C. Castelluccia, M. A. Kaafar, and P. Manils, “How unique

and traceable are usernames?” in International Symposium on Privacy

Enhancing Technologies Symposium. Springer, 2011, pp. 1–17.

[13] A. Malhotra, L. Totti, W. Meira Jr, P. Kumaraguru, and V. Almeida,

“Studying user footprints in different online social networks,” in Ad-

vances in Social Networks Analysis and Mining (ASONAM), 2012

IEEE/ACM International Conference on. IEEE, 2012, pp. 1065–1070.

[14] R. Zafarani and H. Liu, “Connecting users across social media sites:

a behavioral-modeling approach,” in Proceedings of the 19th ACM

SIGKDD international conference on Knowledge discovery and data

mining. ACM, 2013, pp. 41–49.

[15] O. Goga, H. Lei, S. H. K. Parthasarathi, G. Friedland, R. Sommer, and

R. Teixeira, “Exploiting innocuous activity for correlating users across

sites,” in Proceedings of the 22nd international conference on World

Wide Web. ACM, 2013, pp. 447–458.

[16] O. Goga, D. Perito, H. Lei, R. Teixeira, and R. Sommer, “Large-scale

correlation of accounts across social networks,” University of California

at Berkeley, Berkeley, California, Tech. Rep. TR-13-002, 2013.

[17] T. Iofciu, P. Fankhauser, F. Abel, and K. Bischoff, “Identifying users

across social tagging systems.” in ICWSM, 2011.

[18] T. Man, H. Shen, S. Liu, X. Jin, and X. Cheng, “Predict anchor links

across social networks via an embedding approach.” in IJCAI, 2016, pp.

1823–1829.

[19] O. Peled, M. Fire, L. Rokach, and Y. Elovici, “Entity matching in online

social networks,” in Social Computing (SocialCom), 2013 International

Conference on. IEEE, 2013, pp. 339–344.

[20] S. Labitzke, I. Taranu, and H. Hartenstein, “What your friends tell others

about you: Low cost linkability of social network proﬁles,” in Proc. 5th

International ACM Workshop on Social Network Mining and Analysis,

San Diego, CA, USA, 2011, pp. 1065–1070.

[21] J. Liu, F. Zhang, X. Song, Y.-I. Song, C.-Y. Lin, and H.-W. Hon, “What’s

in a name?: an unsupervised approach to link users across communities,”

in Proceedings of the sixth ACM international conference on Web search

and data mining. ACM, 2013, pp. 495–504.

[22] C. Riederer, Y. Kim, A. Chaintreau, N. Korula, and S. Lattanzi, “Linking

users across domains with location data: Theory and validation,” in

Proceedings of the 25th International Conference on World Wide Web.

International World Wide Web Conferences Steering Committee, 2016,

pp. 707–719.

[23] M. Motoyama and G. Varghese, “I seek you: searching and matching

individuals in social networks,” in Proceedings of the eleventh interna-

tional workshop on Web information and data management. ACM,

2009, pp. 67–75.

[24] A. Narayanan and V. Shmatikov, “De-anonymizing social networks,” in

2009 30th IEEE Symposium on Security and Privacy, May 2009, pp.

173–187.

[25] Y. Nie, Y. Jia, S. Li, X. Zhu, A. Li, and B. Zhou, “Identifying users

across social networks based on dynamic core interests,” Neurocomput-

ing, vol. 210, pp. 107–115, 2016.

User Profiling on Universal Data Insights tool on IBM Cloud Pak for Security

Conference Paper

Dec 2021

Link User Identities Across Social Networks Based on Contact Graph and User Social Behavior

Article

Full-text available

Jan 2022

With the rapid development of Social Networking Services (SNSs), linking online user IDs is becoming increasingly important to internet service providers. Existing methods can achieve matching adjacent IDs between different services, where adjacent IDs mean the IDs that send message loggings at the same physical location. However, nonadjacent IDs also need to be matched in reality, which is a key challenge. In this paper, a new method based on users social behaviors and contact graph is put forward to realize linking of IDs across domains. This method can be used for matching both adjacent IDs and nonadjacent IDs. Specifically, all the IDs are mapped to contact graph. And we utilize a set matching algorithm based on the contact graph to find out the set of candidate IDs and generate confidence score by means of this algorithm to select the most appropriate matching. Our experimental results show that our algorithm is capable of identifying not only the set of adjacent IDs that belong to one same user but also the set of nonadjacent IDs that belong to one same user.

Identifying the Right Person in Social Networks with Double Metaphone Codes

Conference Paper

Dec 2020

Nudging Nemo: Helping Users Control Linkability Across Social Networks

Conference Paper

Full-text available

Sep 2017

The last decade has witnessed a boom in social networking platforms; each new platform is unique in its own ways, and offers a different set of features and services. In order to avail these services, users end up creating multiple virtual identities across these platforms. Researchers have proposed numerous techniques to resolve multiple such identities of a user across different platforms. However, the ability to link different identities poses a threat to the users’ privacy; users may or may not want their identities to be linkable across networks. In this paper, we propose Nudging Nemo, a framework which assists users to control the linkability of their identities across multiple platforms. We model the notion of linkability as the probability of an adversary (who is part of the user’s network) being able to link two profiles across different platforms, to the same real user. Nudging Nemo has two components; a linkability calculator which uses state-of-the-art identity resolution techniques to compute a normalized linkability measure for each pair of social network platforms used by a user, and a soft paternalistic nudge, which alerts the user if any of their activity violates their preferred linkability. We evaluate the effectiveness of the nudge by conducting a controlled user study on privacy conscious users who maintain their accounts on Facebook, Twitter, and Instagram. Outcomes of user study confirmed that the proposed framework helped most of the participants to take informed decisions, thereby preventing inadvertent exposure of their personal information across social network services.

Exploiting Innocuous Activity for Correlating Users Across Sites

Conference Paper

Full-text available

May 2013

We study how potential attackers can identify accounts on different social network sites that all belong to the same user, exploiting only innocuous activity that inherently comes with posted content. We examine three specific features on Yelp, Flickr, and Twitter: the geo-location attached to a user's posts, the timestamp of posts, and the user's writing style as captured by language models. We show that among these three features the location of posts is the most powerful feature to identify accounts that belong to the same user in different sites. When we combine all three features, the accuracy of identifying Twitter accounts that belong to a set of Flickr users is comparable to that of existing attacks that exploit usernames. Our attack can identify 37% more accounts than using usernames when we instead correlate Yelp and Twitter. Our results have significant privacy implications as they present a novel class of attacks that exploit users' tendency to assume that, if they maintain different personas with different names, the accounts cannot be linked together; whereas we show that the posts themselves can provide enough information to correlate the accounts.

Online Social Network Profile Linkage

Conference Paper

Dec 2014

Piecing together social signals from people in different online social networks is key for downstream analytics. However, users may have different usernames in different social networks, making the linkage task difficult. To enable this, we explore a probabilistic approach that uses a domain-specific prior knowledge to address this problem of online social network user profile linkage. At scale, linkage approaches that are based on a naïve pairwise comparisons that have quadratic complexity become prohibitively expensive. Our proposed threshold-based canopying framework – named OPL – reduces this pairwise comparisons, and guarantees a upper bound theoretic linear complexity with respect to the dataset size. We evaluate our approaches on real-world, large-scale datasets obtained from Twitter and Linkedin. Our probabilistic classifier integrating prior knowledge into Naïve Bayes performs at over 85% F 1-measure for pairwise linkage, comparable to state-of-the-art approaches.

Linking Users Across Domains with Location Data: Theory and Validation

Conference Paper

Apr 2016

Linking accounts of the same user across datasets -- even when personally identifying information is removed or unavailable -- is an important open problem studied in many contexts. Beyond many practical applications, (such as cross domain analysis, recommendation, and link prediction), understanding this problem more generally informs us on the privacy implications of data disclosure. Previous work has typically addressed this question using either different portions of the same dataset or observing the same behavior across thematically similar domains. In contrast, the general cross-domain case where users have different profiles independently generated from a common but unknown pattern raises new challenges, including difficulties in validation, and remains under-explored. In this paper, we address the reconciliation problem for location-based datasets and introduce a robust method for this general setting. Location datasets are a particularly fruitful domain to study: such records are frequently produced by users in an increasing number of applications and are highly sensitive, especially when linked to other datasets. Our main contribution is a generic and self-tunable algorithm that leverages any pair of sporadic location-based datasets to determine the most likely matching between the users it contains. While making very general assumptions on the patterns of mobile users, we show that the maximum weight matching we compute is provably correct. Although true cross-domain datasets are a rarity, our experimental evaluation uses two entirely new data collections, including one we crawled, on an unprecedented scale. The method we design outperforms naive rules and prior heuristics. As it combines both sparse and dense properties of location-based data and accounts for probabilistic dynamics of observation, it can be shown to be robust even when data gets sparse.

User Identity Linkage by Latent User Space Modelling

Conference Paper

Aug 2016

User identity linkage across social platforms is an important problem of great research challenge and practical value. In real applications, the task often assumes an extra degree of difficulty by requiring linkage across multiple platforms. While pair-wise user linkage between two platforms, which has been the focus of most existing solutions, provides reasonably convincing linkage, the result depends by nature on the order of platform pairs in execution with no theoretical guarantee on its stability. In this paper, we explore a new concept of ``Latent User Space'' to more naturally model the relationship between the underlying real users and their observed projections onto the varied social platforms, such that the more similar the real users, the closer their profiles in the latent user space. We propose two effective algorithms, a batch model(ULink) and an online model(ULink-On), based on latent user space modelling. Two simple yet effective optimization methods are used for optimizing objective function: the first one based on the constrained concave-convex procedure(CCCP) and the second on accelerated proximal gradient. To our best knowledge, this is the first work to propose a unified framework to address the following two important aspects of the multi-platform user identity linkage problem --- (I) the platform multiplicity and (II) online data generation. We present experimental evaluations on real-world data sets for not only traditional pairwise-platform linkage but also multi-platform linkage. The results demonstrate the superiority of our proposed method over the state-of-the-art ones.

Identifying Users Across Social Networks Based on Dynamic Core Interests

Article

Jun 2016
NEUROCOMPUTING

With the development of social networks, most of users hold serval accounts in different social network platforms. It is a very important task to match users' vary identities in the internet. Plenty of existing approaches attempt to link users via comparing social structures, mapping users' profiles and analyzing users' authority. Those existing approaches fail to consider the dynamic changes of users. In the paper, we introduce human behavioral limitations in social networks. And then based on the limitations, we propose a dynamic core interests mapping(DCIM) algorithm, which jointly consider the users' social network structures and users' article content to identify users over platforms. The algorithm firstly models user's core interests and then calculates the similarity of two target users using DCIM. Our experiments use real world datasets from Twitter and BlogCatalog. The results of experiments show that our method is effective on mapping users across social networks. And the algorithm is significantly more effective than baseline methods such as FNN and MAG.

Mining Social Networks for Targeted Advertising

Conference Paper

Feb 2016

In this paper, we propose a data mining framework that utilizes the concept of social network for the targeted advertising of products. This approach discovers the cohesive subgroups from customer's social network which is derived from customer's interaction data. Based on the set of cohesive subgroups, we infer the probabilities of customer's liking a product category from transaction records. Utilizing such information, we construct a targeted advertising system. We evaluate the proposed approach by using real email logs and library-circulation data. The experimental results show that our approach yields better quality of advertisement.

What's in a name? An unsupervised approach to link users across communities

Conference Paper

Feb 2013

In this paper, we consider the problem of linking users across multiple online communities. Specifically, we focus on the alias-disambiguation step of this user linking task, which is meant to differentiate users with the same usernames. We start quantitatively analyzing the importance of the alias-disambiguation step by conducting a survey on 153 volunteers and an experimental analysis on a large dataset of About.me (75,472 users). The analysis shows that the alias-disambiguation solution can address a major part of the user linking problem in terms of the coverage of true pairwise decisions (46.8%). To the best of our knowledge, this is the first study on human behaviors with regards to the usages of online usernames. We then cast the alias-disambiguation step as a pairwise classification problem and propose a novel unsupervised approach. The key idea of our approach is to automatically label training instances based on two observations: (a) rare usernames are likely owned by a single natural person, e.g. pennystar88 as a positive instance; (b) common usernames are likely owned by different natural persons, e.g. tank as a negative instance. We propose using the n-gram probabilities of usernames to estimate the rareness or commonness of usernames. Moreover, these two observations are verified by using the dataset of Yahoo! Answers. The empirical evaluations on 53 forums verify: (a) the effectiveness of the classifiers with the automatically generated training data and (b) that the rareness and commonness of usernames can help user linking. We also analyze the cases where the classifiers fail.

Inferring anchor links across multiple heterogeneous social networks

Conference Paper

Oct 2013

Online social networks can often be represented as heterogeneous information networks containing abundant information about: who, where, when and what. Nowadays, people are usually involved in multiple social networks simultaneously. The multiple accounts of the same user in different networks are mostly isolated from each other without any connection between them. Discovering the correspondence of these accounts across multiple social networks is a crucial prerequisite for many interesting inter-network applications, such as link recommendation and community analysis using information from multiple networks. In this paper, we study the problem of anchor link prediction across multiple heterogeneous social networks, i.e., discovering the correspondence among different accounts of the same user. Unlike most prior work on link prediction and network alignment, we assume that the anchor links are one-to-one relationships (i.e., no two edges share a common endpoint) between the accounts in two social networks, and a small number of anchor links are known beforehand. We propose to extract heterogeneous features from multiple heterogeneous networks for anchor link prediction, including user's social, spatial, temporal and text information. Then we formulate the inference problem for anchor links as a stable matching problem between the two sets of user accounts in two different networks. An effective solution, MNA (Multi-Network Anchoring), is derived to infer anchor links w.r.t. the one-to-one constraint. Extensive experiments on two real-world heterogeneous social networks show that our MNA model consistently outperform other commonly-used baselines on anchor link prediction.

Connecting users across social media sites

Conference Paper

Aug 2013

People use various social media for different purposes. The information on an individual site is often incomplete. When sources of complementary information are integrated, a better profile of a user can be built to improve online services such as verifying online information. To integrate these sources of information, it is necessary to identify individuals across social media sites. This paper aims to address the cross-media user identification problem. We introduce a methodology (MOBIUS) for finding a mapping among identities of individuals across social media sites. It consists of three key components: the first component identifies users' unique behavioral patterns that lead to information redundancies across sites; the second component constructs features that exploit information redundancies due to these behavioral patterns; and the third component employs machine learning for effective user identification. We formally define the cross-media user identification problem and show that MOBIUS is effective in identifying users across social media sites. This study paves the way for analysis and mining across social media sites, and facilitates the creation of novel online services across sites.

Methods for User Profiling Across Social Networks

Abstract and Figures

Recommended publications

Methods for User Profiling across Social Networks

Retrofitting Embeddings for Unsupervised User Identity Linkage

User Identity Linkage across Online Social Networks: A Review

Understanding the User Display Names across Social Networks