Content uploaded by Yulong Gu
Author content
All content in this area was uploaded by Yulong Gu on Aug 14, 2019
Content may be subject to copyright.
We Know Where You Are: Home Location
Identification in Location-Based Social Networks
Yulong Gu, Yuan Yao, Weidong Liu, Jiaxing Song
Department of Computer Science and Technology
Tsinghua University
Beijing, 100084, China
guyulongcs@gmail.com, yaoyuan13@mails.tsinghua.edu.cn, {liuwd, jxsong}@tsinghua.edu.cn
Abstract—The rapid spread of smartphones has led to the
increasing popularity of Location-Based Social Networks(LBSNs)
like Foursquare, Gowalla, Facebook Places and so on where users
can publish information about their current location. In LBSNs,
identifying home locations of users is very important for vari-
ous applications like effective location-based advertisement and
recommendation. However, this problem is rather challenging
because the location information in LBSNs is sparse and noisy:
Only a small percentage of users share their home location
information due to privacy concerns; users may check in at
diverse places far from their home and make friends far away;
many users even do not have any check-in information. In this
paper, we propose a trust-based influence model, named as TSU
to solve the problem. To be specific, TSU is a Trust-based unified
probabilistic model that models edges in LBSNs based on signals
from Social relationship data(social friendship, social trust) and
User-centric data(check-in data) in LBSN. We proposed a Home
Location Identification method based on TSU model and evaluate
it on a large real-world LBSNs dataset. Extensive experiments
demonstrate that our method significantly outperforms state-of-
the-art methods.
Keywords—Home Location Identification, Location-Based So-
cial Networks, Trust, Influence Model, Social Networks
I. INTRODUCTION
In recent years, we have seen a rapid proliferation of
Social Networks such as Facebook1, Twitter2, Google+3and
so on. As the largest online social network in the world,
Facebook has over 1.18 billion monthly active users as of
August 2015 4. The rapid growth of mobile internet and the
location-acquisition technologies has led to the increasing pop-
ularity of Location-Based Social Networks(LBSNs) such as
Foursquare5, Gowalla6and Brightkite7by embedding location
into Social Networks. Users in LBSNs can conveniently log
their activity histories with spatio-temporal data by checking
in at various venues(e.g., scenic spots, restaurants, airports)
at any time using their smart phones. The inherent nature of
LBSNs encourages users to publish their current location(i.e.,
check-ins). This same is true for most popular social network
websites, like Facebook, Twitter, Google+ and so on.
1https://www.facebook.com
2https://twitter.com
3https://plus.google.com/
4https://en.wikipedia.org/wiki/Facebook
5https://foursquare.com
6https://en.wikipedia.org/wiki/Gowalla
7https://en.wikipedia.org/wiki/Brightkite
User profiling, which aims to infer user’s attributes, such as
age, gender, interests, home location, education and so on, has
been a hot topic in academic. Many research have been done
on user profiling[26, 39] to serve personalized search[8, 32],
targeted advertisement[5, 31], news recommendation[4, 22]
and so on. As the rapid growth of LBSNs, Home Location
Identification of users becomes one of the most important
user profiling problems because home location of users are ex-
tremely important for various applications to provide effective
location-based services. For example, profiling users’ home lo-
cations enables search engines to provide personalized search
results in mother tongue of users, news sites to recommend
localized news and advertisers to recommend local ads. The
home location of a user is defined as the relative “permanent”
place where the user spend most of his time in[20]. It captures
the major and static geographic scope of the user and therefore
provides valuable information for personalized services.
The home location problem is quite challenging because
signals that may help to identify home locations of users are
sparse and noisy. Firstly, only a small percentage of users
provide their home locations in Social Networks due to privacy
concerns. On twitter, only a few people (16%) register city
level locations in their profiles and most of users leave general,
non-sensical or even blank information[20]. Secondly, users
may check in at various places far from their home and
make friends far away. Thirdly, many users do not have any
check-in data. As of September 2013, only 30% of users
provide their location information to at least one social media
account and 12% of adult smartphone owners have used geo-
social services to check-in at some location[2]. This problem
has been attracting great interests of researchers in academic
recently[9, 10, 19–21, 30]. Existing approaches can mainly be
divided into two parts: Content Based Approach [9, 10, 19]
and Check-in Based Approach[20, 21, 30]. Content based
approach infers home location of users using models based on
extracted location information from texts like tweets in Social
Networks. Check-in based approach infers home location of
users leveraging check-in data of users.
In this paper, we propose a trust-based influence model,
named as TSU to model edges in LBSNs. We represent a
LBSN as a directed heterogenous graph where the nodes can
be users or venues. Edges in the graph can be friendships
between users, check-in edges from users to venues. TSU is a
Trust-based unified probabilistic model that models edges in
the graph based on signals from Social relationship data(social
friendship, social trust) and User-centric data(check-in data) in
LBSNs. In TSU, we model each node with a location and an
influence scope. We assume each edge t→hfrom a tail
node tto a head node his generated according to nodes’
locations, influence scope of the head node h, social trust
value of the head node hfor the tail node t. TSU is based on
the motivation that people tend to make friends with people
living near and check in at venues that are near from them,
people tend to follow celebrities and visit popular places, and
people tend to make friends who have more common friends
with them. In this paper, we propose the idea of using “social
trust” which measure closeness in social structure to model
edges in LBSNs. Specifically, we measure social trust between
users by calculating Jaccard Similarity[18] on friend sets of
users. Social trust value will be higher if two users have more
common friends. To the best of our knowledge, we are the
first that propose the idea of using “social trust" for Home
Location Identification.
For the Home Location Identification problem, we propose
a two-stage Home Location Identification method based on
TSU model. In the first stage, for users who have check-in
data, we develop a single-pass clustering algorithm to cluster
their check-in data and select the center of largest cluster as
home locations of them. In the second stage, we use a global
iteration method to estimate home location of users so that
the joint conditional probability of generating all the edges is
maximum.
We conduct extensive experiments to evaluate our Home
Location Identification method and compare with state-of-
the-art methods[12, 20, 21, 30, 34] based on a large-scale
Foursquare dataset containing about 836K users and 649K
venues. Experiment results show that our method can predict
home locations of users who have check-in data at the accuracy
of 92.1% though the average check-in number of each user
is only about 2.7. Our method can predict home locations
of all users who don’t have home location at the accuracy
of 63.1%, outperforming state-of-the-art methods by about
6.9%, when only 16.7% users have check-in data and 20%
of users don’t have home location. In a word, out method
significantly outperforms state-of-the-art methods, and achieve
the best performance.
Our main contributions are:
•We firstly propose a trust-based unified probabilistic
model called TSU for Home Location Identification.
•We firstly propose the idea of using social trust to
measure closeness in social structure for Home Location
Identification problem.
•We propose a two-stage Home Location Identification
method based on TSU model and extensive experiments
demonstrate that our method outperforms state-of-the-art
methods by about 6.9%.
The rest of the paper is structured as follows: Section
II introduces related work. In Section III, we describe the
dataset. In Section IV, we formulate the Home Location
Identification problem. In Section V, we present the trust-
based influence model TSU. In Section VI, we introduce our
method for Home Location Identification problem. In Section
VII, we demonstrate the experiment results. In Section VIII,
we conclude the paper and discuss future work.
II. RE LATE D WORK
In this section, we divide related work into three parts: User
Profiling, Human Mobility and Home Location Identification.
A. User Profiling
User profiling aims to infer user’s attributes, such as age,
gender, interests, home location, education and so on. Mislove
et al. [26] propose a method of inferring users’ attributes like
colleges, matriculation years and majors of students by detect-
ing communities in social networks based on the phenomenon
that users with common attributes are more likely to be friends
and often form dense communities. Zhong et al. [39] extract
rich semantics of users’ check-ins, employ tensor factorization
to draw out low dimensional representations of users’ intrinsic
check-in preferences and use the extracted features in classi-
fier to infer various demographic attributes. With increasing
popularity of location-based services (LBSs), there have been
growing concerns for location privacy. Many research have
been done to protect privacy of users recently[15, 16, 27–
29, 36, 38].
User profiling are used for various applications like per-
sonalized search, targeted advertisement and news recommen-
dation. [8, 32] focus on profiling users’ interests to serve
personalized search. Qiu and Cho [32] show that users’
preferences can be learned accurately even from little click-
history data and they can help improve the performance of
personalized search significantly.
B. Human Mobility
Many research have been done on studying social and
temporal characteristics of how people use the location shar-
ing services. The study on patterns of human mobility are
significant for social science, design of future location-based
services, traffic forecasting, urban planning and so on. Cheng
et al. [11] investigate 22 million check-ins of users and
find that human mobility follow certain spatial and periodic
patterns. Cho et al. [12] find that humans experience a com-
bination of periodic movement that is geographically limited
and seemingly random jumps influenced by the social network
structure. Allamanis et al. [6] demonstrate that geographic
distance plays an important role in the creation of new social
connections and users form new ties with friends of existing
friends because connection arise among users visiting the
same place. Wei et al. [35] propose a trace-driven model for
generating synthetic LBSN datasets capturing the properties
of the original datasets. Foroozani et al. [13] propose a model
that captures human mobility properties by introducing hotspot
zones, using a graph of hotspot zones as the input area map,
dividing day time to some periods and modeling various
speeds in different times and spaces.
C. Home Location Identification
Home Location Identification focuses on identifying home
location of users in social networks. There are two types of
approaches to solve these problems: Content Based Approach
and Check-in Based Approach.
1) Content Based Approach: Content based approach infers
home location of users using models based on extracted
location information from texts like tweets in social networks.
Cheng et al. [10] propose a probabilistic framework for
estimating a Twitter user’s city-level location based purely
on the content of the user’s tweets. They use a classification
component for automatically identifying words in tweets with
a strong local geo-scope and a lattice-based neighborhood
smoothing model for refining a user’s location estimate. Chan-
dra et al. [9] employ a probabilistic framework to estimate the
city-level location of a Twitter user, based on the content of
the tweets in their dialogues. [23, 24] use an ensemble of
statistical and heuristic classifiers to predict home locations
of Twitter users based on content of tweets and tweeting
behavior of users. Li et al. [19] propose a global location
identification method that combines multiple microblogs of
a user and utilizes them to identify the user’s location. The
method organizes points of interest into a tree structure,
extract candidate locations from each microblog of a user and
then aggregates these candidate locations and identifies top-k
locations of the user.
2) Check-in Based Approach: Check-in based approach
infers home location of users leveraging check-in data of users.
Cho et al. [12] infer the home location by discretizing the
world into 25 by 25km cells and defining the home location
as the average position of check-ins in the cell with the most
check-ins. Li et al. [20] propose an unified and discriminative
influence model which models influence scope of uses and
venues. They develop location prediction method to identify
home locations of users in Twitter based on the model using
signals observed from friends and venues identified in tweets.
Pontes et al. [30] use a majority voting scheme which take
the most popular location of a user as her home location. Liu
et al. [21] get the estimated home locations using a hierarchical
clustering method to cluster checkins at night.
III. DATASET DESCRIPTION
In this section, we briefly introduce the main characteristics
of Fousquare as well as the crawled dataset used in our
experiments.
A. Foursquare: Background
Foursquare is currently one of the largest and most pop-
ular LBSNs. As a local search and recommendation service,
Foursquare provides search results or recommended places to
go based on targeted locations. The service was created in late
2008 and launched in 2009. Users in Foursquare can share
their locations with friends and followers through check ins.
Check ins are performed via mobile devices with GPS when
a user is close to specific locations known as venues which
represent real locations of a great variety of categories such
as airports or restaurants. As of December 2013, Foursquare
had 45 million registered users[1]. Foursquare gives incentives
to users who visit (check in) specific places (venues) using
rewards like mayorships to frequent visitors. Users can post
tips at specific venues, commenting on their experiences
when visiting the corresponding physical places. What’s more,
Foursquare enables users rate venues by answering questions
which help Foursquare understand how people feel about a
place, including such questions as whether or not a user likes
it. More than 50 million people use Foursquare and Swarm
(a companion app to Foursquare) each month, across desktop,
mobile web, and mobile apps and people have checked in more
than 8 billion times worldwide as of February 20168.
B. Foursquare Dataset
In this paper, we use a widely used and publicly available
Foursquare dataset extracted from the Foursquare applica-
tion through the public API[17, 33]. This dataset contains
2,153,471 users, 1,143,092 venues, 1,021,970 check-ins, and
27,098,490 social connections. Each user has a unique id and
a geospatial location (latitude and longitude) that represents
the user home town location. Each venue has a unique id
and a geospatial location (latitude and longitude). The social
graph data contains the social graph edges (connections) that
exist between users. Each social connection consists of two
users (friends) represented by two unique ids (first user id and
second user id).
We focus our study on Home Location Identification on
Foursquare users within the continental United States. Toward
this purpose, we filter all valid users and who are in the social
graph and located in continental United States. The statistical
data after applying this filter is shown in Table I.
TABLE I
SUM MARY S TATIST IC S OF FOURSQUARE DATASET
Type Number
Users 835,896
Venues 648,825
Check-ins 370,477
Social Graph Edges 12,924,609
C. Mapping Location to City
We need a method to map a location to corresponding city
so that we can know which city a user lives in given home
location of him. In this paper, we map a location to specific
city in following method: The candidate cities are the 297
cities in the United States with a population of at least 100,000
on July 1, 2014, as estimated by the United States Census
Bureau [3]. We define a location’s mapped city as the nearest
city of the location.
8https://foursquare.com/about
Fig. 1. An example of LBSN
IV. HOME LOC ATION IDENTIFI CATION PROB LE M
FORMULATION
In this section, we firstly represent a Location-based Social
Network as a directed heterogeneous graph and then formalize
the Home Location Identification problem.
A. Location-based Social Networks Formulation
We represent a Location-based Social Network as a directed
heterogeneous graph G= (N, E ). An example of the LBSN is
shown in Figure 1. In the graph, nodes can represent users or
venues. There are two types of edges in the graph: (1) follow-
ing relationship edges from users to other users; (2) check-in
edges from users to venues; For LBSNs where friendships are
undirected, they can also be represented using the directed
heterogeneous graph by creating two following relationship
edges for each undirected edge. We denote concepts in LBSNs
as follows:
•N={ni}, i = 1...N: the set of N nodes in G
•E={ehni, nji}: the set of E edges in G, niis the tail node
and njis the head node of the edge
•U={ui}, i = 1...U: the set of U users in N
•V={Vi}, i = 1...V : the set of V venues in N
•F={fhui, uji}: following relationship from user node uito
uj
•C={Cij }={chui, vji}: check-in edges from user node ui
to venue node vj
•UH: the set of users whose home locations are known
•U−H: the set of users whose home locations are not known
•UC: the set of users who have check-in data
•U−C: the set of users who don’t have check-in data
•L: a geographical location denoted by (Lat,Lon) where Lat
is the latitude, Lon is the longitude
•Lui: home location of user ui
•Lvj: location of venue vj
We have that: N=U∪V, E =F∪C∪R, U =UH∪U−H
and U=UC∪U−C.
Further, we denote the edges as follows:
•Ie(n): incoming nodes of node nof edge type e
•Oe(n): outgoing nodes of node nof edge type e
•If(ui): following users of user ui
•Of(ui): users that are followed by user ui
•Ic(ui): venues checked in by user ui
•Ic(vj): users who check in at venue vj
B. Home Location Identification Problem Formulation
Home Location Identification Problem For a Location-
based Social Network G= (N, E ), for each user in U−H,
estimate a home location e
Luiso as to make e
Luiclose to ui’s
true home location Lui.
V. TSU: TRU ST-BASED INFLUENCE MODEL
In this section, we introduce a trust-based influence model
names as T SU to model edges in Location-based Social
Networks.
A. Motivation of TSU model
Existing research have exploited social friendship and
check-in data for Home Location Identification[12, 20, 21, 30,
34]. Our model T SU exploits the new signal “social trust”.
To be specific, T SU is Trust-baed influence model based on
Social friendship data(social friendship, social trust) and User-
centric data(check-in data).
1) social friendship: The probability of friendship de-
creases as the distance between nodes increases has been
observed from social networks like Facebook, Twitter and so
on[7, 20]. Li et al. [20] find that different nodes have different
influence in social networks which means different head nodes
have different probabilities to attract tail nodes at the same
distance. For example, a star is more likely to attract users
who live far away than a regular user.
2) social trust: Existing methods[20] consider friend re-
lation as a binary relationship. However, closer friends in
social networks should have more influence on the home
location of friends. We propose the concept “social trust” to
measure the closeness in social structure and firstly apply it
for Home Location Identification problem. If two users have
more common friends, the social trust value between these
two nodes will be higher and they tend to live nearer.
3) check-in data: We can predict home location of users
using his check-in data because users tend to visit venues
nearby[12, 20].
B. Social Trust
In this paper, we propose “social trust” to measure closeness
in social structure and apply it in TSU model. We denote the
social trust value of node nifor node njas Tji and measure
social trust between nodes using Jaccard Similarity[18] on
friend sets of users. Jaccard Similarity is a statistic used for
comparing the similarity and diversity of sample sets. The
Jaccard Similarity measures similarity between finite sample
sets, and is defined as the size of the intersection divided by
the size of the union of the sample sets. To be specific, for
user node uiand uj, their common friends are denoted as
CF (ui, uj). Then we have that C F (ui, uj) = F(ui)∩F(uj)
where F(ui)is the friend set of user ui. We define Tji as
Equation 1:
Tji =J accard(ui, uj) = |F(ui)∩F(uj)|
|F(ui)∪F(uj)|(1)
We have that Tis symmetric and Tji =Tij .
C. Formulation of TSU Model
We use a trust-based influence model called T SU to model
edges in LBSNs. In this model, we denote the influence of
a node nias Iniwhich is a probability distribution over the
geographic plane. For a node ni, we define ni’s influence on
another node njat a location Las the probability that njbuild
an edge ehnj, niito it. A influential node will have more broad
influence scope and more influence at the same distance than
an ordinary node.
1) Influence Model of Nodes on Geographic: We choose a
gaussian distribution to capture a node’s influence model for
its expressiveness and simplicity the same as previous research
[20]. To be specific, we model a node ni’s influence Inias a
bivariate gaussian distribution N(Lni,Pni), centered at ni’s
location Lni= (latni, lonni)and with the covariance matrix
Pnias its influence scope. We assume the influence scope of
a node on the latitude and longitude dimensions is the same,
so Pni=σni
0
0
σni. The influence probability of node
niat a location L is measured in Equation 2:
P(L|Ini) = 1
2πσ2
ni
e
(Latni−LatL)2+(Lonni−LonL)2
−2σni2(2)
2) Social trust-based User Influence Model: The probabil-
ity that a user uiinfluence a user ujto build a following edge
to him is measured in Equation 3:
P(fhuj, uii|Iui, Luj) = Tji
2πσ2
ui
eTji
(Latui−Latuj)2+(Lonui−Lonuj)2
−2σui2
(3)
3) Venue Influence Model: The probability that a user ui
check in at venue vjis measured in Equation 4:
P(chui, vji|Ivj, Lui) = 1
2πσ2
vj
e
(Latvj−Latui)2+(Lonvj−Lonui)2
−2σvj2
(4)
4) TSU Model on LBSNs: We make a conditional in-
dependence assumption that the edge from a tail node to
a head node is conditionally independent given the head
node and tail node. This assumption is widely applied in
machine learning models like Naive Bayes[25]. TSU Model
is shown in Equation 5 which measures joint probability of
generating friendship and check-in edges in LBSNs. We can
estimate unknown home location of users using the Maximum
Likelihood Estimation(MLE) principle under TSU model.
Algorithm 1 HLIA: Home Location Identification Algorithm
Input: G, F, C, R, Lui(∀ui∈UH)
Output: Lui(∀ui∈U−H)
1: function HLI A(G, F, C , R, L)
2: // Init home location of users in U−H
3: for each ui∈U−Hdo users: no home location
4: if ui∈UCthen user: have check-in
5: Lui=SP C lustering(Cui, cτ)
6: else user: no check-in
7: Lui=Random
8: end if
9: end for
10: // Update home locations of users in U−Hiteratively
11: while true do Outer Loop
12: for each ui∈Udo
13: Update σ2
uibased on Equation 8
14: end for
15: for each vj∈Vdo
16: Update σ2
vjbased on Equation 9
17: end for
18: while true do Inner Loop
19: for each ui∈(U−H∩U−C)do
20: Calculate Latnew
uiand Lonnew
uibased on Equation
6 and 7
21: end for
22: If Inner Loop converges, then break
23: end while
24: for each ui∈(U−H∩U−C)do
25: Latui=Latnew
ui,Lonui=Lonnew
ui
26: end for
27: If Outer Loop converges, then break
28: end while
29: end function
30:
Input: L, cτ L : the location list, cτ: cluster threshold
Output: lc
31: function SP C lustering(L, cτ)
32: C: clusters
33: for each i∈[1, Length(L)] do
34: Get the cluster Cmin that has the minimum distance dmin
with Li
35: if dmin < cτthen
36: Cmin ←Li
37: else
38: Create a new cluster Cnew
39: Cnew ←Li
40: end if
41: end for
42: return lcwhich is center of the largest cluster
43: end function
P(E|Iu, Iv)
=P(F|Iu, Iv)×P(C|Iu, Iv)
=Y
fhuj,uii∈F
Tji
2πσ2
ui
eTji
(Latui−Latuj)2+(Lonui−Lonuj)2
−2σui2
×Y
chui,vji∈C
1
2πσ2
vj
e
(Latvj−Latui)2+(Lonvj−Lonui)2
−2σvj2
(5)
Latui=
P
uj∈If(ui)
Tji Latuj
σ2
ui
+P
uj∈Of(ui)
Tij Latuj
σ2
uj
+P
vj∈Oc(ui)
Latvj
σ2
vj
P
uj∈If(ui)
Tji
σ2
ui
+P
uj∈Of(ui)
Tij
σ2
uj
+P
vj∈Oc(ui)
1
σ2
vj
(6)
Lonui=
P
uj∈If(ui)
Tji Lonuj
σ2
ui
+P
uj∈Of(ui)
Tij Lonuj
σ2
uj
+P
vj∈Oc(ui)
Lonvj
σ2
vj
P
uj∈If(ui)
Tji
σ2
ui
+P
uj∈Of(ui)
Tij
σ2
uj
+P
vj∈Oc(ui)
1
σ2
vj
(7)
σ2
ui=X
uj∈If(ui)
Tji
(Latuj−Latui)2+ (Lonuj−Lonui)2
2|If(ui)|
(8)
σ2
vj=P
ui∈Ic(vj)
(Latui−Latvj)2+ (Latui−Lonvj)2
2|Ic(vj)|(9)
VI. HOME LOCATION IDENTIFICATION METHOD
In this section, we develop our Home Location Identification
method based on TSU model. To be specific, we estimate
a user’s home location that maximizes the likelihood which
represents joint probability of generating edges(friendships,
check-ins).
In TSU model shown in Equation 5, for user ui∈U−H,
both Luiand σuiare unknown; for user ui∈UHand venue
vj∈V,σuiand σvjare unknown. We differentiate Equation
5 with regard to unknown variable and obtain the results
shown in Equation 7, 8, 8, 9. In these equations, the unknown
variables are dependent on each other. We use a two-stage
algorithm called HLIA which is demonstrated in Algorithm
1 to solve the problem. In Stage 1, HLIA initializes home
location of users who have check-in data by clustering their
check-in data using a sing-pass clustering algorithm. In Stage
2, HLIA updates home location of users iteratively so that
the likelihood is maximum. We prove that HLIA converges in
Theorem 6.1.
A. Stage 1: Initialization
HLIA initializes home location of users who don’t have
home locations from Step 3 to Step 9. For a user who has
check-in data, HLIA initializes his home location by clustering
his check-in data using a sing-pass clustering algorithm on
locations called SPClustering based on the Single-pass Clus-
tering Algorithm[14]. For a user who don’t have check-in data,
HLIA initialize his home location as a random value.
The SPClustering algorithm is shown from Step 31 to Step
43. It clusters a location list to clusters in a single pass and
returns the center of the largest cluster as result. Specifically,
SPClustering scans location Liin location list sequentially
and find the nearest cluster Cmin for the location Li. If the
minimum distance dmin is less than a threshold dτ, it adds
the location Lito the nearest cluster Cmin. Otherwise, it
creates a new cluster Cnew with the location Li. Consequently,
SPClustering is a linear algorithm.
B. Stage 2: Updating Iteratively
HLIA updates home location of users who don’t have check-
in data iteratively from Step 11 to Step 28. The outer loop from
Step 11 to Step 28 updates σ2
uiand σ2
vjbased on Equation 8
and 9. The inner loop from Step 18 to Step 26 updates Latui
and Lonuibased on Equation 6 and 7. HLIA stops when the
likelihood converges.
Theorem 6.1: The Home Location Identification algorithm
HLIA converges.
Proof: In the inner loop, HLIA can coverage and obtain
Latuiand Lonuithat maximizes the likelihood with fixed
σ2
uiand σ2
ujas shown in [37]. In the outer loop, HLIA can
directly calculate new σ2
uiand σ2
ujaccording to Equation 8
and 9 given locations of nodes. Consequently, the likelihood
will increases monotonically and the algorithm will converge.
VII. EXPERIMENTS
A. Experiment Setup
1) Dataset: As described in Section III-B, Foursquare
dataset has 835,896 users, 648,825 venues, 370,477 check-
ins, and 12,924,609 social graph edges. In the dataset, there
are 138,983 users who have check-in data, constituting only
16.7% of all users. For users who have check-in data, the
average check-in number of each user is about 2.7.
In the experiments, we define the ratio of people who have
home location as rh and rh =UH
U. We randomly split users
into two parts: rh of users have home location and 1−rh of
users don’t have have home location. In the experiments, we
select rh = 80%. This is the same way as existing methods[7,
10, 20]. In this setting, there are 669,472 users have home
location and 166,424 users don’t have home location. There
are 27,781 users(16.7%) who have check-in data among the
166,424 users who don’t have home location.
a) Methods:
•UDI is the method developed in [20], which predicts a user’s
location based on an influence model. UDI uses signals like
friendships and venues in tweets.
•Maxvote is the baseline method developed in [30], which
predicts a user’s location by taking the most popular location
of a user. We can’t directly using a max mote scheme because
location information like latitude and longitude are continuous.
So we firstly map check-in list to city list using method
described in III-C.
•ClusterHier is the baseline method developed in [21], which
predicts a user’s home location using a hierarchical clustering
algorithm to cluster checkins at night(shared from 8:00 p.m. to
7:59 a.m. every day).
•Avg is the baseline method developed in [12, 34], which
discretizes the world into 25 by 25 km cells and defines the
home location as the average position of check-ins in the cell
with the most check-ins.
•HLI A is our Home Location Identification method.
•HLI Auc is our Home Location Identification method, but also
update users who have check-in data in the iteration stage the
same as UDI.
2) Evaluation Metrics: We measure the performance of
different methods using accuracy within 100 miles error
distance(ACC ) the same as previous work[20]. To be specific,
for a user ui, his true and estimated home location are Luiand
e
Luirespectively. Let Err(ui)be the error distance between
Luiand e
Lui, then ACC is defined as Equation 10.
ACC =|ui∈U−H∧E rr(ui)≤100|
|U−H|(10)
B. Experiment Results
1) Home Location Identification for U−H∩UC:Methods
Maxvote,C lusterH ier and Avg have the limitation that
they can only predict home locations of users who have
check-in data. It means that they can only predict 16.7%
of users in U−Hin the dataset. We firstly compare the
performance of different methods on users who have check-
in data. Table II shows the performance of each method. The
results demonstrate that our method HLIA outperforms all
existing methods for users who have check-in data. To be
specific, HLIA can predict home locations of users who have
check-in data at the accuracy of 92.1% though the average
check-in number of each user is only about 2.7, and achieves
the best performance.
TABLE II
PERFORMANCE OF HOME LOC ATION ID EN TIFI CATI ON F OR U−H∩UC
Method ACC (%)
HLI A 92.1
Maxvote 91.9
ClusterHier 91.6
Avg 88.8
2) Home Location Identification for U−H:In this exper-
iment, we compare the performance of UDI ,HLIA and
HLIAcu for all users who don’t have home location. Table
III shows the performance of each method. The column Gain
in the table defines the gain of ACC comparing to U DI and
the value of gain is equal to accm−accu
accuwhen the ACC of a
method and UDI are accmand accurespectively.
a) HLIA vs. U DI:We can see that HLIA significantly
improves UDI by 6.9% in terms of ACC.
b) HLIA vs. H LI Auc:In the initialization stage of
HLIA, we initialize home location of users in U−Cas random
value and home location of users in UCby clustering their
check-in data. If we update home location of users in UC
using the randomly initialized locations in the updating stage,
the accuracy of estimated home location of users in UCmay
be affect. This is proved in the experiments. By comparing
HLIA and H LI Auc, we see that only update locations of
Fig. 2. Accuracy under different error distances
users in U−Cin updating stage of HLIA can improve the
ACC by 1.3%.
TABLE III
PERFORMANCE OF HOME LOCATION IDENT IFIC ATIO N FO R U−H
Method ACC (%) Gain(100%)
UDI 59.0 0
HLI Auc 62.3 5.6
HLI A 63.1 6.9
3) Influence of Error Distance: We have used the error
distance 100 miles to measure accuracy as illustrated in
VII-A2. To investigate the influence of error distance, we
measure accuracy on different values of error distance and
the result is shown in Figure 2. Our method performs much
better than state-of-the-art methods when error distance is less
than 100. For example, the accuracy of our method and UDI
are 0.532 and 0.454 respectively when the error distance is 20
miles. The accuracy will be close to 1 when the error distance
is more than 2,500 miles.
4) Influence of ratio of users who have home location:
To investigate the influence of ratio of users who have home
location, we evaluate methods in another setting where rh =
0.2, which means that only 20% users have home location.
This setting is more close to the real-world case. Table III
shows the performance of each method. From Table IV, we
find that HLIA significantly outperforms U DI by 47.4%. By
comparing Table III and IV, we find that HLIA outperforms
UDI even more when fewer users have home location.
TABLE IV
INFLUENCE OF RATIO OF USERS WHO HAVE HOME LOCATION
Method ACC (%) Gain(100%)
UDI 33.1 0
HLI A 48.8 47.4
C. Discussion
Content Based Approach [9, 10, 19, 23, 24] infers home
location based on texts in social networks. This approach needs
texts data in social network. What’s more, venue information
in texts can be noisy and ambiguous: Users may just mention
a venue because of news and there can be many places with
the same name. Our method avoids problems like these using
check-in data.
Check-in Based Approach[12, 20, 21, 30, 34] infers home
location of users using check-in data. Existing methods like
Maxvote[30], C lusterH ier[21] and Avg[12, 34] have the
shortcoming that they can only predict home locations of users
who have check-in data. However, in real world, only 12%
of adult smartphone owners have used geo-social services to
check-in at some location[2]. Our method HLIA can predict
home locations of users who have check-in data at the accuracy
of 92.1% when the average check-in number of each user is
only about 2.7 and achieves the best performance. Our method
can predict home locations of users who don’t have check-in
data at the accuracy of 63.1% when only 16.7% have check-
in data and 20% of users don’t have home location, outper-
forming state-of-the-art methods by about 6.9%. Comparing
to Li et al. [20], our method can model closeness in social
structure in LBSNs. What’s more, our method don’t need to
update home location of users who have check-in data in the
updating stage.
In a word, our method outperforms state-of-the-art methods
greatly and achieves the best performance.
VIII. CONCLUSION AND FUT UR E WORK
Home Location Identification of users in Location-based
Social Networks is important for location-based applications
such as personal search and recommendations. In this paper,
we propose a Trust-based influence model called TSU based
on Social relationships data(social relationships, social trust)
and User-centric data(check-ins) in LBSN. We develop a
method for this problem based on TSU model. Extensive ex-
periments on a large scale dataset demonstrate that our method
outperforms state-of-the-art methods by 6.9%. Comparing to
previous research, we firstly demonstrate the effectiveness of
using social trust which measures closeness in social structure
for Home Location Identification problem.
In future, we will further study how to use time information
and social structure in social networks for Home Location
Identification problem. What’s more, we plan to do research
on how to improve location-based services based on our
Home Location Identification method and study how to protect
privacy of users in social networks.
REFERENCES
[1] Foursquare. https://en.wikipedia.org/wiki/Foursquare.
[2] Pewresearch. http://www.pewinternet.org/2013/09/12/
location-based-services.
[3] United state cities by population. https://en.wikipedia.
org/wiki/List_of_United_States_cities_by_population.
[4] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao.
Analyzing user modeling on twitter for personalized
news recommendations. In User Modeling, Adaption and
Personalization, pages 1–12. Springer, 2011.
[5] Amr Ahmed, Yucheng Low, Mohamed Aly, Vanja Josi-
fovski, and Alexander J Smola. Scalable distributed in-
ference of dynamic user interests for behavioral targeting.
In Proceedings of the 17th ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 114–122. ACM, 2011.
[6] Miltiadis Allamanis, Salvatore Scellato, and Cecilia Mas-
colo. Evolution of a location-based online social network:
analysis and models. In Proceedings of the 2012 ACM
conference on Internet measurement conference, pages
145–158. ACM, 2012.
[7] Lars Backstrom, Eric Sun, and Cameron Marlow. Find
me if you can: improving geographical prediction with
social and spatial proximity. In Proceedings of the 19th
international conference on World wide web, pages 61–
70. ACM, 2010.
[8] David Carmel, Naama Zwerdling, Ido Guy, Shila Ofek-
Koifman, Nadav Har’El, Inbal Ronen, Erel Uziel, Sivan
Yogev, and Sergey Chernov. Personalized social search
based on the user’s social network. In Proceedings of
the 18th ACM conference on Information and knowledge
management, pages 1227–1236. ACM, 2009.
[9] Swarup Chandra, Latifur Khan, and Fahad Bin Muhaya.
Estimating twitter user location using social interactions–
a content based approach. In Privacy, Security, Risk
and Trust (PASSAT) and 2011 IEEE Third Inernational
Conference on Social Computing (SocialCom), 2011
IEEE Third International Conference on, pages 838–843.
IEEE, 2011.
[10] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You
are where you tweet: a content-based approach to geo-
locating twitter users. In Proceedings of the 19th ACM
international conference on Information and knowledge
management, pages 759–768. ACM, 2010.
[11] Zhiyuan Cheng, James Caverlee, Kyumin Lee, and
Daniel Z Sui. Exploring millions of footprints in location
sharing services. ICWSM, 2011:81–88, 2011.
[12] Eunjoon Cho, Seth A Myers, and Jure Leskovec. Friend-
ship and mobility: user movement in location-based so-
cial networks. In Proceedings of the 17th ACM SIGKDD
international conference on Knowledge discovery and
data mining, pages 1082–1090. ACM, 2011.
[13] Ahmad Foroozani, Mohammed Gharib, Ali Moham-
mad Afshin Hemmatyar, and Ali Movaghar. A novel
human mobility model for manets based on real data.
In Computer Communication and Networks (ICCCN),
2014 23rd International Conference on, pages 1–7. IEEE,
2014.
[14] William B Frakes and Ricardo Baeza-Yates. Information
retrieval: data structures and algorithms. 1992.
[15] Xiaowen Gong, Xu Chen, Kai Xing, Dong-Hoon Shin,
Mengyuan Zhang, and Junshan Zhang. Personalized lo-
cation privacy in mobile networks: A social group utility
approach. In Computer Communications (INFOCOM),
2015 IEEE Conference on, pages 1008–1016. IEEE,
2015.
[16] Hamed Haddadi, Richard Mortier, and Steven Hand.
Privacy analytics. ACM SIGCOMM Computer Commu-
nication Review, 42(2):94–98, 2012.
[17] Justin J Levandoski, Mohamed Sarwat, Ahmed Eldawy,
and Mohamed F Mokbel. Lars: A location-aware rec-
ommender system. In Data Engineering (ICDE), 2012
IEEE 28th International Conference on, pages 450–461.
IEEE, 2012.
[18] Michael Levandowsky and David Winter. Distance
between sets. Nature, 234(5323):34–35, 1971.
[19] Guoliang Li, Jun Hu, Jianhua Feng, and Kian-lee Tan.
Effective location identification from microblogs. In
Data Engineering (ICDE), 2014 IEEE 30th International
Conference on, pages 880–891. IEEE, 2014.
[20] Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, and
Kevin Chen-Chuan Chang. Towards social user pro-
filing: unified and discriminative influence model for
inferring home locations. In Proceedings of the 18th
ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 1023–1031. ACM,
2012.
[21] Hao Liu, Yaoxue Zhang, Yuezhi Zhou, Di Zhang, Xi-
aoming Fu, and KK Ramakrishnan. Mining checkins
from location-sharing services for client-independent ip
geolocation. In INFOCOM, 2014 Proceedings IEEE,
pages 619–627. IEEE, 2014.
[22] Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. Per-
sonalized news recommendation based on click behavior.
In Proceedings of the 15th international conference on
Intelligent user interfaces, pages 31–40. ACM, 2010.
[23] Jalal Mahmud, Jeffrey Nichols, and Clemens Drews.
Where is this tweet from? inferring home locations of
twitter users. ICWSM, 12:511–514, 2012.
[24] Jalal Mahmud, Jeffrey Nichols, and Clemens Drews.
Home location identification of twitter users. ACM Trans-
actions on Intelligent Systems and Technology (TIST), 5
(3):47, 2014.
[25] Andrew McCallum, Kamal Nigam, et al. A comparison
of event models for naive bayes text classification. In
AAAI-98 workshop on learning for text categorization,
volume 752, pages 41–48. Citeseer, 1998.
[26] Alan Mislove, Bimal Viswanath, Krishna P Gummadi,
and Peter Druschel. You are who you know: inferring
user profiles in online social networks. In Proceedings of
the third ACM international conference on Web search
and data mining, pages 251–260. ACM, 2010.
[27] Ben Niu, Qinghua Li, Xiaoyan Zhu, and Hui Li. A fine-
grained spatial cloaking scheme for privacy-aware users
in location-based services. In Computer Communication
and Networks (ICCCN), 2014 23rd International Confer-
ence on, pages 1–8. IEEE, 2014.
[28] Ed Novak and Qun Li. Near-pri: Private, proximity based
location sharing. In INFOCOM, 2014 Proceedings IEEE,
pages 37–45. IEEE, 2014.
[29] Sarah Pidcock and Urs Hengartner. Zerosquare: A
privacy-friendly location hub for geosocial applications.
In Proc. 2nd ACM SIGCOMM Workshop Networking,
Systems, and Applications Mobile Handhelds, 2013.
[30] Tatiana Pontes, Marisa Vasconcelos, Jussara Almeida,
Ponnurangam Kumaraguru, and Virgilio Almeida. We
know where you live: privacy characterization of
foursquare behavior. In Proceedings of the 2012 ACM
Conference on Ubiquitous Computing, pages 898–905.
ACM, 2012.
[31] Foster Provost, Brian Dalessandro, Rod Hook, Xiaohan
Zhang, and Alan Murray. Audience selection for on-
line brand advertising: privacy-friendly social network
targeting. In Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and
data mining, pages 707–716. ACM, 2009.
[32] Feng Qiu and Junghoo Cho. Automatic identification of
user interest for personalized search. In Proceedings of
the 15th international conference on World Wide Web,
pages 727–736. ACM, 2006.
[33] Mohamed Sarwat, Justin J Levandoski, Ahmed Eldawy,
and Mohamed F Mokbel. Lars*: a scalable and efficient
location-aware recommender system. IEEE Transactions
on Knowledge and Data Engineering (TKDE), 2013.
[34] Salvatore Scellato, Anastasios Noulas, Renaud Lam-
biotte, and Cecilia Mascolo. Socio-spatial properties of
online location-based social networks. ICWSM, 11:329–
336, 2011.
[35] Wei Wei, Xiaojun Zhu, and Qun Li. Lbsnsim: analyzing
and modeling location-based social networks. In INFO-
COM, 2014 Proceedings IEEE, pages 1680–1688. IEEE,
2014.
[36] Ning Xia, Han Hee Song, Yong Liao, Marios Iliofotou,
Antonio Nucci, Zhi-Li Zhang, and Aleksandar Kuz-
manovic. Mosaic: Quantifying privacy leakage in mobile
networks. In ACM SIGCOMM Computer Communication
Review, volume 43, pages 279–290. ACM, 2013.
[37] Zhijun Yin, Rui Li, Qiaozhu Mei, and Jiawei Han. Ex-
ploring social tagging graph for web object classification.
In Proceedings of the 15th ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 957–966. ACM, 2009.
[38] Leah Zhao, Neil Wong Hon Chan, Shanchieh Jay Yang,
and Roy W Melton. Privacy sensitive resource access
monitoring for android systems. In Computer Communi-
cation and Networks (ICCCN), 2015 24th International
Conference on, pages 1–6. IEEE, 2015.
[39] Yuan Zhong, Nicholas Jing Yuan, Wen Zhong, Fuzheng
Zhang, and Xing Xie. You are where you go: Inferring
demographic attributes from location check-ins. In Pro-
ceedings of the Eighth ACM International Conference
on Web Search and Data Mining, pages 295–304. ACM,
2015.