Conference PaperPDF Available

We Know Where You Are: Home Location Identification in Location-Based Social Networks

Authors:
We Know Where You Are: Home Location
Identification in Location-Based Social Networks
Yulong Gu, Yuan Yao, Weidong Liu, Jiaxing Song
Department of Computer Science and Technology
Tsinghua University
Beijing, 100084, China
guyulongcs@gmail.com, yaoyuan13@mails.tsinghua.edu.cn, {liuwd, jxsong}@tsinghua.edu.cn
Abstract—The rapid spread of smartphones has led to the
increasing popularity of Location-Based Social Networks(LBSNs)
like Foursquare, Gowalla, Facebook Places and so on where users
can publish information about their current location. In LBSNs,
identifying home locations of users is very important for vari-
ous applications like effective location-based advertisement and
recommendation. However, this problem is rather challenging
because the location information in LBSNs is sparse and noisy:
Only a small percentage of users share their home location
information due to privacy concerns; users may check in at
diverse places far from their home and make friends far away;
many users even do not have any check-in information. In this
paper, we propose a trust-based influence model, named as TSU
to solve the problem. To be specific, TSU is a Trust-based unified
probabilistic model that models edges in LBSNs based on signals
from Social relationship data(social friendship, social trust) and
User-centric data(check-in data) in LBSN. We proposed a Home
Location Identification method based on TSU model and evaluate
it on a large real-world LBSNs dataset. Extensive experiments
demonstrate that our method significantly outperforms state-of-
the-art methods.
Keywords—Home Location Identification, Location-Based So-
cial Networks, Trust, Influence Model, Social Networks
I. INTRODUCTION
In recent years, we have seen a rapid proliferation of
Social Networks such as Facebook1, Twitter2, Google+3and
so on. As the largest online social network in the world,
Facebook has over 1.18 billion monthly active users as of
August 2015 4. The rapid growth of mobile internet and the
location-acquisition technologies has led to the increasing pop-
ularity of Location-Based Social Networks(LBSNs) such as
Foursquare5, Gowalla6and Brightkite7by embedding location
into Social Networks. Users in LBSNs can conveniently log
their activity histories with spatio-temporal data by checking
in at various venues(e.g., scenic spots, restaurants, airports)
at any time using their smart phones. The inherent nature of
LBSNs encourages users to publish their current location(i.e.,
check-ins). This same is true for most popular social network
websites, like Facebook, Twitter, Google+ and so on.
1https://www.facebook.com
2https://twitter.com
3https://plus.google.com/
4https://en.wikipedia.org/wiki/Facebook
5https://foursquare.com
6https://en.wikipedia.org/wiki/Gowalla
7https://en.wikipedia.org/wiki/Brightkite
User profiling, which aims to infer user’s attributes, such as
age, gender, interests, home location, education and so on, has
been a hot topic in academic. Many research have been done
on user profiling[26, 39] to serve personalized search[8, 32],
targeted advertisement[5, 31], news recommendation[4, 22]
and so on. As the rapid growth of LBSNs, Home Location
Identification of users becomes one of the most important
user profiling problems because home location of users are ex-
tremely important for various applications to provide effective
location-based services. For example, profiling users’ home lo-
cations enables search engines to provide personalized search
results in mother tongue of users, news sites to recommend
localized news and advertisers to recommend local ads. The
home location of a user is defined as the relative “permanent”
place where the user spend most of his time in[20]. It captures
the major and static geographic scope of the user and therefore
provides valuable information for personalized services.
The home location problem is quite challenging because
signals that may help to identify home locations of users are
sparse and noisy. Firstly, only a small percentage of users
provide their home locations in Social Networks due to privacy
concerns. On twitter, only a few people (16%) register city
level locations in their profiles and most of users leave general,
non-sensical or even blank information[20]. Secondly, users
may check in at various places far from their home and
make friends far away. Thirdly, many users do not have any
check-in data. As of September 2013, only 30% of users
provide their location information to at least one social media
account and 12% of adult smartphone owners have used geo-
social services to check-in at some location[2]. This problem
has been attracting great interests of researchers in academic
recently[9, 10, 19–21, 30]. Existing approaches can mainly be
divided into two parts: Content Based Approach [9, 10, 19]
and Check-in Based Approach[20, 21, 30]. Content based
approach infers home location of users using models based on
extracted location information from texts like tweets in Social
Networks. Check-in based approach infers home location of
users leveraging check-in data of users.
In this paper, we propose a trust-based influence model,
named as TSU to model edges in LBSNs. We represent a
LBSN as a directed heterogenous graph where the nodes can
be users or venues. Edges in the graph can be friendships
between users, check-in edges from users to venues. TSU is a
Trust-based unified probabilistic model that models edges in
the graph based on signals from Social relationship data(social
friendship, social trust) and User-centric data(check-in data) in
LBSNs. In TSU, we model each node with a location and an
influence scope. We assume each edge thfrom a tail
node tto a head node his generated according to nodes’
locations, influence scope of the head node h, social trust
value of the head node hfor the tail node t. TSU is based on
the motivation that people tend to make friends with people
living near and check in at venues that are near from them,
people tend to follow celebrities and visit popular places, and
people tend to make friends who have more common friends
with them. In this paper, we propose the idea of using “social
trust” which measure closeness in social structure to model
edges in LBSNs. Specifically, we measure social trust between
users by calculating Jaccard Similarity[18] on friend sets of
users. Social trust value will be higher if two users have more
common friends. To the best of our knowledge, we are the
first that propose the idea of using “social trust" for Home
Location Identification.
For the Home Location Identification problem, we propose
a two-stage Home Location Identification method based on
TSU model. In the first stage, for users who have check-in
data, we develop a single-pass clustering algorithm to cluster
their check-in data and select the center of largest cluster as
home locations of them. In the second stage, we use a global
iteration method to estimate home location of users so that
the joint conditional probability of generating all the edges is
maximum.
We conduct extensive experiments to evaluate our Home
Location Identification method and compare with state-of-
the-art methods[12, 20, 21, 30, 34] based on a large-scale
Foursquare dataset containing about 836K users and 649K
venues. Experiment results show that our method can predict
home locations of users who have check-in data at the accuracy
of 92.1% though the average check-in number of each user
is only about 2.7. Our method can predict home locations
of all users who don’t have home location at the accuracy
of 63.1%, outperforming state-of-the-art methods by about
6.9%, when only 16.7% users have check-in data and 20%
of users don’t have home location. In a word, out method
significantly outperforms state-of-the-art methods, and achieve
the best performance.
Our main contributions are:
We firstly propose a trust-based unified probabilistic
model called TSU for Home Location Identification.
We firstly propose the idea of using social trust to
measure closeness in social structure for Home Location
Identification problem.
We propose a two-stage Home Location Identification
method based on TSU model and extensive experiments
demonstrate that our method outperforms state-of-the-art
methods by about 6.9%.
The rest of the paper is structured as follows: Section
II introduces related work. In Section III, we describe the
dataset. In Section IV, we formulate the Home Location
Identification problem. In Section V, we present the trust-
based influence model TSU. In Section VI, we introduce our
method for Home Location Identification problem. In Section
VII, we demonstrate the experiment results. In Section VIII,
we conclude the paper and discuss future work.
II. RE LATE D WORK
In this section, we divide related work into three parts: User
Profiling, Human Mobility and Home Location Identification.
A. User Profiling
User profiling aims to infer user’s attributes, such as age,
gender, interests, home location, education and so on. Mislove
et al. [26] propose a method of inferring users’ attributes like
colleges, matriculation years and majors of students by detect-
ing communities in social networks based on the phenomenon
that users with common attributes are more likely to be friends
and often form dense communities. Zhong et al. [39] extract
rich semantics of users’ check-ins, employ tensor factorization
to draw out low dimensional representations of users’ intrinsic
check-in preferences and use the extracted features in classi-
fier to infer various demographic attributes. With increasing
popularity of location-based services (LBSs), there have been
growing concerns for location privacy. Many research have
been done to protect privacy of users recently[15, 16, 27–
29, 36, 38].
User profiling are used for various applications like per-
sonalized search, targeted advertisement and news recommen-
dation. [8, 32] focus on profiling users’ interests to serve
personalized search. Qiu and Cho [32] show that users’
preferences can be learned accurately even from little click-
history data and they can help improve the performance of
personalized search significantly.
B. Human Mobility
Many research have been done on studying social and
temporal characteristics of how people use the location shar-
ing services. The study on patterns of human mobility are
significant for social science, design of future location-based
services, traffic forecasting, urban planning and so on. Cheng
et al. [11] investigate 22 million check-ins of users and
find that human mobility follow certain spatial and periodic
patterns. Cho et al. [12] find that humans experience a com-
bination of periodic movement that is geographically limited
and seemingly random jumps influenced by the social network
structure. Allamanis et al. [6] demonstrate that geographic
distance plays an important role in the creation of new social
connections and users form new ties with friends of existing
friends because connection arise among users visiting the
same place. Wei et al. [35] propose a trace-driven model for
generating synthetic LBSN datasets capturing the properties
of the original datasets. Foroozani et al. [13] propose a model
that captures human mobility properties by introducing hotspot
zones, using a graph of hotspot zones as the input area map,
dividing day time to some periods and modeling various
speeds in different times and spaces.
C. Home Location Identification
Home Location Identification focuses on identifying home
location of users in social networks. There are two types of
approaches to solve these problems: Content Based Approach
and Check-in Based Approach.
1) Content Based Approach: Content based approach infers
home location of users using models based on extracted
location information from texts like tweets in social networks.
Cheng et al. [10] propose a probabilistic framework for
estimating a Twitter user’s city-level location based purely
on the content of the user’s tweets. They use a classification
component for automatically identifying words in tweets with
a strong local geo-scope and a lattice-based neighborhood
smoothing model for refining a user’s location estimate. Chan-
dra et al. [9] employ a probabilistic framework to estimate the
city-level location of a Twitter user, based on the content of
the tweets in their dialogues. [23, 24] use an ensemble of
statistical and heuristic classifiers to predict home locations
of Twitter users based on content of tweets and tweeting
behavior of users. Li et al. [19] propose a global location
identification method that combines multiple microblogs of
a user and utilizes them to identify the user’s location. The
method organizes points of interest into a tree structure,
extract candidate locations from each microblog of a user and
then aggregates these candidate locations and identifies top-k
locations of the user.
2) Check-in Based Approach: Check-in based approach
infers home location of users leveraging check-in data of users.
Cho et al. [12] infer the home location by discretizing the
world into 25 by 25km cells and defining the home location
as the average position of check-ins in the cell with the most
check-ins. Li et al. [20] propose an unified and discriminative
influence model which models influence scope of uses and
venues. They develop location prediction method to identify
home locations of users in Twitter based on the model using
signals observed from friends and venues identified in tweets.
Pontes et al. [30] use a majority voting scheme which take
the most popular location of a user as her home location. Liu
et al. [21] get the estimated home locations using a hierarchical
clustering method to cluster checkins at night.
III. DATASET DESCRIPTION
In this section, we briefly introduce the main characteristics
of Fousquare as well as the crawled dataset used in our
experiments.
A. Foursquare: Background
Foursquare is currently one of the largest and most pop-
ular LBSNs. As a local search and recommendation service,
Foursquare provides search results or recommended places to
go based on targeted locations. The service was created in late
2008 and launched in 2009. Users in Foursquare can share
their locations with friends and followers through check ins.
Check ins are performed via mobile devices with GPS when
a user is close to specific locations known as venues which
represent real locations of a great variety of categories such
as airports or restaurants. As of December 2013, Foursquare
had 45 million registered users[1]. Foursquare gives incentives
to users who visit (check in) specific places (venues) using
rewards like mayorships to frequent visitors. Users can post
tips at specific venues, commenting on their experiences
when visiting the corresponding physical places. What’s more,
Foursquare enables users rate venues by answering questions
which help Foursquare understand how people feel about a
place, including such questions as whether or not a user likes
it. More than 50 million people use Foursquare and Swarm
(a companion app to Foursquare) each month, across desktop,
mobile web, and mobile apps and people have checked in more
than 8 billion times worldwide as of February 20168.
B. Foursquare Dataset
In this paper, we use a widely used and publicly available
Foursquare dataset extracted from the Foursquare applica-
tion through the public API[17, 33]. This dataset contains
2,153,471 users, 1,143,092 venues, 1,021,970 check-ins, and
27,098,490 social connections. Each user has a unique id and
a geospatial location (latitude and longitude) that represents
the user home town location. Each venue has a unique id
and a geospatial location (latitude and longitude). The social
graph data contains the social graph edges (connections) that
exist between users. Each social connection consists of two
users (friends) represented by two unique ids (first user id and
second user id).
We focus our study on Home Location Identification on
Foursquare users within the continental United States. Toward
this purpose, we filter all valid users and who are in the social
graph and located in continental United States. The statistical
data after applying this filter is shown in Table I.
TABLE I
SUM MARY S TATIST IC S OF FOURSQUARE DATASET
Type Number
Users 835,896
Venues 648,825
Check-ins 370,477
Social Graph Edges 12,924,609
C. Mapping Location to City
We need a method to map a location to corresponding city
so that we can know which city a user lives in given home
location of him. In this paper, we map a location to specific
city in following method: The candidate cities are the 297
cities in the United States with a population of at least 100,000
on July 1, 2014, as estimated by the United States Census
Bureau [3]. We define a location’s mapped city as the nearest
city of the location.
8https://foursquare.com/about
Fig. 1. An example of LBSN
IV. HOME LOC ATION IDENTIFI CATION PROB LE M
FORMULATION
In this section, we firstly represent a Location-based Social
Network as a directed heterogeneous graph and then formalize
the Home Location Identification problem.
A. Location-based Social Networks Formulation
We represent a Location-based Social Network as a directed
heterogeneous graph G= (N, E ). An example of the LBSN is
shown in Figure 1. In the graph, nodes can represent users or
venues. There are two types of edges in the graph: (1) follow-
ing relationship edges from users to other users; (2) check-in
edges from users to venues; For LBSNs where friendships are
undirected, they can also be represented using the directed
heterogeneous graph by creating two following relationship
edges for each undirected edge. We denote concepts in LBSNs
as follows:
N={ni}, i = 1...N: the set of N nodes in G
E={ehni, nji}: the set of E edges in G, niis the tail node
and njis the head node of the edge
U={ui}, i = 1...U: the set of U users in N
V={Vi}, i = 1...V : the set of V venues in N
F={fhui, uji}: following relationship from user node uito
uj
C={Cij }={chui, vji}: check-in edges from user node ui
to venue node vj
UH: the set of users whose home locations are known
UH: the set of users whose home locations are not known
UC: the set of users who have check-in data
UC: the set of users who don’t have check-in data
L: a geographical location denoted by (Lat,Lon) where Lat
is the latitude, Lon is the longitude
Lui: home location of user ui
Lvj: location of venue vj
We have that: N=UV, E =FCR, U =UHUH
and U=UCUC.
Further, we denote the edges as follows:
Ie(n): incoming nodes of node nof edge type e
Oe(n): outgoing nodes of node nof edge type e
If(ui): following users of user ui
Of(ui): users that are followed by user ui
Ic(ui): venues checked in by user ui
Ic(vj): users who check in at venue vj
B. Home Location Identification Problem Formulation
Home Location Identification Problem For a Location-
based Social Network G= (N, E ), for each user in UH,
estimate a home location e
Luiso as to make e
Luiclose to ui’s
true home location Lui.
V. TSU: TRU ST-BASED INFLUENCE MODEL
In this section, we introduce a trust-based influence model
names as T SU to model edges in Location-based Social
Networks.
A. Motivation of TSU model
Existing research have exploited social friendship and
check-in data for Home Location Identification[12, 20, 21, 30,
34]. Our model T SU exploits the new signal “social trust”.
To be specific, T SU is Trust-baed influence model based on
Social friendship data(social friendship, social trust) and User-
centric data(check-in data).
1) social friendship: The probability of friendship de-
creases as the distance between nodes increases has been
observed from social networks like Facebook, Twitter and so
on[7, 20]. Li et al. [20] find that different nodes have different
influence in social networks which means different head nodes
have different probabilities to attract tail nodes at the same
distance. For example, a star is more likely to attract users
who live far away than a regular user.
2) social trust: Existing methods[20] consider friend re-
lation as a binary relationship. However, closer friends in
social networks should have more influence on the home
location of friends. We propose the concept “social trust” to
measure the closeness in social structure and firstly apply it
for Home Location Identification problem. If two users have
more common friends, the social trust value between these
two nodes will be higher and they tend to live nearer.
3) check-in data: We can predict home location of users
using his check-in data because users tend to visit venues
nearby[12, 20].
B. Social Trust
In this paper, we propose “social trust” to measure closeness
in social structure and apply it in TSU model. We denote the
social trust value of node nifor node njas Tji and measure
social trust between nodes using Jaccard Similarity[18] on
friend sets of users. Jaccard Similarity is a statistic used for
comparing the similarity and diversity of sample sets. The
Jaccard Similarity measures similarity between finite sample
sets, and is defined as the size of the intersection divided by
the size of the union of the sample sets. To be specific, for
user node uiand uj, their common friends are denoted as
CF (ui, uj). Then we have that C F (ui, uj) = F(ui)F(uj)
where F(ui)is the friend set of user ui. We define Tji as
Equation 1:
Tji =J accard(ui, uj) = |F(ui)F(uj)|
|F(ui)F(uj)|(1)
We have that Tis symmetric and Tji =Tij .
C. Formulation of TSU Model
We use a trust-based influence model called T SU to model
edges in LBSNs. In this model, we denote the influence of
a node nias Iniwhich is a probability distribution over the
geographic plane. For a node ni, we define ni’s influence on
another node njat a location Las the probability that njbuild
an edge ehnj, niito it. A influential node will have more broad
influence scope and more influence at the same distance than
an ordinary node.
1) Influence Model of Nodes on Geographic: We choose a
gaussian distribution to capture a node’s influence model for
its expressiveness and simplicity the same as previous research
[20]. To be specific, we model a node ni’s influence Inias a
bivariate gaussian distribution N(Lni,Pni), centered at nis
location Lni= (latni, lonni)and with the covariance matrix
Pnias its influence scope. We assume the influence scope of
a node on the latitude and longitude dimensions is the same,
so Pni=σni
0
0
σni. The influence probability of node
niat a location L is measured in Equation 2:
P(L|Ini) = 1
2πσ2
ni
e
(LatniLatL)2+(LonniLonL)2
2σni2(2)
2) Social trust-based User Influence Model: The probabil-
ity that a user uiinfluence a user ujto build a following edge
to him is measured in Equation 3:
P(fhuj, uii|Iui, Luj) = Tji
2πσ2
ui
eTji
(LatuiLatuj)2+(LonuiLonuj)2
2σui2
(3)
3) Venue Influence Model: The probability that a user ui
check in at venue vjis measured in Equation 4:
P(chui, vji|Ivj, Lui) = 1
2πσ2
vj
e
(LatvjLatui)2+(LonvjLonui)2
2σvj2
(4)
4) TSU Model on LBSNs: We make a conditional in-
dependence assumption that the edge from a tail node to
a head node is conditionally independent given the head
node and tail node. This assumption is widely applied in
machine learning models like Naive Bayes[25]. TSU Model
is shown in Equation 5 which measures joint probability of
generating friendship and check-in edges in LBSNs. We can
estimate unknown home location of users using the Maximum
Likelihood Estimation(MLE) principle under TSU model.
Algorithm 1 HLIA: Home Location Identification Algorithm
Input: G, F, C, R, Lui(uiUH)
Output: Lui(uiUH)
1: function HLI A(G, F, C , R, L)
2: // Init home location of users in UH
3: for each uiUHdo users: no home location
4: if uiUCthen user: have check-in
5: Lui=SP C lustering(Cui, cτ)
6: else user: no check-in
7: Lui=Random
8: end if
9: end for
10: // Update home locations of users in UHiteratively
11: while true do Outer Loop
12: for each uiUdo
13: Update σ2
uibased on Equation 8
14: end for
15: for each vjVdo
16: Update σ2
vjbased on Equation 9
17: end for
18: while true do Inner Loop
19: for each ui(UHUC)do
20: Calculate Latnew
uiand Lonnew
uibased on Equation
6 and 7
21: end for
22: If Inner Loop converges, then break
23: end while
24: for each ui(UHUC)do
25: Latui=Latnew
ui,Lonui=Lonnew
ui
26: end for
27: If Outer Loop converges, then break
28: end while
29: end function
30:
Input: L, cτ L : the location list, cτ: cluster threshold
Output: lc
31: function SP C lustering(L, cτ)
32: C: clusters
33: for each i[1, Length(L)] do
34: Get the cluster Cmin that has the minimum distance dmin
with Li
35: if dmin < cτthen
36: Cmin Li
37: else
38: Create a new cluster Cnew
39: Cnew Li
40: end if
41: end for
42: return lcwhich is center of the largest cluster
43: end function
P(E|Iu, Iv)
=P(F|Iu, Iv)×P(C|Iu, Iv)
=Y
fhuj,uii∈F
Tji
2πσ2
ui
eTji
(LatuiLatuj)2+(LonuiLonuj)2
2σui2
×Y
chui,vji∈C
1
2πσ2
vj
e
(LatvjLatui)2+(LonvjLonui)2
2σvj2
(5)
Latui=
P
ujIf(ui)
Tji Latuj
σ2
ui
+P
ujOf(ui)
Tij Latuj
σ2
uj
+P
vjOc(ui)
Latvj
σ2
vj
P
ujIf(ui)
Tji
σ2
ui
+P
ujOf(ui)
Tij
σ2
uj
+P
vjOc(ui)
1
σ2
vj
(6)
Lonui=
P
ujIf(ui)
Tji Lonuj
σ2
ui
+P
ujOf(ui)
Tij Lonuj
σ2
uj
+P
vjOc(ui)
Lonvj
σ2
vj
P
ujIf(ui)
Tji
σ2
ui
+P
ujOf(ui)
Tij
σ2
uj
+P
vjOc(ui)
1
σ2
vj
(7)
σ2
ui=X
ujIf(ui)
Tji
(LatujLatui)2+ (LonujLonui)2
2|If(ui)|
(8)
σ2
vj=P
uiIc(vj)
(LatuiLatvj)2+ (LatuiLonvj)2
2|Ic(vj)|(9)
VI. HOME LOCATION IDENTIFICATION METHOD
In this section, we develop our Home Location Identification
method based on TSU model. To be specific, we estimate
a user’s home location that maximizes the likelihood which
represents joint probability of generating edges(friendships,
check-ins).
In TSU model shown in Equation 5, for user uiUH,
both Luiand σuiare unknown; for user uiUHand venue
vjV,σuiand σvjare unknown. We differentiate Equation
5 with regard to unknown variable and obtain the results
shown in Equation 7, 8, 8, 9. In these equations, the unknown
variables are dependent on each other. We use a two-stage
algorithm called HLIA which is demonstrated in Algorithm
1 to solve the problem. In Stage 1, HLIA initializes home
location of users who have check-in data by clustering their
check-in data using a sing-pass clustering algorithm. In Stage
2, HLIA updates home location of users iteratively so that
the likelihood is maximum. We prove that HLIA converges in
Theorem 6.1.
A. Stage 1: Initialization
HLIA initializes home location of users who don’t have
home locations from Step 3 to Step 9. For a user who has
check-in data, HLIA initializes his home location by clustering
his check-in data using a sing-pass clustering algorithm on
locations called SPClustering based on the Single-pass Clus-
tering Algorithm[14]. For a user who don’t have check-in data,
HLIA initialize his home location as a random value.
The SPClustering algorithm is shown from Step 31 to Step
43. It clusters a location list to clusters in a single pass and
returns the center of the largest cluster as result. Specifically,
SPClustering scans location Liin location list sequentially
and find the nearest cluster Cmin for the location Li. If the
minimum distance dmin is less than a threshold dτ, it adds
the location Lito the nearest cluster Cmin. Otherwise, it
creates a new cluster Cnew with the location Li. Consequently,
SPClustering is a linear algorithm.
B. Stage 2: Updating Iteratively
HLIA updates home location of users who don’t have check-
in data iteratively from Step 11 to Step 28. The outer loop from
Step 11 to Step 28 updates σ2
uiand σ2
vjbased on Equation 8
and 9. The inner loop from Step 18 to Step 26 updates Latui
and Lonuibased on Equation 6 and 7. HLIA stops when the
likelihood converges.
Theorem 6.1: The Home Location Identification algorithm
HLIA converges.
Proof: In the inner loop, HLIA can coverage and obtain
Latuiand Lonuithat maximizes the likelihood with fixed
σ2
uiand σ2
ujas shown in [37]. In the outer loop, HLIA can
directly calculate new σ2
uiand σ2
ujaccording to Equation 8
and 9 given locations of nodes. Consequently, the likelihood
will increases monotonically and the algorithm will converge.
VII. EXPERIMENTS
A. Experiment Setup
1) Dataset: As described in Section III-B, Foursquare
dataset has 835,896 users, 648,825 venues, 370,477 check-
ins, and 12,924,609 social graph edges. In the dataset, there
are 138,983 users who have check-in data, constituting only
16.7% of all users. For users who have check-in data, the
average check-in number of each user is about 2.7.
In the experiments, we define the ratio of people who have
home location as rh and rh =UH
U. We randomly split users
into two parts: rh of users have home location and 1rh of
users don’t have have home location. In the experiments, we
select rh = 80%. This is the same way as existing methods[7,
10, 20]. In this setting, there are 669,472 users have home
location and 166,424 users don’t have home location. There
are 27,781 users(16.7%) who have check-in data among the
166,424 users who don’t have home location.
a) Methods:
UDI is the method developed in [20], which predicts a user’s
location based on an influence model. UDI uses signals like
friendships and venues in tweets.
Maxvote is the baseline method developed in [30], which
predicts a user’s location by taking the most popular location
of a user. We can’t directly using a max mote scheme because
location information like latitude and longitude are continuous.
So we firstly map check-in list to city list using method
described in III-C.
ClusterHier is the baseline method developed in [21], which
predicts a user’s home location using a hierarchical clustering
algorithm to cluster checkins at night(shared from 8:00 p.m. to
7:59 a.m. every day).
Avg is the baseline method developed in [12, 34], which
discretizes the world into 25 by 25 km cells and defines the
home location as the average position of check-ins in the cell
with the most check-ins.
HLI A is our Home Location Identification method.
HLI Auc is our Home Location Identification method, but also
update users who have check-in data in the iteration stage the
same as UDI.
2) Evaluation Metrics: We measure the performance of
different methods using accuracy within 100 miles error
distance(ACC ) the same as previous work[20]. To be specific,
for a user ui, his true and estimated home location are Luiand
e
Luirespectively. Let Err(ui)be the error distance between
Luiand e
Lui, then ACC is defined as Equation 10.
ACC =|uiUHE rr(ui)100|
|UH|(10)
B. Experiment Results
1) Home Location Identification for UHUC:Methods
Maxvote,C lusterH ier and Avg have the limitation that
they can only predict home locations of users who have
check-in data. It means that they can only predict 16.7%
of users in UHin the dataset. We firstly compare the
performance of different methods on users who have check-
in data. Table II shows the performance of each method. The
results demonstrate that our method HLIA outperforms all
existing methods for users who have check-in data. To be
specific, HLIA can predict home locations of users who have
check-in data at the accuracy of 92.1% though the average
check-in number of each user is only about 2.7, and achieves
the best performance.
TABLE II
PERFORMANCE OF HOME LOC ATION ID EN TIFI CATI ON F OR UHUC
Method ACC (%)
HLI A 92.1
Maxvote 91.9
ClusterHier 91.6
Avg 88.8
2) Home Location Identification for UH:In this exper-
iment, we compare the performance of UDI ,HLIA and
HLIAcu for all users who don’t have home location. Table
III shows the performance of each method. The column Gain
in the table defines the gain of ACC comparing to U DI and
the value of gain is equal to accmaccu
accuwhen the ACC of a
method and UDI are accmand accurespectively.
a) HLIA vs. U DI:We can see that HLIA significantly
improves UDI by 6.9% in terms of ACC.
b) HLIA vs. H LI Auc:In the initialization stage of
HLIA, we initialize home location of users in UCas random
value and home location of users in UCby clustering their
check-in data. If we update home location of users in UC
using the randomly initialized locations in the updating stage,
the accuracy of estimated home location of users in UCmay
be affect. This is proved in the experiments. By comparing
HLIA and H LI Auc, we see that only update locations of
Fig. 2. Accuracy under different error distances
users in UCin updating stage of HLIA can improve the
ACC by 1.3%.
TABLE III
PERFORMANCE OF HOME LOCATION IDENT IFIC ATIO N FO R UH
Method ACC (%) Gain(100%)
UDI 59.0 0
HLI Auc 62.3 5.6
HLI A 63.1 6.9
3) Influence of Error Distance: We have used the error
distance 100 miles to measure accuracy as illustrated in
VII-A2. To investigate the influence of error distance, we
measure accuracy on different values of error distance and
the result is shown in Figure 2. Our method performs much
better than state-of-the-art methods when error distance is less
than 100. For example, the accuracy of our method and UDI
are 0.532 and 0.454 respectively when the error distance is 20
miles. The accuracy will be close to 1 when the error distance
is more than 2,500 miles.
4) Influence of ratio of users who have home location:
To investigate the influence of ratio of users who have home
location, we evaluate methods in another setting where rh =
0.2, which means that only 20% users have home location.
This setting is more close to the real-world case. Table III
shows the performance of each method. From Table IV, we
find that HLIA significantly outperforms U DI by 47.4%. By
comparing Table III and IV, we find that HLIA outperforms
UDI even more when fewer users have home location.
TABLE IV
INFLUENCE OF RATIO OF USERS WHO HAVE HOME LOCATION
Method ACC (%) Gain(100%)
UDI 33.1 0
HLI A 48.8 47.4
C. Discussion
Content Based Approach [9, 10, 19, 23, 24] infers home
location based on texts in social networks. This approach needs
texts data in social network. What’s more, venue information
in texts can be noisy and ambiguous: Users may just mention
a venue because of news and there can be many places with
the same name. Our method avoids problems like these using
check-in data.
Check-in Based Approach[12, 20, 21, 30, 34] infers home
location of users using check-in data. Existing methods like
Maxvote[30], C lusterH ier[21] and Avg[12, 34] have the
shortcoming that they can only predict home locations of users
who have check-in data. However, in real world, only 12%
of adult smartphone owners have used geo-social services to
check-in at some location[2]. Our method HLIA can predict
home locations of users who have check-in data at the accuracy
of 92.1% when the average check-in number of each user is
only about 2.7 and achieves the best performance. Our method
can predict home locations of users who don’t have check-in
data at the accuracy of 63.1% when only 16.7% have check-
in data and 20% of users don’t have home location, outper-
forming state-of-the-art methods by about 6.9%. Comparing
to Li et al. [20], our method can model closeness in social
structure in LBSNs. What’s more, our method don’t need to
update home location of users who have check-in data in the
updating stage.
In a word, our method outperforms state-of-the-art methods
greatly and achieves the best performance.
VIII. CONCLUSION AND FUT UR E WORK
Home Location Identification of users in Location-based
Social Networks is important for location-based applications
such as personal search and recommendations. In this paper,
we propose a Trust-based influence model called TSU based
on Social relationships data(social relationships, social trust)
and User-centric data(check-ins) in LBSN. We develop a
method for this problem based on TSU model. Extensive ex-
periments on a large scale dataset demonstrate that our method
outperforms state-of-the-art methods by 6.9%. Comparing to
previous research, we firstly demonstrate the effectiveness of
using social trust which measures closeness in social structure
for Home Location Identification problem.
In future, we will further study how to use time information
and social structure in social networks for Home Location
Identification problem. What’s more, we plan to do research
on how to improve location-based services based on our
Home Location Identification method and study how to protect
privacy of users in social networks.
REFERENCES
[1] Foursquare. https://en.wikipedia.org/wiki/Foursquare.
[2] Pewresearch. http://www.pewinternet.org/2013/09/12/
location-based-services.
[3] United state cities by population. https://en.wikipedia.
org/wiki/List_of_United_States_cities_by_population.
[4] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao.
Analyzing user modeling on twitter for personalized
news recommendations. In User Modeling, Adaption and
Personalization, pages 1–12. Springer, 2011.
[5] Amr Ahmed, Yucheng Low, Mohamed Aly, Vanja Josi-
fovski, and Alexander J Smola. Scalable distributed in-
ference of dynamic user interests for behavioral targeting.
In Proceedings of the 17th ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 114–122. ACM, 2011.
[6] Miltiadis Allamanis, Salvatore Scellato, and Cecilia Mas-
colo. Evolution of a location-based online social network:
analysis and models. In Proceedings of the 2012 ACM
conference on Internet measurement conference, pages
145–158. ACM, 2012.
[7] Lars Backstrom, Eric Sun, and Cameron Marlow. Find
me if you can: improving geographical prediction with
social and spatial proximity. In Proceedings of the 19th
international conference on World wide web, pages 61–
70. ACM, 2010.
[8] David Carmel, Naama Zwerdling, Ido Guy, Shila Ofek-
Koifman, Nadav Har’El, Inbal Ronen, Erel Uziel, Sivan
Yogev, and Sergey Chernov. Personalized social search
based on the user’s social network. In Proceedings of
the 18th ACM conference on Information and knowledge
management, pages 1227–1236. ACM, 2009.
[9] Swarup Chandra, Latifur Khan, and Fahad Bin Muhaya.
Estimating twitter user location using social interactions–
a content based approach. In Privacy, Security, Risk
and Trust (PASSAT) and 2011 IEEE Third Inernational
Conference on Social Computing (SocialCom), 2011
IEEE Third International Conference on, pages 838–843.
IEEE, 2011.
[10] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You
are where you tweet: a content-based approach to geo-
locating twitter users. In Proceedings of the 19th ACM
international conference on Information and knowledge
management, pages 759–768. ACM, 2010.
[11] Zhiyuan Cheng, James Caverlee, Kyumin Lee, and
Daniel Z Sui. Exploring millions of footprints in location
sharing services. ICWSM, 2011:81–88, 2011.
[12] Eunjoon Cho, Seth A Myers, and Jure Leskovec. Friend-
ship and mobility: user movement in location-based so-
cial networks. In Proceedings of the 17th ACM SIGKDD
international conference on Knowledge discovery and
data mining, pages 1082–1090. ACM, 2011.
[13] Ahmad Foroozani, Mohammed Gharib, Ali Moham-
mad Afshin Hemmatyar, and Ali Movaghar. A novel
human mobility model for manets based on real data.
In Computer Communication and Networks (ICCCN),
2014 23rd International Conference on, pages 1–7. IEEE,
2014.
[14] William B Frakes and Ricardo Baeza-Yates. Information
retrieval: data structures and algorithms. 1992.
[15] Xiaowen Gong, Xu Chen, Kai Xing, Dong-Hoon Shin,
Mengyuan Zhang, and Junshan Zhang. Personalized lo-
cation privacy in mobile networks: A social group utility
approach. In Computer Communications (INFOCOM),
2015 IEEE Conference on, pages 1008–1016. IEEE,
2015.
[16] Hamed Haddadi, Richard Mortier, and Steven Hand.
Privacy analytics. ACM SIGCOMM Computer Commu-
nication Review, 42(2):94–98, 2012.
[17] Justin J Levandoski, Mohamed Sarwat, Ahmed Eldawy,
and Mohamed F Mokbel. Lars: A location-aware rec-
ommender system. In Data Engineering (ICDE), 2012
IEEE 28th International Conference on, pages 450–461.
IEEE, 2012.
[18] Michael Levandowsky and David Winter. Distance
between sets. Nature, 234(5323):34–35, 1971.
[19] Guoliang Li, Jun Hu, Jianhua Feng, and Kian-lee Tan.
Effective location identification from microblogs. In
Data Engineering (ICDE), 2014 IEEE 30th International
Conference on, pages 880–891. IEEE, 2014.
[20] Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, and
Kevin Chen-Chuan Chang. Towards social user pro-
filing: unified and discriminative influence model for
inferring home locations. In Proceedings of the 18th
ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 1023–1031. ACM,
2012.
[21] Hao Liu, Yaoxue Zhang, Yuezhi Zhou, Di Zhang, Xi-
aoming Fu, and KK Ramakrishnan. Mining checkins
from location-sharing services for client-independent ip
geolocation. In INFOCOM, 2014 Proceedings IEEE,
pages 619–627. IEEE, 2014.
[22] Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. Per-
sonalized news recommendation based on click behavior.
In Proceedings of the 15th international conference on
Intelligent user interfaces, pages 31–40. ACM, 2010.
[23] Jalal Mahmud, Jeffrey Nichols, and Clemens Drews.
Where is this tweet from? inferring home locations of
twitter users. ICWSM, 12:511–514, 2012.
[24] Jalal Mahmud, Jeffrey Nichols, and Clemens Drews.
Home location identification of twitter users. ACM Trans-
actions on Intelligent Systems and Technology (TIST), 5
(3):47, 2014.
[25] Andrew McCallum, Kamal Nigam, et al. A comparison
of event models for naive bayes text classification. In
AAAI-98 workshop on learning for text categorization,
volume 752, pages 41–48. Citeseer, 1998.
[26] Alan Mislove, Bimal Viswanath, Krishna P Gummadi,
and Peter Druschel. You are who you know: inferring
user profiles in online social networks. In Proceedings of
the third ACM international conference on Web search
and data mining, pages 251–260. ACM, 2010.
[27] Ben Niu, Qinghua Li, Xiaoyan Zhu, and Hui Li. A fine-
grained spatial cloaking scheme for privacy-aware users
in location-based services. In Computer Communication
and Networks (ICCCN), 2014 23rd International Confer-
ence on, pages 1–8. IEEE, 2014.
[28] Ed Novak and Qun Li. Near-pri: Private, proximity based
location sharing. In INFOCOM, 2014 Proceedings IEEE,
pages 37–45. IEEE, 2014.
[29] Sarah Pidcock and Urs Hengartner. Zerosquare: A
privacy-friendly location hub for geosocial applications.
In Proc. 2nd ACM SIGCOMM Workshop Networking,
Systems, and Applications Mobile Handhelds, 2013.
[30] Tatiana Pontes, Marisa Vasconcelos, Jussara Almeida,
Ponnurangam Kumaraguru, and Virgilio Almeida. We
know where you live: privacy characterization of
foursquare behavior. In Proceedings of the 2012 ACM
Conference on Ubiquitous Computing, pages 898–905.
ACM, 2012.
[31] Foster Provost, Brian Dalessandro, Rod Hook, Xiaohan
Zhang, and Alan Murray. Audience selection for on-
line brand advertising: privacy-friendly social network
targeting. In Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and
data mining, pages 707–716. ACM, 2009.
[32] Feng Qiu and Junghoo Cho. Automatic identification of
user interest for personalized search. In Proceedings of
the 15th international conference on World Wide Web,
pages 727–736. ACM, 2006.
[33] Mohamed Sarwat, Justin J Levandoski, Ahmed Eldawy,
and Mohamed F Mokbel. Lars*: a scalable and efficient
location-aware recommender system. IEEE Transactions
on Knowledge and Data Engineering (TKDE), 2013.
[34] Salvatore Scellato, Anastasios Noulas, Renaud Lam-
biotte, and Cecilia Mascolo. Socio-spatial properties of
online location-based social networks. ICWSM, 11:329–
336, 2011.
[35] Wei Wei, Xiaojun Zhu, and Qun Li. Lbsnsim: analyzing
and modeling location-based social networks. In INFO-
COM, 2014 Proceedings IEEE, pages 1680–1688. IEEE,
2014.
[36] Ning Xia, Han Hee Song, Yong Liao, Marios Iliofotou,
Antonio Nucci, Zhi-Li Zhang, and Aleksandar Kuz-
manovic. Mosaic: Quantifying privacy leakage in mobile
networks. In ACM SIGCOMM Computer Communication
Review, volume 43, pages 279–290. ACM, 2013.
[37] Zhijun Yin, Rui Li, Qiaozhu Mei, and Jiawei Han. Ex-
ploring social tagging graph for web object classification.
In Proceedings of the 15th ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 957–966. ACM, 2009.
[38] Leah Zhao, Neil Wong Hon Chan, Shanchieh Jay Yang,
and Roy W Melton. Privacy sensitive resource access
monitoring for android systems. In Computer Communi-
cation and Networks (ICCCN), 2015 24th International
Conference on, pages 1–6. IEEE, 2015.
[39] Yuan Zhong, Nicholas Jing Yuan, Wen Zhong, Fuzheng
Zhang, and Xing Xie. You are where you go: Inferring
demographic attributes from location check-ins. In Pro-
ceedings of the Eighth ACM International Conference
on Web Search and Data Mining, pages 295–304. ACM,
2015.
... According to the location-based social network, every user forms a social connection at a specific location. For v n , the function Home Location(v n , L) identifies one location from L as home location h n based on home location algorithm stated in [48]. For ...
... Each node of the triads belongs to one specific location, treated as its home location. e location of home for each user or node is identified using the home detection algorithm [48]. While critically examining the formation of the closed triad, we identified and hence proposed three cases of triads, listed as follows: ...
... In our research, we incorporated the location ID into identified homophily in a network. We utilize the existing home detection algorithm to identify the home location for each user [48]. In location-based social networks, by home location, we mean the most visited and stayed at place. ...
Article
Full-text available
Social Internet of Things (SIoT) is a variation of social networks that adopt the property of peer-to-peer networks, in which connections between the things and social actors are automatically established. SIoT is a part of various organizations that inherit the social interaction, and these organizations include industries, institutions, and other establishments. Triadic closure and homophily are the most commonly used measures to investigate social networks’ formation and nature, where both measures are used exclusively or with statistical models. The triadic closure patterns are mapped for actors’ communication behavior over a location-based social network, affecting the homophily. In this study, we investigate triads emergence in homophilic social networks. This evaluation is based on the empirical review of triads within social networks (SNs) formed on Big Data. We utilized a large location-based dataset for an in-depth analysis, the Chinese telecommunication-based anonymized call detail records (CDRs). Two other openly available datasets, Brightkite and Gowalla, were also studied. We identified and proposed three social triad classes in a homophilic network to feature the correlation between social triads and homophily. The study opened a promising research direction that relates the variation of homophily based on closure triads nature. The homophilic triads are further categorized into transitive and intransitive groups. As our concluding research objective, we examined the relative triadic throughput within a location-based social network for the given datasets. The research study attains significant results highlighting the positive connection between homophily and a specific social triad class.
... Additionally, researchers have demonstrated the ability to estimate a user's home locations (Gu et al., 2016), social relationships (Sadilek et al., 2012), as well as probabilities of returning to a venue (Preoţiuc-Pietro & Cohn, 2013) based solely on geotagged social media contents. Alrayes et al. (2020) summarized the issue of location disclosure through three dimensions (Fig. 2): what's being shared (data), who has access (visibility), and how much does a user know (awareness)? ...
... For instance, research has shown that one's location can be identified based on the textual content and timing of a social media post (McKenzie et al., 2016). Additional studies on user profiling also explored home (Gu et al., 2016) or current location identification (Bellatti et al., 2017;Pontes et al., 2012), future check-in location prediction (Gao et al., 2012), social relationship inference (Sadilek et al., 2012), returning probability computation (Preoţiuc-Pietro & Cohn 2013), and sensitive personal information calculation (e.g., gender, educational background, age, and sexual orientation) (Rossi & Musolesi, 2014;Zhong et al., 2015). Weiser and Scheider (2014) therefore suggest building a civilized cyberspace to prevent misuse of personal location information. ...
Article
Full-text available
Traditional boundaries between people are vanishing due to the rise of Internet of Things technology. Our smart devices keep us connected to the world, but also monitor our daily lives through an unprecedented amount data collection. As a result, defining privacy has become more complicated. Individuals want to leverage new technology (e.g., making friends through sharing private experiences) and also avoid unwanted consequences (e.g., targeted advertising). In the age of ubiquitous digital content, geoprivacy is unique because concerns in this area are constantly changing and context-dependent. Multiple factors influence people’s location disclosure decisions, including time, culture, demographics, spatial granularity, and trust. Existing research primarily focuses on the computational efforts of protecting geoprivacy, while the variation of geoprivacy perceptions has yet to receive adequate attention in the data science literature. In this work, we explore geoprivacy from a cognate-based perspective and tackle our changing perception of the concept from multiple angles. Our objectives are to rehumanize this field from contextual, cultural, and economic dimensions and highlight the uniqueness of geodata under the broad topic of privacy. It is essential that we understand the spatial variations of geoprivacy perceptions in the era of big data. Masking geographic coordinates can no longer fully anonymize spatial data, and targeted geoprivacy protection needs to be further investigated to improve user experience.
... On the other hand, issues such as users' privacy, high energy consumption, and loss of signal must be addressed. First, the accuracy and frequency of positioning sensing of a user allows obtaining personal information such as home and work location [Kang et al. 2004], as well as work times [Gu et al. 2016]. Next, the elevated energy consumption must be considered, given that mobile devices have limited sources of energy and that most of the times the sensing devices have other functions sharing the same source; thus, the frequency of update must be just enough for the purpose. ...
... Such attacks typically use information that users disclose in their profiles and posts to feed ML models capable of inferring sensitive attributes and behavioural patterns [17]. For instance, Gu et al. [18] developed a probabilistic model that employs user-centric and social-relationship data to infer individuals' home location. Such a model can achieve more than 90% accuracy if the targeted user had disclosed some location data in the past (e.g., check-in at a bar or a restaurant) and around 60% when not. ...
Conference Paper
Full-text available
Social Coding Platforms (SCPs) like GitHub have become central to modern software engineering thanks to their collaborative and version-control features. Like in mainstream Online Social Networks (OSNs) such as Facebook, users of SCPs are subjected to privacy attacks and threats given the high amounts of personal and project-related data available in their profiles and software repositories. However, unlike in OSNs, the privacy concerns and practices of SCP users have not been extensively explored nor documented in the current literature. In this work, we present the preliminary results of an online survey (N=105) addressing developers' concerns and perceptions about privacy threats steaming from SCPs. Our results suggest that, although users express concern about social and organisational privacy threats, they often feel safe sharing personal and project-related information on these platforms. Moreover, attacks targeting the inference of sensitive attributes are considered more likely than those seeking to re-identify source-code contributors. Based on these findings, we propose a set of recommendations for future investigations addressing privacy and identity management in SCPs.
... Later, mobile phones and smartphones gained popularity, as they could enable researchers to study individual-level mobility patterns (2)(3)(4). Other emerging mobile device location data sources, such as call detail record (CDR), cellular network data, and social media location-based services, have also been used by researchers to study mobility behavior (5)(6)(7)(8)(9)(10)(11)(12)(13). Mobile device location data has proved to be a great asset for decision-makers amid the current COVID-19 pandemic. ...
Article
The research team has utilized privacy-protected mobile device location data, integrated with COVID-19 case data and census population data, to produce a COVID-19 impact analysis platform that can inform users about the effects of COVID-19 spread and government orders on mobility and social distancing. The platform is being updated daily, to continuously inform decision-makers about the impacts of COVID-19 on their communities, using an interactive analytical tool. The research team has processed anonymized mobile device location data to identify trips and produced a set of variables, including social distancing index, percentage of people staying at home, visits to work and non-work locations, out-of-town trips, and trip distance. The results are aggregated to county and state levels to protect privacy, and scaled to the entire population of each county and state. The research team is making their data and findings, which are updated daily and go back to January 1, 2020, for benchmarking, available to the public to help public officials make informed decisions. This paper presents a summary of the platform and describes the methodology used to process data and produce the platform metrics.
... Prior work has proposed approaches for identifying home and work locations that range from inspecting social graphs, to studying check-ins and precise geolocation data (a survey can be found in [74]). As certain users do not geotag their tweets, previous work has also tried to infer home locations based on tweet content [18], [46], [63] or other information like social ties [33], [17] or check-in behavior [40]. However, these studies only infer key locations at a very coarse granularity. ...
Conference Paper
Full-text available
The exposure of location data constitutes a significant privacy risk to users as it can lead to de-anonymization, the inference of sensitive information, and even physical threats. In this paper we present LPAuditor, a tool that conducts a comprehensive evaluation of the privacy loss caused by public location metadata. First, we demonstrate how our system can pinpoint users’ key locations at an unprecedented granularity by identifying their actual postal addresses. Our evaluation on Twitter data highlights the effectiveness of our techniques which outperform prior approaches by 18.9%-91.6% for homes and 8.7%-21.8% for workplaces. Next we present a novel exploration of automated private information inference that uncovers “sensitive” locations that users have visited (pertaining to health, religion, and sex/nightlife). We find that location metadata can provide additional context to tweets and thus lead to the exposure of private information that might not match the users’ intentions. We further explore the mismatch between user actions and information exposure and find that older versions of the official Twitter apps follow a privacy-invasive policy of including precise GPS coordinates in the metadata of tweets that users have geotagged at a coarse-grained level (e.g., city). The implications of this exposure are further exacerbated by our finding that users are considerably privacy-cautious in regards to exposing precise location data. When users can explicitly select what location data is published, there is a 94.6% reduction in tweets with GPS coordinates. As part of current efforts to give users more control over their data, LPAuditor can be adopted by major services and offered as an auditing tool that informs users about sensitive information they (indirectly) expose through location metadata.
Article
Nowadays, the explosive growth of personalized web applications and the rapid development of artificial intelligence technology have flourished the recent research on mobile user profiling, i.e., inferring the user profile from mobile behavioral data. Particularly, existing studies mainly follow the data-driven paradigm to develop feature engineering and representation learning on such data, which however suffer from the robustness issue, i.e., generalizing poorly across datasets and profiles without considering semantic knowledge therein. In comparison, the rising knowledge-driven paradigm built upon the knowledge graph (KG) offers a potential solution to mitigate such weakness. Therefore, in this paper, we propose a Knowledge Graph aided framework for Mobile User Profiling (KG-MUP). Specifically, to distil semantic knowledge among data, we firstly construct an urban knowledge graph (UrbanKG) with domain entities like users, regions, point of interests (POIs), etc. identified, as well as semantic relations for home, workplace, spatiality, etc. extracted. Moreover, we leverage tensor decomposition and graph neural network to obtain knowledgeable user representations from UrbanKG. In addition, we introduce several customized features to quantify individual mobility characteristics for mobile user profiling. Extensive experiments on three real-world mobility datasets demonstrate that KG-MUP achieves state-of-the-art performance on user profile inference tasks. Moreover, further results also reveal the importance of various semantic knowledge to user profile inference, which provides meaningful insights on user modeling with mobile behavioral data.
Article
Full-text available
The newly emerged machine learning (e.g., deep learning) methods have become a strong driving force to revolutionize a wide range of industries, such as smart healthcare, financial technology, and surveillance systems. Meanwhile, privacy has emerged as a big concern in this machine learning-based artificial intelligence era. This article is a comprehensive study on privacy preservation problems and machine learning. The survey covers three categories of interactions between privacy and machine learning: (i) private machine learning, (ii) machine learning-aided privacy protection, and (iii) machine learning-based privacy attack and corresponding protection schemes. The current research progress in each category is reviewed and the key challenges are identified. Finally, based on our in-depth analysis of the area of privacy and machine learning, we point out future research directions in this field.
Article
Full-text available
Group recommendation has attracted researchers’ attention in various domains, specifically such approaches utilizing location-based social networks (LBSNs). However, point of interest (POI) group recommendation faces the challenge of aggregating diverse user preferences, while group members have different influences on the final decision of the group. Besides, the recommendation of spatial items is different from non-spatial items and the unique features of the spatial items such as distance must be considered in the recommendation. In this paper, a POI group recommendation method is proposed to tackle this problem. User influence is modeled fuzzy and taken into account the difference of users’ personality and their preferences when are alone or in a group, by using historical check-in data in LBSNs and in terms of category, distance and time. The proposed method is integrated with the weighted average aggregation to improve the efficiency of the POI group recommendation. Experimental results in a real dataset show improvement in the accuracy of POI group recommendations in varying sizes of groups. The results also get better when the user influence is calculated using the fuzzy approach. Besides, studying user behavior differences to choose the place to visit when alone or in a group shows that i) the flexibility of users in distance is less than time and category. It is also in the category less than time. ii) Time has a greater range of behavioral change than distance and category. iii) Users who actively participate in group decision making have a more significant number of visits in groups than when they are alone.
Conference Paper
Full-text available
Performance evaluation of mobile networks needs accurate simulation set up including realistic characteristics. The most important issue in mobile networks' simulation is the mobility of the nodes. Since mobile nodes usually are carried by humans, thus, nodes mobility should be modelized as human movement. To the best of our knowledge, none of the existing mobility models have the ability to modelize all the human movement characteristics. In this paper, a new mobility model has been proposed based on the human mobility data collected for more than 6000 hours. The new model captures human mobility properties by introducing hotspot zones, using a graph of hotspot zones as the input area map, dividing day time to some periods and modeling various speeds in different times and spaces. Moreover, it models some other important human mobility features that had been modeled in previous works. To evaluate the performance of the proposed model, it is compared with real collected data.
Conference Paper
Full-text available
Accurately determining the geographic location of an Internet host is important for location-aware applications such as location-based advertising and network diagnostics. Despite their fast response time, widely used database-driven geolocation approaches provide only inaccurate locations. Delay measurement based approaches improve the estimation accuracy but still suffer from a limited precision (about 10 km) and a long response time (tens of seconds) to localize a single PC, which cannot meet the demand of precise and real-time geolocation for location-aware applications. In this paper, we propose a new geolocation approach, Checkin-Geo, which exploits geolocation resources fundamentally different from existing database-driven (using DNS, Whois, etc.) or network delay measurement based approaches. In particular, we leverage the location data that users are willing to share in location-sharing services and logs of user logins from PCs for real-time and accurate geolocation. Experimental results show that compared to existing geolocation techniques, Checkin-Geo achieves 1) a median estimation error of 799 meters (an order of magnitude smaller than existing approaches), and 2) a negligible response time, which are promising for accurate location-aware applications.
Conference Paper
Full-text available
The soaring adoption of location-based social networks (LBSNs) makes it possible to analyze human socio-spatial behaviors based on large-scale realistic data, which is important to both the research community and the design of new location-based social applications. However, performing direct measurements on LBSNs is impractical, because of the security mechanisms of existing LBSNs, and high time and resource costs. The problem is exacerbated by the scarcity of available LBSN datasets, which is mainly due to the privacy concerns and the hardness of distributing large-volume data. As a result, only a very few number of LBSN datasets are publicly released. In this paper, we extract and study the universal statistical features of three LBSN datasets, and propose LBSNSim, a trace-driven model for generating synthetic LBSN datasets capturing the properties of the original datasets. Our evaluation shows that LBSNSim provides an accurate representation of target LBSNs.
Conference Paper
In Location-Based Services (LBSs) mobile users submit location-related queries to the untrusted LBS server to get service. However, such queries increasingly induce privacy concerns from mobile users. To address this problem, we propose FGcloak, a novel fine-grained spatial cloaking scheme for privacy-aware mobile users in LBSs. Based on a novel use of modified Hilbert Curve in a particular area, our scheme effectively guarantees k-anonymity and at the same time provides larger cloaking region. It also uses a parameter σ for users to make fine-grained control on the system overhead based on the resource constraints of mobile devices. Security analysis and empirical evaluation results verify the effectiveness and efficiency of our scheme.
Article
User profiling is crucial to many online services. Several recent studies suggest that demographic attributes are predictable from different online behavioral data, such as users' "Likes" on Facebook, friendship relations, and the linguistic characteristics of tweets. But location check-ins, as a bridge of users' offline and online lives, have by and large been overlooked in inferring user profiles. In this paper, we investigate the predictive power of location check-ins for inferring users' demographics and propose a simple yet general location to profile (L2P) framework. More specifically, we extract rich semantics of users' check-ins in terms of spatiality, temporality, and location knowledge, where the location knowledge is enriched with semantics mined from heterogeneous domains including both online customer review sites and social networks. Additionally, tensor factorization is employed to draw out low dimensional representations of users' intrinsic check-in preferences considering the above factors. Meanwhile, the extracted features are used to train predictive models for inferring various demographic attributes. We collect a large dataset consisting of profiles of 159,530 verified users from an online social network. Extensive experimental results based upon this dataset validate that: 1) Location check-ins are diagnostic representations of a variety of demographic attributes, such as gender, age, education background, and marital status; 2) The proposed framework substantially outperforms compared models for profile inference in terms of various evaluation metrics, such as precision, recall, F-measure, and AUC.
Conference Paper
The rapid development of social networks has resulted in a proliferation of user-generated content (UGC). The UGC data, when properly analyzed, can be beneficial to many applications. For example, identifying a user's locations from microblogs is very important for effective location-based advertisement and recommendation. In this paper, we study the problem of identifying a user's locations from microblogs. This problem is rather challenging because the location information in a microblog is incomplete and we cannot get an accurate location from a local microblog. To address this challenge, we propose a global location identification method, called Glitter. Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations. Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog. To facilitate location identification, GLITTER organizes points of interest (POIs) into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs, e.g., countries, states, cities, districts, and streets. Using the tree structure, Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes. Then Glitter aggregates these candidate locations and identifies top-k locations of the user. Using the identified top-k user locations, Glitter refines the candidate locations and computes top-k locations of each microblog. To achieve high recall, we enable fuzzy matching between locations and microblogs. We propose an incremental algorithm to support dynamic updates of microblogs. Experimental results on real-world datasets show that our method achieves high quality and good performance, and scales very well.
Conference Paper
As the ubiquity of smartphones increases we see an increase in the popularity of location based services. Specifically, online social networks provide services such as alerting the user of friend co-location, and finding a user's k nearest neighbors. Location information is sensitive, which makes privacy a strong concern for location based systems like these. We have built one such service that allows two parties to share location information privately and securely. Our system allows every user to maintain and enforce their own policy. When one party, (Alice), queries the location of another party, (Bob), our system uses homomorphic encryption to test if Alice is within Bob's policy. If she is, Bob's location is shared with Alice only. If she is not, no user location information is shared with anyone. Due to the importance and sensitivity of location information, and the easily deployable design of our system, we offer a useful, practical, and important system to users. Our main contribution is a flexible, practical protocol for private proximity testing, a useful and efficient technique for representing location values, and a working implementation of the system we design in this paper. It is implemented as an Android application with the Facebook online social network used for communication between users.