ArticlePDF Available

TripImputor: Real-Time Imputing Taxi Trip Purpose Leveraging Multi-sourced Urban Data

Authors:

Abstract and Figures

(*will be accepted after minor changes.*) Travel behaviour understanding is a long-standing and critically important topic in the area of smart cities. Big volumes of various GPS-based travel data can be easily collected, among which the taxi GPS trajectory data is a typical example. However, in GPS trajectory data, there is usually little information on travellers’ activities, thereby they can only support limited applications. Quite a few studies have been focused on enriching the semantic meaning for raw data, such as travel mode/purpose inferring. Unfortunately, trip purpose imputation receives relatively less attention and requires no real- time response. To narrow the gap, we propose a probabilistic two-phase framework named TripImputor, for making the real- time taxi trip purpose imputation and recommending services to passengers at their drop-off points. Specifically, in the first phase, we propose a two-stage clustering algorithm to identify candidate activity areas (CAAs) in the urban space. Then, we extract fine- granularity spatial and temporal patterns of human behaivours inside the CAAs from Foursquare check-in data to approximate the prior probability for each activity, and compute the posterior probabilities (i.e., infer the trip purposes) using the Bayes’ theorem. In the second phase, we take a sophisticated procedure that clusters historical drop-off points and matches the drop-off clusters and CAAs to immerse the real-time response. Finally, we evaluate the effectiveness and efficiency of the proposed two-phase framework using real-world datasets, which consist of road network, check-in data generated by over 38,000 users in one year, and the large-scale taxi trip data generated by over 19,000 taxis in a month in Manhattan, the New York City (NYC), US. Experimental results demonstrate that the system is able to infer the trip purpose accurately, and can provide recommendation results to passengers within 1.6 seconds in Manhattan on average, just using a single normal PC.
Content may be subject to copyright.
IEEE Proof
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1
TripImputor: Real-Time Imputing Taxi Trip Purpose
Leveraging Multi-Sourced Urban Data
Chao Chen , Shuhai Jiao, Shu Zhang, Weichen Liu ,Member, IEEE,
Liang Feng, and Yasha Wang
Abstract Travel behavior understanding is a long-standing1
and critically important topic in the area of smart cities.2
Big volumes of various GPS-based travel data can be easily3
collected, among which the taxi GPS trajectory data is a typical4
example. However, in GPS trajectory data, there is usually5
little information on travelers’ activities, thereby they can only6
support limited applications. Quite a few studies have been7
focused on enriching the semantic meaning for raw data, such8
as travel mode/purpose inferring. Unfortunately, trip purpose9
imputation receives relatively less attention and requires no real-10
time response. To narrow the gap, we propose a probabilistic11
two-phase framework named TripImputor, for making the real-12
time taxi trip purpose imputation and recommending services to13
passengers at their dropoff points. Specifically, in the first phase,14
we propose a two-stage clustering algorithm to identify candidate15
activity areas (CAAs) in the urban space. Then, we extract fine-16
granularity spatial and temporal patterns of human behaviors17
inside the CAAs from foursquare check-in data to approximate18
the priori probability for each activity, and compute the pos-19
terior probabilities (i.e., infer the trip purposes) using Bayes’20
theorem. In the second phase, we take a sophisticated procedure21
that clusters historical dropoff points and matches the dropoff22
clusters and CAAs to immerse the real-time response. Finally,23
we evaluate the effectiveness and efficiency of the proposed two-24
phase framework using real-world data sets, which consist of25
road network, check-in data generated by over 38000 users in26
one year, and the large-scale taxi trip data generated by over27
19000 taxis in a month in Manhattan, the New York City, USA.28
Experimental results demonstrate that the system is able to infer29
Manuscript received March 27, 2017; revised July 18, 2017 and
October 9, 2017; accepted November 2, 2017. This work was supported in
part by the National Key Research and Development Project of China under
Grant 2017YFB1002000, in part by the National Science Foundation
of China under Grant 61602067 and Grant 71601024, in part by the
Fundamental Research Funds for the Central Universities under Grant
106112017cdjxy180001, in part by the Chongqing Basic and Frontier
Research Program under Grant cstc2015jcyjA00016, in part by the Open
Research Fund Program of Shenzhen Key Laboratory of Spatial Smart
Sensing and Services, Shenzhen University, and in part by the Ministry
of Education in China Humanities and Social Sciences Youth Foundation
under Grant 16yjc630169. The Associate Editor for this paper was K. Savla.
(Corresponding author: Chao Chen.)
C. Chen, S. Jiao, and L. Feng are with the College of Com-
puter Science, Chongqing University, Chongqing 400044, China (e-mail:
ivanchao.chen@gmail.com; jiaoshuhai@gmail.com; brightfengs@gmail.com).
S. Zhang is with the School of Economics and Business Admin-
istration, Chongqing University, Chongqing 400044, China (e-mail:
zhangshu@cqu.edu.cn).
AQ:1 W. Liu is with the School of Computer Science and Engineering, Nanyang
Technological University, Singapore (e-mail: liu@ntu.edu.sg).
Y. Wang is with the School of Electronics Engineering and Computer
Science, Institute of Software, Peking University, Beijing 100871, China
(e-mail: wangys@sei.pku.edu.cn).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TITS.2017.2771231
the trip purpose accurately, and can provide recommendation 30
results to passengers within 1.6 s in Manhattan on average, just 31
using a single normal PC. 32
Index Terms—Travel behaviour, trip purpose, smart city, 33
Bayes’ theorem, trajectory data mining. 34
I. INTRODUCTION 35
TRAVEL behavioural analysis is an important research 36
topic [20]. During recent years, travel behaviour and 37
patterns have become more complex than before since modern 38
cities are undergoing rapid urbanization [4], [8], [30]. It is 39
well-recognized that the travel-related data is an important 40
and valuable source for obtaining a holistic and in-depth 41
understanding on travel behaviours. By analyzing such data, 42
urban planners and policy makers can increase their abili- 43
ties in addressing urban planning, management and operat- 44
ing issues [4]. Traditionally, travel-related data was mainly 45
collected manually by original paper-and-pencil interview, 46
computer-assisted telephone interview, and computer-assisted- 47
self-interview. All these methods suffer from several lim- 48
itations including high survey cost, heavy respondent bur- 49
den, short time and space coverage, and underreported trips 50
(inaccuracies) [33]. 51
With the wide proliferation of location-aware devices 52
including smart phones and GPS-equipped vehicles in daily 53
life, large volumes of time-stamped locational data of indi- 54
viduals become easily available [38]. Such data contains a 55
wealth of travel behavior information, such as when and 56
where passengers move around the city in a reasonably 57
high resolution, and sometimes on which the routes do they 58
transport. For instance, a piece of taxi trip log tells us the 59
concrete physical coordinates (longitudes and latitudes) and 60
the exact times that a passenger was picked up and dropped 61
off, as well as the detailed traversing road sequence from the 62
source to the destination. Consequently, experimenting with 63
GPS-based data collection methods to supplement or replace 64
the conventional ones is a hot trend. However, the collected 65
GPS data is raw. In general, it lacks semantic information 66
like the transport mode taken or activity types performed 67
(travel purposes), i.e., how and why a passenger is moving and 68
what is the essential component required for urban computing. 69
Furthermore, compared to enriching the raw data with ‘how’ 70
semantic,1existing methods on ‘why’ semantic are still far 71
1Note that taxi GPS trajectory data contains the transport mode information
explicitly.
1524-9050 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE Proof
2IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
from accurate [12], [39]. Indeed, there exists a dilemma that72
trajectory data is rich due to emerging passive data collection73
technologies but activity information is poor, although such74
activity information can directly help reveal the purpose of the75
trips [15]. Hence, this paper is an attempt to narrow the gap76
between the raw data and people’s activities, with a particular77
focus on analyzing taxi passengers’ trip purposes.78
Trip purpose imputation2has been a long-standing research79
topic for over a decade [9], [13], [15], [16], [26], [41].80
But previous studies have rarely addressed the following two81
issues: 1) Infer the trip purpose at an individual level.More82
specifically, prior research mainly focuses on interpreting trip83
purposes at an aggregate level, e.g., city scale, thus only84
smart urban services at the macro level can be enabled.85
In contrast, to support micro smart urban services such86
as recommendation services to each passenger according to87
his/her travel purpose, the imputation of the trip purpose at88
the individual level is necessary; 2) Require the real-time89
response, i.e., returning the corresponding purpose as soon90
as the trip ends. As a matter of fact, real-time recognition of91
passengers’ travel purposes not only can offer the possibility92
to understand what people intend to do, but also can provide93
timely recommendation services to passengers. In such way,94
passengers can undertake and organize their daily activities95
more efficiently and economically. For example, it is often96
desirable that restaurant coupons and/or other discount infor-97
mation can be timely delivered to the passenger for choice once98
getting off the taxi, if he/she is predicted to take the activity99
of dining. To the best of our knowledge, there has not been100
work reported in this regard. We would like to clarify that we101
infer the trip purposes after the information about the drop off102
point is revealed. This is because, on one hand, although the103
taxi drivers may be aware of the destinations in advance, such104
information usually cannot be recorded by the embedded GPS105
systems automatically until taxi drivers push the passenger106
status button (from occupied to free) after arriving at the107
destinations. On the other hand, how to accurately predict108
the destinations of taxi trips based on their partial trajectories109
is challenging and can be a separate research problem itself,110
which has been received intensive attention from the academic111
community, such as [24], [25], [32], and [34].112
To enable the real-time taxi trip purpose imputation at113
the individual level, we need to address the following two114
challenges:115
Lack ground-truth. The ground-truth of travelling pur-116
pose per trip is usually collected by the proactively117
prompted recall [27], where only a very small fraction of118
users are called to annotate their traces with the activities119
that they have done. To make matters worse, the ground120
truth of the annotation is contaminated since many users121
just cannot remember what they have done correctly.122
Real-time response. On one hand, existing algorithms123
on inferring trip purposes cannot be applied directly,124
since they are not providing real-time responses. On the125
other hand, the taxi trip is generated continuously and126
2We use ‘inference’, ‘prediction’, ‘imputation’ interchangeably throughout
the whole manuscript.
intensively as time goes by, which makes the real-time 127
response even more challenging. 128
In order to predict what activity that a passenger intends to 129
take after getting off the taxi with a high accuracy, one should 130
take the drop-off time, the drop-off location and the nearby 131
geographical context [23] into account. To be more specific, 132
the distribution of different activities that people commonly 133
take (i.e., human behaviours) in the area near the drop-off 134
point at the drop-off time is a useful reference. Fortunately, 135
check-in data, which is left by users when checking-in at 136
point-of-interests (POIs) using LBSNs (i.e., Location-based 137
Social Networks) like Foursquare, contains a detailed descrip- 138
tion of the POIs (e.g., the hierarchical category, the open 139
time) [6], [35]. With the check-in information, it is not 140
difficult to understand the passengers’ travel activities as well 141
as the activity distribution at an area during a given time 142
period [19], [29], [41]. For instance, people visit a restaurant 143
to have food and visit a shopping mall to shop. Thus, the 144
problem of trip purpose inference is migrated to the problem of 145
predicting the probabilities of visiting different POI categories 146
once the passenger gets off the taxi. 147
With the research objectives and challenges discussed 148
above, the main contributions of the paper are: 149
1) We define a new problem which extends the existing 150
travel purpose inferring problem by requiring real-time 151
response, in order to recommend timely and accurate 152
services to passengers accordingly. 153
2) We propose a novel two-phase framework based on 154
Bayes’ theorem, called TripImputor, to tackle the real- 155
time taxi trip purpose imputation problem.In Phase I, we 156
first propose a two-stage clustering algorithm to aggre- 157
gate POIs. We identify urban activity regions (UARs) 158
which are bounded and separated by physical barriers 159
using road network data (Stage 1). For each UAR, 160
with the passenger’s drop-off location and alighting time 161
as input, we identify candidate activity areas (CAAs) 162
based on POI data (Stage 2). Then, we extract fine- 163
granularity spatial and temporal patterns regarding 164
human behaivours inside the CAAs from Foursquare 165
check-in data to approximate the priori probability for 166
each activity, and compute the posterior probabilities 167
using the Bayes’ theorem. In Phase II, to enable the 168
real-time response, after analyzing the computational 169
bottleneck of the first phase, we propose a procedure 170
that includes the clustering of historical drop-off points 171
and the matching between drop-off clusters and CAAs 172
to reduce the online computation time. 173
3) We conduct extensive evaluations on the effectiveness 174
and efficiency of TripImputor using real-world datasets, 175
which consists of the road network data, the Foursquare 176
check-in data generated by over 38,000 users in one 177
year, and the taxi GPS trajectory data generated by 178
over 19,000 taxis in a month in Manhattan, NYC. Due 179
to the lack of ground-truth of each taxi trip, we eval- 180
uate the effectiveness indirectly by comparing to the 181
travel survey data in the statistical sense at the regional 182
scale, instead of calculating the prediction accuracy for 183
each trip individually. Experimental results show that 184
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 3
TripImputor achieves the best prediction accuracy, com-185
pared to other two baselines. The average time for each186
taxi trip is about 1.588 seconds. The quickest response187
time is 40 milliseconds, and the longest response time188
is 7.54 seconds, which is still acceptable for practical189
applications.190
The rest of the paper is organized as follows. In Section II,191
we review the related work and show how this paper differs192
from prior research. In Section III, we introduce several basic193
concepts and present the problem formulation. We present194
detailed discussion on our two-phase framework in Section IV195
and Section V respectively. We evaluate the performance of the196
proposed framework in Section VI. Finally, we conclude the197
paper and discuss the future research directions in Section VII.198
II. RELATED WORK199
A. Semantic Trajectory Enrichment200
The passive collection of large-scale locational data with201
time stamps (trajectory data) becomes easily feasible, both202
technically and economically, with the rapid development of203
mobile localization technologies. The data come from many204
sources, e.g., the call detail records from mobile phone users,205
smart card data from travellers, GPS tracking of private/public206
vehicles and so on. The recorded location has varied for-207
mats and resolutions. For instance, GPS-based trajectory data208
records the physical coordinates of the moving objects; smart209
card data records the location in the stop name. Besides, some210
of the data can contain the travel mode information explicitly.211
But there generally lacks an explicit understanding of the212
individuals intention in making that trip. In another word,213
while such unlabelled data is available, the semantic label of214
travel purpose is missing.215
Extracting high-level semantics from raw data and further216
use them to better understand the underlying meaningful217
movement behaviors (e.g., why people move) have attracted218
many researchers’ attention [22]. Quite a few of technicals219
have been applied to interpret travel purposes in terms of220
travel activities after the trip. The techniques mainly include221
deterministic and heuristic rules, machine learning based222
approaches, and statistical data mining algorithms [9], [13],223
[15], [16], [26], [33], [41]. To name a few, Wolf [33]224
proposed using a set of deterministic rules to derive the trip225
purpose, coupling with the land use data. Deng and Ji [9]226
built a decision tree for trip purpose inference, combining227
the other information provided by GIS data and respondents’228
social-demographics. On the basis of modelling the proba-229
bility of points of interest to be visited using Bayes’ rules,230
Gong et al. [15] inferred the the travel purposes for taxi trips.231
Although lots of approaches have been developed to enrich the232
raw trajectory with the semantic meaning, prior work never233
requires the timely response when inferring trip purpose, thus234
recommendation services cannot be supported.235
B. Check-In Data and Taxi Trajectory Data Mining236
Check-in data and taxi trajectory data have been mined237
to support various smart urban applications, having attracted238
lots of attentions from researchers during recent years. For239
example, knowledge hidden behind the check-in data has been 240
mined to support (personalized) landmark recommendation/ 241
search, frequent associated POI sequences suggesting, 242
the heat-map of landmark popularity at different time under- 243
standing and so on [6], [35]. 244
Information mined from taxi trajectory data can benefit 245
a number of parities, including taxi drivers, passengers and 246
city planners. For taxi drivers who are mostly interested in 247
making more money while minimizing the fuel cost [10], [14]. 248
Work on recommending the best corner to catch taxis, real- 249
time ordering free taxis, and the taxi fee estimation aims to 250
improve the experiences of passengers, e.g. [1]. An interesting 251
work detected anomalous taxi rides and warned the passengers 252
“on-the-fly” that they were taken on a unnecessary detour [5]. 253
For city planners, taxi trajectory data provides a rich data 254
source to identify flaws in city planning, probe traffic con- 255
ditions, estimate the travel demands, infer the land-use effi- 256
ciency, suggest bus routes, etc [2]. Recent studies also incor- 257
porate taxi trajectory data with other data sources such as 258
POI data, Foursquare check-in data, and Flickr image data, 259
to enable smarter applications, such as building functions 260
inferring, personalized travel route planning, hitchhiking pack- 261
age deliveries and so on [6], [7], [36]. However, to the best 262
of our knowledge, we are the first study on inferring trip 263
purpose in real time, leveraging the complementary knowledge 264
embedded in the multi-sourced urban data. 265
III. BASIC CONCEPTS AND PROBLEM STATEMENT 266
A. Basic Concepts 267
Definition 1 (Road Network): A road network is a graph 268
G(N,E), consisting of a node set N and an edge set E, 269
where each element n in N is an intersection with a pair 270
of longitude and latitude coordinates (x,y)representing its 271
spatial location. Edge set E is a subset of the Cartesian 272
product N ×N. Each element e(u,v) in E is a street 273
connecting node u and node v, which has several attributes 274
including speed limit, number of lanes, street level.3275
Definition 2 (A Taxi Drop-Off Point): A taxi drop-off 276
point (pi) is defined as a time-stamped location where the 277
passenger was dropped off, denoted by ((xi,yi), ti). 278
Definition 3 (POI Category): A POI category is a semantic 279
label for a place, indicating the correlation between the place 280
and potential human activities. 281
Foursquare maintains a three-level ontology structure for 282
category description [6]. In the first level, it has 9 categories 283
in total. In the second and third levels, it has 412 sub-/sub- 284
subcategories in total. Table I shows the trip purposes (travel 285
activities) and the corresponding primary POI categories [15]. 286
Definition 4 (A Check-In): A check-in is represented by a 287
triple ck =(uid ,v
id,ti), indicating a user with id uid checked- 288
in at a venue (i.e. POI) with id vid at time tiusing Foursquare. 289
In general, a POI (venue) that is frequently checked-in by 290
many users is popular and attractive. In addition, Foursquare 291
provides the physical coordinates, tags, and the open time 292
information of an any given venue. 293
3The road network can be crawled from an open crowdsourced platform,
i.e., OpenStreetMap. Refer to www.openstreetmap.org for more details.
IEEE Proof
4IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TAB L E I
NINE TRIP PURPOSES AND THE CORRESPONDING
PRIMARY POI CATEGORIES
Definition 5 (Response Time): The response time is defined294
as the time difference between the drop-off time when the295
passenger gets off taxis and the time when the passenger296
receives the recommendation services.297
B. Problem Statement298
Inferring the taxi trip purposes leveraging multi-sourced299
urban data can be viewed as predicting the probabilities of300
taking one of the nine activities, which can be formulated as:301
Given:302
1) A drop-off point ((xr,yr), tr), which is generated in real-303
time;304
2) A set of historical check-ins {uid,v
id,ti}(e.g., the last305
month), together with check-ins accumulated several306
hours before trin the designated city;307
3) POIs in the designated city, which can be obtained from308
the check-in data;309
4) A road network G(N,E)of the designated city.310
Predict the probabilities of taking each of the nine activities311
respectively for the drop-off point (the objective of Phase I),312
and provide timely service recommendations related to the313
top-ranked trip purposes (activities with top probabilities) for314
the passenger (the objective of Phase II).315
IV. PHASE I: IMPUTING TRIP PURPOSES316
A. Urban Activity Region Identification317
Human beings are known as collective people (i.e., most318
of people live, work together with others in nature), thus319
it is highly likely that people take activities in a small and320
scattered fraction of the whole city space. A preliminary step321
for inferring the travel purpose of passengers is to identify322
all the scattered activity regions in the whole urban space.323
To ease the presentation, we name these regions as Urban324
Activity Regions (UARs).325
Urban activity regions are bounded and separated by some326
physical barriers such as main roads, rivers, and mountains,327
as can be witnessed in the human civilization and urbanization328
process in history [28], [40]. Each separated UAR is isolated329
and bounded by main road segments (or rivers), covering330
several neighborhoods and narrow streets. Inside each UAR,331
Fig. 1. Illustrative example of determining the region that a given POI
belonging to (top left); the illustrative examples of assigning a huge number
of POIs to regions (top right and bottom left); the identified CAAs for the
illustrative example (bottom right).
passengers can easily reach between two points if they are 332
located to each other. Usually, passengers who get off taxis at 333
one side of the primary way will not cross it (i.e., go to the 334
other side) to take activities due to the huge barrier. On the 335
contrary, when getting off taxis at small and narrow streets, 336
the passengers can easily walk towards another direction. 337
Based on the above observations, in this paper, we mainly rely 338
on the road network data to identify the UARs in the target 339
city. We propose a two-step procedure to divide the whole city 340
into a number of disjointed UARs. 341
Step 1: We extract the road network data including 342
coordinates of nodes, edges, as well as the attributes of 343
edges (e.g., number of lanes, speed limits, road levels/ 344
types) from an open crowdsourced platform, i.e., the 345
OpenStreetMap. With the information of road level/type 346
attributes, we are able to keep high-level road segments 347
that are only tagged as ‘motorway’, ‘trunk’, or ‘primary’. 348
Step 2: For the trimmed road network only consisting of 349
high-level road segments, we apply the image-processing- 350
based map segmentation algorithm in [37] to obtain 351
connected components. Each connected component is just 352
a piece of the separated urban activity region (UAR, 353
R1R5in Fig. 1). 354
B. Candidate Activity Area Identification 355
It is well-known that POIs are the most common activity 356
unit for human beings. In the case of people taking taxi to 357
travel, on one hand, they always prefer to get off as close 358
to the true destination as possible. On the other hand, in the 359
modern city, there are usually many different categories of 360
POIs located in a same building (e.g., a shopping mall). In this 361
respect, people are more likely to be attracted by the nearby 362
one or two buildings after getting off taxis. Hence, we propose 363
the concept of candidate activity area (CAA) in which different 364
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 5
POIs locate close to each other. The CAAs correspond to small365
areas, and we use CAA as the activity unit for taxi passengers.366
To identify such a CAA, we first determine which UAR a367
given POI belongs to. Then, we aggregate the POIs belonging368
to the same UAR to several clusters based on the spatial369
proximity. Finally, we identify each cluster as a CAA. In this370
sense, a UAR contains serval CAAs. However, the assignment371
of POIs to UARs is quite challenging since we have to address372
the following two issues:373
1) Each UAR is usually of an arbitrary shape, thus we374
cannot simply compare the POI locations to the locations375
of the UAR boundaries. A simpler but essential problem376
is the point-in-polygon problem [31]. More specifically,377
it’s the problem of determining whether a given point378
is inside/outside a given closed polygon (i.e., region),379
whichisprovedtobehard[17].380
2) The number of POIs is huge (e.g., the number of POIs381
in the Manhattan of NYC is more than 10k), and how382
to efficiently determine which UAR each POI locates at383
is also a challenging issue.384
Algorithm 1 Algorithm for Determining the Region That a
Given POI Belongs to
Input: agivenPOI(pi); the trimmed road network and the
identified UARs in the target city;
Output: the UAR that the given point is located, denoted by
Ri=PinR(pi).
Step 1: Based on the location of the given point (pi),
we can find its nearest node ni;
Step 2: According to the identified node niand the
topology of the high-level road network, we can easily
identify all the regions that share the node ni. We denote
these regions by {Ri};
Step 3.a: For each region in the set of {Ri}, we apply
ifPinR algorithm to check whether piis inside that
region;
Step 3.b: Loop ends when ifPinR returns 1.
Without loss of generality, to deal with the first issue,385
we apply a popular and mature algorithm to determine the386
relationship (i.e., inside or outside) between a given point387
and a given region [18]. For simplicity, we denote the algo-388
rithm as ifPinR(pi,ri). If the point piis inside region ri,389
ifPinR(pi,ri)returns 1; otherwise, it returns 0. To determine390
which region that a given point belongs to, we propose the391
algorithm by recalling if PinR repeatedly. The pseudo-code392
of this algorithm is presented in Algorithm 1. For the given393
point, Step 1 and Step 2 identify all the possible regions that it394
may belong to, according to the geometrical relationship in the395
space. Note that a region is represented by a sequence of nodes396
in the clockwise direction. For instance, the possible regions397
for piin the illustrative example (as shown in top left of Fig. 1)398
are marked as R1,R2,andR3. Step 3 shows the repeated399
recalling procedure of algorithm if PinR. The number of400
loops is usually small since the possible region set contains401
few and limited regions. In the best case, the number of loops402
is 1, while in the worst case, the number of loops is just equal403
to the size of the possible region set. The loop number is 1 for 404
the illustrative example since if PinRreturns 1 when checking 405
R1at the first loop. 406
To deal with the second issue, a straightforward but com- 407
putationally expensive method is to check each POI based 408
on Algorithm 1. In theory, the computation complexity is 409
O(N×M×C),whereNis the number of POIs; Mis the 410
average number of possible regions for a given POI, which 411
is usually small and O(C)is the complexity of ifPinR 412
algorithm. Therefore, in order to accelerate the computation 413
process, we should reduce the number of POIs to be checked. 414
Actually, it is unnecessary to check some POIs. More specif- 415
ically, if we have determined the region where a given POI 416
locates at, then we can directly infer that its ‘nearby’ POIs 417
should also be located inside the same region with high 418
confidence level. Inspired by this observation, we propose a 419
novel and efficient algorithm to determine the regions of the 420
POIs. Briefly speaking, the algorithm mainly consists of POI 421
random selection,point in which region determination and cell 422
growing, as illustrated in Algorithm 2. 423
Algorithm 2 Algorithm for Determining Regions That a Huge
Number of POIs Belong to
Input: a pool of POIs ({pi}) and a set of UARs ({Ri})inthe
target city;
Output: {Ri}=PinR({pi}).
Step 1: Randomly select a POI from {pi}(e.g., ps);
Step 1.1: Rs=PinR(ps);
Step 2: Take psas the center, get a grid cell with equal
width and length (g0);
Step 2.1: gi=g0;
Step 3: If gihas no intersection with Rs,then
Step 3.1: Identify all POIs inside the grid based on the
geometric relationship (denoted by Psub(gi));
Psub(gi)should be all located at Rs;
Step 3.2: {pi}={pi}− Psub(gi);
Step 3.3: Increase the grid cell size by 50%, gi+1=1.5×
gi);
Step 4: Repeat Step 1 3 until {pi}is empty.
In the first step, we randomly pick up a POI from the 424
pool and determine which region the selected POI belongs to 425
(Step 1.1) based on Algorithm 1. In the second step, we deter- 426
mine a grid cell with the selected POI as the center. 427
Fig. 1 (top right) demonstrates the result after the first two 428
steps. All POIs inside the grid cell should be located at 429
the same region of the selected POI if there is no inter- 430
section between the grid cell and the region boundaries 431
(Step 3.1 and 3.2 respectively). Thus, there is no need for 432
us to check for those POIs and we can remove them from 433
the POI pool directly (Step 3.3). With the objective of further 434
increasing the number of no-need-check POIs, the grid cell 435
will grow bigger to contain more POIs (Step 3.4), as demon- 436
strated in Fig. 1 (bottom left). In the case that the grid cell (gi)437
crosses over the region, the algorithm will restart the whole 438
procedure from the first step by selecting a new POI randomly 439
again. The process will terminate until there is no POI in the 440
IEEE Proof
6IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
pool (Step 4). Finally, each POI will be associated with a label441
of the region that it belongs to.442
For POIs inside the same UAR (POIs with the same443
region label), we apply the popular DBSCAN algorithm to444
get clusters since the algorithm can identify clusters with445
different density and shape [11]. POIs that are close to each446
other and within the same UAR would be identified as a447
Candidate Activity Area (CAA). However, as demonstrated in448
Fig. 1 (bottom right), POIs scattering at different UARs are449
grouped to different CAAs, even if they are close to each other.450
Remark: Although the clustering and identification of CAAs451
can be done offline, it should be a plus if we can accelerate452
the procedure, since we have a huge number of points of453
interests and dozens of regions in the city. What is more,454
POIs in the city are dynamic, for instance, some POIs are455
disappearing while some POIs are emerging, necessitating the456
regular update of CAAs. Thus, it is desirable if we have an457
efficient algorithm for clustering and identification of CAAs.458
C. Trip Purpose Imputation459
The objective of the trip purpose imputation is to predict460
the POI category that the passenger intends to visit after461
getting off the taxi, given the drop-off point location and462
the drop-off time. We denote the drop-off information of the463
passenger by ((x,y), t). To infer the trip purpose correctly,464
several factors need to be considered. The first is the distance465
from passenger’s final destination to the drop-off location.466
In more detail, the closer is the POI to the drop-off point,467
the more likely would the POI be visited, since taxis offer door-468
to-door services to passengers. Under such circumstance, most469
passengers prefer to get off taxis as close as possible to the470
final destination. The second factor that needs to be considered471
is the distribution of nearby POI categories to the drop-off472
point. Heading to an area mostly covered by Restaurants,473
the trip purpose would probably be the dining activities. Last474
but not the least, the alighting time of the passenger from the475
taxi is also vital as people take different activities at different476
time.477
To integrate the above three factors comprehensively,478
we mainly take the following three major steps. First, given479
the location of the drop-off point, we select the top-knear-480
est CAAs within the walkable distance (e.g., 500 meters).481
We note that passengers will visit the top-kCAAs with482
different probabilities. That is, the closer is the CAA to the483
drop-off point, the higher is the probability that the CAA484
will be visited, which exhibits the distance decay effect.485
Specifically, the probability that a CAA will be visited can486
be determined by Eq. 1.487
P(CAA
i|(x,y)) (di)β
488
s.t.k
i=1P(CAA
i|(x,y)) =1(1)489
where direfers to the Euclidean distance from the center of490
CAA
ito the drop-off point (x,y) of the passenger; kis the491
number of the nearby CAAs considered, which is set to 3 in492
our study; βis the distance decay parameter. We set β=1.5,493
which is also consistent with existing findings in [6] and [20].494
Second, even if the visited CAA has been determined, 495
because there are different POIs, each with a unique category 496
and visiting popularity, the prediction of the POI categories 497
for passengers is still challenging [15]. To alleviate the issue, 498
inside a determined CAA (e.g., CAA
i), we compute the 499
probability for visiting each POI category (i.e., taking activity) 500
based on Bayes’ theorem [21], as shown in Eqns. 2 and 3. 501
P(aj|(x,y), t,CAA
i)502
=P((x,y)|aj,t,CAA
i)×P(aj|t,CAA
i)×P(t,CAA
i)
P((x,y), t,CAA
i)503
(2) 504
P((x,y), t,CAA
i)505
=n
j=1P((x,y)|aj,t,CAA
i)506
×P(aj|t,CAA
i)×P(t,CAA
i)(3) 507
nis the number of total activities considered in the paper; 508
P((x,y)|aj,t,CAA
i)represents the probability that a passen- 509
ger gets off the taxi at location (x,y)if he/she has decided to 510
take the activity ajat CAA
iat time t. Gong et al. [15] simply 511
assume that the location and the time of the drop-off point are 512
conditionally independent, given the activity type (aj), i.e., the 513
following equation can be satisfied. 514
P((x,y)|aj,t,CAA
i)=P((x,y)|aj,CAA
i)(4) 515
However, we argue that Eq. 4 does not hold for most cases, 516
since where passengers select to get off taxis does not only 517
depend on the nearby land use (i.e., spatial context) [9], [33], 518
but also the alighting time. On one hand, passengers may 519
get off taxis near a shopping plaza to shop; while on the 520
other hand, passengers might get off taxis at places in a 521
business district to have meal in the evening. In other words, 522
the locations and the times of the drop-off point are inter- 523
dependent. Here, we use the following equation to approximate 524
thetruevalueof P((x,y)|aj,t,CAA
i)by considering the 525
attractiveness and the POI distribution on categories of the 526
CAA collectively, as shown in Eq. 5. 527
P((x,y)|aj,t,CAA
i)528
numberof POIs(aj,CAA
i)
numberof POIs(CAA
i)×Ai(t)529
s.t.n
j=1P((x,y)|aj,t,CAA
i)=1(5)530
numberof POIs(CAA
i)and numberof POIs(aj,CAA
i)in 531
Eq. 5 refer to the number of POIs and the number of POIs 532
related to ajwithin the CAA
irespectively; Ai(t)refers to the 533
attractiveness of the CAA
iat the given time slot, which can be 534
measured by the popularity of CAA
iat that time, compared to 535
the rest of other CAAs among the top-klist. In more detail, 536
we calculate the value of Ai(t)by dividing the number of 537
check-ins of CAA
iby the total number of check-ins of all 538
top-kCAAs during the given time slot in the historical days 539
(e.g., last month), as can be seen in Eq. 6. Note that it is easy 540
to extract the information about the check-ins and categories 541
of POIs from the Foursquare platform. 542
Ai(t)=numbero f Checki ns(CAA
i,t,days)
k
i=1numberof Checki ns(CAA
i,t,days)(6) 543
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 7
P(aj|t,CAA
i)in Eq. 2 is the probability of taking activity544
ajif the passenger is located in CAA
iat time t. The distrib-545
ution of P(aj|t,CAA
i)depends on the spatial and temporal546
patterns of human activity in that area. It has been well547
recognized that human behaviours in terms of taking activities548
present strong and regular patterns. For instance, with respect549
to the time dimension, the probability of visiting work-related550
places during 8:00 am-10:00 am is generally much higher than551
that of visiting shopping malls. With respect to the space552
dimension, the case may vary depending on geographical553
areas. To capture such temporal and spatial regularities in a554
fine granularity, again in this study, we rely on the check-555
ins from Foursquare. Given the time tand candidate activity556
area CAA
i, we approximate the probability of visiting a557
certain POI category (i.e., taking the activity of aj)bythe558
ratio of the number of check-ins on the given POI category to559
the total number of check-ins in CAA
iduring the given time560
slot in the historical days (e.g., last month), as shown in Eq. 7.561
P(aj|t,CAA
i)=numbero f Checkins(aj,CAA
i,t,days)
numbero f Checki ns(CAA
i,t,days)
562
(7)563
Although strong and regular patterns (i.e., regularity)of564
human behaviours are frequently observed, dynamic is also565
an another salient feature. For instance, human behaviours are566
interrupted and changed when encountering unexpected sud-567
den and big social events. To capture such changes, we propose568
to combine the most fresh check-ins in the studied area since569
the live data may reflect the affected human activities timely.570
Therefore, the probability can be updated by Eq. 8.571
P(aj|t,CAA
i)572
α×numbero f Checki ns(aj,CAA
i,t,days)
numbero f Checkins(CAA
i,t,days)

regularity
573
+(1α) ×numbero f Checkins(aj,CAA
i,t,4h)
numbero f Checki ns(CAA
i,t,4h)

dynamic
(8)574
where numbero f Checki ns(aj,CAA
i,t,4h)refers to the575
number of check-ins in the given POI category and576
numbero f Checkins(CAA
i,t,4h)indicates the total number577
of check-ins in the area of CAA
iby counting the check-ins578
accumulated in the most recent four hours just before time t,579
respectively. αis a weighting factor (we set α=0.9inthis580
study). We note that the probability obtained by Eq. 8 needs581
to be normalized, i.e., n
j=1P(aj|t,CAA
i)=1 with nbeing582
the total number of activities considered in the paper.583
P(t,CAA
i)in Eq. 2 is the probability of taking activities584
in CAA
iafter the passengers gets off taxis at time t,which585
can be computed by Eq. 9, as follows.586
P(t,CAA
i)=P(t)×P(CAA
i|t)(9)587
The probability of the passenger getting off taxis at time t588
(i.e., P(t)) is different at different times of the day, since589
human activity has strong time regularity. The probability P(t)590
can be estimated by the ratio of the number of drop-offsduring591
the given time slot to the number of drop-offs during the whole592
Fig. 2. Illustration for the computation of P(t,CAA
i). value in the grid cell
refers to the probability of taking activity in the corresponding CAA after the
ending of the corresponding trip.
day. The computation of P(CAA
i|t)is a bit more complicated. 593
In the following,to better understand how to compute the value 594
of P(CAA
i|t), we use an example to illustrate the basic idea, 595
as shown in Fig. 2. We suppose that there are 6 taxi trips 596
occurred during the given time slot and there are 8 CAAs 597
that have been identified. For each taxi trip, passengers would 598
choose one of the CAAs to take activities after getting off 599
taxis. Furthermore, as discussed earlier in the section, for each 600
trip, we assume the passenger would take activities in one of 601
the top-kCAAs within the walkable distance. In the example, 602
the value of the grid cell (e.g., gij) refers to the probability 603
of passengers from taxi trip tritaking activity in area CAA
i,604
which can be computed based on Eq. 1. For each time slot, 605
the probability of taking activity in a given CAA (CAA
i)is 606
just the average value of the corresponding row values, i.e., 607
P(CAA
i|t)=N
m=1gim
N(10) 608
where Nis the number of taxi trips occurred in the studied 609
time slot. 610
In summary, for thw taxi trip (x,y,t), the probability of 611
passengers taking a given activity ajafter getting off the taxi 612
can be approximated by the following equation. 613
P(aj|(x,y), t)614
P(CAA
i|(x,y)) ×P(aj|(x,y), t,CAA
i)615
s.t.n
j=1P(aj|(x,y), t)=1 (11) 616
V. PHASE II: ENABLING REAL-TIME RESPONSE 617
In order to enable the real-time response for each drop- 618
off event (i.e., compute the posterior probability of taking 619
each activity for each drop-off point using Bayes’ theorem in 620
real-time), we need to identify the most time-consuming com- 621
ponent. As discussed in Section III, the posterior probability 622
calculation mainly consists of four components, the details of 623
which are shown in Table II. 624
As shown in the table, the first component is related to 625
the probability of visiting a given candidate activity area 626
IEEE Proof
8IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TAB L E II
DETAILS ON EACH COMPONENT OF INFERRING TRIP PURPOSES
Fig. 3. A schematic diagram of reducing the time complexity of the first
component. the value on the edge carries the information about the visiting
probability to the corresponding CAA.
(CAA
i) if the passenger was dropped off at point (x,y).627
The probability is computed online because the distance to628
each top-knearest CAAs varies if the passengers get off629
taxis at different points. However, we argue that two drop-630
off points that are close to each other would have sim-631
ilar value of P(CAA
i|(x,y)),i.e., P(CAA
i|(x1,y1)) 632
P(CAA
i|(x2,y2)) if (x1,y1)is close to (x2,y2). Hence, we633
aggregate historical information on drop-off points to drop-634
off cluster and assume all drop-off points in the same cluster635
would have equal value of P(CAA
i|(x,y)).Insuchway,the636
value of the first component can be pre-computed offline.637
The only online job is to identify which drop-off clusters638
that it should belong to. Once receiving a real-time drop-639
off point, this online job is quite efficient. In this manner,640
the computation time can be reduced significantly. As shown641
in Fig. 3, the top-kCAAs of the drop-off cluster can be642
identified and the distance to each CAA can be measured643
by the one between the centroid of drop-off cluster and644
the centroid of each CAA. Thus, the probability of visiting645
CAA
ifrom a drop-off point inside the drop-off cluster can646
be calculated offline efficiently. We note that many drop-off647
clusters can be obtained in advance, given the historical taxi648
trip data. Each of the drop-off clusters is associated with k649
visiting probabilities to its nearby top-kCAAs.650
The second component is related to the probability of651
getting off taxis at point (x,y)if the passenger walks to area652
CAA
iand intends to take activity ajat time t. As discussed 653
earlier, two factors are considered. The first is the attrac- 654
tiveness of CAA
iat the given time slot, which is measured 655
by the popularity of that area. Note that the popularity of 656
a CAA at a given time slot can be calculated in advance, 657
using the historical check-in data contributed by mobile users. 658
The second factor is the POI category distribution in the 659
CAA
i, which remains relatively stable. Thus, it is obviously 660
that the value of the second component can be pre-computed 661
offline. 662
The third component is the conditional probability of taking 663
a given activity (e.g., aj) if the passenger is at CAA
iat the 664
time t. To approximate the true value of this component, both 665
the “regularity” and “dynamic” patterns of the area are taken 666
into consideration. As shown in the formula, the “regularity” 667
pattern is based on the historical check-in data, and the 668
“dynamic” pattern is captured by the most recent check-in data 669
just before the drop-off time. Thus, the former part can be pre- 670
processed offline, while the latter part can only be computed 671
online. 672
The fourth component is about the joint probability of 673
visiting the area of CAA
iat the time of t. As can be seen, 674
the value is determined by two parts. One is the frequency 675
of getting off taxis at the given time slot, and the other is 676
the spatial distribution of the drop-off pints. Both parts are 677
quantified using the historical taxi trip data. Thus, the value 678
can be pre-computed offline. 679
In summary, two online jobs, identifying the drop-off 680
clusters and extracting the “dynamic” patterns of the top-k681
CAAs, are required when receiving a streaming drop-off 682
point (xr,yr,tr). With the other components computed and 683
structured offline purposely, the whole process can be quite 684
efficient. We will validate this in the experiments. 685
VI. EVA L UA T I O N 686
A. Experimental Setup 687
1) Data Preparation: Three data sets in the Manhattan area, 688
the city of New York (NYC) are used, i.e., the road network, 689
the Foursquare check-in data, and the taxi GPS trajectory data. 690
Some basic statistical information about the three data sets is 691
shown in Table III. 692
2) Comparison Algorithms: We compare our approach with 693
two baseline algorithms, the details of which are presented as 694
follows. 695
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 9
Fig. 4. Results of UARs and CAAs identification in Manhattan, NYC. a full-view of clustering results (a); a close-view of some selected regions (b); the
number of CAAs in the UARs (c). (best viewed in an enlarged digital version.)
TABLE III
STATI STI CS OF URBAN DATA SETS USED IN THE PAPER
Nearest.TheNearest algorithm simply sets the POI that696
is closest to the drop-off location as the final destination697
of the passenger, regardless of the drop-off time. Thus,698
the trip purpose is predicted as taking activities related to699
that POI category.700
Bayes’ rule [15]. The major difference between this base-701
line and our proposed one is that the baseline assumes702
that two temporally-close drop-off points may be related703
to the same priori probability of a given trip purpose,704
even if the two points are located far away from each705
other. While for our proposed algorithm, both regular706
and dynamic patterns are considered when calculating707
the priori probability in a very fine spatial and temporal708
resolution, which leverages the user-generated check-in709
data.710
3) Evaluation Environment: All the evaluations in the711
paper are programmed using Java language under the Eclipse712
J2SE 1.5 integrated development environment, and are run713
on an Intel Core i5-4950 PC with 8-GB RAM and Windows714
8 operation system.715
B. Evaluation on Candidate Activity Area Identification 716
Fig. 4 presents the clustering results (i.e., the identification 717
of UARs and CAAs) of our two-stage clustering algorithm. 718
In total, we have identified 30 UARs, all of which are based 719
on the road network data. As shown in Fig. 4(a), most POIs 720
are located at midtown and downtown of Manhattan, while 721
only very are scattering at the upper town. A close view of 722
some selected regions are shown in Fig. 4(b) to highlight the 723
advantages of our proposed clustering algorithm. For example, 724
due to the physical barriers (i.e., wide roads), POIs in purple 725
color at Region 6 are not grouped together with their nearby 726
POIs at Region 5, and several POIs at Region 4 are not 727
merged with their neighbours at Region 5 either. Each UAR 728
contains different number of CAAs, depending on the spatial 729
distribution of the POIs inside. Fig. 4(c) shows the number 730
of CAAs for each UAR. The xcoordinate corresponds to the 731
region number and the ycoordinate is the number of CAAs 732
in that region. As shown in the figure, region 17 contains 733
the maximal number of CAAs, while most of regions have a 734
number of CAAs less than 20. 735
The size of the identified CAA is also an important metric 736
to evaluate the clustering algorithm. The size of each CAA 737
should be within a region of the walkable distance. Here the 738
size of a CAA is defined as the minimal rectangle which covers 739
all POIs in the CAA. If the CAA size is too big, then the POIs 740
in the CAA are difficult to be reached by foot. Fig. 5 shows 741
the Cumulative Distribution Function (CDF) of the size of 742
all CAAs. As can be seen from the figure, the size of over 743
96% of CAA are less than 10,000 square meters, showing the 744
effectiveness of our proposed two-stage clustering algorithm. 745
IEEE Proof
10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 5. The CDF distribution of the size of CAAs.
C. Evaluation on Trip Purpose Imputation Algorithm746
As discussed earlier, due to the lack of ground-truth of the747
taxi trip purpose, it is impossible to calculate the inference748
accuracy directly. Fortunately, we are provided with the travel749
purpose survey data at the regional scale (e.g., Manhattan) [3],750
which motivated us to evaluate the system accuracy indirectly.751
The rationale here behind is: if the distribution of the trip752
purposes inferred by our proposed method is close to the753
one obtained by the survey data in the statistical sense at the754
regional scale, our proposed method should be reliable. Since755
the survey data classifies the travel purpose into 4 categories,756
i.e., work, education, recreation, shopping and others, to make757
the results comparable, we manually put ‘dining’, ‘In-home’,758
‘Transportation transfer’, ‘Lodging’ and ‘Medical’ into the759
‘Others’ category. Next, for each taxi trip, with the proposed760
inference algorithm, we are able to get 5 probabilities of 5 new761
trip purposes. Finally, for each trip purpose, we average the762
probabilities of all taxi trips generated in one month, and use763
the average value as the percentages of the travel for that trip764
purpose.765
We show the comparison between our inference results to766
the travel survey data in Fig. 6. Besides, the results obtained767
by the other two baselines are also plotted for comparison.768
It is easy to understand that, the closer the percentage value769
on each category to the corresponding survey data value,770
the better performance our algorithm achieves. As can be771
seen from the results, our proposed algorithm achieves the772
best performance, while the Nearest algorithm achieves773
the worst performance and the Bayes’ Rules [15] achieves774
the performance in-between.775
Our proposed inference algorithm also enables us to gain776
insights on trip purpose in a much finer resolution. We thus777
select a representative urban activity region (UAR) to inves-778
tigate the trip purpose trend at different time of the work779
day. The selected UAR together with inside distributed POIs780
is shown in Fig. 7, where only four POI categories can be781
found. Fig. 8 shows the trip purpose inference results of the782
selected region across the whole day (top chart). We also show783
the corresponding results in other regions of Manhattan for784
comparison (bottom chart). As shown in the figure, travel for785
shopping and dining in the selected region is more common786
Fig. 6. Comparison results to baseline algorithms and survey data.
Fig. 7. A selected UAR with 4 kinds of POIs. (Best viewed in an enlarged
digital version.)
Fig. 8. Trip purpose imputation results for a given day in the selected UAR
and in Manhattan, respectively.
since it is a well-known shopping and dinner center in NYC. 787
Moreover, the number of trips for shopping purpose keeps 788
increasing and remain high in the daytime, even in the work 789
days. In both selected UAR and other regions in Manhattan, 790
the number of trips for recreation purpose climbs after the 791
work time. 792
D. Evaluation on Response Time 793
Another key system metrics is how long a passenger can get 794
the recommendation services after getting off the taxi. Because 795
all the requests are processed sequentially in one machine fol- 796
lowing the First-Come-First-Out (FIFO) rule, when a request 797
arrives, one of the following two situations may occur. 798
(1) There are no other requests are being processed or waiting 799
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 11
TAB L E IV
RESPONSE TIME IN THE WORST CASE IN
MANHATTAN AND NYC, RESPECTIVELY
Fig. 9. The CDF distributions of the response times at a day in Manhattan
and the whole NYC, respectively.
to be processed in the system; (2) There are other requests800
in the system, being processed or waiting to be processed.801
In the first case, the request can enter service immediately802
upon arrival. In the second case, the request has to wait in803
queue until the server has finished processing other requests804
that arrive earlier. Thus, the response time for a request is the805
time from the request arrives till the time the request has been806
processed. In other words, the response time includes the wait807
time and the process time.808
We are more interested in the longest response time that a809
request needs to spend during a day, i.e., the longest time that a810
request (or a taxi trip) needs to wait before being proceeded.811
The logic is that if the longest response time is acceptable812
for most users, then the system is useful in practice. The813
longest response time corresponds to the worst case during814
a day. Table IV shows the average of the longest response815
time and its standard deviation values in Manhattan and the816
whole NYC respectively. Note that the observation days is 15.817
On average, the worst case takes 7.54 seconds and 8.15 sec-818
onds to respond requests from Manhattan and from the whole819
NYC, respectively, which are acceptable in our application820
scenarios. Hence, we conclude that our proposed TripImputor821
is not only able to process requests from the whole NYC with822
a single normal PC, but also provide timely recommendation823
services.824
We are also interested at the distribution of all response825
times in Manhattan and the whole NYC, as shown826
in Fig. 9. As can be observed, although Manhattan contributes827
90% trip inferring requests of the whole city, it still takes more828
time to respond to a request from the city, because the more829
requests come per unit time, the longer the waiting time and830
so is the response time. Moreover, almost half of requests831
can be responded within 50 milliseconds in both Manhattan832
and whole NYC. As shown in the figure, although in the833
Fig. 10. The longest response time (corresponds to the worst case) under
different number of requests per hour.
worst case it takes up to around 7.54 seconds to process a 834
request, 80% of the requests from Manhattan can be responded 835
within 4.5 seconds and that from the NYC can be responded 836
within around 5 seconds. On average, it takes only 1.588 and 837
1.812 seconds to respond for Manhattan and the whole NYC 838
respectively. The above results demonstrate the efficiency of 839
our system. 840
The previous experimental results ensure the efficiency of 841
our proposed system in handling requests from the whole 842
NYC. We are also aware that it takes more time to respond to 843
a request when there are more requests arrive (as in the NYC). 844
Going a step further, we intend to investigate how many cities 845
(like NYC) can a single normal PC support and return a timely 846
response. As shown in the Fig. 10, x-axis refers to the number 847
of requests per hour and y-axis refers to the longest response 848
time of all requests. As can be seen, it takes around 7, 16, 849
24, 30 seconds at most to process 20,000, 40,000, 60,000, 850
80,0000 requests, respectively. When the number of requests 851
received during one hour keeps increasing, the total processing 852
time will increase exponentially, because all the requests are 853
processed sequentially in one PC. The longest response time 854
is more than 9 minutes if the number of requests per hour is 855
100,000. Note that there are around 20,000 requests arriving 856
in one hour in the whole NYC during the peak hours. Thus, 857
facilitated by our method, we are capable of taking care of 858
requests for 4 cities like NYC by just using one normal PC, 859
if users can accept the maximal response time as around 860
30 seconds. 861
VII. CONCLUSION AND FUTURE WORK 862
In this paper, we present a novel two-phase framework 863
called TripImputor for inferring the taxi trip purpose in real 864
time. In the phase of trip purpose inference, we first proposed 865
a two-stage clustering algorithm to identify the candidate 866
activity areas in the urban space, then calculate the poste- 867
rior probabilities of taking each activity for each taxi trip 868
using Bayes’ theorem. In the second phase, to reduce the 869
online computation time and immerse a real-time response, 870
we develop a sophisticated procedure mainly including clus- 871
tering of historical drop-off points and matching the drop-off 872
clusters with CAAs. Finally, we evaluate the effectiveness 873
IEEE Proof
12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
and efficiency of the system using the real-world datasets.874
Experimental results demonstrate that the proposed two-phase875
framework achieves the promising performance both in accu-876
racy and response time.877
In the future, we plan to broaden and deepen this work in878
several directions. First, we plan to incorporate more relevant879
information to improve the accuracy of the inference algorithm880
further, such as the personal background, social-economical881
features, with a particular focus on utilizing the information882
about the pick-up point (the pick-up time and location, and its883
nearby spatial context as well) and the trip travel time. Second,884
we intend to investigate the taxi trip purposes at different885
seasons under different spatial resolutions, and also the yearly886
evolution tendency of taxi trip purpose and the underlying887
motivations. Third, we intend to accelerate the computation888
process by introducing some parallel mechanisms such as889
Spark, since each taxi trip can be handled separately. Finally,890
we would like to deploy our system on mobile devices, and891
recruit some volunteers to test our system in actual settings,892
collecting feedback on how to further improve the service.893
REFERENCES894
[1] R. K. Balan, K. X. Nguyen, and L. Jiang, “Real-time trip information895
service for a large taxi fleet,” in Proc. MobiSys, 2011, pp. 99–112.896
[2] P. S. Castro, D. Zhang, C. Chen, S. Li, and G. Pan, “From taxi GPS897
traces to social and community dynamics: A survey,ACM Comput.898
Surv., vol. 46, no. 2, pp. 17:1–17:34, 2013.899
[3] C. Chen, H. Gong, C. Lawson, and E. Bialostozky, “Evaluating the900
feasibility of a passive travel survey collection in a complex urban901
environment: Lessons learned from the New York City case study,”902
Transp. Res. A, Policy Pract., vol. 44, no. 10, pp. 830–840, 2010.903
[4] C. Chen, Z. Wang, and B. Guo, “The road to the Chinese smart city:904
Progress, challenges, and future directions,” IT Prof., vol. 18, no. 1,905
pp. 14–17, Jan./Feb. 2016.906
[5] C. Chen et al., “iBOAT: Isolation-based online anomalous trajec-907
tory detection,” IEEE Trans. Intell. Transp. Syst., vol. 14, no. 2,908
pp. 806–818, Jun. 2013.909
[6] C. Chen, D. Zhang, B. Guo, X. Ma, G. Pan, and Z. Wu, “TripPlanner:910
Personalized trip planning leveraging heterogeneous crowdsourced dig-911
ital footprints,” IEEE Trans. Intell. Transp. Syst., vol. 16, no. 3,912
pp. 1259–1273, Jun. 2015.913
[7] C. Chen et al., “CrowdDeliver: Planning city-wide package delivery914
paths leveraging the crowd of taxis,” IEEE Trans. Intell. Transp. Syst.,915
vol. 18, no. 6, pp. 1478–1496, Jun. 2017.916
[8] K. J. Clifton and S. L. Handy, “Qualitative methods in travel behaviour917
research,” in Transport Survey Quality and Innovation. Emerald Group918
Publishing Limited, 2003, pp. 283–302.AQ:2 919
[9] Z. Deng and M. Ji, “Deriving rules for trip purpose identification from920
GPS travel survey data and land use data: A machine learning approach,”921
in Proc. 7th Int. Conf. Traffic Transp. Stud., 2010, pp. 768–777.922
[10] Y. Ding, C. Chen, S. Zhang, B. Guo, Z. Yu, and Y. Wang, “GreenPlanner:923
Planning personalized fuel-efficient driving routes using multi-sourced924
urban data,” in Proc. PerCom, Mar. 2017, pp. 207–216.925
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm926
for discovering clusters in large spatial databases with noise,” in Proc.927
KDD, vol. 96. 1996, pp. 226–231.928
[12] T. Feng and H. J. P. Timmermans, “Detecting activity type from929
GPS traces using spatial and temporal information,” Eur. J. Transp.930
Infrastruct. Res., vol. 15, no. 4, pp. 662–674, 2015.931
[13] B. Furletti, P. Cintia, C. Renso, and L. Spinsanti, “Inferring human932
activities from GPS tracks,” in Proc. 2nd ACM SIGKDD Int. Workshop933
Urban Comput., 2013, p. 5.934
[14] Y. Ge, H. Xiong, A. Tuzhilin, K. Xiao, M. Gruteser, and M. Pazzani,935
“An energy-efficient mobile recommender system,” in Proc. ACM KDD,936
2010, pp. 899–908.937
[15] L. Gong, X. Liu, L. Wu, and Y. Liu, “Inferring trip purposes and938
uncovering travel patterns from taxi trajectory data,” Cartogr. Geogr.939
Inf. Sci., vol. 43, no. 2, pp. 103–114, 2016.940
[16] L. Gong, T. Morikawa, T. Yamamoto, and H. Sato, “Deriving personal 941
trip data from GPS data: A literature review on the existing methodolo- 942
gies,” Procedia-Social Behavioral Sci., vol. 138, pp. 557–565, Jul. 2014. 943
[17] K. Hormann and A. Agathos, “The point in polygon problem for 944
arbitrary polygons,” Comput. Geometry, vol. 20, no. 3, pp. 131–144, 945
2001. 946
[18] J. Huang, Y. Li, R. Crawfis, S.-C. Lu, and S.-Y. Liou, “A complete 947
distance field representation,” in Proc. Conf. Vis., 2001, pp. 247–254. 948
[19] L. Huang, Q. Li, and Y. Yue, “Activity identification from GPS trajec- 949
tories using spatial temporal POIs’ attractiveness,” in Proc. 2nd ACM 950
SIGSPATIAL Int. Workshop LBSNs, 2010, pp. 27–30. 951
[20] C. Kang, X. Ma, D. Tong, and Y. Liu, “Intra-urban human mobility 952
patterns: An urban morphology perspective,Phys. A, Statist. Mech. 953
Appl., vol. 391, no. 4, pp. 1702–1717, 2012. 954
[21] K.-R. Koch, Introduction to Bayesian Statistics. Springer, 2007. AQ:3955
[22] J. Krumm and D. Rouhana, “Placer: Semantic place labels from diary 956
data,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput.,957
2013, pp. 163–172. 958
[23] M.-P. Kwan, “How GIS can help address the uncertain geographic 959
context problem in social science research,” Ann. GIS, vol. 18, no. 4, 960
pp. 245–255, 2012. 961
[24] H. T. Lam, E. Diaz-Aviles, A. Pascale, Y. Gkoufas, and B. Chen. 962
(2015). “(Blue) taxi destination and trip time prediction from partial 963
trajectories.” [Online]. Available: https://arxiv.org/abs/1509.05257 964
[25] X. Li, M. Li, Y.-J. Gong, X.-L. Zhang, and J. Yin, “T-DesP: Destination 965
prediction based on big trajectory data,” IEEE Trans. Intell. Transp. 966
Syst., vol. 17, no. 8, pp. 2344–2354, Aug. 2016. 967
[26] Y. Lin, H. Wan, R. Jiang, Z. Wu, and X. Jia, “Inferring the travel 968
purposes of passenger groups for better understanding of passengers,” 969
IEEE Trans. Intell. Transp. Syst., vol. 16, no. 1, pp. 235–243, Feb. 2015. 970
[27] Y. Lu and L. Zhang, “Imputing trip purposes for long-distance travel,” 971
Transportation, vol. 42, no. 4, pp. 581–595, 2015. 972
[28] D. Newman and A. Paasi, “Fences and neighbours in the postmodern 973
world: Boundary narratives in political geography,” Prog. Hum. Geogr.,974
vol. 22, no. 2, pp. 186–207, 1998. 975
[29] T. H. Rashidi, A. Abbasi, M. Maghrebi, S. Hasan, and T. S. Waller, 976
“Exploring the capacity of social media data for modelling travel behav- 977
iour: Opportunities and challenges,” Transp. Res. C, Emerg. Technol.,978
vol. 75, pp. 197–211, Feb. 2017. 979
[30] S. Schönfelder, “Urban rhythms: Modelling the rhythms of individual 980
travel behaviour,” Ph.D. dissertation, ETH Zurich, Zürich, Switzerland, 981
2006. AQ:4982
[31] M. Shimrat, “Algorithm 112: Position of point relative to polygon,” 983
Commun. ACM, vol. 5, no. 8, p. 434, 1962. 984
[32] L. Wang, Z. Yu, B. Guo, T. Ku, and F. Yi, “Moving destination prediction 985
using sparse dataset: A mobility gradient descent approach,” ACM Trans. 986
Knowl. Discovery Data, vol. 11, no. 3, p. 37, 2017. 987
[33] J. Wolf, “Using GPS data loggers to replace travel diaries in the 988
collection of travel data,” Ph.D. dissertation, Georgia Inst. Technol., 989
Atlanta, GA, USA, 2000. 990
[34] A. Y. Xue, R. Zhang, Y. Zheng, X. Xie, J. Huang, and Z. Xu, “Destina- 991
tion prediction by sub-trajectory synthesis and privacy protection against 992
such prediction,” in Proc. IEEE ICDE, Apr. 2013, pp. 254–265. 993
[35] D. Yang, D. Zhang, V. W. Zheng, and Z. Yu, “Modeling user activity 994
preference by leveraging user spatial temporal characteristics in LBSNs,” 995
IEEE Trans. Syst., Man, Cybern., Syst., vol. 45, no. 1, pp. 129–142, 996
Jan. 2015. 997
[36] Z. Yu, H. Xu, Z. Yang, and B. Guo, “Personalized travel package 998
with multi-point-of-interest recommendation based on crowdsourced 999
user footprints,” IEEE Trans. Human–Mach. Syst., vol. 46, no. 1, 1000
pp. 151–158, Feb. 2016. 1001
[37] N. J. Yuan, Y. Zheng, and X. Xie, “Segmentation of urban areas using 1002
road networks,” Microsoft Res., Tech. Rep., 2012. AQ:51003
[38] Y. Yue, T. Lan, A. G. O. Yeh, and Q.-Q. Li, “Zooming into individ- 1004
uals to understand the collective: A review of trajectory-based travel 1005
behaviour studies,” Travel Behaviour Soc., vol. 1, no. 2, pp. 69–78, 1006
2014. 1007
[39] Y. Zheng, Y. Chen, Q. Li, X. Xie, and W.-Y. Ma, “Understanding 1008
transportation modes based on GPS data for Web applications,” ACM 1009
Trans. Web, vol. 4, no. 1, p. 1, 2010. 1010
[40] C. Zhong, S. M. Arisona, X. Huang, M. Batty, and G. Schmitt, 1011
“Detecting the dynamics of urban structure through spatial network 1012
analysis,” Int. J. Geogr. Inf. Sci., vol. 28, no. 11, pp. 2178–2199, 1013
2014. 1014
[41] Z. Zhu, U. Blanke, and G. Tröster, “Inferring travel purpose from crowd- 1015
augmented human mobility data,” in Proc. 1st Int. Conf. IoT Urban 1016
Space, 2014, pp. 44–49. 1017
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 13
Chao Chen received the B.Sc. and M.Sc. degrees1018
in control science and control engineering from1019
Northwestern Polytechnical University, Xi’an,1020
China, in 2007 and 2010, respectively, and the1021
Ph.D. degree from the Université Pierre et Marie1022
Curie and the Institut Mines-Télécom/Télécom1023
SudParis, France, in 2014.1024
In 2009, he was a Research Assistant with1025
Hong Kong Polytechnic University, Hong Kong.1026
He is currently an Associate Professor with1027
the College of Computer Science, Chongqing1028
University, Chongqing, China. He has authored or co-authored over1029
40 papers including eight IEEE transactions. His research interests include1030
pervasive computing, mobile computing, urban logistics, data mining from1031
large-scale GPS trajectory data, and big data analytics for smart cities. His1032
work on taxi trajectory data mining was featured by the IEEE Spectrum1033
in 2011 and 2016, respectively. He was also a recipient of the Best Paper1034
Runner-Up Award at MobiQuitous 2011.1035
Shuhai Jiao received the B.Sc. degree from the1036
College of Information and Software Engineering,1037
Northeast Normal University, Changchun, China,1038
in 2015. He is currently pursuing the master’s degree1039
with the College of Computer Science, Chongqing1040
University, Chongqing, China. He was a Research1041
Intern at Didi Chuxing Company, Beijing, China,1042
in 2017. His research interests include scenic travel1043
route planning and taxi GPS trajectory data mining.1044
Shu Zhang received the bachelor’s degree from1045
the Civil Aviation University of China, Tianjin,1046
China, in 2007, the master’s degree from Mississippi1047
State University, Starkville, MS, USA, in 2010,1048
and the Ph.D. degree in management sciences from1049
the University of Iowa, Iowa, IA, USA, in 2015.1050
She is currently an Assistant Professor with the1051
College of Economics and Business Administration,1052
Chongqing University, Chongqing. Her research1053
interests including vehicle routing, urban logistics,1054
and transportation network design.1055
Weichen Liu (S’07–M’11) received the B.Eng. 1056
and M.Eng. degrees from the Harbin Institute of 1057
Technology, China, and the Ph.D. degree from the 1058
Hong Kong University of Science and Technology, 1059
Hong Kong. He is currently an Assistant Professor 1060
with the School of Computer Science and Engineer- 1061
ing, Nanyang Technological University, Singapore. 1062
He has authored and co-authored over 70 research 1063
papers in peer-reviewed journals, conferences, and 1064
books. His research interests include embedded and 1065
real-time systems, multiprocessor systems, and fault- 1066
tolerant systems. He has received the Best Paper Candidate Awards from 1067
CODES+ISSS, CASES, and ASP-DAC. 1068
Liang Feng received the Ph.D. degree from the 1069
School of Computer Engineering, Nanyang Tech- 1070
nological University, Singapore, in 2014. He was 1071
a Post-Doctoral Research Fellow at the Computa- 1072
tional Intelligence Graduate Laboratory, Nanyang 1073
Technological University. He is currently an Assis- 1074
tant Professor at the College of Computer Science, 1075
Chongqing University, China. His research inter- 1076
ests include computational and artificial intelligence, 1077
memetic computing, big data optimization and learn- 1078
ing, and transfer learning. 1079
Yas h a Wa ng received the Ph.D. degree from 1080
Northeastern University, Shenyang, China, in 2003. 1081
He is currently a Professor and an Associate Director 1082
of the National Research and Engineering Center 1083
of Software Engineering with Peking University, 1084
China. His research interests include urban data 1085
analytics, ubiquitous computing, software reuse, and 1086
online software development environment. He has 1087
authored or co-authored over 50 papers in pres- 1088
tigious conferences and journals, such as ICWS, 1089
UbiComp, ICSP, and so on. As a Technical Leader 1090
and Manager, he has accomplished several key national projects on software 1091
engineering and smart cities. Cooperating with major smart-city solution 1092
providing companies, his research work has been adopted in more than 1093
20 cities in China. 1094
IEEE Proof
AUTHOR QUERIES
AUTHOR PLEASE ANSWER ALL QUERIES
PLEASE NOTE: We cannot accept new source files as corrections for your paper. If possible, please annotate the PDF
proof we have sent you with your corrections and upload it via the Author Gateway. Alternatively, you may send us
your corrections in list format. You may also upload revised graphics via the Author Gateway.
AQ:1 = Please provide the postal code for “Nanyang Technological University.”
AQ:2 = Please provide the publisher location for ref. [8].
AQ:3 = Please note that the publisher name “Springer Science & Business Media” was changed to “Springer”
for ref. [21]. Also provide the publisher location.
AQ:4 = Please provide the department name for refs. [30] and [33].
AQ:5 = Please provide the organization location and report no. for ref. [37].
IEEE Proof
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1
TripImputor: Real-Time Imputing Taxi Trip Purpose
Leveraging Multi-Sourced Urban Data
Chao Chen , Shuhai Jiao, Shu Zhang, Weichen Liu ,Member, IEEE,
Liang Feng, and Yasha Wang
Abstract Travel behavior understanding is a long-standing1
and critically important topic in the area of smart cities.2
Big volumes of various GPS-based travel data can be easily3
collected, among which the taxi GPS trajectory data is a typical4
example. However, in GPS trajectory data, there is usually5
little information on travelers’ activities, thereby they can only6
support limited applications. Quite a few studies have been7
focused on enriching the semantic meaning for raw data, such8
as travel mode/purpose inferring. Unfortunately, trip purpose9
imputation receives relatively less attention and requires no real-10
time response. To narrow the gap, we propose a probabilistic11
two-phase framework named TripImputor, for making the real-12
time taxi trip purpose imputation and recommending services to13
passengers at their dropoff points. Specifically, in the first phase,14
we propose a two-stage clustering algorithm to identify candidate15
activity areas (CAAs) in the urban space. Then, we extract fine-16
granularity spatial and temporal patterns of human behaviors17
inside the CAAs from foursquare check-in data to approximate18
the priori probability for each activity, and compute the pos-19
terior probabilities (i.e., infer the trip purposes) using Bayes’20
theorem. In the second phase, we take a sophisticated procedure21
that clusters historical dropoff points and matches the dropoff22
clusters and CAAs to immerse the real-time response. Finally,23
we evaluate the effectiveness and efficiency of the proposed two-24
phase framework using real-world data sets, which consist of25
road network, check-in data generated by over 38000 users in26
one year, and the large-scale taxi trip data generated by over27
19000 taxis in a month in Manhattan, the New York City, USA.28
Experimental results demonstrate that the system is able to infer29
Manuscript received March 27, 2017; revised July 18, 2017 and
October 9, 2017; accepted November 2, 2017. This work was supported in
part by the National Key Research and Development Project of China under
Grant 2017YFB1002000, in part by the National Science Foundation
of China under Grant 61602067 and Grant 71601024, in part by the
Fundamental Research Funds for the Central Universities under Grant
106112017cdjxy180001, in part by the Chongqing Basic and Frontier
Research Program under Grant cstc2015jcyjA00016, in part by the Open
Research Fund Program of Shenzhen Key Laboratory of Spatial Smart
Sensing and Services, Shenzhen University, and in part by the Ministry
of Education in China Humanities and Social Sciences Youth Foundation
under Grant 16yjc630169. The Associate Editor for this paper was K. Savla.
(Corresponding author: Chao Chen.)
C. Chen, S. Jiao, and L. Feng are with the College of Com-
puter Science, Chongqing University, Chongqing 400044, China (e-mail:
ivanchao.chen@gmail.com; jiaoshuhai@gmail.com; brightfengs@gmail.com).
S. Zhang is with the School of Economics and Business Admin-
istration, Chongqing University, Chongqing 400044, China (e-mail:
zhangshu@cqu.edu.cn).
AQ:1 W. Liu is with the School of Computer Science and Engineering, Nanyang
Technological University, Singapore (e-mail: liu@ntu.edu.sg).
Y. Wang is with the School of Electronics Engineering and Computer
Science, Institute of Software, Peking University, Beijing 100871, China
(e-mail: wangys@sei.pku.edu.cn).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TITS.2017.2771231
the trip purpose accurately, and can provide recommendation 30
results to passengers within 1.6 s in Manhattan on average, just 31
using a single normal PC. 32
Index Terms—Travel behaviour, trip purpose, smart city, 33
Bayes’ theorem, trajectory data mining. 34
I. INTRODUCTION 35
TRAVEL behavioural analysis is an important research 36
topic [20]. During recent years, travel behaviour and 37
patterns have become more complex than before since modern 38
cities are undergoing rapid urbanization [4], [8], [30]. It is 39
well-recognized that the travel-related data is an important 40
and valuable source for obtaining a holistic and in-depth 41
understanding on travel behaviours. By analyzing such data, 42
urban planners and policy makers can increase their abili- 43
ties in addressing urban planning, management and operat- 44
ing issues [4]. Traditionally, travel-related data was mainly 45
collected manually by original paper-and-pencil interview, 46
computer-assisted telephone interview, and computer-assisted- 47
self-interview. All these methods suffer from several lim- 48
itations including high survey cost, heavy respondent bur- 49
den, short time and space coverage, and underreported trips 50
(inaccuracies) [33]. 51
With the wide proliferation of location-aware devices 52
including smart phones and GPS-equipped vehicles in daily 53
life, large volumes of time-stamped locational data of indi- 54
viduals become easily available [38]. Such data contains a 55
wealth of travel behavior information, such as when and 56
where passengers move around the city in a reasonably 57
high resolution, and sometimes on which the routes do they 58
transport. For instance, a piece of taxi trip log tells us the 59
concrete physical coordinates (longitudes and latitudes) and 60
the exact times that a passenger was picked up and dropped 61
off, as well as the detailed traversing road sequence from the 62
source to the destination. Consequently, experimenting with 63
GPS-based data collection methods to supplement or replace 64
the conventional ones is a hot trend. However, the collected 65
GPS data is raw. In general, it lacks semantic information 66
like the transport mode taken or activity types performed 67
(travel purposes), i.e., how and why a passenger is moving and 68
what is the essential component required for urban computing. 69
Furthermore, compared to enriching the raw data with ‘how’ 70
semantic,1existing methods on ‘why’ semantic are still far 71
1Note that taxi GPS trajectory data contains the transport mode information
explicitly.
1524-9050 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE Proof
2IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
from accurate [12], [39]. Indeed, there exists a dilemma that72
trajectory data is rich due to emerging passive data collection73
technologies but activity information is poor, although such74
activity information can directly help reveal the purpose of the75
trips [15]. Hence, this paper is an attempt to narrow the gap76
between the raw data and people’s activities, with a particular77
focus on analyzing taxi passengers’ trip purposes.78
Trip purpose imputation2has been a long-standing research79
topic for over a decade [9], [13], [15], [16], [26], [41].80
But previous studies have rarely addressed the following two81
issues: 1) Infer the trip purpose at an individual level.More82
specifically, prior research mainly focuses on interpreting trip83
purposes at an aggregate level, e.g., city scale, thus only84
smart urban services at the macro level can be enabled.85
In contrast, to support micro smart urban services such86
as recommendation services to each passenger according to87
his/her travel purpose, the imputation of the trip purpose at88
the individual level is necessary; 2) Require the real-time89
response, i.e., returning the corresponding purpose as soon90
as the trip ends. As a matter of fact, real-time recognition of91
passengers’ travel purposes not only can offer the possibility92
to understand what people intend to do, but also can provide93
timely recommendation services to passengers. In such way,94
passengers can undertake and organize their daily activities95
more efficiently and economically. For example, it is often96
desirable that restaurant coupons and/or other discount infor-97
mation can be timely delivered to the passenger for choice once98
getting off the taxi, if he/she is predicted to take the activity99
of dining. To the best of our knowledge, there has not been100
work reported in this regard. We would like to clarify that we101
infer the trip purposes after the information about the drop off102
point is revealed. This is because, on one hand, although the103
taxi drivers may be aware of the destinations in advance, such104
information usually cannot be recorded by the embedded GPS105
systems automatically until taxi drivers push the passenger106
status button (from occupied to free) after arriving at the107
destinations. On the other hand, how to accurately predict108
the destinations of taxi trips based on their partial trajectories109
is challenging and can be a separate research problem itself,110
which has been received intensive attention from the academic111
community, such as [24], [25], [32], and [34].112
To enable the real-time taxi trip purpose imputation at113
the individual level, we need to address the following two114
challenges:115
Lack ground-truth. The ground-truth of travelling pur-116
pose per trip is usually collected by the proactively117
prompted recall [27], where only a very small fraction of118
users are called to annotate their traces with the activities119
that they have done. To make matters worse, the ground120
truth of the annotation is contaminated since many users121
just cannot remember what they have done correctly.122
Real-time response. On one hand, existing algorithms123
on inferring trip purposes cannot be applied directly,124
since they are not providing real-time responses. On the125
other hand, the taxi trip is generated continuously and126
2We use ‘inference’, ‘prediction’, ‘imputation’ interchangeably throughout
the whole manuscript.
intensively as time goes by, which makes the real-time 127
response even more challenging. 128
In order to predict what activity that a passenger intends to 129
take after getting off the taxi with a high accuracy, one should 130
take the drop-off time, the drop-off location and the nearby 131
geographical context [23] into account. To be more specific, 132
the distribution of different activities that people commonly 133
take (i.e., human behaviours) in the area near the drop-off 134
point at the drop-off time is a useful reference. Fortunately, 135
check-in data, which is left by users when checking-in at 136
point-of-interests (POIs) using LBSNs (i.e., Location-based 137
Social Networks) like Foursquare, contains a detailed descrip- 138
tion of the POIs (e.g., the hierarchical category, the open 139
time) [6], [35]. With the check-in information, it is not 140
difficult to understand the passengers’ travel activities as well 141
as the activity distribution at an area during a given time 142
period [19], [29], [41]. For instance, people visit a restaurant 143
to have food and visit a shopping mall to shop. Thus, the 144
problem of trip purpose inference is migrated to the problem of 145
predicting the probabilities of visiting different POI categories 146
once the passenger gets off the taxi. 147
With the research objectives and challenges discussed 148
above, the main contributions of the paper are: 149
1) We define a new problem which extends the existing 150
travel purpose inferring problem by requiring real-time 151
response, in order to recommend timely and accurate 152
services to passengers accordingly. 153
2) We propose a novel two-phase framework based on 154
Bayes’ theorem, called TripImputor, to tackle the real- 155
time taxi trip purpose imputation problem.In Phase I, we 156
first propose a two-stage clustering algorithm to aggre- 157
gate POIs. We identify urban activity regions (UARs) 158
which are bounded and separated by physical barriers 159
using road network data (Stage 1). For each UAR, 160
with the passenger’s drop-off location and alighting time 161
as input, we identify candidate activity areas (CAAs) 162
based on POI data (Stage 2). Then, we extract fine- 163
granularity spatial and temporal patterns regarding 164
human behaivours inside the CAAs from Foursquare 165
check-in data to approximate the priori probability for 166
each activity, and compute the posterior probabilities 167
using the Bayes’ theorem. In Phase II, to enable the 168
real-time response, after analyzing the computational 169
bottleneck of the first phase, we propose a procedure 170
that includes the clustering of historical drop-off points 171
and the matching between drop-off clusters and CAAs 172
to reduce the online computation time. 173
3) We conduct extensive evaluations on the effectiveness 174
and efficiency of TripImputor using real-world datasets, 175
which consists of the road network data, the Foursquare 176
check-in data generated by over 38,000 users in one 177
year, and the taxi GPS trajectory data generated by 178
over 19,000 taxis in a month in Manhattan, NYC. Due 179
to the lack of ground-truth of each taxi trip, we eval- 180
uate the effectiveness indirectly by comparing to the 181
travel survey data in the statistical sense at the regional 182
scale, instead of calculating the prediction accuracy for 183
each trip individually. Experimental results show that 184
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 3
TripImputor achieves the best prediction accuracy, com-185
pared to other two baselines. The average time for each186
taxi trip is about 1.588 seconds. The quickest response187
time is 40 milliseconds, and the longest response time188
is 7.54 seconds, which is still acceptable for practical189
applications.190
The rest of the paper is organized as follows. In Section II,191
we review the related work and show how this paper differs192
from prior research. In Section III, we introduce several basic193
concepts and present the problem formulation. We present194
detailed discussion on our two-phase framework in Section IV195
and Section V respectively. We evaluate the performance of the196
proposed framework in Section VI. Finally, we conclude the197
paper and discuss the future research directions in Section VII.198
II. RELATED WORK199
A. Semantic Trajectory Enrichment200
The passive collection of large-scale locational data with201
time stamps (trajectory data) becomes easily feasible, both202
technically and economically, with the rapid development of203
mobile localization technologies. The data come from many204
sources, e.g., the call detail records from mobile phone users,205
smart card data from travellers, GPS tracking of private/public206
vehicles and so on. The recorded location has varied for-207
mats and resolutions. For instance, GPS-based trajectory data208
records the physical coordinates of the moving objects; smart209
card data records the location in the stop name. Besides, some210
of the data can contain the travel mode information explicitly.211
But there generally lacks an explicit understanding of the212
individuals intention in making that trip. In another word,213
while such unlabelled data is available, the semantic label of214
travel purpose is missing.215
Extracting high-level semantics from raw data and further216
use them to better understand the underlying meaningful217
movement behaviors (e.g., why people move) have attracted218
many researchers’ attention [22]. Quite a few of technicals219
have been applied to interpret travel purposes in terms of220
travel activities after the trip. The techniques mainly include221
deterministic and heuristic rules, machine learning based222
approaches, and statistical data mining algorithms [9], [13],223
[15], [16], [26], [33], [41]. To name a few, Wolf [33]224
proposed using a set of deterministic rules to derive the trip225
purpose, coupling with the land use data. Deng and Ji [9]226
built a decision tree for trip purpose inference, combining227
the other information provided by GIS data and respondents’228
social-demographics. On the basis of modelling the proba-229
bility of points of interest to be visited using Bayes’ rules,230
Gong et al. [15] inferred the the travel purposes for taxi trips.231
Although lots of approaches have been developed to enrich the232
raw trajectory with the semantic meaning, prior work never233
requires the timely response when inferring trip purpose, thus234
recommendation services cannot be supported.235
B. Check-In Data and Taxi Trajectory Data Mining236
Check-in data and taxi trajectory data have been mined237
to support various smart urban applications, having attracted238
lots of attentions from researchers during recent years. For239
example, knowledge hidden behind the check-in data has been 240
mined to support (personalized) landmark recommendation/ 241
search, frequent associated POI sequences suggesting, 242
the heat-map of landmark popularity at different time under- 243
standing and so on [6], [35]. 244
Information mined from taxi trajectory data can benefit 245
a number of parities, including taxi drivers, passengers and 246
city planners. For taxi drivers who are mostly interested in 247
making more money while minimizing the fuel cost [10], [14]. 248
Work on recommending the best corner to catch taxis, real- 249
time ordering free taxis, and the taxi fee estimation aims to 250
improve the experiences of passengers, e.g. [1]. An interesting 251
work detected anomalous taxi rides and warned the passengers 252
“on-the-fly” that they were taken on a unnecessary detour [5]. 253
For city planners, taxi trajectory data provides a rich data 254
source to identify flaws in city planning, probe traffic con- 255
ditions, estimate the travel demands, infer the land-use effi- 256
ciency, suggest bus routes, etc [2]. Recent studies also incor- 257
porate taxi trajectory data with other data sources such as 258
POI data, Foursquare check-in data, and Flickr image data, 259
to enable smarter applications, such as building functions 260
inferring, personalized travel route planning, hitchhiking pack- 261
age deliveries and so on [6], [7], [36]. However, to the best 262
of our knowledge, we are the first study on inferring trip 263
purpose in real time, leveraging the complementary knowledge 264
embedded in the multi-sourced urban data. 265
III. BASIC CONCEPTS AND PROBLEM STATEMENT 266
A. Basic Concepts 267
Definition 1 (Road Network): A road network is a graph 268
G(N,E), consisting of a node set N and an edge set E, 269
where each element n in N is an intersection with a pair 270
of longitude and latitude coordinates (x,y)representing its 271
spatial location. Edge set E is a subset of the Cartesian 272
product N ×N. Each element e(u,v) in E is a street 273
connecting node u and node v, which has several attributes 274
including speed limit, number of lanes, street level.3275
Definition 2 (A Taxi Drop-Off Point): A taxi drop-off 276
point (pi) is defined as a time-stamped location where the 277
passenger was dropped off, denoted by ((xi,yi), ti). 278
Definition 3 (POI Category): A POI category is a semantic 279
label for a place, indicating the correlation between the place 280
and potential human activities. 281
Foursquare maintains a three-level ontology structure for 282
category description [6]. In the first level, it has 9 categories 283
in total. In the second and third levels, it has 412 sub-/sub- 284
subcategories in total. Table I shows the trip purposes (travel 285
activities) and the corresponding primary POI categories [15]. 286
Definition 4 (A Check-In): A check-in is represented by a 287
triple ck =(uid ,v
id,ti), indicating a user with id uid checked- 288
in at a venue (i.e. POI) with id vid at time tiusing Foursquare. 289
In general, a POI (venue) that is frequently checked-in by 290
many users is popular and attractive. In addition, Foursquare 291
provides the physical coordinates, tags, and the open time 292
information of an any given venue. 293
3The road network can be crawled from an open crowdsourced platform,
i.e., OpenStreetMap. Refer to www.openstreetmap.org for more details.
IEEE Proof
4IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TAB L E I
NINE TRIP PURPOSES AND THE CORRESPONDING
PRIMARY POI CATEGORIES
Definition 5 (Response Time): The response time is defined294
as the time difference between the drop-off time when the295
passenger gets off taxis and the time when the passenger296
receives the recommendation services.297
B. Problem Statement298
Inferring the taxi trip purposes leveraging multi-sourced299
urban data can be viewed as predicting the probabilities of300
taking one of the nine activities, which can be formulated as:301
Given:302
1) A drop-off point ((xr,yr), tr), which is generated in real-303
time;304
2) A set of historical check-ins {uid,v
id,ti}(e.g., the last305
month), together with check-ins accumulated several306
hours before trin the designated city;307
3) POIs in the designated city, which can be obtained from308
the check-in data;309
4) A road network G(N,E)of the designated city.310
Predict the probabilities of taking each of the nine activities311
respectively for the drop-off point (the objective of Phase I),312
and provide timely service recommendations related to the313
top-ranked trip purposes (activities with top probabilities) for314
the passenger (the objective of Phase II).315
IV. PHASE I: IMPUTING TRIP PURPOSES316
A. Urban Activity Region Identification317
Human beings are known as collective people (i.e., most318
of people live, work together with others in nature), thus319
it is highly likely that people take activities in a small and320
scattered fraction of the whole city space. A preliminary step321
for inferring the travel purpose of passengers is to identify322
all the scattered activity regions in the whole urban space.323
To ease the presentation, we name these regions as Urban324
Activity Regions (UARs).325
Urban activity regions are bounded and separated by some326
physical barriers such as main roads, rivers, and mountains,327
as can be witnessed in the human civilization and urbanization328
process in history [28], [40]. Each separated UAR is isolated329
and bounded by main road segments (or rivers), covering330
several neighborhoods and narrow streets. Inside each UAR,331
Fig. 1. Illustrative example of determining the region that a given POI
belonging to (top left); the illustrative examples of assigning a huge number
of POIs to regions (top right and bottom left); the identified CAAs for the
illustrative example (bottom right).
passengers can easily reach between two points if they are 332
located to each other. Usually, passengers who get off taxis at 333
one side of the primary way will not cross it (i.e., go to the 334
other side) to take activities due to the huge barrier. On the 335
contrary, when getting off taxis at small and narrow streets, 336
the passengers can easily walk towards another direction. 337
Based on the above observations, in this paper, we mainly rely 338
on the road network data to identify the UARs in the target 339
city. We propose a two-step procedure to divide the whole city 340
into a number of disjointed UARs. 341
Step 1: We extract the road network data including 342
coordinates of nodes, edges, as well as the attributes of 343
edges (e.g., number of lanes, speed limits, road levels/ 344
types) from an open crowdsourced platform, i.e., the 345
OpenStreetMap. With the information of road level/type 346
attributes, we are able to keep high-level road segments 347
that are only tagged as ‘motorway’, ‘trunk’, or ‘primary’. 348
Step 2: For the trimmed road network only consisting of 349
high-level road segments, we apply the image-processing- 350
based map segmentation algorithm in [37] to obtain 351
connected components. Each connected component is just 352
a piece of the separated urban activity region (UAR, 353
R1R5in Fig. 1). 354
B. Candidate Activity Area Identification 355
It is well-known that POIs are the most common activity 356
unit for human beings. In the case of people taking taxi to 357
travel, on one hand, they always prefer to get off as close 358
to the true destination as possible. On the other hand, in the 359
modern city, there are usually many different categories of 360
POIs located in a same building (e.g., a shopping mall). In this 361
respect, people are more likely to be attracted by the nearby 362
one or two buildings after getting off taxis. Hence, we propose 363
the concept of candidate activity area (CAA) in which different 364
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 5
POIs locate close to each other. The CAAs correspond to small365
areas, and we use CAA as the activity unit for taxi passengers.366
To identify such a CAA, we first determine which UAR a367
given POI belongs to. Then, we aggregate the POIs belonging368
to the same UAR to several clusters based on the spatial369
proximity. Finally, we identify each cluster as a CAA. In this370
sense, a UAR contains serval CAAs. However, the assignment371
of POIs to UARs is quite challenging since we have to address372
the following two issues:373
1) Each UAR is usually of an arbitrary shape, thus we374
cannot simply compare the POI locations to the locations375
of the UAR boundaries. A simpler but essential problem376
is the point-in-polygon problem [31]. More specifically,377
it’s the problem of determining whether a given point378
is inside/outside a given closed polygon (i.e., region),379
whichisprovedtobehard[17].380
2) The number of POIs is huge (e.g., the number of POIs381
in the Manhattan of NYC is more than 10k), and how382
to efficiently determine which UAR each POI locates at383
is also a challenging issue.384
Algorithm 1 Algorithm for Determining the Region That a
Given POI Belongs to
Input: agivenPOI(pi); the trimmed road network and the
identified UARs in the target city;
Output: the UAR that the given point is located, denoted by
Ri=PinR(pi).
Step 1: Based on the location of the given point (pi),
we can find its nearest node ni;
Step 2: According to the identified node niand the
topology of the high-level road network, we can easily
identify all the regions that share the node ni. We denote
these regions by {Ri};
Step 3.a: For each region in the set of {Ri}, we apply
ifPinR algorithm to check whether piis inside that
region;
Step 3.b: Loop ends when ifPinR returns 1.
Without loss of generality, to deal with the first issue,385
we apply a popular and mature algorithm to determine the386
relationship (i.e., inside or outside) between a given point387
and a given region [18]. For simplicity, we denote the algo-388
rithm as ifPinR(pi,ri). If the point piis inside region ri,389
ifPinR(pi,ri)returns 1; otherwise, it returns 0. To determine390
which region that a given point belongs to, we propose the391
algorithm by recalling if PinR repeatedly. The pseudo-code392
of this algorithm is presented in Algorithm 1. For the given393
point, Step 1 and Step 2 identify all the possible regions that it394
may belong to, according to the geometrical relationship in the395
space. Note that a region is represented by a sequence of nodes396
in the clockwise direction. For instance, the possible regions397
for piin the illustrative example (as shown in top left of Fig. 1)398
are marked as R1,R2,andR3. Step 3 shows the repeated399
recalling procedure of algorithm if PinR. The number of400
loops is usually small since the possible region set contains401
few and limited regions. In the best case, the number of loops402
is 1, while in the worst case, the number of loops is just equal403
to the size of the possible region set. The loop number is 1 for 404
the illustrative example since if PinRreturns 1 when checking 405
R1at the first loop. 406
To deal with the second issue, a straightforward but com- 407
putationally expensive method is to check each POI based 408
on Algorithm 1. In theory, the computation complexity is 409
O(N×M×C),whereNis the number of POIs; Mis the 410
average number of possible regions for a given POI, which 411
is usually small and O(C)is the complexity of ifPinR 412
algorithm. Therefore, in order to accelerate the computation 413
process, we should reduce the number of POIs to be checked. 414
Actually, it is unnecessary to check some POIs. More specif- 415
ically, if we have determined the region where a given POI 416
locates at, then we can directly infer that its ‘nearby’ POIs 417
should also be located inside the same region with high 418
confidence level. Inspired by this observation, we propose a 419
novel and efficient algorithm to determine the regions of the 420
POIs. Briefly speaking, the algorithm mainly consists of POI 421
random selection,point in which region determination and cell 422
growing, as illustrated in Algorithm 2. 423
Algorithm 2 Algorithm for Determining Regions That a Huge
Number of POIs Belong to
Input: a pool of POIs ({pi}) and a set of UARs ({Ri})inthe
target city;
Output: {Ri}=PinR({pi}).
Step 1: Randomly select a POI from {pi}(e.g., ps);
Step 1.1: Rs=PinR(ps);
Step 2: Take psas the center, get a grid cell with equal
width and length (g0);
Step 2.1: gi=g0;
Step 3: If gihas no intersection with Rs,then
Step 3.1: Identify all POIs inside the grid based on the
geometric relationship (denoted by Psub(gi));
Psub(gi)should be all located at Rs;
Step 3.2: {pi}={pi}− Psub(gi);
Step 3.3: Increase the grid cell size by 50%, gi+1=1.5×
gi);
Step 4: Repeat Step 1 3 until {pi}is empty.
In the first step, we randomly pick up a POI from the 424
pool and determine which region the selected POI belongs to 425
(Step 1.1) based on Algorithm 1. In the second step, we deter- 426
mine a grid cell with the selected POI as the center. 427
Fig. 1 (top right) demonstrates the result after the first two 428
steps. All POIs inside the grid cell should be located at 429
the same region of the selected POI if there is no inter- 430
section between the grid cell and the region boundaries 431
(Step 3.1 and 3.2 respectively). Thus, there is no need for 432
us to check for those POIs and we can remove them from 433
the POI pool directly (Step 3.3). With the objective of further 434
increasing the number of no-need-check POIs, the grid cell 435
will grow bigger to contain more POIs (Step 3.4), as demon- 436
strated in Fig. 1 (bottom left). In the case that the grid cell (gi)437
crosses over the region, the algorithm will restart the whole 438
procedure from the first step by selecting a new POI randomly 439
again. The process will terminate until there is no POI in the 440
IEEE Proof
6IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
pool (Step 4). Finally, each POI will be associated with a label441
of the region that it belongs to.442
For POIs inside the same UAR (POIs with the same443
region label), we apply the popular DBSCAN algorithm to444
get clusters since the algorithm can identify clusters with445
different density and shape [11]. POIs that are close to each446
other and within the same UAR would be identified as a447
Candidate Activity Area (CAA). However, as demonstrated in448
Fig. 1 (bottom right), POIs scattering at different UARs are449
grouped to different CAAs, even if they are close to each other.450
Remark: Although the clustering and identification of CAAs451
can be done offline, it should be a plus if we can accelerate452
the procedure, since we have a huge number of points of453
interests and dozens of regions in the city. What is more,454
POIs in the city are dynamic, for instance, some POIs are455
disappearing while some POIs are emerging, necessitating the456
regular update of CAAs. Thus, it is desirable if we have an457
efficient algorithm for clustering and identification of CAAs.458
C. Trip Purpose Imputation459
The objective of the trip purpose imputation is to predict460
the POI category that the passenger intends to visit after461
getting off the taxi, given the drop-off point location and462
the drop-off time. We denote the drop-off information of the463
passenger by ((x,y), t). To infer the trip purpose correctly,464
several factors need to be considered. The first is the distance465
from passenger’s final destination to the drop-off location.466
In more detail, the closer is the POI to the drop-off point,467
the more likely would the POI be visited, since taxis offer door-468
to-door services to passengers. Under such circumstance, most469
passengers prefer to get off taxis as close as possible to the470
final destination. The second factor that needs to be considered471
is the distribution of nearby POI categories to the drop-off472
point. Heading to an area mostly covered by Restaurants,473
the trip purpose would probably be the dining activities. Last474
but not the least, the alighting time of the passenger from the475
taxi is also vital as people take different activities at different476
time.477
To integrate the above three factors comprehensively,478
we mainly take the following three major steps. First, given479
the location of the drop-off point, we select the top-knear-480
est CAAs within the walkable distance (e.g., 500 meters).481
We note that passengers will visit the top-kCAAs with482
different probabilities. That is, the closer is the CAA to the483
drop-off point, the higher is the probability that the CAA484
will be visited, which exhibits the distance decay effect.485
Specifically, the probability that a CAA will be visited can486
be determined by Eq. 1.487
P(CAA
i|(x,y)) (di)β
488
s.t.k
i=1P(CAA
i|(x,y)) =1(1)489
where direfers to the Euclidean distance from the center of490
CAA
ito the drop-off point (x,y) of the passenger; kis the491
number of the nearby CAAs considered, which is set to 3 in492
our study; βis the distance decay parameter. We set β=1.5,493
which is also consistent with existing findings in [6] and [20].494
Second, even if the visited CAA has been determined, 495
because there are different POIs, each with a unique category 496
and visiting popularity, the prediction of the POI categories 497
for passengers is still challenging [15]. To alleviate the issue, 498
inside a determined CAA (e.g., CAA
i), we compute the 499
probability for visiting each POI category (i.e., taking activity) 500
based on Bayes’ theorem [21], as shown in Eqns. 2 and 3. 501
P(aj|(x,y), t,CAA
i)502
=P((x,y)|aj,t,CAA
i)×P(aj|t,CAA
i)×P(t,CAA
i)
P((x,y), t,CAA
i)503
(2) 504
P((x,y), t,CAA
i)505
=n
j=1P((x,y)|aj,t,CAA
i)506
×P(aj|t,CAA
i)×P(t,CAA
i)(3) 507
nis the number of total activities considered in the paper; 508
P((x,y)|aj,t,CAA
i)represents the probability that a passen- 509
ger gets off the taxi at location (x,y)if he/she has decided to 510
take the activity ajat CAA
iat time t. Gong et al. [15] simply 511
assume that the location and the time of the drop-off point are 512
conditionally independent, given the activity type (aj), i.e., the 513
following equation can be satisfied. 514
P((x,y)|aj,t,CAA
i)=P((x,y)|aj,CAA
i)(4) 515
However, we argue that Eq. 4 does not hold for most cases, 516
since where passengers select to get off taxis does not only 517
depend on the nearby land use (i.e., spatial context) [9], [33], 518
but also the alighting time. On one hand, passengers may 519
get off taxis near a shopping plaza to shop; while on the 520
other hand, passengers might get off taxis at places in a 521
business district to have meal in the evening. In other words, 522
the locations and the times of the drop-off point are inter- 523
dependent. Here, we use the following equation to approximate 524
thetruevalueof P((x,y)|aj,t,CAA
i)by considering the 525
attractiveness and the POI distribution on categories of the 526
CAA collectively, as shown in Eq. 5. 527
P((x,y)|aj,t,CAA
i)528
numberof POIs(aj,CAA
i)
numberof POIs(CAA
i)×Ai(t)529
s.t.n
j=1P((x,y)|aj,t,CAA
i)=1(5)530
numberof POIs(CAA
i)and numberof POIs(aj,CAA
i)in 531
Eq. 5 refer to the number of POIs and the number of POIs 532
related to ajwithin the CAA
irespectively; Ai(t)refers to the 533
attractiveness of the CAA
iat the given time slot, which can be 534
measured by the popularity of CAA
iat that time, compared to 535
the rest of other CAAs among the top-klist. In more detail, 536
we calculate the value of Ai(t)by dividing the number of 537
check-ins of CAA
iby the total number of check-ins of all 538
top-kCAAs during the given time slot in the historical days 539
(e.g., last month), as can be seen in Eq. 6. Note that it is easy 540
to extract the information about the check-ins and categories 541
of POIs from the Foursquare platform. 542
Ai(t)=numbero f Checki ns(CAA
i,t,days)
k
i=1numberof Checki ns(CAA
i,t,days)(6) 543
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 7
P(aj|t,CAA
i)in Eq. 2 is the probability of taking activity544
ajif the passenger is located in CAA
iat time t. The distrib-545
ution of P(aj|t,CAA
i)depends on the spatial and temporal546
patterns of human activity in that area. It has been well547
recognized that human behaviours in terms of taking activities548
present strong and regular patterns. For instance, with respect549
to the time dimension, the probability of visiting work-related550
places during 8:00 am-10:00 am is generally much higher than551
that of visiting shopping malls. With respect to the space552
dimension, the case may vary depending on geographical553
areas. To capture such temporal and spatial regularities in a554
fine granularity, again in this study, we rely on the check-555
ins from Foursquare. Given the time tand candidate activity556
area CAA
i, we approximate the probability of visiting a557
certain POI category (i.e., taking the activity of aj)bythe558
ratio of the number of check-ins on the given POI category to559
the total number of check-ins in CAA
iduring the given time560
slot in the historical days (e.g., last month), as shown in Eq. 7.561
P(aj|t,CAA
i)=numbero f Checkins(aj,CAA
i,t,days)
numbero f Checki ns(CAA
i,t,days)
562
(7)563
Although strong and regular patterns (i.e., regularity)of564
human behaviours are frequently observed, dynamic is also565
an another salient feature. For instance, human behaviours are566
interrupted and changed when encountering unexpected sud-567
den and big social events. To capture such changes, we propose568
to combine the most fresh check-ins in the studied area since569
the live data may reflect the affected human activities timely.570
Therefore, the probability can be updated by Eq. 8.571
P(aj|t,CAA
i)572
α×numbero f Checki ns(aj,CAA
i,t,days)
numbero f Checkins(CAA
i,t,days)

regularity
573
+(1α) ×numbero f Checkins(aj,CAA
i,t,4h)
numbero f Checki ns(CAA
i,t,4h)

dynamic
(8)574
where numbero f Checki ns(aj,CAA
i,t,4h)refers to the575
number of check-ins in the given POI category and576
numbero f Checkins(CAA
i,t,4h)indicates the total number577
of check-ins in the area of CAA
iby counting the check-ins578
accumulated in the most recent four hours just before time t,579
respectively. αis a weighting factor (we set α=0.9inthis580
study). We note that the probability obtained by Eq. 8 needs581
to be normalized, i.e., n
j=1P(aj|t,CAA
i)=1 with nbeing582
the total number of activities considered in the paper.583
P(t,CAA
i)in Eq. 2 is the probability of taking activities584
in CAA
iafter the passengers gets off taxis at time t,which585
can be computed by Eq. 9, as follows.586
P(t,CAA
i)=P(t)×P(CAA
i|t)(9)587
The probability of the passenger getting off taxis at time t588
(i.e., P(t)) is different at different times of the day, since589
human activity has strong time regularity. The probability P(t)590
can be estimated by the ratio of the number of drop-offsduring591
the given time slot to the number of drop-offs during the whole592
Fig. 2. Illustration for the computation of P(t,CAA
i). value in the grid cell
refers to the probability of taking activity in the corresponding CAA after the
ending of the corresponding trip.
day. The computation of P(CAA
i|t)is a bit more complicated. 593
In the following,to better understand how to compute the value 594
of P(CAA
i|t), we use an example to illustrate the basic idea, 595
as shown in Fig. 2. We suppose that there are 6 taxi trips 596
occurred during the given time slot and there are 8 CAAs 597
that have been identified. For each taxi trip, passengers would 598
choose one of the CAAs to take activities after getting off 599
taxis. Furthermore, as discussed earlier in the section, for each 600
trip, we assume the passenger would take activities in one of 601
the top-kCAAs within the walkable distance. In the example, 602
the value of the grid cell (e.g., gij) refers to the probability 603
of passengers from taxi trip tritaking activity in area CAA
i,604
which can be computed based on Eq. 1. For each time slot, 605
the probability of taking activity in a given CAA (CAA
i)is 606
just the average value of the corresponding row values, i.e., 607
P(CAA
i|t)=N
m=1gim
N(10) 608
where Nis the number of taxi trips occurred in the studied 609
time slot. 610
In summary, for thw taxi trip (x,y,t), the probability of 611
passengers taking a given activity ajafter getting off the taxi 612
can be approximated by the following equation. 613
P(aj|(x,y), t)614
P(CAA
i|(x,y)) ×P(aj|(x,y), t,CAA
i)615
s.t.n
j=1P(aj|(x,y), t)=1 (11) 616
V. PHASE II: ENABLING REAL-TIME RESPONSE 617
In order to enable the real-time response for each drop- 618
off event (i.e., compute the posterior probability of taking 619
each activity for each drop-off point using Bayes’ theorem in 620
real-time), we need to identify the most time-consuming com- 621
ponent. As discussed in Section III, the posterior probability 622
calculation mainly consists of four components, the details of 623
which are shown in Table II. 624
As shown in the table, the first component is related to 625
the probability of visiting a given candidate activity area 626
IEEE Proof
8IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
TAB L E II
DETAILS ON EACH COMPONENT OF INFERRING TRIP PURPOSES
Fig. 3. A schematic diagram of reducing the time complexity of the first
component. the value on the edge carries the information about the visiting
probability to the corresponding CAA.
(CAA
i) if the passenger was dropped off at point (x,y).627
The probability is computed online because the distance to628
each top-knearest CAAs varies if the passengers get off629
taxis at different points. However, we argue that two drop-630
off points that are close to each other would have sim-631
ilar value of P(CAA
i|(x,y)), i.e., P(CAA
i|(x1,y1)) 632
P(CAA
i|(x2,y2)) if (x1,y1)is close to (x2,y2). Hence, we633
aggregate historical information on drop-off points to drop-634
off cluster and assume all drop-off points in the same cluster635
would have equal value of P(CAA
i|(x,y)).Insuchway,the636
value of the first component can be pre-computed offline.637
The only online job is to identify which drop-off clusters638
that it should belong to. Once receiving a real-time drop-639
off point, this online job is quite efficient. In this manner,640
the computation time can be reduced significantly. As shown641
in Fig. 3, the top-kCAAs of the drop-off cluster can be642
identified and the distance to each CAA can be measured643
by the one between the centroid of drop-off cluster and644
the centroid of each CAA. Thus, the probability of visiting645
CAA
ifrom a drop-off point inside the drop-off cluster can646
be calculated offline efficiently. We note that many drop-off647
clusters can be obtained in advance, given the historical taxi648
trip data. Each of the drop-off clusters is associated with k649
visiting probabilities to its nearby top-kCAAs.650
The second component is related to the probability of651
getting off taxis at point (x,y)if the passenger walks to area652
CAA
iand intends to take activity ajat time t. As discussed 653
earlier, two factors are considered. The first is the attrac- 654
tiveness of CAA
iat the given time slot, which is measured 655
by the popularity of that area. Note that the popularity of 656
a CAA at a given time slot can be calculated in advance, 657
using the historical check-in data contributed by mobile users. 658
The second factor is the POI category distribution in the 659
CAA
i, which remains relatively stable. Thus, it is obviously 660
that the value of the second component can be pre-computed 661
offline. 662
The third component is the conditional probability of taking 663
a given activity (e.g., aj) if the passenger is at CAA
iat the 664
time t. To approximate the true value of this component, both 665
the “regularity” and “dynamic” patterns of the area are taken 666
into consideration. As shown in the formula, the “regularity” 667
pattern is based on the historical check-in data, and the 668
“dynamic” pattern is captured by the most recent check-in data 669
just before the drop-off time. Thus, the former part can be pre- 670
processed offline, while the latter part can only be computed 671
online. 672
The fourth component is about the joint probability of 673
visiting the area of CAA
iat the time of t. As can be seen, 674
the value is determined by two parts. One is the frequency 675
of getting off taxis at the given time slot, and the other is 676
the spatial distribution of the drop-off pints. Both parts are 677
quantified using the historical taxi trip data. Thus, the value 678
can be pre-computed offline. 679
In summary, two online jobs, identifying the drop-off 680
clusters and extracting the “dynamic” patterns of the top-k681
CAAs, are required when receiving a streaming drop-off 682
point (xr,yr,tr). With the other components computed and 683
structured offline purposely, the whole process can be quite 684
efficient. We will validate this in the experiments. 685
VI. EVA L UA T I O N 686
A. Experimental Setup 687
1) Data Preparation: Three data sets in the Manhattan area, 688
the city of New York (NYC) are used, i.e., the road network, 689
the Foursquare check-in data, and the taxi GPS trajectory data. 690
Some basic statistical information about the three data sets is 691
shown in Table III. 692
2) Comparison Algorithms: We compare our approach with 693
two baseline algorithms, the details of which are presented as 694
follows. 695
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 9
Fig. 4. Results of UARs and CAAs identification in Manhattan, NYC. a full-view of clustering results (a); a close-view of some selected regions (b); the
number of CAAs in the UARs (c). (best viewed in an enlarged digital version.)
TABLE III
STATI STI CS OF URBAN DATA SETS USED IN THE PAPER
Nearest.TheNearest algorithm simply sets the POI that696
is closest to the drop-off location as the final destination697
of the passenger, regardless of the drop-off time. Thus,698
the trip purpose is predicted as taking activities related to699
that POI category.700
Bayes’ rule [15]. The major difference between this base-701
line and our proposed one is that the baseline assumes702
that two temporally-close drop-off points may be related703
to the same priori probability of a given trip purpose,704
even if the two points are located far away from each705
other. While for our proposed algorithm, both regular706
and dynamic patterns are considered when calculating707
the priori probability in a very fine spatial and temporal708
resolution, which leverages the user-generated check-in709
data.710
3) Evaluation Environment: All the evaluations in the711
paper are programmed using Java language under the Eclipse712
J2SE 1.5 integrated development environment, and are run713
on an Intel Core i5-4950 PC with 8-GB RAM and Windows714
8 operation system.715
B. Evaluation on Candidate Activity Area Identification 716
Fig. 4 presents the clustering results (i.e., the identification 717
of UARs and CAAs) of our two-stage clustering algorithm. 718
In total, we have identified 30 UARs, all of which are based 719
on the road network data. As shown in Fig. 4(a), most POIs 720
are located at midtown and downtown of Manhattan, while 721
only very are scattering at the upper town. A close view of 722
some selected regions are shown in Fig. 4(b) to highlight the 723
advantages of our proposed clustering algorithm. For example, 724
due to the physical barriers (i.e., wide roads), POIs in purple 725
color at Region 6 are not grouped together with their nearby 726
POIs at Region 5, and several POIs at Region 4 are not 727
merged with their neighbours at Region 5 either. Each UAR 728
contains different number of CAAs, depending on the spatial 729
distribution of the POIs inside. Fig. 4(c) shows the number 730
of CAAs for each UAR. The xcoordinate corresponds to the 731
region number and the ycoordinate is the number of CAAs 732
in that region. As shown in the figure, region 17 contains 733
the maximal number of CAAs, while most of regions have a 734
number of CAAs less than 20. 735
The size of the identified CAA is also an important metric 736
to evaluate the clustering algorithm. The size of each CAA 737
should be within a region of the walkable distance. Here the 738
size of a CAA is defined as the minimal rectangle which covers 739
all POIs in the CAA. If the CAA size is too big, then the POIs 740
in the CAA are difficult to be reached by foot. Fig. 5 shows 741
the Cumulative Distribution Function (CDF) of the size of 742
all CAAs. As can be seen from the figure, the size of over 743
96% of CAA are less than 10,000 square meters, showing the 744
effectiveness of our proposed two-stage clustering algorithm. 745
IEEE Proof
10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Fig. 5. The CDF distribution of the size of CAAs.
C. Evaluation on Trip Purpose Imputation Algorithm746
As discussed earlier, due to the lack of ground-truth of the747
taxi trip purpose, it is impossible to calculate the inference748
accuracy directly. Fortunately, we are provided with the travel749
purpose survey data at the regional scale (e.g., Manhattan) [3],750
which motivated us to evaluate the system accuracy indirectly.751
The rationale here behind is: if the distribution of the trip752
purposes inferred by our proposed method is close to the753
one obtained by the survey data in the statistical sense at the754
regional scale, our proposed method should be reliable. Since755
the survey data classifies the travel purpose into 4 categories,756
i.e., work, education, recreation, shopping and others, to make757
the results comparable, we manually put ‘dining’, ‘In-home’,758
‘Transportation transfer’, ‘Lodging’ and ‘Medical’ into the759
‘Others’ category. Next, for each taxi trip, with the proposed760
inference algorithm, we are able to get 5 probabilities of 5 new761
trip purposes. Finally, for each trip purpose, we average the762
probabilities of all taxi trips generated in one month, and use763
the average value as the percentages of the travel for that trip764
purpose.765
We show the comparison between our inference results to766
the travel survey data in Fig. 6. Besides, the results obtained767
by the other two baselines are also plotted for comparison.768
It is easy to understand that, the closer the percentage value769
on each category to the corresponding survey data value,770
the better performance our algorithm achieves. As can be771
seen from the results, our proposed algorithm achieves the772
best performance, while the Nearest algorithm achieves773
the worst performance and the Bayes’ Rules [15] achieves774
the performance in-between.775
Our proposed inference algorithm also enables us to gain776
insights on trip purpose in a much finer resolution. We thus777
select a representative urban activity region (UAR) to inves-778
tigate the trip purpose trend at different time of the work779
day. The selected UAR together with inside distributed POIs780
is shown in Fig. 7, where only four POI categories can be781
found. Fig. 8 shows the trip purpose inference results of the782
selected region across the whole day (top chart). We also show783
the corresponding results in other regions of Manhattan for784
comparison (bottom chart). As shown in the figure, travel for785
shopping and dining in the selected region is more common786
Fig. 6. Comparison results to baseline algorithms and survey data.
Fig. 7. A selected UAR with 4 kinds of POIs. (Best viewed in an enlarged
digital version.)
Fig. 8. Trip purpose imputation results for a given day in the selected UAR
and in Manhattan, respectively.
since it is a well-known shopping and dinner center in NYC. 787
Moreover, the number of trips for shopping purpose keeps 788
increasing and remain high in the daytime, even in the work 789
days. In both selected UAR and other regions in Manhattan, 790
the number of trips for recreation purpose climbs after the 791
work time. 792
D. Evaluation on Response Time 793
Another key system metrics is how long a passenger can get 794
the recommendation services after getting off the taxi. Because 795
all the requests are processed sequentially in one machine fol- 796
lowing the First-Come-First-Out (FIFO) rule, when a request 797
arrives, one of the following two situations may occur. 798
(1) There are no other requests are being processed or waiting 799
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 11
TAB L E IV
RESPONSE TIME IN THE WORST CASE IN
MANHATTAN AND NYC, RESPECTIVELY
Fig. 9. The CDF distributions of the response times at a day in Manhattan
and the whole NYC, respectively.
to be processed in the system; (2) There are other requests800
in the system, being processed or waiting to be processed.801
In the first case, the request can enter service immediately802
upon arrival. In the second case, the request has to wait in803
queue until the server has finished processing other requests804
that arrive earlier. Thus, the response time for a request is the805
time from the request arrives till the time the request has been806
processed. In other words, the response time includes the wait807
time and the process time.808
We are more interested in the longest response time that a809
request needs to spend during a day, i.e., the longest time that a810
request (or a taxi trip) needs to wait before being proceeded.811
The logic is that if the longest response time is acceptable812
for most users, then the system is useful in practice. The813
longest response time corresponds to the worst case during814
a day. Table IV shows the average of the longest response815
time and its standard deviation values in Manhattan and the816
whole NYC respectively. Note that the observation days is 15.817
On average, the worst case takes 7.54 seconds and 8.15 sec-818
onds to respond requests from Manhattan and from the whole819
NYC, respectively, which are acceptable in our application820
scenarios. Hence, we conclude that our proposed TripImputor821
is not only able to process requests from the whole NYC with822
a single normal PC, but also provide timely recommendation823
services.824
We are also interested at the distribution of all response825
times in Manhattan and the whole NYC, as shown826
in Fig. 9. As can be observed, although Manhattan contributes827
90% trip inferring requests of the whole city, it still takes more828
time to respond to a request from the city, because the more829
requests come per unit time, the longer the waiting time and830
so is the response time. Moreover, almost half of requests831
can be responded within 50 milliseconds in both Manhattan832
and whole NYC. As shown in the figure, although in the833
Fig. 10. The longest response time (corresponds to the worst case) under
different number of requests per hour.
worst case it takes up to around 7.54 seconds to process a 834
request, 80% of the requests from Manhattan can be responded 835
within 4.5 seconds and that from the NYC can be responded 836
within around 5 seconds. On average, it takes only 1.588 and 837
1.812 seconds to respond for Manhattan and the whole NYC 838
respectively. The above results demonstrate the efficiency of 839
our system. 840
The previous experimental results ensure the efficiency of 841
our proposed system in handling requests from the whole 842
NYC. We are also aware that it takes more time to respond to 843
a request when there are more requests arrive (as in the NYC). 844
Going a step further, we intend to investigate how many cities 845
(like NYC) can a single normal PC support and return a timely 846
response. As shown in the Fig. 10, x-axis refers to the number 847
of requests per hour and y-axis refers to the longest response 848
time of all requests. As can be seen, it takes around 7, 16, 849
24, 30 seconds at most to process 20,000, 40,000, 60,000, 850
80,0000 requests, respectively. When the number of requests 851
received during one hour keeps increasing, the total processing 852
time will increase exponentially, because all the requests are 853
processed sequentially in one PC. The longest response time 854
is more than 9 minutes if the number of requests per hour is 855
100,000. Note that there are around 20,000 requests arriving 856
in one hour in the whole NYC during the peak hours. Thus, 857
facilitated by our method, we are capable of taking care of 858
requests for 4 cities like NYC by just using one normal PC, 859
if users can accept the maximal response time as around 860
30 seconds. 861
VII. CONCLUSION AND FUTURE WORK 862
In this paper, we present a novel two-phase framework 863
called TripImputor for inferring the taxi trip purpose in real 864
time. In the phase of trip purpose inference, we first proposed 865
a two-stage clustering algorithm to identify the candidate 866
activity areas in the urban space, then calculate the poste- 867
rior probabilities of taking each activity for each taxi trip 868
using Bayes’ theorem. In the second phase, to reduce the 869
online computation time and immerse a real-time response, 870
we develop a sophisticated procedure mainly including clus- 871
tering of historical drop-off points and matching the drop-off 872
clusters with CAAs. Finally, we evaluate the effectiveness 873
IEEE Proof
12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
and efficiency of the system using the real-world datasets.874
Experimental results demonstrate that the proposed two-phase875
framework achieves the promising performance both in accu-876
racy and response time.877
In the future, we plan to broaden and deepen this work in878
several directions. First, we plan to incorporate more relevant879
information to improve the accuracy of the inference algorithm880
further, such as the personal background, social-economical881
features, with a particular focus on utilizing the information882
about the pick-up point (the pick-up time and location, and its883
nearby spatial context as well) and the trip travel time. Second,884
we intend to investigate the taxi trip purposes at different885
seasons under different spatial resolutions, and also the yearly886
evolution tendency of taxi trip purpose and the underlying887
motivations. Third, we intend to accelerate the computation888
process by introducing some parallel mechanisms such as889
Spark, since each taxi trip can be handled separately. Finally,890
we would like to deploy our system on mobile devices, and891
recruit some volunteers to test our system in actual settings,892
collecting feedback on how to further improve the service.893
REFERENCES894
[1] R. K. Balan, K. X. Nguyen, and L. Jiang, “Real-time trip information895
service for a large taxi fleet,” in Proc. MobiSys, 2011, pp. 99–112.896
[2] P. S. Castro, D. Zhang, C. Chen, S. Li, and G. Pan, “From taxi GPS897
traces to social and community dynamics: A survey,ACM Comput.898
Surv., vol. 46, no. 2, pp. 17:1–17:34, 2013.899
[3] C. Chen, H. Gong, C. Lawson, and E. Bialostozky, “Evaluating the900
feasibility of a passive travel survey collection in a complex urban901
environment: Lessons learned from the New York City case study,902
Transp. Res. A, Policy Pract., vol. 44, no. 10, pp. 830–840, 2010.903
[4] C. Chen, Z. Wang, and B. Guo, “The road to the Chinese smart city:904
Progress, challenges, and future directions,” IT Prof., vol. 18, no. 1,905
pp. 14–17, Jan./Feb. 2016.906
[5] C. Chen et al., “iBOAT: Isolation-based online anomalous trajec-907
tory detection,” IEEE Trans. Intell. Transp. Syst., vol. 14, no. 2,908
pp. 806–818, Jun. 2013.909
[6] C. Chen, D. Zhang, B. Guo, X. Ma, G. Pan, and Z. Wu, “TripPlanner:910
Personalized trip planning leveraging heterogeneous crowdsourced dig-911
ital footprints,” IEEE Trans. Intell. Transp. Syst., vol. 16, no. 3,912
pp. 1259–1273, Jun. 2015.913
[7] C. Chen et al., “CrowdDeliver: Planning city-wide package delivery914
paths leveraging the crowd of taxis,” IEEE Trans. Intell. Transp. Syst.,915
vol. 18, no. 6, pp. 1478–1496, Jun. 2017.916
[8] K. J. Clifton and S. L. Handy, “Qualitative methods in travel behaviour917
research,” in Transport Survey Quality and Innovation. Emerald Group918
Publishing Limited, 2003, pp. 283–302.AQ:2 919
[9] Z. Deng and M. Ji, “Deriving rules for trip purpose identification from920
GPS travel survey data and land use data: A machine learning approach,”921
in Proc. 7th Int. Conf. Traffic Transp. Stud., 2010, pp. 768–777.922
[10] Y. Ding, C. Chen, S. Zhang, B. Guo, Z. Yu, and Y. Wang, “GreenPlanner:923
Planning personalized fuel-efficient driving routes using multi-sourced924
urban data,” in Proc. PerCom, Mar. 2017, pp. 207–216.925
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm926
for discovering clusters in large spatial databases with noise,” in Proc.927
KDD, vol. 96. 1996, pp. 226–231.928
[12] T. Feng and H. J. P. Timmermans, “Detecting activity type from929
GPS traces using spatial and temporal information,” Eur. J. Transp.930
Infrastruct. Res., vol. 15, no. 4, pp. 662–674, 2015.931
[13] B. Furletti, P. Cintia, C. Renso, and L. Spinsanti, “Inferring human932
activities from GPS tracks,” in Proc. 2nd ACM SIGKDD Int. Workshop933
Urban Comput., 2013, p. 5.934
[14] Y. Ge, H. Xiong, A. Tuzhilin, K. Xiao, M. Gruteser, and M. Pazzani,935
“An energy-efficient mobile recommender system,” in Proc. ACM KDD,936
2010, pp. 899–908.937
[15] L. Gong, X. Liu, L. Wu, and Y. Liu, “Inferring trip purposes and938
uncovering travel patterns from taxi trajectory data,” Cartogr. Geogr.939
Inf. Sci., vol. 43, no. 2, pp. 103–114, 2016.940
[16] L. Gong, T. Morikawa, T. Yamamoto, and H. Sato, “Deriving personal 941
trip data from GPS data: A literature review on the existing methodolo- 942
gies,” Procedia-Social Behavioral Sci., vol. 138, pp. 557–565, Jul. 2014. 943
[17] K. Hormann and A. Agathos, “The point in polygon problem for 944
arbitrary polygons,” Comput. Geometry, vol. 20, no. 3, pp. 131–144, 945
2001. 946
[18] J. Huang, Y. Li, R. Crawfis, S.-C. Lu, and S.-Y. Liou, “A complete 947
distance field representation,” in Proc. Conf. Vis., 2001, pp. 247–254. 948
[19] L. Huang, Q. Li, and Y. Yue, “Activity identification from GPS trajec- 949
tories using spatial temporal POIs’ attractiveness,” in Proc. 2nd ACM 950
SIGSPATIAL Int. Workshop LBSNs, 2010, pp. 27–30. 951
[20] C. Kang, X. Ma, D. Tong, and Y. Liu, “Intra-urban human mobility 952
patterns: An urban morphology perspective,Phys. A, Statist. Mech. 953
Appl., vol. 391, no. 4, pp. 1702–1717, 2012. 954
[21] K.-R. Koch, Introduction to Bayesian Statistics. Springer, 2007. AQ:3955
[22] J. Krumm and D. Rouhana, “Placer: Semantic place labels from diary 956
data,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput.,957
2013, pp. 163–172. 958
[23] M.-P. Kwan, “How GIS can help address the uncertain geographic 959
context problem in social science research,” Ann. GIS, vol. 18, no. 4, 960
pp. 245–255, 2012. 961
[24] H. T. Lam, E. Diaz-Aviles, A. Pascale, Y. Gkoufas, and B. Chen. 962
(2015). “(Blue) taxi destination and trip time prediction from partial 963
trajectories.” [Online]. Available: https://arxiv.org/abs/1509.05257 964
[25] X. Li, M. Li, Y.-J. Gong, X.-L. Zhang, and J. Yin, “T-DesP: Destination 965
prediction based on big trajectory data,” IEEE Trans. Intell. Transp. 966
Syst., vol. 17, no. 8, pp. 2344–2354, Aug. 2016. 967
[26] Y. Lin, H. Wan, R. Jiang, Z. Wu, and X. Jia, “Inferring the travel 968
purposes of passenger groups for better understanding of passengers,” 969
IEEE Trans. Intell. Transp. Syst., vol. 16, no. 1, pp. 235–243, Feb. 2015. 970
[27] Y. Lu and L. Zhang, “Imputing trip purposes for long-distance travel,” 971
Transportation, vol. 42, no. 4, pp. 581–595, 2015. 972
[28] D. Newman and A. Paasi, “Fences and neighbours in the postmodern 973
world: Boundary narratives in political geography,” Prog. Hum. Geogr.,974
vol. 22, no. 2, pp. 186–207, 1998. 975
[29] T. H. Rashidi, A. Abbasi, M. Maghrebi, S. Hasan, and T. S. Waller, 976
“Exploring the capacity of social media data for modelling travel behav- 977
iour: Opportunities and challenges,” Transp. Res. C, Emerg. Technol.,978
vol. 75, pp. 197–211, Feb. 2017. 979
[30] S. Schönfelder, “Urban rhythms: Modelling the rhythms of individual 980
travel behaviour,” Ph.D. dissertation, ETH Zurich, Zürich, Switzerland, 981
2006. AQ:4982
[31] M. Shimrat, “Algorithm 112: Position of point relative to polygon,” 983
Commun. ACM, vol. 5, no. 8, p. 434, 1962. 984
[32] L. Wang, Z. Yu, B. Guo, T. Ku, and F. Yi, “Moving destination prediction 985
using sparse dataset: A mobility gradient descent approach,ACM Trans. 986
Knowl. Discovery Data, vol. 11, no. 3, p. 37, 2017. 987
[33] J. Wolf, “Using GPS data loggers to replace travel diaries in the 988
collection of travel data,” Ph.D. dissertation, Georgia Inst. Technol., 989
Atlanta, GA, USA, 2000. 990
[34] A. Y. Xue, R. Zhang, Y. Zheng, X. Xie, J. Huang, and Z. Xu, “Destina- 991
tion prediction by sub-trajectory synthesis and privacy protection against 992
such prediction,” in Proc. IEEE ICDE, Apr. 2013, pp. 254–265. 993
[35] D. Yang, D. Zhang, V. W. Zheng, and Z. Yu, “Modeling user activity 994
preference by leveraging user spatial temporal characteristics in LBSNs,” 995
IEEE Trans. Syst., Man, Cybern., Syst., vol. 45, no. 1, pp. 129–142, 996
Jan. 2015. 997
[36] Z. Yu, H. Xu, Z. Yang, and B. Guo, “Personalized travel package 998
with multi-point-of-interest recommendation based on crowdsourced 999
user footprints,” IEEE Trans. Human–Mach. Syst., vol. 46, no. 1, 1000
pp. 151–158, Feb. 2016. 1001
[37] N. J. Yuan, Y. Zheng, and X. Xie, “Segmentation of urban areas using 1002
road networks,” Microsoft Res., Tech. Rep., 2012. AQ:51003
[38] Y. Yue, T. Lan, A. G. O. Yeh, and Q.-Q. Li, “Zooming into individ- 1004
uals to understand the collective: A review of trajectory-based travel 1005
behaviour studies,” Travel Behaviour Soc., vol. 1, no. 2, pp. 69–78, 1006
2014. 1007
[39] Y. Zheng, Y. Chen, Q. Li, X. Xie, and W.-Y. Ma, “Understanding 1008
transportation modes based on GPS data for Web applications,” ACM 1009
Trans. Web, vol. 4, no. 1, p. 1, 2010. 1010
[40] C. Zhong, S. M. Arisona, X. Huang, M. Batty, and G. Schmitt, 1011
“Detecting the dynamics of urban structure through spatial network 1012
analysis,” Int. J. Geogr. Inf. Sci., vol. 28, no. 11, pp. 2178–2199, 1013
2014. 1014
[41] Z. Zhu, U. Blanke, and G. Tröster, “Inferring travel purpose from crowd- 1015
augmented human mobility data,” in Proc. 1st Int. Conf. IoT Urban 1016
Space, 2014, pp. 44–49. 1017
IEEE Proof
CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 13
Chao Chen received the B.Sc. and M.Sc. degrees1018
in control science and control engineering from1019
Northwestern Polytechnical University, Xi’an,1020
China, in 2007 and 2010, respectively, and the1021
Ph.D. degree from the Université Pierre et Marie1022
Curie and the Institut Mines-Télécom/Télécom1023
SudParis, France, in 2014.1024
In 2009, he was a Research Assistant with1025
Hong Kong Polytechnic University, Hong Kong.1026
He is currently an Associate Professor with1027
the College of Computer Science, Chongqing1028
University, Chongqing, China. He has authored or co-authored over1029
40 papers including eight IEEE transactions. His research interests include1030
pervasive computing, mobile computing, urban logistics, data mining from1031
large-scale GPS trajectory data, and big data analytics for smart cities. His1032
work on taxi trajectory data mining was featured by the IEEE Spectrum1033
in 2011 and 2016, respectively. He was also a recipient of the Best Paper1034
Runner-Up Award at MobiQuitous 2011.1035
Shuhai Jiao received the B.Sc. degree from the1036
College of Information and Software Engineering,1037
Northeast Normal University, Changchun, China,1038
in 2015. He is currently pursuing the master’s degree1039
with the College of Computer Science, Chongqing1040
University, Chongqing, China. He was a Research1041
Intern at Didi Chuxing Company, Beijing, China,1042
in 2017. His research interests include scenic travel1043
route planning and taxi GPS trajectory data mining.1044
Shu Zhang received the bachelor’s degree from1045
the Civil Aviation University of China, Tianjin,1046
China, in 2007, the master’s degree from Mississippi1047
State University, Starkville, MS, USA, in 2010,1048
and the Ph.D. degree in management sciences from1049
the University of Iowa, Iowa, IA, USA, in 2015.1050
She is currently an Assistant Professor with the1051
College of Economics and Business Administration,1052
Chongqing University, Chongqing. Her research1053
interests including vehicle routing, urban logistics,1054
and transportation network design.1055
Weichen Liu (S’07–M’11) received the B.Eng. 1056
and M.Eng. degrees from the Harbin Institute of 1057
Technology, China, and the Ph.D. degree from the 1058
Hong Kong University of Science and Technology, 1059
Hong Kong. He is currently an Assistant Professor 1060
with the School of Computer Science and Engineer- 1061
ing, Nanyang Technological University, Singapore. 1062
He has authored and co-authored over 70 research 1063
papers in peer-reviewed journals, conferences, and 1064
books. His research interests include embedded and 1065
real-time systems, multiprocessor systems, and fault- 1066
tolerant systems. He has received the Best Paper Candidate Awards from 1067
CODES+ISSS, CASES, and ASP-DAC. 1068
Liang Feng received the Ph.D. degree from the 1069
School of Computer Engineering, Nanyang Tech- 1070
nological University, Singapore, in 2014. He was 1071
a Post-Doctoral Research Fellow at the Computa- 1072
tional Intelligence Graduate Laboratory, Nanyang 1073
Technological University. He is currently an Assis- 1074
tant Professor at the College of Computer Science, 1075
Chongqing University, China. His research inter- 1076
ests include computational and artificial intelligence, 1077
memetic computing, big data optimization and learn- 1078
ing, and transfer learning. 1079
Yas h a Wa ng received the Ph.D. degree from 1080
Northeastern University, Shenyang, China, in 2003. 1081
He is currently a Professor and an Associate Director 1082
of the National Research and Engineering Center 1083
of Software Engineering with Peking University, 1084
China. His research interests include urban data 1085
analytics, ubiquitous computing, software reuse, and 1086
online software development environment. He has 1087
authored or co-authored over 50 papers in pres- 1088
tigious conferences and journals, such as ICWS, 1089
UbiComp, ICSP, and so on. As a Technical Leader 1090
and Manager, he has accomplished several key national projects on software 1091
engineering and smart cities. Cooperating with major smart-city solution 1092
providing companies, his research work has been adopted in more than 1093
20 cities in China. 1094
IEEE Proof
AUTHOR QUERIES
AUTHOR PLEASE ANSWER ALL QUERIES
PLEASE NOTE: We cannot accept new source files as corrections for your paper. If possible, please annotate the PDF
proof we have sent you with your corrections and upload it via the Author Gateway. Alternatively, you may send us
your corrections in list format. You may also upload revised graphics via the Author Gateway.
AQ:1 = Please provide the postal code for “Nanyang Technological University.”
AQ:2 = Please provide the publisher location for ref. [8].
AQ:3 = Please note that the publisher name “Springer Science & Business Media” was changed to “Springer”
for ref. [21]. Also provide the publisher location.
AQ:4 = Please provide the department name for refs. [30] and [33].
AQ:5 = Please provide the organization location and report no. for ref. [37].
... With the rapid development of computer-aided diagnosis technology [1], a variety of medical imaging technologies have emerged clinically for capturing medical images of human internal tissues and organs, such as Computed Tomography (CT), Nuclear Magnetic Resonance Imaging (NMRI), and Ultrasound Imaging (UI). In the ambulance, these medical images can now be presented on ambulancemounted intelligent systems for fast and accurate diagnosis with the development of artificial intelligence technology [2][3][4], mobile communication technology [5][6][7][8], the and Internet of Things [9][10][11]. However, the pixels of these medical images have been already fixed in the process of image production. ...
Article
Full-text available
Ambulance services play a vital role in intelligent transportation systems (ITS). In an intelligent ambulance system, the medical images can help doctors quickly and accurately understand the patients’ condition during first aid. On various display devices in different kinds of ambulances, content-aware image adaption can be used to better present the medical image among different display resolutions and aspect ratios. Most existing methods mainly focus on visual protection of salient areas, such as specific organ parts of the human body, with less attention paid to the visual effect of unimportant areas. However, the human visual system is more sensitive to the edge and contour of images, which are important for ambulance services. To improve the visual effect of adapted images, a contour-maintaining-based image adaption method for an efficient ambulance service in ITS is proposed here. Firstly, the proposed method innovatively combines the weighted gradient, saliency, and edge maps into an importance map. Secondly, energy is optimized for reducing contour distortion and interruption according to the visual slope and curvature of contours and edges in non-salient areas. Finally, applying the sub-procedure of a forward seam carving method, the optimal seams can more evenly pass through the contour areas. The experimental results demonstrate that the proposed method is more effective than other similar methods.
... Nowadays, wireless access networks and cloud infrastructure are faced with severe demands such as ultra-low latency, high reliability and user experience continuity, due to the rapid growth of Internet of Things (IoTs) and various mobile applications [8,11,24,25]. These stringent requirements starve for highly localized services in close proximity to end users at the network edge. ...
Conference Paper
In this paper, we investigate the problem of energy expenditure minimization under latency and accuracy constraints in mobile edge computing (MEC)-based computation offloading. Given the non-convexity of the formulated problem, we first propose an energy-efficient computation resource allocation scheme inspired by the recent successive convex approximation (SCA) advances. After carefully exploring the problem structure, we fortunately derive the optimal solution, whose optimality is theoretically proved. Numerical results show that, compared with the SCA-based scheme and two other benchmark schemes, the optimal computation resource allocation scheme achieves the lowest energy consumption while satisfying the latency and accuracy requirements.
... For example, based on biometric metrics, Lu et al. [26] utilize certain on-screen sliding movements to figure out 'who you are'. Chen et al. [27] calculate users'intention to take taxis through multi-sourced urban data. Some existing approaches [28] [29] are to identify the user's gender and mood through the user's voice. ...
Article
User profile can be used to characterize a person and help us better understand him/her, which further can be utilized to provide enhanced personalized services. When using mobile phone, some of one’s information are unavoidably and unobtrusively passed or stored, which makes it possible to draw the user profile. In this article, we propose to infer user profile including age, gender and personality traits based on mobile phone sensory data. Specifically, we capture data when unlocking screen, playing games as well as some basic mobile phone information, app usage and screen status by using common available sensors in commodity mobile phones. By analyzing the differences in users’ phone usage, we extracted features for user profile inference. Random Forest regression and Random Forest classification models are separately used to estimate age and gender of the user while SVR algorithm is applied to identify personality traits. In addition, we evaluate the model through real-life experiments conducted with a total of 84 phone users. Experimental results show that our approach effective, achieving a RSME of 4.3696 in age estimation and precision of 91.70% in gender detection. As for personality traits identification, the RMSEs of openness, conscientiousness, extraversion, agreeableness and neuroticism are 0.29, 0.3506, 0.465, 0.3022 and 0.452, respectively.
Article
The mismatch between logical and physical I/O granularity inhibits the deployment of embedded file systems. Most existing embedded file systems manage logical space with a small unit, which is no longer the case of the flash operation granularity. Manually enlarging the logical I/O granularity of file systems requires enormous transplanting efforts. Moreover, large logical pages signify the write amplification problem, which turns to severe space consumption and performance collapse. This paper designs a novel storage middleware, NV-middle, for legacy embedded file systems with large-capacity flash memories. Legacy embedded storage schemes can be smoothly transplanted into new platforms with different hardware read/write granularity. Moreover, the legacy optimization schemes can be maximally reserved, without inducing write amplification problems. We implement NV-middle with the state-of-the-art embedded file system, YAFFS2. Comprehensive evaluations show that NV-middle can achieve times of performance improvement over manually transplanted YAFFS2 with various workloads.
Article
Full-text available
We study the vehicle routing problem of China Post Group with time windows (VRP_CPG_TW). A three-level hub model is established, which includes the determination of the number, location of hubs and their service area as well as routes between hubs and local post office. We propose a comprehensive approach by integrating the center distribution method, and the Taboo and Genetic algorithm to solve the VRP_CPG_TW. The proposed algorithm is divided into two phase. The first phase includes initial site selection and regional division and the second phase is responsible for solving vehicle routing problem with time constraint. The two phases are iterated alternately until the feasible solution comes out. Our scheme is compared with the real scheme of Guizhou Post and about 25% operation fee is reduced. The test results confirm that our model has high potential to be applied to optimize the transportation network of China Post. Moreover, the proposed model also provides an effective solution for optimizing the transportation network with hierarchical structure.
Conference Paper
Ride-on-demand (RoD) services such as Uber and Didi (in China) are becoming increasingly popular, and in these services dynamic price plays an important role in balancing the supply (i.e., the number of cars) and demand (i.e., the number of passenger requests) to benefit both drivers and passengers. However, the dynamic price also creates concerns for passengers: the "unpredictable" prices sometimes prevent them from making quick decisions at ease. One may wonder if it is possible to get a lower price if s/he chooses to wait a while. Giving passengers more information helps to tackle this concern, and predicting the prices is a possible solution. In this paper we perform dynamic price prediction based on multi-source urban data. Price prediction helps passengers understand whether they could get a lower price in neighboring locations or within a short time, thus alleviating their concerns. The prediction is based on urban data from multiple sources, including the RoD service itself, taxi service, public transportation, weather, the map of a city, etc. The rationale behind using multi-source urban data is that the dynamic price in RoD may be influenced by different factors found in different data sources. We train a neural network to perform the prediction, and evaluate the prediction accuracy of using different combinations of multi-source urban data. Our results show that using multi-source urban data indeed helps improve the prediction accuracy, and different datasets may have varying influences on the dynamic prices.
Article
Full-text available
Ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular, and in these services dynamic prices play an important role in balancing the supply and demand to benefit both drivers and passengers. However, dynamic prices also create concerns. For passengers, the "unpredictable" prices sometimes prevent them from making quick decisions: one may wonder if it is possible to get a lower price if s/he chooses to wait a while. It is necessary to provide more information to them, and predicting the dynamic prices is a possible solution. For the transportation industry and policy makers, there are also concerns about the relationship between RoD services and their more traditional counterparts such as metro, bus, and taxi: whether they affect each other and how. In this paper we tackle these two concerns by predicting the dynamic prices using multi-source urban data. Price prediction could help passengers understand whether they could get a lower price in neighboring locations or within a short time, thus alleviating their concerns. The prediction is based on urban data from multiple sources, including the RoD service itself, taxi service, public transportation, weather, the map of a city, etc. We train a simple linear regression model with high-dimensional composite features to perform the prediction. By combining simple basic features into composite features, we compensate for the loss of expressiveness in a linear model due to the lack of non-linearity. Additionally, the use of multi-source data and a linear model enables us to quantify and explain the relationship between multiple means of transportation by examining the weights of different features in the model. Our hope is that the study not only serves as an accurate prediction to make passengers more satisfied, but also sheds light on the concern about the relationship between different means of transportation for either the industry or policy makers.
Article
In emerging ride-on-demand (RoD) services, dynamic pricing plays an important role in regulating supply and demand and improving service efficiency. Despite this, it also makes passenger anxious: whether the current price is low enough, or otherwise, how to get a lower price. It is thus necessary to provide more information to ease the anxiety, and predicting the prices is one possible solution. In this study, the authors predict the dynamic prices to help passengers learn if there is a lower price around. They first use entropy of historical prices to characterize the predictability of prices in different locations and claim that different prediction algorithms should be used to balance between efficiency and accuracy. They present an ensemble learning approach to price prediction and compare it with two baseline predictors, namely a Markov and a neural network predictor. The performance evaluation is based on the real data from a major RoD service provider. Results verify that the two baseline predictors work well in locations with different levels of predictabilities, and that ensemble learning significantly increases the prediction accuracy. Finally, they also evaluate the effects of prediction, i.e., the probability that passengers could benefit from the prediction and get a lower price.
Thesis
Full-text available
The recent availability of longitudinal data on individual trip making and activity behaviour has enabled analysts to get new insights into the structures and motives of daily life travel. Travel diary data sets such as Mobidrive (six-week continuous travel diary survey) and GPS observations such as Atlanta (up to 2 years of vehicle instrumented GPS monitoring) are exciting sources of information for the description and modelling of the variability of individual travel patterns. The investigation of long-term temporal and spatial phenomena of travel demand is adding to the analysis repertoire of Activity Based Analysis (ABA) which identifies this area as an important issue for research and practice. This thesis picks up two aspects from the wide field of the intra-personal investigation of travel behaviour which are the periodicity in activity demand and the long-term structures of destination choice and activity spaces. These two issues stress the regularity and the stability of day-to-day travel behaviour which has been often neglected in travel behaviour analysis in favour of the legitimate intention to search for complexity and variability in the first place. The first stream of analysis concentrates on the description of the temporal patterns of activity demand by Survival Analysis techniques such as hazard models. The approach which considers parametric as well as non-parametric models is chosen to capture the specific characteristics of interval duration data. The models reveal the effects of socio-economic attributes of travellers on the periodicity of activity execution. The focus of the second stream of analysis is the description and measurement of the spatial distribution of activities. Activity locations which are frequently visited over prolonged periods are structural elements of the activity spaces which may be understood as a “manifestation of our everyday lives”. The thesis develops several measurement approaches which focus on the enumeration and mapping of unique locations and the transformation of point patterns into continuous representations of locational choice. The identification and measurement of revealed individual activity spaces is believed to increase transport planning’s ability to realistically define choice set for destination choice. The analysis is based on a range of individual panel data sets of different data collection methods and survey areas which provides a great variety of behavioural patterns and regional peculiarities. These data sets span the range from rural village and small town (Canton Thurgau, Switzerland) to metropolitan environments (Copenhagen or Atlanta). The analysis tries to trace the possible impacts of these scale differences. The thesis offers interesting new findings on the motives of recurrent patterns of travel and especially on the longitudinal structures of people‘s destination choice. A multifaceted and ambiguous character of daily life travel is revealed. Whereas sound routines in time and space seem to dominate daily life, individuals show a considerable amount of variability, flexibility and variety seeking in travel and activity behaviour. The results have strong implications for further methodological developments in travel behaviour analysis and for the ongoing practitioners’ discussion of how to influence people’s mobility patterns.
Article
Full-text available
Moving destination prediction offers an important category of location-based applications and provides essential intelligence to business and governments. In existing studies, a common approach to destination prediction is to match the given query trajectory with massive recorded trajectories by similarity calculation. Unfortunately, due to privacy concerns, budget constraints, and many other factors, in most circumstances, we can only obtain a sparse trajectory dataset. In sparse dataset, the available moving trajectories are far from enough to cover all possible query trajectories; thus the predictability of the matching-based approach will decrease remarkably. Toward destination prediction with sparse dataset, instead of searching similar trajectories over the sparse records, we alternatively examine the changes of distances from sampling locations to final destination on query trajectory. The underlying idea is intuitive: It is directly motivated by travel purpose, people always get closer to the final destination during the movement. By borrowing the conception of gradient descent in optimization theory, we propose a novel moving destination prediction approach, namely MGDPre. Building upon the mobility gradient descent, MGDPre only investigates the behavior characteristics of query trajectory itself without matching historical trajectories, and thus is applicable for sparse dataset. We evaluate our approach based on extensive experiments, using GPS trajectories generated by a sample of taxis over a 10-day period in Shenzhen city, China. The results demonstrate that the effectiveness, efficiency, and scalability of our approach outperform state-of-the-art baseline methods.. 2017. Moving destination prediction using sparse dataset: A mobility gradient descent approach.
Article
Full-text available
With the recent surge of location based social networks (LBSNs), activity data of millions of users has become attainable. This data contains not only spatial and temporal stamps of user activity, but also its semantic information. LBSNs can help to understand mobile users' spatial temporal activity preference (STAP), which can enable a wide range of ubiquitous applications, such as personalized context-aware location recommendation and group-oriented advertisement. However, modeling such user-specific STAP needs to tackle high-dimensional data, i.e., user-location-time-activity quadruples, which is complicated and usually suffers from a data sparsity problem. In order to address this problem, we propose a STAP model. It first models the spatial and temporal activity preference separately, and then uses a principle way to combine them for preference inference. In order to characterize the impact of spatial features on user activity preference, we propose the notion of personal functional region and related parameters to model and infer user spatial activity preference. In order to model the user temporal activity preference with sparse user activity data in LBSNs, we propose to exploit the temporal activity similarity among different users and apply nonnegative tensor factorization to collaboratively infer temporal activity preference. Finally, we put forward a context-aware fusion framework to combine the spatial and temporal activity preference models for preference inference. We evaluate our proposed approach on three real-world datasets collected from New York and Tokyo, and show that our STAP model consistently outperforms the baseline approaches in various settings.
Article
In the past few years, the social science literature has shown significance attention to extracting information from social media to track and analyse human movements. In this paper the transportation aspect of social media is investigated and reviewed. A detailed discussion is provided about how social media data from different sources can be used to indirectly and with minimal cost extract travel attributes such as trip purpose, mode of transport, activity duration and destination choice, as well as land use variables such as home, job and school location and socio-demographic attributes including gender, age and income. The evolution of the field of transport and travel behaviour around applications of social media over the last few years is studied. Further, this paper presents results of a qualitative survey from travel demand modelling experts around the world on applicability of social media data for modelling daily travel behaviour. The result of the survey reveals positive view of the experts about usefulness of such data sources.
Article
Destination prediction is very important in location-based services such as recommendation of targeted advertising location. Most current approaches always predict destination according to existing trip based on history trajectories. However, no existing work has considered the difference between the effects of passing-by locations and the destination in history trajectories, which seriously impacts the accuracy of predicted results as the destination can indicate the purpose of traveling. Meanwhile, the temporal information of history trajectories in destination prediction plays an important role. On one hand, the history trajectories in different periods also differ in the influence, e.g., the history trajectories from last week can reflect the status quo more accurately than the history trajectories two years ago. On the other hand, the history trajectories in different time slots reflect different facts of traffic and moving habits of people, e.g., visiting a restaurant in the daytime and visiting a bar at night. Although a huge amount of history trajectories can be achieved in the era of big data, it is still far from covering all the query trajectories since a road network is widely distributed and trajectory data is sparse. The temporal sensitivity of history trajectories highlights the sparsity problem even more. Therefore, we propose a novel model [Formula: see text] to solve the aforementioned problems. The model is comprised of two modules: trajectory learning and destination prediction. In the module of trajectory learning, a novel method called the mirror absorbing Markov chain model is proposed for modeling the trajectories for isolating the destination. We build a transition tensor to deduce the transition probability between each location pair in a particular time slot. To address the data sparsity problem, we fill the missing values in transition tensor through a context-aware tensor decomposition approach. In the module of destination prediction, an absorbing tensor is derived from the filled transition tensor, and the theoretical model is established for destination prediction. The experiments prove the effectiveness and efficiency of [Formula: see text].
Article
Location-based social networks (LBSNs) provide people with an interface to share their locations and write reviews about interesting places of attraction. The shared locations form the crowdsourced digital footprints, in which each user has many connections to many locations, indicating user preference to locations. In this paper, we propose an approach for personalized travel package recommendation to help users make travel plans. The approach utilizes data collected from LBSNs to model users and locations, and it determines users’ preferred destinations using collaborative filtering approaches. Recommendations are generated by jointly considering user preference and spatiotemporal constraints. A heuristic search-based travel route planning algorithm was designed to generate travel packages. We developed a prototype system, which obtains users’ travel demands from mobile client and generates travel packages containing multiple points of interest and their visiting sequence. Experimental results suggest that the proposed approach shows promise with respect to improving recommendation accuracy and diversity.
Article
Real-time estimation of destination and travel time for taxis is of great importance for existing electronic dispatch systems. We present an approach based on trip matching and ensemble learning, in which we leverage the patterns observed in a dataset of roughly 1.7 million taxi journeys to predict the corresponding final destination and travel time for ongoing taxi trips, as a solution for the ECML/PKDD Discovery Challenge 2015 competition. The results of our empirical evaluation show that our approach is effective and very robust, which led our team -- BlueTaxi -- to the 3rd and 7th position of the final rankings for the trip time and destination prediction tasks, respectively. Given the fact that the final rankings were computed using a very small test set (with only 320 trips) we believe that our approach is one of the most robust solutions for the challenge based on the consistency of our good results across the test sets.
Article
Planning and policy analysis at the national, state and inter-regional corridor levels depends on reliable information and forecasts about long-distance travel. Emerging passive data collection technologies such as GPS, smartphones, and social media provide the opportunity for researchers and practitioners to potentially supplement or replace traditional long-distance travel surveys. However, certain important trip information, such as trip purpose, travel mode, and travelers’ socio-demographic characteristics, is missing from passively collected travel data. One promising solution to this data issue is to impute the missing information based on supplementary data (e.g., land use) and advanced statistical or data mining algorithms. This paper develops machine learning methods, including decision tree and meta-learning, to estimate trip purposes for long-distance passenger travel. A passively collected long-distance trip dataset is simulated from the 1995 American Travel Survey for the development and validation of the machine learning methods. The predictive accuracy of the proposed methods is evaluated for several scenarios varying with trip purposes and the extent of data availability as inputs. This research design will provide not only a practically useful approach for long-distance trip purpose imputation, but also generate valuable insights for future long-distance travel surveys. Results show that the accuracy of the trip purpose imputation methods based on all available data decreases from 95 % with two purposes (business and non-business) to 77 % with four purposes (business, personal business, social visit, and leisure). Based on a two-purpose scheme, the predictive accuracy of the imputation algorithms decreases from 95 % when all input data is used (a full-information model), to 72 % with a minimum information model that only utilizes the passively collected data. If traveler’s socio-demographic characteristics are available (possibly through other imputation models), the predictive accuracy only decreases from 95 to 91 %.
Article
Semantic place labels are labels like "home", "work", and "school" given to geographic locations where a person spends time. Such labels are important both for giving understandable location information to people and for automatically inferring activities. Deployed products often compute semantic labels with heuristics, which are difficult to program reliably. In this paper, we develop Placer, an algorithm to infer semantic places labels. It uses data from two large, government diary studies to create a principled algorithm for labeling places based on machine learning. Our labeling reduces to a classification problem, where we classify locations into different label categories based on individual demographics, the timing of visits, and nearby businesses. Using these government studies gives us an unprecedented amount of training and test data. For instance, one of our experiments used training data from 87,600 place visits (from 10,372 distinct people) evaluated on 1,135,053 visits (from 124,517 distinct people). We show labeling accuracy for a number of experiments, including one that gives a 14 percentage point increase in accuracy when labeling is a function of nearby businesses in addition to demographic and time features. We also test on GPS data from 28 subjects.