ArticlePDF Available

TripImputor: Real-Time Imputing Taxi Trip Purpose Leveraging Multi-sourced Urban Data

December 2018
IEEE Transactions on Intelligent Transportation Systems

December 2018

Authors:

Chongqing University

(*will be accepted after minor changes.*) Travel behaviour understanding is a long-standing and critically important topic in the area of smart cities. Big volumes of various GPS-based travel data can be easily collected, among which the taxi GPS trajectory data is a typical example. However, in GPS trajectory data, there is usually little information on travellers’ activities, thereby they can only support limited applications. Quite a few studies have been focused on enriching the semantic meaning for raw data, such as travel mode/purpose inferring. Unfortunately, trip purpose imputation receives relatively less attention and requires no real- time response. To narrow the gap, we propose a probabilistic two-phase framework named TripImputor, for making the real- time taxi trip purpose imputation and recommending services to passengers at their drop-off points. Specifically, in the first phase, we propose a two-stage clustering algorithm to identify candidate activity areas (CAAs) in the urban space. Then, we extract fine- granularity spatial and temporal patterns of human behaivours inside the CAAs from Foursquare check-in data to approximate the prior probability for each activity, and compute the posterior probabilities (i.e., infer the trip purposes) using the Bayes’ theorem. In the second phase, we take a sophisticated procedure that clusters historical drop-off points and matches the drop-off clusters and CAAs to immerse the real-time response. Finally, we evaluate the effectiveness and efficiency of the proposed two-phase framework using real-world datasets, which consist of road network, check-in data generated by over 38,000 users in one year, and the large-scale taxi trip data generated by over 19,000 taxis in a month in Manhattan, the New York City (NYC), US. Experimental results demonstrate that the system is able to infer the trip purpose accurately, and can provide recommendation results to passengers within 1.6 seconds in Manhattan on average, just using a single normal PC.

Illustration for the computation of P(t, C AA i ). value in the grid cell refers to the probability of taking activity in the corresponding CAA after the ending of the corresponding trip.

…

Illustration for the computation of P(t, C AA i ). value in the grid cell refers to the probability of taking activity in the corresponding CAA after the ending of the corresponding trip.

…

Results of UARs and CAAs identification in Manhattan, NYC. a full-view of clustering results (a); a close-view of some selected regions (b); the number of CAAs in the UARs (c). (best viewed in an enlarged digital version.)

…

The CDF distribution of the size of CAAs.

…

Figures - uploaded by Chao Chen

Content may be subject to copyright.

Content uploaded by Chao Chen

Content may be subject to copyright.

IEEE Proof

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

TripImputor: Real-Time Imputing Taxi Trip Purpose

Leveraging Multi-Sourced Urban Data

Chao Chen , Shuhai Jiao, Shu Zhang, Weichen Liu ,Member, IEEE,

Liang Feng, and Yasha Wang

Abstract— Travel behavior understanding is a long-standing1

and critically important topic in the area of smart cities.2

Big volumes of various GPS-based travel data can be easily3

collected, among which the taxi GPS trajectory data is a typical4

example. However, in GPS trajectory data, there is usually5

little information on travelers’ activities, thereby they can only6

support limited applications. Quite a few studies have been7

focused on enriching the semantic meaning for raw data, such8

as travel mode/purpose inferring. Unfortunately, trip purpose9

imputation receives relatively less attention and requires no real-10

time response. To narrow the gap, we propose a probabilistic11

two-phase framework named TripImputor, for making the real-12

time taxi trip purpose imputation and recommending services to13

passengers at their dropoff points. Speciﬁcally, in the ﬁrst phase,14

we propose a two-stage clustering algorithm to identify candidate15

activity areas (CAAs) in the urban space. Then, we extract ﬁne-16

granularity spatial and temporal patterns of human behaviors17

inside the CAAs from foursquare check-in data to approximate18

the priori probability for each activity, and compute the pos-19

terior probabilities (i.e., infer the trip purposes) using Bayes’20

theorem. In the second phase, we take a sophisticated procedure21

that clusters historical dropoff points and matches the dropoff22

clusters and CAAs to immerse the real-time response. Finally,23

we evaluate the effectiveness and efﬁciency of the proposed two-24

phase framework using real-world data sets, which consist of25

road network, check-in data generated by over 38000 users in26

one year, and the large-scale taxi trip data generated by over27

19000 taxis in a month in Manhattan, the New York City, USA.28

Experimental results demonstrate that the system is able to infer29

Manuscript received March 27, 2017; revised July 18, 2017 and

October 9, 2017; accepted November 2, 2017. This work was supported in

part by the National Key Research and Development Project of China under

Grant 2017YFB1002000, in part by the National Science Foundation

of China under Grant 61602067 and Grant 71601024, in part by the

Fundamental Research Funds for the Central Universities under Grant

106112017cdjxy180001, in part by the Chongqing Basic and Frontier

Research Program under Grant cstc2015jcyjA00016, in part by the Open

Research Fund Program of Shenzhen Key Laboratory of Spatial Smart

Sensing and Services, Shenzhen University, and in part by the Ministry

of Education in China Humanities and Social Sciences Youth Foundation

under Grant 16yjc630169. The Associate Editor for this paper was K. Savla.

(Corresponding author: Chao Chen.)

C. Chen, S. Jiao, and L. Feng are with the College of Com-

puter Science, Chongqing University, Chongqing 400044, China (e-mail:

ivanchao.chen@gmail.com; jiaoshuhai@gmail.com; brightfengs@gmail.com).

S. Zhang is with the School of Economics and Business Admin-

istration, Chongqing University, Chongqing 400044, China (e-mail:

zhangshu@cqu.edu.cn).

AQ:1 W. Liu is with the School of Computer Science and Engineering, Nanyang

Technological University, Singapore (e-mail: liu@ntu.edu.sg).

Y. Wang is with the School of Electronics Engineering and Computer

Science, Institute of Software, Peking University, Beijing 100871, China

(e-mail: wangys@sei.pku.edu.cn).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TITS.2017.2771231

the trip purpose accurately, and can provide recommendation 30

results to passengers within 1.6 s in Manhattan on average, just 31

using a single normal PC. 32

Index Terms—Travel behaviour, trip purpose, smart city, 33

Bayes’ theorem, trajectory data mining. 34

I. INTRODUCTION 35

TRAVEL behavioural analysis is an important research 36

topic [20]. During recent years, travel behaviour and 37

patterns have become more complex than before since modern 38

cities are undergoing rapid urbanization [4], [8], [30]. It is 39

well-recognized that the travel-related data is an important 40

and valuable source for obtaining a holistic and in-depth 41

understanding on travel behaviours. By analyzing such data, 42

urban planners and policy makers can increase their abili- 43

ties in addressing urban planning, management and operat- 44

ing issues [4]. Traditionally, travel-related data was mainly 45

collected manually by original paper-and-pencil interview, 46

computer-assisted telephone interview, and computer-assisted- 47

self-interview. All these methods suffer from several lim- 48

itations including high survey cost, heavy respondent bur- 49

den, short time and space coverage, and underreported trips 50

(inaccuracies) [33]. 51

With the wide proliferation of location-aware devices 52

including smart phones and GPS-equipped vehicles in daily 53

life, large volumes of time-stamped locational data of indi- 54

viduals become easily available [38]. Such data contains a 55

wealth of travel behavior information, such as when and 56

where passengers move around the city in a reasonably 57

high resolution, and sometimes on which the routes do they 58

transport. For instance, a piece of taxi trip log tells us the 59

concrete physical coordinates (longitudes and latitudes) and 60

the exact times that a passenger was picked up and dropped 61

off, as well as the detailed traversing road sequence from the 62

source to the destination. Consequently, experimenting with 63

GPS-based data collection methods to supplement or replace 64

the conventional ones is a hot trend. However, the collected 65

GPS data is raw. In general, it lacks semantic information 66

like the transport mode taken or activity types performed 67

(travel purposes), i.e., how and why a passenger is moving and 68

what is the essential component required for urban computing. 69

Furthermore, compared to enriching the raw data with ‘how’ 70

semantic,1existing methods on ‘why’ semantic are still far 71

1Note that taxi GPS trajectory data contains the transport mode information

explicitly.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE Proof

2IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

from accurate [12], [39]. Indeed, there exists a dilemma that72

trajectory data is rich due to emerging passive data collection73

technologies but activity information is poor, although such74

activity information can directly help reveal the purpose of the75

trips [15]. Hence, this paper is an attempt to narrow the gap76

between the raw data and people’s activities, with a particular77

focus on analyzing taxi passengers’ trip purposes.78

Trip purpose imputation2has been a long-standing research79

topic for over a decade [9], [13], [15], [16], [26], [41].80

But previous studies have rarely addressed the following two81

issues: 1) Infer the trip purpose at an individual level.More82

speciﬁcally, prior research mainly focuses on interpreting trip83

purposes at an aggregate level, e.g., city scale, thus only84

smart urban services at the macro level can be enabled.85

In contrast, to support micro smart urban services such86

as recommendation services to each passenger according to87

his/her travel purpose, the imputation of the trip purpose at88

the individual level is necessary; 2) Require the real-time89

response, i.e., returning the corresponding purpose as soon90

as the trip ends. As a matter of fact, real-time recognition of91

passengers’ travel purposes not only can offer the possibility92

to understand what people intend to do, but also can provide93

timely recommendation services to passengers. In such way,94

passengers can undertake and organize their daily activities95

more efﬁciently and economically. For example, it is often96

desirable that restaurant coupons and/or other discount infor-97

mation can be timely delivered to the passenger for choice once98

getting off the taxi, if he/she is predicted to take the activity99

of dining. To the best of our knowledge, there has not been100

work reported in this regard. We would like to clarify that we101

infer the trip purposes after the information about the drop off102

point is revealed. This is because, on one hand, although the103

taxi drivers may be aware of the destinations in advance, such104

information usually cannot be recorded by the embedded GPS105

systems automatically until taxi drivers push the passenger106

status button (from occupied to free) after arriving at the107

destinations. On the other hand, how to accurately predict108

the destinations of taxi trips based on their partial trajectories109

is challenging and can be a separate research problem itself,110

which has been received intensive attention from the academic111

community, such as [24], [25], [32], and [34].112

To enable the real-time taxi trip purpose imputation at113

the individual level, we need to address the following two114

challenges:115

•Lack ground-truth. The ground-truth of travelling pur-116

pose per trip is usually collected by the proactively117

prompted recall [27], where only a very small fraction of118

users are called to annotate their traces with the activities119

that they have done. To make matters worse, the ground120

truth of the annotation is contaminated since many users121

just cannot remember what they have done correctly.122

•Real-time response. On one hand, existing algorithms123

on inferring trip purposes cannot be applied directly,124

since they are not providing real-time responses. On the125

other hand, the taxi trip is generated continuously and126

2We use ‘inference’, ‘prediction’, ‘imputation’ interchangeably throughout

the whole manuscript.

intensively as time goes by, which makes the real-time 127

response even more challenging. 128

In order to predict what activity that a passenger intends to 129

take after getting off the taxi with a high accuracy, one should 130

take the drop-off time, the drop-off location and the nearby 131

geographical context [23] into account. To be more speciﬁc, 132

the distribution of different activities that people commonly 133

take (i.e., human behaviours) in the area near the drop-off 134

point at the drop-off time is a useful reference. Fortunately, 135

check-in data, which is left by users when checking-in at 136

point-of-interests (POIs) using LBSNs (i.e., Location-based 137

Social Networks) like Foursquare, contains a detailed descrip- 138

tion of the POIs (e.g., the hierarchical category, the open 139

time) [6], [35]. With the check-in information, it is not 140

difﬁcult to understand the passengers’ travel activities as well 141

as the activity distribution at an area during a given time 142

period [19], [29], [41]. For instance, people visit a restaurant 143

to have food and visit a shopping mall to shop. Thus, the 144

problem of trip purpose inference is migrated to the problem of 145

predicting the probabilities of visiting different POI categories 146

once the passenger gets off the taxi. 147

With the research objectives and challenges discussed 148

above, the main contributions of the paper are: 149

1) We deﬁne a new problem which extends the existing 150

travel purpose inferring problem by requiring real-time 151

response, in order to recommend timely and accurate 152

services to passengers accordingly. 153

2) We propose a novel two-phase framework based on 154

Bayes’ theorem, called TripImputor, to tackle the real- 155

time taxi trip purpose imputation problem.In Phase I, we 156

ﬁrst propose a two-stage clustering algorithm to aggre- 157

gate POIs. We identify urban activity regions (UARs) 158

which are bounded and separated by physical barriers 159

using road network data (Stage 1). For each UAR, 160

with the passenger’s drop-off location and alighting time 161

as input, we identify candidate activity areas (CAAs) 162

based on POI data (Stage 2). Then, we extract ﬁne- 163

granularity spatial and temporal patterns regarding 164

human behaivours inside the CAAs from Foursquare 165

check-in data to approximate the priori probability for 166

each activity, and compute the posterior probabilities 167

using the Bayes’ theorem. In Phase II, to enable the 168

real-time response, after analyzing the computational 169

bottleneck of the ﬁrst phase, we propose a procedure 170

that includes the clustering of historical drop-off points 171

and the matching between drop-off clusters and CAAs 172

to reduce the online computation time. 173

3) We conduct extensive evaluations on the effectiveness 174

and efﬁciency of TripImputor using real-world datasets, 175

which consists of the road network data, the Foursquare 176

check-in data generated by over 38,000 users in one 177

year, and the taxi GPS trajectory data generated by 178

over 19,000 taxis in a month in Manhattan, NYC. Due 179

to the lack of ground-truth of each taxi trip, we eval- 180

uate the effectiveness indirectly by comparing to the 181

travel survey data in the statistical sense at the regional 182

scale, instead of calculating the prediction accuracy for 183

each trip individually. Experimental results show that 184

IEEE Proof

CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 3

TripImputor achieves the best prediction accuracy, com-185

pared to other two baselines. The average time for each186

taxi trip is about 1.588 seconds. The quickest response187

time is 40 milliseconds, and the longest response time188

is 7.54 seconds, which is still acceptable for practical189

applications.190

The rest of the paper is organized as follows. In Section II,191

we review the related work and show how this paper differs192

from prior research. In Section III, we introduce several basic193

concepts and present the problem formulation. We present194

detailed discussion on our two-phase framework in Section IV195

and Section V respectively. We evaluate the performance of the196

proposed framework in Section VI. Finally, we conclude the197

paper and discuss the future research directions in Section VII.198

II. RELATED WORK199

A. Semantic Trajectory Enrichment200

The passive collection of large-scale locational data with201

time stamps (trajectory data) becomes easily feasible, both202

technically and economically, with the rapid development of203

mobile localization technologies. The data come from many204

sources, e.g., the call detail records from mobile phone users,205

smart card data from travellers, GPS tracking of private/public206

vehicles and so on. The recorded location has varied for-207

mats and resolutions. For instance, GPS-based trajectory data208

records the physical coordinates of the moving objects; smart209

card data records the location in the stop name. Besides, some210

of the data can contain the travel mode information explicitly.211

But there generally lacks an explicit understanding of the212

individuals intention in making that trip. In another word,213

while such unlabelled data is available, the semantic label of214

travel purpose is missing.215

Extracting high-level semantics from raw data and further216

use them to better understand the underlying meaningful217

movement behaviors (e.g., why people move) have attracted218

many researchers’ attention [22]. Quite a few of technicals219

have been applied to interpret travel purposes in terms of220

travel activities after the trip. The techniques mainly include221

deterministic and heuristic rules, machine learning based222

approaches, and statistical data mining algorithms [9], [13],223

[15], [16], [26], [33], [41]. To name a few, Wolf [33]224

proposed using a set of deterministic rules to derive the trip225

purpose, coupling with the land use data. Deng and Ji [9]226

built a decision tree for trip purpose inference, combining227

the other information provided by GIS data and respondents’228

social-demographics. On the basis of modelling the proba-229

bility of points of interest to be visited using Bayes’ rules,230

Gong et al. [15] inferred the the travel purposes for taxi trips.231

Although lots of approaches have been developed to enrich the232

raw trajectory with the semantic meaning, prior work never233

requires the timely response when inferring trip purpose, thus234

recommendation services cannot be supported.235

B. Check-In Data and Taxi Trajectory Data Mining236

Check-in data and taxi trajectory data have been mined237

to support various smart urban applications, having attracted238

lots of attentions from researchers during recent years. For239

example, knowledge hidden behind the check-in data has been 240

mined to support (personalized) landmark recommendation/ 241

search, frequent associated POI sequences suggesting, 242

the heat-map of landmark popularity at different time under- 243

standing and so on [6], [35]. 244

Information mined from taxi trajectory data can beneﬁt 245

a number of parities, including taxi drivers, passengers and 246

city planners. For taxi drivers who are mostly interested in 247

making more money while minimizing the fuel cost [10], [14]. 248

Work on recommending the best corner to catch taxis, real- 249

time ordering free taxis, and the taxi fee estimation aims to 250

improve the experiences of passengers, e.g. [1]. An interesting 251

work detected anomalous taxi rides and warned the passengers 252

“on-the-ﬂy” that they were taken on a unnecessary detour [5]. 253

For city planners, taxi trajectory data provides a rich data 254

source to identify ﬂaws in city planning, probe trafﬁc con- 255

ditions, estimate the travel demands, infer the land-use efﬁ- 256

ciency, suggest bus routes, etc [2]. Recent studies also incor- 257

porate taxi trajectory data with other data sources such as 258

POI data, Foursquare check-in data, and Flickr image data, 259

to enable smarter applications, such as building functions 260

inferring, personalized travel route planning, hitchhiking pack- 261

age deliveries and so on [6], [7], [36]. However, to the best 262

of our knowledge, we are the ﬁrst study on inferring trip 263

purpose in real time, leveraging the complementary knowledge 264

embedded in the multi-sourced urban data. 265

III. BASIC CONCEPTS AND PROBLEM STATEMENT 266

A. Basic Concepts 267

Deﬁnition 1 (Road Network): A road network is a graph 268

G(N,E), consisting of a node set N and an edge set E, 269

where each element n in N is an intersection with a pair 270

of longitude and latitude coordinates (x,y)representing its 271

spatial location. Edge set E is a subset of the Cartesian 272

product N ×N. Each element e(u,v) in E is a street 273

connecting node u and node v, which has several attributes 274

including speed limit, number of lanes, street level.3275

Deﬁnition 2 (A Taxi Drop-Off Point): A taxi drop-off 276

point (pi) is deﬁned as a time-stamped location where the 277

passenger was dropped off, denoted by ((xi,yi), ti). 278

Deﬁnition 3 (POI Category): A POI category is a semantic 279

label for a place, indicating the correlation between the place 280

and potential human activities. 281

Foursquare maintains a three-level ontology structure for 282

category description [6]. In the ﬁrst level, it has 9 categories 283

in total. In the second and third levels, it has 412 sub-/sub- 284

subcategories in total. Table I shows the trip purposes (travel 285

activities) and the corresponding primary POI categories [15]. 286

Deﬁnition 4 (A Check-In): A check-in is represented by a 287

triple ck =(uid ,v

id,ti), indicating a user with id uid checked- 288

in at a venue (i.e. POI) with id vid at time tiusing Foursquare. 289

In general, a POI (venue) that is frequently checked-in by 290

many users is popular and attractive. In addition, Foursquare 291

provides the physical coordinates, tags, and the open time 292

information of an any given venue. 293

3The road network can be crawled from an open crowdsourced platform,

i.e., OpenStreetMap. Refer to www.openstreetmap.org for more details.

IEEE Proof

4IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TAB L E I

NINE TRIP PURPOSES AND THE CORRESPONDING

PRIMARY POI CATEGORIES

Deﬁnition 5 (Response Time): The response time is deﬁned294

as the time difference between the drop-off time when the295

passenger gets off taxis and the time when the passenger296

receives the recommendation services.297

B. Problem Statement298

Inferring the taxi trip purposes leveraging multi-sourced299

urban data can be viewed as predicting the probabilities of300

taking one of the nine activities, which can be formulated as:301

Given:302

1) A drop-off point ((xr,yr), tr), which is generated in real-303

time;304

2) A set of historical check-ins {uid,v

id,ti}(e.g., the last305

month), together with check-ins accumulated several306

hours before trin the designated city;307

3) POIs in the designated city, which can be obtained from308

the check-in data;309

4) A road network G(N,E)of the designated city.310

Predict the probabilities of taking each of the nine activities311

respectively for the drop-off point (the objective of Phase I),312

and provide timely service recommendations related to the313

top-ranked trip purposes (activities with top probabilities) for314

the passenger (the objective of Phase II).315

IV. PHASE I: IMPUTING TRIP PURPOSES316

A. Urban Activity Region Identiﬁcation317

Human beings are known as collective people (i.e., most318

of people live, work together with others in nature), thus319

it is highly likely that people take activities in a small and320

scattered fraction of the whole city space. A preliminary step321

for inferring the travel purpose of passengers is to identify322

all the scattered activity regions in the whole urban space.323

To ease the presentation, we name these regions as Urban324

Activity Regions (UARs).325

Urban activity regions are bounded and separated by some326

physical barriers such as main roads, rivers, and mountains,327

as can be witnessed in the human civilization and urbanization328

process in history [28], [40]. Each separated UAR is isolated329

and bounded by main road segments (or rivers), covering330

several neighborhoods and narrow streets. Inside each UAR,331

Fig. 1. Illustrative example of determining the region that a given POI

belonging to (top left); the illustrative examples of assigning a huge number

of POIs to regions (top right and bottom left); the identiﬁed CAAs for the

illustrative example (bottom right).

passengers can easily reach between two points if they are 332

located to each other. Usually, passengers who get off taxis at 333

one side of the primary way will not cross it (i.e., go to the 334

other side) to take activities due to the huge barrier. On the 335

contrary, when getting off taxis at small and narrow streets, 336

the passengers can easily walk towards another direction. 337

Based on the above observations, in this paper, we mainly rely 338

on the road network data to identify the UARs in the target 339

city. We propose a two-step procedure to divide the whole city 340

into a number of disjointed UARs. 341

•Step 1: We extract the road network data including 342

coordinates of nodes, edges, as well as the attributes of 343

edges (e.g., number of lanes, speed limits, road levels/ 344

types) from an open crowdsourced platform, i.e., the 345

OpenStreetMap. With the information of road level/type 346

attributes, we are able to keep high-level road segments 347

that are only tagged as ‘motorway’, ‘trunk’, or ‘primary’. 348

•Step 2: For the trimmed road network only consisting of 349

high-level road segments, we apply the image-processing- 350

based map segmentation algorithm in [37] to obtain 351

connected components. Each connected component is just 352

a piece of the separated urban activity region (UAR, 353

R1∼R5in Fig. 1). 354

B. Candidate Activity Area Identiﬁcation 355

It is well-known that POIs are the most common activity 356

unit for human beings. In the case of people taking taxi to 357

travel, on one hand, they always prefer to get off as close 358

to the true destination as possible. On the other hand, in the 359

modern city, there are usually many different categories of 360

POIs located in a same building (e.g., a shopping mall). In this 361

respect, people are more likely to be attracted by the nearby 362

one or two buildings after getting off taxis. Hence, we propose 363

the concept of candidate activity area (CAA) in which different 364

IEEE Proof

CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 5

POIs locate close to each other. The CAAs correspond to small365

areas, and we use CAA as the activity unit for taxi passengers.366

To identify such a CAA, we ﬁrst determine which UAR a367

given POI belongs to. Then, we aggregate the POIs belonging368

to the same UAR to several clusters based on the spatial369

proximity. Finally, we identify each cluster as a CAA. In this370

sense, a UAR contains serval CAAs. However, the assignment371

of POIs to UARs is quite challenging since we have to address372

the following two issues:373

1) Each UAR is usually of an arbitrary shape, thus we374

cannot simply compare the POI locations to the locations375

of the UAR boundaries. A simpler but essential problem376

is the point-in-polygon problem [31]. More speciﬁcally,377

it’s the problem of determining whether a given point378

is inside/outside a given closed polygon (i.e., region),379

whichisprovedtobehard[17].380

2) The number of POIs is huge (e.g., the number of POIs381

in the Manhattan of NYC is more than 10k), and how382

to efﬁciently determine which UAR each POI locates at383

is also a challenging issue.384

Algorithm 1 Algorithm for Determining the Region That a

Given POI Belongs to

Input: agivenPOI(pi); the trimmed road network and the

identiﬁed UARs in the target city;

Output: the UAR that the given point is located, denoted by

Ri=PinR(pi).

•Step 1: Based on the location of the given point (pi),

we can ﬁnd its nearest node ni;

•Step 2: According to the identiﬁed node niand the

topology of the high-level road network, we can easily

identify all the regions that share the node ni. We denote

these regions by {Ri};

•Step 3.a: For each region in the set of {Ri}, we apply

ifPinR algorithm to check whether piis inside that

region;

•Step 3.b: Loop ends when ifPinR returns 1.

Without loss of generality, to deal with the ﬁrst issue,385

we apply a popular and mature algorithm to determine the386

relationship (i.e., inside or outside) between a given point387

and a given region [18]. For simplicity, we denote the algo-388

rithm as ifPinR(pi,ri). If the point piis inside region ri,389

ifPinR(pi,ri)returns 1; otherwise, it returns 0. To determine390

which region that a given point belongs to, we propose the391

algorithm by recalling if PinR repeatedly. The pseudo-code392

of this algorithm is presented in Algorithm 1. For the given393

point, Step 1 and Step 2 identify all the possible regions that it394

may belong to, according to the geometrical relationship in the395

space. Note that a region is represented by a sequence of nodes396

in the clockwise direction. For instance, the possible regions397

for piin the illustrative example (as shown in top left of Fig. 1)398

are marked as R1,R2,andR3. Step 3 shows the repeated399

recalling procedure of algorithm if PinR. The number of400

loops is usually small since the possible region set contains401

few and limited regions. In the best case, the number of loops402

is 1, while in the worst case, the number of loops is just equal403

to the size of the possible region set. The loop number is 1 for 404

the illustrative example since if PinRreturns 1 when checking 405

R1at the ﬁrst loop. 406

To deal with the second issue, a straightforward but com- 407

putationally expensive method is to check each POI based 408

on Algorithm 1. In theory, the computation complexity is 409

O(N×M×C),whereNis the number of POIs; Mis the 410

average number of possible regions for a given POI, which 411

is usually small and O(C)is the complexity of ifPinR 412

algorithm. Therefore, in order to accelerate the computation 413

process, we should reduce the number of POIs to be checked. 414

Actually, it is unnecessary to check some POIs. More specif- 415

ically, if we have determined the region where a given POI 416

locates at, then we can directly infer that its ‘nearby’ POIs 417

should also be located inside the same region with high 418

conﬁdence level. Inspired by this observation, we propose a 419

novel and efﬁcient algorithm to determine the regions of the 420

POIs. Brieﬂy speaking, the algorithm mainly consists of POI 421

random selection,point in which region determination and cell 422

growing, as illustrated in Algorithm 2. 423

Algorithm 2 Algorithm for Determining Regions That a Huge

Number of POIs Belong to

Input: a pool of POIs ({pi}) and a set of UARs ({Ri})inthe

target city;

Output: {Ri}=PinR({pi}).

•Step 1: Randomly select a POI from {pi}(e.g., ps);

Step 1.1: Rs=PinR(ps);

•Step 2: Take psas the center, get a grid cell with equal

width and length (g0);

Step 2.1: gi=g0;

•Step 3: If gihas no intersection with Rs,then

Step 3.1: Identify all POIs inside the grid based on the

geometric relationship (denoted by Psub(gi));

Psub(gi)should be all located at Rs;

Step 3.2: {pi}={pi}− Psub(gi);

Step 3.3: Increase the grid cell size by 50%, gi+1=1.5×

gi);

•Step 4: Repeat Step 1 ∼3 until {pi}is empty.

In the ﬁrst step, we randomly pick up a POI from the 424

pool and determine which region the selected POI belongs to 425

(Step 1.1) based on Algorithm 1. In the second step, we deter- 426

mine a grid cell with the selected POI as the center. 427

Fig. 1 (top right) demonstrates the result after the ﬁrst two 428

steps. All POIs inside the grid cell should be located at 429

the same region of the selected POI if there is no inter- 430

section between the grid cell and the region boundaries 431

(Step 3.1 and 3.2 respectively). Thus, there is no need for 432

us to check for those POIs and we can remove them from 433

the POI pool directly (Step 3.3). With the objective of further 434

increasing the number of no-need-check POIs, the grid cell 435

will grow bigger to contain more POIs (Step 3.4), as demon- 436

strated in Fig. 1 (bottom left). In the case that the grid cell (gi)437

crosses over the region, the algorithm will restart the whole 438

procedure from the ﬁrst step by selecting a new POI randomly 439

again. The process will terminate until there is no POI in the 440

IEEE Proof

6IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

pool (Step 4). Finally, each POI will be associated with a label441

of the region that it belongs to.442

For POIs inside the same UAR (POIs with the same443

region label), we apply the popular DBSCAN algorithm to444

get clusters since the algorithm can identify clusters with445

different density and shape [11]. POIs that are close to each446

other and within the same UAR would be identiﬁed as a447

Candidate Activity Area (CAA). However, as demonstrated in448

Fig. 1 (bottom right), POIs scattering at different UARs are449

grouped to different CAAs, even if they are close to each other.450

Remark: Although the clustering and identiﬁcation of CAAs451

can be done ofﬂine, it should be a plus if we can accelerate452

the procedure, since we have a huge number of points of453

interests and dozens of regions in the city. What is more,454

POIs in the city are dynamic, for instance, some POIs are455

disappearing while some POIs are emerging, necessitating the456

regular update of CAAs. Thus, it is desirable if we have an457

efﬁcient algorithm for clustering and identiﬁcation of CAAs.458

C. Trip Purpose Imputation459

The objective of the trip purpose imputation is to predict460

the POI category that the passenger intends to visit after461

getting off the taxi, given the drop-off point location and462

the drop-off time. We denote the drop-off information of the463

passenger by ((x,y), t). To infer the trip purpose correctly,464

several factors need to be considered. The ﬁrst is the distance465

from passenger’s ﬁnal destination to the drop-off location.466

In more detail, the closer is the POI to the drop-off point,467

the more likely would the POI be visited, since taxis offer door-468

to-door services to passengers. Under such circumstance, most469

passengers prefer to get off taxis as close as possible to the470

ﬁnal destination. The second factor that needs to be considered471

is the distribution of nearby POI categories to the drop-off472

point. Heading to an area mostly covered by Restaurants,473

the trip purpose would probably be the dining activities. Last474

but not the least, the alighting time of the passenger from the475

taxi is also vital as people take different activities at different476

time.477

To integrate the above three factors comprehensively,478

we mainly take the following three major steps. First, given479

the location of the drop-off point, we select the top-knear-480

est CAAs within the walkable distance (e.g., 500 meters).481

We note that passengers will visit the top-kCAAs with482

different probabilities. That is, the closer is the CAA to the483

drop-off point, the higher is the probability that the CAA484

will be visited, which exhibits the distance decay effect.485

Speciﬁcally, the probability that a CAA will be visited can486

be determined by Eq. 1.487

P(CAA

i|(x,y)) ∝(di)−β

488

s.t.k

i=1P(CAA

i|(x,y)) =1(1)489

where direfers to the Euclidean distance from the center of490

CAA

ito the drop-off point (x,y) of the passenger; kis the491

number of the nearby CAAs considered, which is set to 3 in492

our study; βis the distance decay parameter. We set β=1.5,493

which is also consistent with existing ﬁndings in [6] and [20].494

Second, even if the visited CAA has been determined, 495

because there are different POIs, each with a unique category 496

and visiting popularity, the prediction of the POI categories 497

for passengers is still challenging [15]. To alleviate the issue, 498

inside a determined CAA (e.g., CAA

i), we compute the 499

probability for visiting each POI category (i.e., taking activity) 500

based on Bayes’ theorem [21], as shown in Eqns. 2 and 3. 501

P(aj|(x,y), t,CAA

i)502

=P((x,y)|aj,t,CAA

i)×P(aj|t,CAA

i)×P(t,CAA

P((x,y), t,CAA

i)503

(2) 504

P((x,y), t,CAA

i)505

=n

j=1P((x,y)|aj,t,CAA

i)506

×P(aj|t,CAA

i)×P(t,CAA

i)(3) 507

nis the number of total activities considered in the paper; 508

P((x,y)|aj,t,CAA

i)represents the probability that a passen- 509

ger gets off the taxi at location (x,y)if he/she has decided to 510

take the activity ajat CAA

iat time t. Gong et al. [15] simply 511

assume that the location and the time of the drop-off point are 512

conditionally independent, given the activity type (aj), i.e., the 513

following equation can be satisﬁed. 514

P((x,y)|aj,t,CAA

i)=P((x,y)|aj,CAA

i)(4) 515

However, we argue that Eq. 4 does not hold for most cases, 516

since where passengers select to get off taxis does not only 517

depend on the nearby land use (i.e., spatial context) [9], [33], 518

but also the alighting time. On one hand, passengers may 519

get off taxis near a shopping plaza to shop; while on the 520

other hand, passengers might get off taxis at places in a 521

business district to have meal in the evening. In other words, 522

the locations and the times of the drop-off point are inter- 523

dependent. Here, we use the following equation to approximate 524

thetruevalueof P((x,y)|aj,t,CAA

i)by considering the 525

attractiveness and the POI distribution on categories of the 526

CAA collectively, as shown in Eq. 5. 527

P((x,y)|aj,t,CAA

i)528

∝numberof POIs(aj,CAA

numberof POIs(CAA

i)×Ai(t)529

s.t.n

j=1P((x,y)|aj,t,CAA

i)=1(5)530

numberof POIs(CAA

i)and numberof POIs(aj,CAA

i)in 531

Eq. 5 refer to the number of POIs and the number of POIs 532

related to ajwithin the CAA

irespectively; Ai(t)refers to the 533

attractiveness of the CAA

iat the given time slot, which can be 534

measured by the popularity of CAA

iat that time, compared to 535

the rest of other CAAs among the top-klist. In more detail, 536

we calculate the value of Ai(t)by dividing the number of 537

check-ins of CAA

iby the total number of check-ins of all 538

top-kCAAs during the given time slot in the historical days 539

(e.g., last month), as can be seen in Eq. 6. Note that it is easy 540

to extract the information about the check-ins and categories 541

of POIs from the Foursquare platform. 542

Ai(t)=numbero f Checki ns(CAA

i,t,days)

k

i=1numberof Checki ns(CAA

i,t,days)(6) 543

IEEE Proof

CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 7

P(aj|t,CAA

i)in Eq. 2 is the probability of taking activity544

ajif the passenger is located in CAA

iat time t. The distrib-545

ution of P(aj|t,CAA

i)depends on the spatial and temporal546

patterns of human activity in that area. It has been well547

recognized that human behaviours in terms of taking activities548

present strong and regular patterns. For instance, with respect549

to the time dimension, the probability of visiting work-related550

places during 8:00 am-10:00 am is generally much higher than551

that of visiting shopping malls. With respect to the space552

dimension, the case may vary depending on geographical553

areas. To capture such temporal and spatial regularities in a554

ﬁne granularity, again in this study, we rely on the check-555

ins from Foursquare. Given the time tand candidate activity556

area CAA

i, we approximate the probability of visiting a557

certain POI category (i.e., taking the activity of aj)bythe558

ratio of the number of check-ins on the given POI category to559

the total number of check-ins in CAA

iduring the given time560

slot in the historical days (e.g., last month), as shown in Eq. 7.561

P(aj|t,CAA

i)=numbero f Checkins(aj,CAA

i,t,days)

numbero f Checki ns(CAA

i,t,days)

562

(7)563

Although strong and regular patterns (i.e., regularity)of564

human behaviours are frequently observed, dynamic is also565

an another salient feature. For instance, human behaviours are566

interrupted and changed when encountering unexpected sud-567

den and big social events. To capture such changes, we propose568

to combine the most fresh check-ins in the studied area since569

the live data may reﬂect the affected human activities timely.570

Therefore, the probability can be updated by Eq. 8.571

P(aj|t,CAA

i)572

∝α×numbero f Checki ns(aj,CAA

i,t,days)

numbero f Checkins(CAA

i,t,days)

 

regularity

573

+(1−α) ×numbero f Checkins(aj,CAA

i,t,4h)

numbero f Checki ns(CAA

i,t,4h)

 

dynamic

(8)574

where numbero f Checki ns(aj,CAA

i,t,4h)refers to the575

number of check-ins in the given POI category and576

numbero f Checkins(CAA

i,t,4h)indicates the total number577

of check-ins in the area of CAA

iby counting the check-ins578

accumulated in the most recent four hours just before time t,579

respectively. αis a weighting factor (we set α=0.9inthis580

study). We note that the probability obtained by Eq. 8 needs581

to be normalized, i.e., n

j=1P(aj|t,CAA

i)=1 with nbeing582

the total number of activities considered in the paper.583

P(t,CAA

i)in Eq. 2 is the probability of taking activities584

in CAA

iafter the passengers gets off taxis at time t,which585

can be computed by Eq. 9, as follows.586

P(t,CAA

i)=P(t)×P(CAA

i|t)(9)587

The probability of the passenger getting off taxis at time t588

(i.e., P(t)) is different at different times of the day, since589

human activity has strong time regularity. The probability P(t)590

can be estimated by the ratio of the number of drop-offsduring591

the given time slot to the number of drop-offs during the whole592

Fig. 2. Illustration for the computation of P(t,CAA

i). value in the grid cell

refers to the probability of taking activity in the corresponding CAA after the

ending of the corresponding trip.

day. The computation of P(CAA

i|t)is a bit more complicated. 593

In the following,to better understand how to compute the value 594

of P(CAA

i|t), we use an example to illustrate the basic idea, 595

as shown in Fig. 2. We suppose that there are 6 taxi trips 596

occurred during the given time slot and there are 8 CAAs 597

that have been identiﬁed. For each taxi trip, passengers would 598

choose one of the CAAs to take activities after getting off 599

taxis. Furthermore, as discussed earlier in the section, for each 600

trip, we assume the passenger would take activities in one of 601

the top-kCAAs within the walkable distance. In the example, 602

the value of the grid cell (e.g., gij) refers to the probability 603

of passengers from taxi trip tritaking activity in area CAA

i,604

which can be computed based on Eq. 1. For each time slot, 605

the probability of taking activity in a given CAA (CAA

i)is 606

just the average value of the corresponding row values, i.e., 607

P(CAA

i|t)=N

m=1gim

N(10) 608

where Nis the number of taxi trips occurred in the studied 609

time slot. 610

In summary, for thw taxi trip (x,y,t), the probability of 611

passengers taking a given activity ajafter getting off the taxi 612

can be approximated by the following equation. 613

P(aj|(x,y), t)614

∝P(CAA

i|(x,y)) ×P(aj|(x,y), t,CAA

i)615

s.t.n

j=1P(aj|(x,y), t)=1 (11) 616

V. PHASE II: ENABLING REAL-TIME RESPONSE 617

In order to enable the real-time response for each drop- 618

off event (i.e., compute the posterior probability of taking 619

each activity for each drop-off point using Bayes’ theorem in 620

real-time), we need to identify the most time-consuming com- 621

ponent. As discussed in Section III, the posterior probability 622

calculation mainly consists of four components, the details of 623

which are shown in Table II. 624

As shown in the table, the ﬁrst component is related to 625

the probability of visiting a given candidate activity area 626

IEEE Proof

8IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TAB L E II

DETAILS ON EACH COMPONENT OF INFERRING TRIP PURPOSES

Fig. 3. A schematic diagram of reducing the time complexity of the ﬁrst

component. the value on the edge carries the information about the visiting

probability to the corresponding CAA.

(CAA

i) if the passenger was dropped off at point (x,y).627

The probability is computed online because the distance to628

each top-knearest CAAs varies if the passengers get off629

taxis at different points. However, we argue that two drop-630

off points that are close to each other would have sim-631

ilar value of P(CAA

i|(x,y)),i.e., P(CAA

i|(x1,y1)) ≈632

P(CAA

i|(x2,y2)) if (x1,y1)is close to (x2,y2). Hence, we633

aggregate historical information on drop-off points to drop-634

off cluster and assume all drop-off points in the same cluster635

would have equal value of P(CAA

i|(x,y)).Insuchway,the636

value of the ﬁrst component can be pre-computed ofﬂine.637

The only online job is to identify which drop-off clusters638

that it should belong to. Once receiving a real-time drop-639

off point, this online job is quite efﬁcient. In this manner,640

the computation time can be reduced signiﬁcantly. As shown641

in Fig. 3, the top-kCAAs of the drop-off cluster can be642

identiﬁed and the distance to each CAA can be measured643

by the one between the centroid of drop-off cluster and644

the centroid of each CAA. Thus, the probability of visiting645

CAA

ifrom a drop-off point inside the drop-off cluster can646

be calculated ofﬂine efﬁciently. We note that many drop-off647

clusters can be obtained in advance, given the historical taxi648

trip data. Each of the drop-off clusters is associated with k649

visiting probabilities to its nearby top-kCAAs.650

The second component is related to the probability of651

getting off taxis at point (x,y)if the passenger walks to area652

CAA

iand intends to take activity ajat time t. As discussed 653

earlier, two factors are considered. The ﬁrst is the attrac- 654

tiveness of CAA

iat the given time slot, which is measured 655

by the popularity of that area. Note that the popularity of 656

a CAA at a given time slot can be calculated in advance, 657

using the historical check-in data contributed by mobile users. 658

The second factor is the POI category distribution in the 659

CAA

i, which remains relatively stable. Thus, it is obviously 660

that the value of the second component can be pre-computed 661

ofﬂine. 662

The third component is the conditional probability of taking 663

a given activity (e.g., aj) if the passenger is at CAA

iat the 664

time t. To approximate the true value of this component, both 665

the “regularity” and “dynamic” patterns of the area are taken 666

into consideration. As shown in the formula, the “regularity” 667

pattern is based on the historical check-in data, and the 668

“dynamic” pattern is captured by the most recent check-in data 669

just before the drop-off time. Thus, the former part can be pre- 670

processed ofﬂine, while the latter part can only be computed 671

online. 672

The fourth component is about the joint probability of 673

visiting the area of CAA

iat the time of t. As can be seen, 674

the value is determined by two parts. One is the frequency 675

of getting off taxis at the given time slot, and the other is 676

the spatial distribution of the drop-off pints. Both parts are 677

quantiﬁed using the historical taxi trip data. Thus, the value 678

can be pre-computed ofﬂine. 679

In summary, two online jobs, identifying the drop-off 680

clusters and extracting the “dynamic” patterns of the top-k681

CAAs, are required when receiving a streaming drop-off 682

point (xr,yr,tr). With the other components computed and 683

structured ofﬂine purposely, the whole process can be quite 684

efﬁcient. We will validate this in the experiments. 685

VI. EVA L UA T I O N 686

A. Experimental Setup 687

1) Data Preparation: Three data sets in the Manhattan area, 688

the city of New York (NYC) are used, i.e., the road network, 689

the Foursquare check-in data, and the taxi GPS trajectory data. 690

Some basic statistical information about the three data sets is 691

shown in Table III. 692

2) Comparison Algorithms: We compare our approach with 693

two baseline algorithms, the details of which are presented as 694

follows. 695

IEEE Proof

CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 9

Fig. 4. Results of UARs and CAAs identiﬁcation in Manhattan, NYC. a full-view of clustering results (a); a close-view of some selected regions (b); the

number of CAAs in the UARs (c). (best viewed in an enlarged digital version.)

TABLE III

STATI STI CS OF URBAN DATA SETS USED IN THE PAPER

•Nearest.TheNearest algorithm simply sets the POI that696

is closest to the drop-off location as the ﬁnal destination697

of the passenger, regardless of the drop-off time. Thus,698

the trip purpose is predicted as taking activities related to699

that POI category.700

•Bayes’ rule [15]. The major difference between this base-701

line and our proposed one is that the baseline assumes702

that two temporally-close drop-off points may be related703

to the same priori probability of a given trip purpose,704

even if the two points are located far away from each705

other. While for our proposed algorithm, both regular706

and dynamic patterns are considered when calculating707

the priori probability in a very ﬁne spatial and temporal708

resolution, which leverages the user-generated check-in709

data.710

3) Evaluation Environment: All the evaluations in the711

paper are programmed using Java language under the Eclipse712

J2SE 1.5 integrated development environment, and are run713

on an Intel Core i5-4950 PC with 8-GB RAM and Windows714

8 operation system.715

B. Evaluation on Candidate Activity Area Identiﬁcation 716

Fig. 4 presents the clustering results (i.e., the identiﬁcation 717

of UARs and CAAs) of our two-stage clustering algorithm. 718

In total, we have identiﬁed 30 UARs, all of which are based 719

on the road network data. As shown in Fig. 4(a), most POIs 720

are located at midtown and downtown of Manhattan, while 721

only very are scattering at the upper town. A close view of 722

some selected regions are shown in Fig. 4(b) to highlight the 723

advantages of our proposed clustering algorithm. For example, 724

due to the physical barriers (i.e., wide roads), POIs in purple 725

color at Region 6 are not grouped together with their nearby 726

POIs at Region 5, and several POIs at Region 4 are not 727

merged with their neighbours at Region 5 either. Each UAR 728

contains different number of CAAs, depending on the spatial 729

distribution of the POIs inside. Fig. 4(c) shows the number 730

of CAAs for each UAR. The xcoordinate corresponds to the 731

region number and the ycoordinate is the number of CAAs 732

in that region. As shown in the ﬁgure, region 17 contains 733

the maximal number of CAAs, while most of regions have a 734

number of CAAs less than 20. 735

The size of the identiﬁed CAA is also an important metric 736

to evaluate the clustering algorithm. The size of each CAA 737

should be within a region of the walkable distance. Here the 738

size of a CAA is deﬁned as the minimal rectangle which covers 739

all POIs in the CAA. If the CAA size is too big, then the POIs 740

in the CAA are difﬁcult to be reached by foot. Fig. 5 shows 741

the Cumulative Distribution Function (CDF) of the size of 742

all CAAs. As can be seen from the ﬁgure, the size of over 743

96% of CAA are less than 10,000 square meters, showing the 744

effectiveness of our proposed two-stage clustering algorithm. 745

IEEE Proof

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 5. The CDF distribution of the size of CAAs.

C. Evaluation on Trip Purpose Imputation Algorithm746

As discussed earlier, due to the lack of ground-truth of the747

taxi trip purpose, it is impossible to calculate the inference748

accuracy directly. Fortunately, we are provided with the travel749

purpose survey data at the regional scale (e.g., Manhattan) [3],750

which motivated us to evaluate the system accuracy indirectly.751

The rationale here behind is: if the distribution of the trip752

purposes inferred by our proposed method is close to the753

one obtained by the survey data in the statistical sense at the754

regional scale, our proposed method should be reliable. Since755

the survey data classiﬁes the travel purpose into 4 categories,756

i.e., work, education, recreation, shopping and others, to make757

the results comparable, we manually put ‘dining’, ‘In-home’,758

‘Transportation transfer’, ‘Lodging’ and ‘Medical’ into the759

‘Others’ category. Next, for each taxi trip, with the proposed760

inference algorithm, we are able to get 5 probabilities of 5 new761

trip purposes. Finally, for each trip purpose, we average the762

probabilities of all taxi trips generated in one month, and use763

the average value as the percentages of the travel for that trip764

purpose.765

We show the comparison between our inference results to766

the travel survey data in Fig. 6. Besides, the results obtained767

by the other two baselines are also plotted for comparison.768

It is easy to understand that, the closer the percentage value769

on each category to the corresponding survey data value,770

the better performance our algorithm achieves. As can be771

seen from the results, our proposed algorithm achieves the772

best performance, while the Nearest algorithm achieves773

the worst performance and the Bayes’ Rules [15] achieves774

the performance in-between.775

Our proposed inference algorithm also enables us to gain776

insights on trip purpose in a much ﬁner resolution. We thus777

select a representative urban activity region (UAR) to inves-778

tigate the trip purpose trend at different time of the work779

day. The selected UAR together with inside distributed POIs780

is shown in Fig. 7, where only four POI categories can be781

found. Fig. 8 shows the trip purpose inference results of the782

selected region across the whole day (top chart). We also show783

the corresponding results in other regions of Manhattan for784

comparison (bottom chart). As shown in the ﬁgure, travel for785

shopping and dining in the selected region is more common786

Fig. 6. Comparison results to baseline algorithms and survey data.

Fig. 7. A selected UAR with 4 kinds of POIs. (Best viewed in an enlarged

digital version.)

Fig. 8. Trip purpose imputation results for a given day in the selected UAR

and in Manhattan, respectively.

since it is a well-known shopping and dinner center in NYC. 787

Moreover, the number of trips for shopping purpose keeps 788

increasing and remain high in the daytime, even in the work 789

days. In both selected UAR and other regions in Manhattan, 790

the number of trips for recreation purpose climbs after the 791

work time. 792

D. Evaluation on Response Time 793

Another key system metrics is how long a passenger can get 794

the recommendation services after getting off the taxi. Because 795

all the requests are processed sequentially in one machine fol- 796

lowing the First-Come-First-Out (FIFO) rule, when a request 797

arrives, one of the following two situations may occur. 798

(1) There are no other requests are being processed or waiting 799

IEEE Proof

CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 11

TAB L E IV

RESPONSE TIME IN THE WORST CASE IN

MANHATTAN AND NYC, RESPECTIVELY

Fig. 9. The CDF distributions of the response times at a day in Manhattan

and the whole NYC, respectively.

to be processed in the system; (2) There are other requests800

in the system, being processed or waiting to be processed.801

In the ﬁrst case, the request can enter service immediately802

upon arrival. In the second case, the request has to wait in803

queue until the server has ﬁnished processing other requests804

that arrive earlier. Thus, the response time for a request is the805

time from the request arrives till the time the request has been806

processed. In other words, the response time includes the wait807

time and the process time.808

We are more interested in the longest response time that a809

request needs to spend during a day, i.e., the longest time that a810

request (or a taxi trip) needs to wait before being proceeded.811

The logic is that if the longest response time is acceptable812

for most users, then the system is useful in practice. The813

longest response time corresponds to the worst case during814

a day. Table IV shows the average of the longest response815

time and its standard deviation values in Manhattan and the816

whole NYC respectively. Note that the observation days is 15.817

On average, the worst case takes 7.54 seconds and 8.15 sec-818

onds to respond requests from Manhattan and from the whole819

NYC, respectively, which are acceptable in our application820

scenarios. Hence, we conclude that our proposed TripImputor821

is not only able to process requests from the whole NYC with822

a single normal PC, but also provide timely recommendation823

services.824

We are also interested at the distribution of all response825

times in Manhattan and the whole NYC, as shown826

in Fig. 9. As can be observed, although Manhattan contributes827

90% trip inferring requests of the whole city, it still takes more828

time to respond to a request from the city, because the more829

requests come per unit time, the longer the waiting time and830

so is the response time. Moreover, almost half of requests831

can be responded within 50 milliseconds in both Manhattan832

and whole NYC. As shown in the ﬁgure, although in the833

Fig. 10. The longest response time (corresponds to the worst case) under

different number of requests per hour.

worst case it takes up to around 7.54 seconds to process a 834

request, 80% of the requests from Manhattan can be responded 835

within 4.5 seconds and that from the NYC can be responded 836

within around 5 seconds. On average, it takes only 1.588 and 837

1.812 seconds to respond for Manhattan and the whole NYC 838

respectively. The above results demonstrate the efﬁciency of 839

our system. 840

The previous experimental results ensure the efﬁciency of 841

our proposed system in handling requests from the whole 842

NYC. We are also aware that it takes more time to respond to 843

a request when there are more requests arrive (as in the NYC). 844

Going a step further, we intend to investigate how many cities 845

(like NYC) can a single normal PC support and return a timely 846

response. As shown in the Fig. 10, x-axis refers to the number 847

of requests per hour and y-axis refers to the longest response 848

time of all requests. As can be seen, it takes around 7, 16, 849

24, 30 seconds at most to process 20,000, 40,000, 60,000, 850

80,0000 requests, respectively. When the number of requests 851

received during one hour keeps increasing, the total processing 852

time will increase exponentially, because all the requests are 853

processed sequentially in one PC. The longest response time 854

is more than 9 minutes if the number of requests per hour is 855

100,000. Note that there are around 20,000 requests arriving 856

in one hour in the whole NYC during the peak hours. Thus, 857

facilitated by our method, we are capable of taking care of 858

requests for 4 cities like NYC by just using one normal PC, 859

if users can accept the maximal response time as around 860

30 seconds. 861

VII. CONCLUSION AND FUTURE WORK 862

In this paper, we present a novel two-phase framework 863

called TripImputor for inferring the taxi trip purpose in real 864

time. In the phase of trip purpose inference, we ﬁrst proposed 865

a two-stage clustering algorithm to identify the candidate 866

activity areas in the urban space, then calculate the poste- 867

rior probabilities of taking each activity for each taxi trip 868

using Bayes’ theorem. In the second phase, to reduce the 869

online computation time and immerse a real-time response, 870

we develop a sophisticated procedure mainly including clus- 871

tering of historical drop-off points and matching the drop-off 872

clusters with CAAs. Finally, we evaluate the effectiveness 873

IEEE Proof

12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

and efﬁciency of the system using the real-world datasets.874

Experimental results demonstrate that the proposed two-phase875

framework achieves the promising performance both in accu-876

racy and response time.877

In the future, we plan to broaden and deepen this work in878

several directions. First, we plan to incorporate more relevant879

information to improve the accuracy of the inference algorithm880

further, such as the personal background, social-economical881

features, with a particular focus on utilizing the information882

about the pick-up point (the pick-up time and location, and its883

nearby spatial context as well) and the trip travel time. Second,884

we intend to investigate the taxi trip purposes at different885

seasons under different spatial resolutions, and also the yearly886

evolution tendency of taxi trip purpose and the underlying887

motivations. Third, we intend to accelerate the computation888

process by introducing some parallel mechanisms such as889

Spark, since each taxi trip can be handled separately. Finally,890

we would like to deploy our system on mobile devices, and891

recruit some volunteers to test our system in actual settings,892

collecting feedback on how to further improve the service.893

REFERENCES894

[1] R. K. Balan, K. X. Nguyen, and L. Jiang, “Real-time trip information895

service for a large taxi ﬂeet,” in Proc. MobiSys, 2011, pp. 99–112.896

[2] P. S. Castro, D. Zhang, C. Chen, S. Li, and G. Pan, “From taxi GPS897

traces to social and community dynamics: A survey,” ACM Comput.898

Surv., vol. 46, no. 2, pp. 17:1–17:34, 2013.899

[3] C. Chen, H. Gong, C. Lawson, and E. Bialostozky, “Evaluating the900

feasibility of a passive travel survey collection in a complex urban901

environment: Lessons learned from the New York City case study,”902

Transp. Res. A, Policy Pract., vol. 44, no. 10, pp. 830–840, 2010.903

[4] C. Chen, Z. Wang, and B. Guo, “The road to the Chinese smart city:904

Progress, challenges, and future directions,” IT Prof., vol. 18, no. 1,905

pp. 14–17, Jan./Feb. 2016.906

[5] C. Chen et al., “iBOAT: Isolation-based online anomalous trajec-907

tory detection,” IEEE Trans. Intell. Transp. Syst., vol. 14, no. 2,908

pp. 806–818, Jun. 2013.909

[6] C. Chen, D. Zhang, B. Guo, X. Ma, G. Pan, and Z. Wu, “TripPlanner:910

Personalized trip planning leveraging heterogeneous crowdsourced dig-911

ital footprints,” IEEE Trans. Intell. Transp. Syst., vol. 16, no. 3,912

pp. 1259–1273, Jun. 2015.913

[7] C. Chen et al., “CrowdDeliver: Planning city-wide package delivery914

paths leveraging the crowd of taxis,” IEEE Trans. Intell. Transp. Syst.,915

vol. 18, no. 6, pp. 1478–1496, Jun. 2017.916

[8] K. J. Clifton and S. L. Handy, “Qualitative methods in travel behaviour917

research,” in Transport Survey Quality and Innovation. Emerald Group918

Publishing Limited, 2003, pp. 283–302.AQ:2 919

[9] Z. Deng and M. Ji, “Deriving rules for trip purpose identiﬁcation from920

GPS travel survey data and land use data: A machine learning approach,”921

in Proc. 7th Int. Conf. Trafﬁc Transp. Stud., 2010, pp. 768–777.922

[10] Y. Ding, C. Chen, S. Zhang, B. Guo, Z. Yu, and Y. Wang, “GreenPlanner:923

Planning personalized fuel-efﬁcient driving routes using multi-sourced924

urban data,” in Proc. PerCom, Mar. 2017, pp. 207–216.925

[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm926

for discovering clusters in large spatial databases with noise,” in Proc.927

KDD, vol. 96. 1996, pp. 226–231.928

[12] T. Feng and H. J. P. Timmermans, “Detecting activity type from929

GPS traces using spatial and temporal information,” Eur. J. Transp.930

Infrastruct. Res., vol. 15, no. 4, pp. 662–674, 2015.931

[13] B. Furletti, P. Cintia, C. Renso, and L. Spinsanti, “Inferring human932

activities from GPS tracks,” in Proc. 2nd ACM SIGKDD Int. Workshop933

Urban Comput., 2013, p. 5.934

[14] Y. Ge, H. Xiong, A. Tuzhilin, K. Xiao, M. Gruteser, and M. Pazzani,935

“An energy-efﬁcient mobile recommender system,” in Proc. ACM KDD,936

2010, pp. 899–908.937

[15] L. Gong, X. Liu, L. Wu, and Y. Liu, “Inferring trip purposes and938

uncovering travel patterns from taxi trajectory data,” Cartogr. Geogr.939

Inf. Sci., vol. 43, no. 2, pp. 103–114, 2016.940

[16] L. Gong, T. Morikawa, T. Yamamoto, and H. Sato, “Deriving personal 941

trip data from GPS data: A literature review on the existing methodolo- 942

gies,” Procedia-Social Behavioral Sci., vol. 138, pp. 557–565, Jul. 2014. 943

[17] K. Hormann and A. Agathos, “The point in polygon problem for 944

arbitrary polygons,” Comput. Geometry, vol. 20, no. 3, pp. 131–144, 945

2001. 946

[18] J. Huang, Y. Li, R. Crawﬁs, S.-C. Lu, and S.-Y. Liou, “A complete 947

distance ﬁeld representation,” in Proc. Conf. Vis., 2001, pp. 247–254. 948

[19] L. Huang, Q. Li, and Y. Yue, “Activity identiﬁcation from GPS trajec- 949

tories using spatial temporal POIs’ attractiveness,” in Proc. 2nd ACM 950

SIGSPATIAL Int. Workshop LBSNs, 2010, pp. 27–30. 951

[20] C. Kang, X. Ma, D. Tong, and Y. Liu, “Intra-urban human mobility 952

patterns: An urban morphology perspective,” Phys. A, Statist. Mech. 953

Appl., vol. 391, no. 4, pp. 1702–1717, 2012. 954

[21] K.-R. Koch, Introduction to Bayesian Statistics. Springer, 2007. AQ:3955

[22] J. Krumm and D. Rouhana, “Placer: Semantic place labels from diary 956

data,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput.,957

2013, pp. 163–172. 958

[23] M.-P. Kwan, “How GIS can help address the uncertain geographic 959

context problem in social science research,” Ann. GIS, vol. 18, no. 4, 960

pp. 245–255, 2012. 961

[24] H. T. Lam, E. Diaz-Aviles, A. Pascale, Y. Gkoufas, and B. Chen. 962

(2015). “(Blue) taxi destination and trip time prediction from partial 963

trajectories.” [Online]. Available: https://arxiv.org/abs/1509.05257 964

[25] X. Li, M. Li, Y.-J. Gong, X.-L. Zhang, and J. Yin, “T-DesP: Destination 965

prediction based on big trajectory data,” IEEE Trans. Intell. Transp. 966

Syst., vol. 17, no. 8, pp. 2344–2354, Aug. 2016. 967

[26] Y. Lin, H. Wan, R. Jiang, Z. Wu, and X. Jia, “Inferring the travel 968

purposes of passenger groups for better understanding of passengers,” 969

IEEE Trans. Intell. Transp. Syst., vol. 16, no. 1, pp. 235–243, Feb. 2015. 970

[27] Y. Lu and L. Zhang, “Imputing trip purposes for long-distance travel,” 971

Transportation, vol. 42, no. 4, pp. 581–595, 2015. 972

[28] D. Newman and A. Paasi, “Fences and neighbours in the postmodern 973

world: Boundary narratives in political geography,” Prog. Hum. Geogr.,974

vol. 22, no. 2, pp. 186–207, 1998. 975

[29] T. H. Rashidi, A. Abbasi, M. Maghrebi, S. Hasan, and T. S. Waller, 976

“Exploring the capacity of social media data for modelling travel behav- 977

iour: Opportunities and challenges,” Transp. Res. C, Emerg. Technol.,978

vol. 75, pp. 197–211, Feb. 2017. 979

[30] S. Schönfelder, “Urban rhythms: Modelling the rhythms of individual 980

travel behaviour,” Ph.D. dissertation, ETH Zurich, Zürich, Switzerland, 981

2006. AQ:4982

[31] M. Shimrat, “Algorithm 112: Position of point relative to polygon,” 983

Commun. ACM, vol. 5, no. 8, p. 434, 1962. 984

[32] L. Wang, Z. Yu, B. Guo, T. Ku, and F. Yi, “Moving destination prediction 985

using sparse dataset: A mobility gradient descent approach,” ACM Trans. 986

Knowl. Discovery Data, vol. 11, no. 3, p. 37, 2017. 987

[33] J. Wolf, “Using GPS data loggers to replace travel diaries in the 988

collection of travel data,” Ph.D. dissertation, Georgia Inst. Technol., 989

Atlanta, GA, USA, 2000. 990

[34] A. Y. Xue, R. Zhang, Y. Zheng, X. Xie, J. Huang, and Z. Xu, “Destina- 991

tion prediction by sub-trajectory synthesis and privacy protection against 992

such prediction,” in Proc. IEEE ICDE, Apr. 2013, pp. 254–265. 993

[35] D. Yang, D. Zhang, V. W. Zheng, and Z. Yu, “Modeling user activity 994

preference by leveraging user spatial temporal characteristics in LBSNs,” 995

IEEE Trans. Syst., Man, Cybern., Syst., vol. 45, no. 1, pp. 129–142, 996

Jan. 2015. 997

[36] Z. Yu, H. Xu, Z. Yang, and B. Guo, “Personalized travel package 998

with multi-point-of-interest recommendation based on crowdsourced 999

user footprints,” IEEE Trans. Human–Mach. Syst., vol. 46, no. 1, 1000

pp. 151–158, Feb. 2016. 1001

[37] N. J. Yuan, Y. Zheng, and X. Xie, “Segmentation of urban areas using 1002

road networks,” Microsoft Res., Tech. Rep., 2012. AQ:51003

[38] Y. Yue, T. Lan, A. G. O. Yeh, and Q.-Q. Li, “Zooming into individ- 1004

uals to understand the collective: A review of trajectory-based travel 1005

behaviour studies,” Travel Behaviour Soc., vol. 1, no. 2, pp. 69–78, 1006

2014. 1007

[39] Y. Zheng, Y. Chen, Q. Li, X. Xie, and W.-Y. Ma, “Understanding 1008

transportation modes based on GPS data for Web applications,” ACM 1009

Trans. Web, vol. 4, no. 1, p. 1, 2010. 1010

[40] C. Zhong, S. M. Arisona, X. Huang, M. Batty, and G. Schmitt, 1011

“Detecting the dynamics of urban structure through spatial network 1012

analysis,” Int. J. Geogr. Inf. Sci., vol. 28, no. 11, pp. 2178–2199, 1013

2014. 1014

[41] Z. Zhu, U. Blanke, and G. Tröster, “Inferring travel purpose from crowd- 1015

augmented human mobility data,” in Proc. 1st Int. Conf. IoT Urban 1016

Space, 2014, pp. 44–49. 1017

IEEE Proof

CHEN et al.: TRIPIMPUTOR: REAL-TIME IMPUTING TAXI TRIP PURPOSE LEVERAGING MULTI-SOURCED URBAN DATA 13

Chao Chen received the B.Sc. and M.Sc. degrees1018

in control science and control engineering from1019

Northwestern Polytechnical University, Xi’an,1020

China, in 2007 and 2010, respectively, and the1021

Ph.D. degree from the Université Pierre et Marie1022

Curie and the Institut Mines-Télécom/Télécom1023

SudParis, France, in 2014.1024

In 2009, he was a Research Assistant with1025

Hong Kong Polytechnic University, Hong Kong.1026

He is currently an Associate Professor with1027

the College of Computer Science, Chongqing1028

University, Chongqing, China. He has authored or co-authored over1029

40 papers including eight IEEE transactions. His research interests include1030

pervasive computing, mobile computing, urban logistics, data mining from1031

large-scale GPS trajectory data, and big data analytics for smart cities. His1032

work on taxi trajectory data mining was featured by the IEEE Spectrum1033

in 2011 and 2016, respectively. He was also a recipient of the Best Paper1034

Runner-Up Award at MobiQuitous 2011.1035

Shuhai Jiao received the B.Sc. degree from the1036

College of Information and Software Engineering,1037

Northeast Normal University, Changchun, China,1038

in 2015. He is currently pursuing the master’s degree1039

with the College of Computer Science, Chongqing1040

University, Chongqing, China. He was a Research1041

Intern at Didi Chuxing Company, Beijing, China,1042

in 2017. His research interests include scenic travel1043

route planning and taxi GPS trajectory data mining.1044

Shu Zhang received the bachelor’s degree from1045

the Civil Aviation University of China, Tianjin,1046

China, in 2007, the master’s degree from Mississippi1047

State University, Starkville, MS, USA, in 2010,1048

and the Ph.D. degree in management sciences from1049

the University of Iowa, Iowa, IA, USA, in 2015.1050

She is currently an Assistant Professor with the1051

College of Economics and Business Administration,1052

Chongqing University, Chongqing. Her research1053

interests including vehicle routing, urban logistics,1054

and transportation network design.1055

Weichen Liu (S’07–M’11) received the B.Eng. 1056

and M.Eng. degrees from the Harbin Institute of 1057

Technology, China, and the Ph.D. degree from the 1058

Hong Kong University of Science and Technology, 1059

Hong Kong. He is currently an Assistant Professor 1060

with the School of Computer Science and Engineer- 1061

ing, Nanyang Technological University, Singapore. 1062

He has authored and co-authored over 70 research 1063

papers in peer-reviewed journals, conferences, and 1064

books. His research interests include embedded and 1065

real-time systems, multiprocessor systems, and fault- 1066

tolerant systems. He has received the Best Paper Candidate Awards from 1067

CODES+ISSS, CASES, and ASP-DAC. 1068

Liang Feng received the Ph.D. degree from the 1069

School of Computer Engineering, Nanyang Tech- 1070

nological University, Singapore, in 2014. He was 1071

a Post-Doctoral Research Fellow at the Computa- 1072

tional Intelligence Graduate Laboratory, Nanyang 1073

Technological University. He is currently an Assis- 1074

tant Professor at the College of Computer Science, 1075

Chongqing University, China. His research inter- 1076

ests include computational and artiﬁcial intelligence, 1077

memetic computing, big data optimization and learn- 1078

ing, and transfer learning. 1079

Yas h a Wa ng received the Ph.D. degree from 1080

Northeastern University, Shenyang, China, in 2003. 1081

He is currently a Professor and an Associate Director 1082

of the National Research and Engineering Center 1083

of Software Engineering with Peking University, 1084

China. His research interests include urban data 1085

analytics, ubiquitous computing, software reuse, and 1086

online software development environment. He has 1087

authored or co-authored over 50 papers in pres- 1088

tigious conferences and journals, such as ICWS, 1089

UbiComp, ICSP, and so on. As a Technical Leader 1090

and Manager, he has accomplished several key national projects on software 1091

engineering and smart cities. Cooperating with major smart-city solution 1092

providing companies, his research work has been adopted in more than 1093

20 cities in China. 1094

IEEE Proof

AUTHOR QUERIES

AUTHOR PLEASE ANSWER ALL QUERIES

PLEASE NOTE: We cannot accept new source ﬁles as corrections for your paper. If possible, please annotate the PDF

proof we have sent you with your corrections and upload it via the Author Gateway. Alternatively, you may send us

your corrections in list format. You may also upload revised graphics via the Author Gateway.

AQ:1 = Please provide the postal code for “Nanyang Technological University.”

AQ:2 = Please provide the publisher location for ref. [8].

AQ:3 = Please note that the publisher name “Springer Science & Business Media” was changed to “Springer”

for ref. [21]. Also provide the publisher location.

AQ:4 = Please provide the department name for refs. [30] and [33].

AQ:5 = Please provide the organization location and report no. for ref. [37].