ArticlePDF Available

Predicting Citywide Crowd Dynamics at Big Events: A Deep Learning System

April 2022
ACM Transactions on Intelligent Systems and Technology 13(2):1-24

April 2022
13(2):1-24

DOI:10.1145/3472300

Authors:

Renhe Jiang

The University of Tokyo

Zekun Cai

The University of Tokyo

Zhaonan Wang

The University of Tokyo

Chuang Yang

The University of Tokyo

Show all 8 authorsHide

Event crowd management has been a significant research topic with high social impact. When some big events happen such as an earthquake, typhoon, and national festival, crowd management becomes the first priority for governments (e.g., police) and public service operators (e.g., subway/bus operator) to protect people’s safety or maintain the operation of public infrastructures. However, under such event situations, human behavior will become very different from daily routines, which makes prediction of crowd dynamics at big events become highly challenging, especially at a citywide level. Therefore in this study, we aim to extract the “deep” trend only from the current momentary observations and generate an accurate prediction for the trend in the short future, which is considered to be an effective way to deal with the event situations. Motivated by these, we build an online system called DeepUrbanEvent, which can iteratively take citywide crowd dynamics from the current one hour as input and report the prediction results for the next one hour as output. A novel deep learning architecture built with recurrent neural networks is designed to effectively model these highly complex sequential data in an analogous manner to video prediction tasks. Experimental results demonstrate the superior performance of our proposed methodology to the existing approaches. Lastly, we apply our prototype system to multiple big real-world events and show that it is highly deployable as an online crowd management system.

Mesh-based citywide crowd prediction: density, in/out flow, and transition, analogous to video data prediction on 4D tensor (Timestep, Heiдht, W idth, and Channel).

…

Learning curves on validation loss.

…

Hyperparameter study: MSE on crowd density and flow prediction.

…

Summary of Tuned Parameters

…

Performance Evaluation of 30 Minutes Ahead Prediction on Four Events

…

Figures - uploaded by Renhe Jiang

Content may be subject to copyright.

Content uploaded by Renhe Jiang

Content may be subject to copyright.

Predicting Citywide Crowd Dynamics at Big Events: A Deep

Learning System

RENHE JIANG, The University of Tokyo, SUSTech-UTokyo Joint Research Center on Super Smart City,

Southern University of Science and Technology

ZEKUN CAI, ZHAONAN WANG, and CHUANG YANG, The University of Tokyo

ZIPEI FAN and QUANJUN CHEN, The University of Tokyo, SUSTech-UTokyo Joint Research Center

on Super Smart City, Southern University of Science and Technology

XUAN SONG, SUSTech-UTokyo Joint Research Center on Super Smart City, Southern University of

Science and Technology, The University of Tokyo

RYOSUKE SHIBASAKI, The University of Tokyo

Event crowd management has been a signicant research topic with high social impact. When some big

events happen such as an earthquake, typhoon, and national festival, crowd management becomes the rst

priority for governments (e.g., police) and public service operators (e.g., subway/bus operator) to protect

people’s safety or maintain the operation of public infrastructures. However, under such event situations,

human behavior will become very dierent from daily routines, which makes prediction of crowd dynamics

at big events become highly challenging, especially at a citywide level. Therefore in this study, we aim to

extract the “deep” trend only from the current momentary observations and generate an accurate prediction

for the trend in the short future, which is considered to be an eective way to deal with the event situations.

Motivated by these, we build an online system called DeepUrbanEvent, which can iteratively take citywide

crowd dynamics from the current one hour as input and report the prediction results for the next one hour

as output. A novel deep learning architecture built with recurrent neural networks is designed to eectively

model these highly complex sequential data in an analogous manner to video prediction tasks. Experimental

results demonstrate the superior performance of our proposed methodology to the existing approaches. Lastly,

we apply our prototype system to multiple big real-world events and show that it is highly deployable as an

online crowd management system.

CCS Concepts: • Information systems →Information systems applications;•Human-centered com-

puting →Ubiquitous and mobile computing;•Computing methodologies →Articial intelligence;

Additional Key Words and Phrases: Crowd management, ubiquitous and mobile computing, deep learning,

application and system

This work was supported by Grant-in-Aid for Early-Career Scientists (20K19859) and Grant-in-Aid for Early-Career Scien-

tists (20K19782) of Japan Society for the Promotion of Science (JSPS).

Authors’ addresses: R. Jiang, Z. Fan, and Q. Chen, The University of Tokyo, SUSTech-UTokyo Joint Research Cen-

ter on Super Smart City, Southern University of Science and Technology; emails: jiangrh@csis.u-tokyo.ac.jp, {fanzipei,

chen1990}@iis.u-tokyo.ac.jp; Z. Cai, Z. Wang, C. Yang, and R. Shibasaki, The University of Tokyo; emails: {caizekun,

znwang, chuang.yang, shiba}@csis.u-tokyo.ac.jp; X. Song (corresponding author), SUSTech-UTokyo Joint Research Cen-

ter on Super Smart City, Southern University of Science and Technology, The University of Tokyo; email: songxuan@csis.

u-tokyo.ac.jp.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

2157-6904/2022/03-ART21 $15.00

https://doi.org/10.1145/3472300

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:2 R.Jiangetal.

ACM Reference format:

Renhe Jiang, Zekun Cai, Zhaonan Wang, Chuang Yang, Zipei Fan, Quanjun Chen, Xuan Song, and Ryosuke

Shibasaki. 2022. Predicting Citywide Crowd Dynamics at Big Events: A Deep Learning System. ACM Trans.

Intell. Syst. Technol. 13, 2, Article 21 (March 2022), 24 pages.

https://doi.org/10.1145/3472300

1 INTRODUCTION

Event crowd management has been a signicant research topic with high social impact. When

some big events happen such as an earthquake, typhoon, and national festival, crowd manage-

ment becomes the rst priority for governments (e.g., police) and public service operators (e.g.,

subway/bus operator) to protect people’s safety or maintain the operation of public infrastruc-

tures. Especially for a large urban area such as Tokyo and Shanghai, the population density is very

high, which naturally leads to high risks for various accidents and emergency situations. Recall

the tragedy on New Year’s Eve in Shanghai, around 300,000 people gathered to celebrate the ar-

rival of 2015 near Chen Yi Square on the Bund. However, the large crowd was not well controlled,

and a stampede occurred where 36 people died and 47 were injured in the tragedy. Meanwhile,

articial intelligence (AI) technology is rapidly developing and the 5G mobile internet technol-

ogy is forthcoming. Big human mobility data are being continuously generated through a variety

of sources, some of which can be treated and utilized as streaming data for understanding and

predicting crowd dynamics. All these stimulate us to take new eorts and achieve new success on

this social issue by using such streaming mobility data and advanced AI technologies. However,

when big events or disasters happen, urban human mobility may dramatically change from nor-

mal situations. It means people’s movements will almost be uncorrelated with their daily routines.

AsshowninFigure1, a big earthquake occurred at 14:46 JST March 11, 2011. Citywide human

mobility in Tokyo area was greatly impacted since the transportation network was suddenly shut

down by the earthquake. Due to this big event, an abnormal pattern of crowd density can be ob-

served in both Tokyo station area and Shinjuku station area. All these demonstrate that predicting

crowd dynamics under event situations is of high social impact, but very challenging, especially

at a citywide level.

To address this challenge, we aim to extract the “deep” trend only from the current momentary

observations and generate an accurate prediction for the trend in the short future, which is con-

sidered to be an eective way to handle the event situations [10]. We build an intelligent system

called DeepUrbanEvent based on collected big human mobility data and a unique deep-learning

architecture. It is designed to be deployed as an online system for crowd management at big events,

which can continuously take limited steps of currently observed crowd dynamics as input and re-

port multiple steps of prediction results for a short time period in the future as output. With such

multiple steps of prediction, we can easily understand how the crowd dynamics are evolving in

more detail.

Specically, in this study, citywide crowd dynamics are rst decomposed into two parts: crowd

density and crowd ow. By meshing a large urban area into ne-grained grids, they can both

be represented by a four-dimensional (4D) tensor (Timestep,Heiдht,Width,andChannel)

analogously to a short video, where Timestep represents the number of observation/prediction

steps, and Heiдht and Width are determined by mesh size. Crowd density/ow video represents

a time series of density/ow value for each mesh-grid, therefore Channel for density is equal to

1, whereas Channel for ow is equal to the size of ow kernel window η×η.Thestoredvalue

indicates how many people inside a central mesh-grid will transit to each of η×ηneighboring

mesh-grids in a given time interval. A Multitask ConvLSTM Encoder-Decoder architecture is

designed to simultaneously model these two kinds of high-dimensional sequential data to gain

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:3

Fig. 1. Citywide human mobility in Tokyo before and aer the Great East Japan Earthquake (top). Crowd

density in Tokyo station area and Shinjuku station area (boom).

concurrent enhancement. Based on this architecture, our system works as an online system that

can continuously take limited steps of currently observed crowd density and crowd ow as input,

and report multiple steps of predictions results as output for the future time period. Finally, we

validate our system on four big real-world events that happened in Tokyo area, namely, 3.11

Japan Earthquake, Typhoon Roke (2011), New Year’s Day (2012), and Tokyo Marathon (2011), and

demonstrate its superior performance to baseline models. The overview of our system has been

showninFigure2. In summary, our work has the following key characteristics that make it unique:

— For predicting crowd dynamics at citywide-level big events, we build an online deployable

system that needs only limited steps of current observations as input.

— Citywide crowd dynamics are decomposed into two kinds of articial videos, namely, crowd

density video and crowd ow video, and a Multitask ConvLSTM Encoder-Decoder is de-

signed to simultaneously predict multiple steps of crowd density and ow for the future

time period.

— Using the predicted crowd density and ow video, we can further build a series of dynamic

crowd mobility graph to help conduct probabilistic reasoning of crowd movements during

big events.

— We validate our system on four big real-world events with big human mobility data source

and verify it as a highly deployable prototype system.

The remainder of this article is organized as follows. In Section 2, we review some related works.

In Section 3, we provide a description of our data source. In Section 4, we give the problem deni-

tion of citywide crowd dynamics prediction. In Section 5, we illustrate the proposed deep sequen-

tial learning architecture. In Section 6, we explain how to build dynamic crowd mobility graph. We

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:4 R.Jiangetal.

Fig. 2. DeepUrbanEvent is designed as an eective real-world system for predicting citywide crowd dynamics

at big events. Citywide crowd dynamics are first decomposed into two parts: crowd density and crowd flow.

By meshing a large urban area into fine-grained grids, they can both be represented by a four-dimensional

tensor (Timestep,Heiдht,Width,andChannel) analogously to a short video, where Timestep represents

the number of observation/prediction steps, and Heiдht and Width are determined by mesh size. Crowd

dynamics graph for management can be built based on the predicted density (nodes) and flow (edges).

present and discuss the results of the experimental evaluation in Section 7. In Section 8, we give

our conclusions as well as future work.

2 RELATED WORK

Human mobility data have been widely studied in the eld of data science and AI. A very compre-

hensive survey on the recent deep spatiotemporal models has been given in [48]. Here, we briey

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:5

recap the related works in citywide human mobility prediction. According to the modeling strat-

egy, it could be divided into three categories: trajectory-based prediction, mesh-based prediction,

and network-based prediction. The trajectory-based methods [9–13,22,23,33,42] directly model

the trajectories as typical sequential data, whereas the mesh-based [18,32,46,49,55–57,59–61,65]

and network-based ones [5,14,21,28,29,34–36,38,41,47,50,54,58] aggregate or map the trajec-

tories with mesh-grid or transportation network.

2.1 Trajectory-Based Mobility Prediction

Many trajectory-based deep learning models were proposed to predict each individual’s move-

ment [12,13,33]. [33] extends a regular RNN by utilizing time and distance-specic transi-

tion matrices to propose an ST-RNN model for predicting the next location. DeepMove [12],

considered to be a state-of-the-art model for trajectory prediction, designed a historical atten-

tion module to capture periodicities and augment prediction accuracy. VANext [13] further en-

hanced DeepMove by proposing a novel variational attention mechanism. Besides, [9,11,22]

modeled human mobility for a large population at a citywide level. [10,23,42] simulated and

predicted human emergency mobility following disasters or events. [20] generated urban hu-

man mobility with variational autoencoder. [24] transferred urban human mobility via Point-

Of-Interest among dierent cities. However, these models are built based on millions of indi-

viduals’ mobility. In our experiment, we implement CityMomentum [10] as a trajectory-based

baseline.

2.2 Mesh-Based Mobility Prediction

Density (Demand) Prediction. Based on mesh, predicting citywide crowd density with histor-

ical observations can be represented by a 4D tensor (Timestep,Heiдht,Width,andChannel), as

demonstrated in Figure 3. Following the same formulation, Hetero-ConvLSTM [59] predicted traf-

c accidents using heterogeneous data sources; DeepSD [46], DMVST-Net [56], and Periodic-CRN

[65] predicted taxi demand using taxi request dataset collected from car-hailing company. CoST-

Net [57] predicted multiple transportation demands using both taxi and bike data.

In/Out Flow Prediction. Forecasting the citywide crowd ow based on mesh is proposed and

addressed by [18,60,61]. As illustrated by Figure 3, they dene inow and outow to represent

how many people will ow into or out from a certain mesh-grid. The prediction problem can be

represented by 4D tensor (Timestep,Heiдht,Width,Channel = 2), whereChannel stores the in ow

and out ow. [18,60,61] build deep learning models using deep neural network and convolutional

neural network (CNN) to do the citywide prediction. Following the same problem denition, a

series of deep learning models have been proposed to improve the performance of ST-ResNet [60],

including Periodic-CRN [65], STDN [55], and DeepSTN+ [32]. In our experiment, we implement

ST-ResNet [60] as the mesh-based baseline. We adapt ST-ResNet to take in limited steps of latest

observations on crowd density/ow and perform one-step-by-one-step autoregression to obtain

multiple steps of predictions.

Flow Prediction. In-ow and out-ow can only indicate how many people will ow into or out

from a certain mesh-grid, but cannot answer where the people ow come or transit. As illustrated

by Figure 3, citywide crowd ow can depict how a crowd of people move among the entire mesh-

grids. The problem can be represented by 4D tensor (Timestep,Heiдht,Width,Channel =Heiдht

*Width). MDL [62] also utilized multitask learning to simultaneously model and predict crowd

in/out ow and crowd in-out transition. GEML [49] predicted origin-destination transition matrix

via Graph Convolutional Network (GCN).Moreover,[2] conducts transition estimation from

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:6 R.Jiangetal.

Fig. 3. Mesh-based citywide crowd prediction: density, in/out flow, and transition, analogous to video data

prediction on 4D tensor (Timestep,Heiдht,Width,andChannel).

aggregated population data, and [44] estimates the transition populations using inow and outow.

However, all these models including MDL [62] and GEML [49] can only do the single-step ow

prediction rather than a continuous multi-step prediction. These models also cannot give out the

crowd density prediction in a straightforward way either, which is very crucial for event crowd

management. Our system aims to simultaneously model and predict citywide crowd density and

crowd ow in a multi-step-to-multi-step manner to better facilitate event crowd management in

the real world.

2.3 Network-Based Mobility Prediction

Based on road network-mapped trajectories, researchers also applied deep learning to predict the

trac ow [21,34], trac speed, congestion [35,36], travel time [28,29,47,50,54], and human

mobility with transportation mode [41]. On the other hand, based on road network-aggregated

time series data, GeoMAN [31] designed LSTM plus attention mechanism as a general solution to

geo-sensory time series prediction, and ST-MetaNet [38] employed meta-learning for trac ow

prediction. In particular, leveraging on the latest technique GCN, a series of models have been

proposed to address trac-related problems: ST-GCN [58] and DCRNN [30] rst applied GCN

to trac ow forecasting; ASTGCN [16], GMAN [64], and STSGCN [40] are the state-of-the-arts

extended from STGCN; Graph WaveNet [52] combined GCN with dilated casual convolution, while

T-GCN [63] combined GCN and gated recurrent unit to demonstrate the superior performances.

Besides, [3,6,14] utilized GCN for ride-hailing and car-hailing demand prediction. However, these

models did not directly address the citywide crowd density nor crowd ow prediction task.

3 DATA SOURCE

“Konzatsu-Tokei (R)” from ZENRIN DataCom Co., Ltd. was used. It refers to people ow data col-

lected by individual location data sent from mobile phones with an enabled AUTO-GPS function

under the users’ consent, through the “DoCoMo map navi” service provided by NTT DoCoMo,

Inc. Those data are processed collectively and statistically in order to conceal private information.

The original location data are Global Positioning System (GPS) data (latitude, longitude) sent

at a minimum period of about 5 min, and does not include information (such as gender or age)

to specify individuals. However, the data acquisition is aected by several factors such as loss

of signal or low battery power. In addition, when a mobile phone user stops at a location, the

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:7

Table 1. Notation Description

Symbol Description

M,дmesh for an urban area, mesh-grid

Timestep the number of observation/prediction steps

Heiдht number of mesh-grids along Latitude-axis

Width number of mesh-grids along Longitude-axis

Channel dimension to store the value for density/ow

Γiu each user’s(u) trajectory on each day(i)

dtm crowd density at timeslot ton mesh-grid дm

ftmw crowd ow from дmat t-1 to дwat t

dt,ftcitywide density/ow at t

xd,xfcurrent αsteps of density/ow w.r.t t

yd,yfnext βsteps of density/ow w.r.t t

Xsamples from all timeslots for the current

Ysamples from all timeslots for the next

∧the predicted results

positioning function of his/her mobile phone is automatically turned o to save power. The pro-

posed methodology is applied to raw GPS data from NTT DoCoMo, Inc. The raw GPS log dataset

was collected anonymously from approximately 1.6 million mobile phone users in Japan over a

three-year period (August 1, 2010, to July 31, 2013). Each record contains user ID, latitude, longi-

tude, altitude, timestamp, and accuracy level. In this study, we select Greater Tokyo Area (including

Tokyo City, Kanagawa Prefecture, Chiba Prefecture, and Saitama Prefecture) as the target area. We

can obtain 145,507 users’ trajectories in total that covers approximately 1% of the real-world popu-

lation. This dataset is slightly biased towards young people, because comparing with older people

and children, young people are more likely to have a mobile phone with localization function. To

discover the relationship between our dataset and real-world population and validate the repre-

sentativeness of the dataset, a previous study [19] estimated the home location of each user in the

dataset and compared it with census data on 1km mesh-grid, and found the linear relationship as:

NGPS =0.0063048 ∗Ncensus +0.73551,R2=0.79222,

where NGPS is the population estimated from our GPS dataset, Ncensus is the population given by

the national census data, and R2is the coecient of determination. This R2value demonstrates

that our dataset has very good representativeness of the read-world population. Please refer to

[19] for more details of the basic statistics of this dataset.

4 CITYWIDE CROWD DYNAMICS MODELING

Denition 1 (Calibrated Human Trajectory Database). Human trajectory is stored and indexed

by day (i) and userid (u) in the trajectory database Γ. Given a mesh Mof an area, namely a set

of mesh-grids {д1,д2,...,дm,...,дHeiдht∗Width}, and a time interval Δt, each user’s trajectory on

each day Γiu is mapped onto mesh-grids and then calibrated to obtain constant sampling rate as

follows:

Γiu =(t1,д1),...,(tk,дk)∧∀j∈[1,k),|tj+1−tj|=Δt,

which means that the time interval between any two consecutive timeslots is calibrated into Δt.

For simplicity, from now on we only consider a one-day slice of the trajectory database Γ, then the

day index (i) can be omitted when referring to Γ.Table1above summarizes the notations used in

this article in a comprehensible way.

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:8 R.Jiangetal.

Denition 2 (Crowd Density). Given Γ,M, crowd density at timeslot ton mesh-grid дmis dened

as follows:

dtm =|{u|Γ

u.дt=дm}|,

which intuitively indicates how many people inside дmat t.

Denition 3 (Crowd Flow). To capture the crowd ow starting from a certain mesh-grid, we

utilize a kernel window denoted as η×ηw.r.t дm, which represents a square area made up ofη×η

neighboring mesh-grids with дmas the centroid mesh-grid. Given Γ,M, and a kernel window η×η

w.r.t ea ch д, crowd ow at timeslot ton mesh-grid дmis dened as follows:

ftmw =|{u|Γ

u.дt−1=дm∧Γ

u.дt=дw}|,

which intuitively indicates how many people transit from mesh-grid дmat timeslot t-1toaneigh-

boring mesh grid дwinside a kernel window at timeslott. After calculating the crowd density/ow

for each mesh-grid over the entire mesh, citywide crowd density/ow can be obtained for each

timeslot.

Denition 4 (Crowd Density/ow Video). As the mesh is represented in a two-dimensional format,

a crowd density/ow video containing a couple of consecutive frames can be represented by a 4D

tensor RTimestep×Heiдht×Width×Channel,whereTimestep represents the number of video frames,

Channel for density is equal to 1, and Channel for ow is equal to the size of the given kernel

window namely η2. An illustration for crowd density/ow video has been shown in Figure 2.

Denition 5 (Crowd Density/ow Video Prediction). Given currently observed a-step crowd den-

sity/ow video xd=dt−(α−1),...,dt,xf=ft−(α−1),...,ftat timeslot t, prediction for the next

β-step density/ow video ˆ

yd=ˆ

dt+1,..., ˆ

dt+β,ˆ

yf=ˆ

ft+1,..., ˆ

ft+βis modeled as follows:

yd=ˆ

dt+1,ˆ

dt+2,..., ˆ

dt+β

=argmax

dt+1,dt+2,...,dt+β

P(dt+1,dt+2,...,dt+β|dt−(α−1),...,dt),

yf=ˆ

ft+1,ˆ

ft+2,..., ˆ

ft+β

=argmax

ft+1,ft+2,...,ft+β

P(ft+1,ft+2,...,ft+β|ft−(α−1),...,ft).

Denition 6 (Citywide Crowd Dynamics Prediction). Given currently observed a-step crowd den-

sity/ow video, citywide crowd dynamics prediction aims to simultaneously generate next β-step

density/ow video, which is modeled as follows:

yd,ˆ

yf=argmax

yd,yf

P(yd,yf|xd,xf).

Moreover, by jointly modeling these two highly correlated tasks, concurrent enhancements for

both can be expected. It should be noted that crowd density video and crowd ow video are sum-

marized and proposed as a new concept called crowd dynamics here, which aims to not only

reect the crowd density for each mesh-grid but also depict how a crowd of people move among

the mesh-grids. Figure 4demonstrates the overall problem denition mentioned above.

5 DEEP SEQUENTIAL LEARNING ARCHITECTURE

As shown above, citywide crowd dynamics problem has been dened in an analogous manner

to a video prediction task. However, citywide crowd dynamics are a highly complex phenomenon

especially when big events happen, which makes it very dicult to handle these high-dimensional

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:9

Fig. 4. Citywide crowd dynamics prediction.

sequential data with some classical methodologies. This naturally motivates us to employ the most

advanced deep video learning model as the basic component of our system.

Convolutional LSTM. ConvLSTM [53] has been proposed to build an end-to-end trainable model

for the precipitation nowcasting problem. It extends the fully connected LSTM to have convolu-

tional structures in both the input-to-state and state-to-state transitions and achieves new success

on video modeling tasks. Thus, ConvLSTM is utilized as the core component of our system for

the density and ow video prediction tasks. As shown in Figure 2, a ConvLSTM has three gates

comprising an input gate i, an output gate o,andaforgetgatefas same as an ordinary LSTM.

Hidden state htin a ConvLSTM is calculated iteratively from t=1 to Tfor an input sequence of

frames (x1,x2,...,xT)as follows:

it=σ(Wxi ∗xt+Whi ∗ht−1+Wci ct−1+bi),

ft=σ(Wxf ∗xt+Whf ∗ht−1+Wcf ct−1+bf),

ct=ftct−1+ittanh(Wxc ∗xt+Whc ∗ht−1+bc),

ot=σ(Wxo ∗xt+Who ∗ht−1+Wco ct+bo),

ht=ottanh(ct),

where Wis weight, bbias vector, * denotes the convolution operator, and represents Hadamard

product. All of these weight parameters are determined by applying the standard “backpropagation

through time” algorithm, which starts by unfolding the recurrent neural networks through time

and it then generalizes the backpropagation for feed-forward networks to minimize the dened

loss function, which will be Mean Squared Error (MSE) for our problem. The full details of the

algorithm are omitted here.

5.1 Stacked ConvLSTM Architecture

As a comparative modeling approach, we would like to verify how the performance of our sys-

tem would be like if we use a one-step-by-one-step prediction model and obtain multiple steps of

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:10 R.Jiangetal.

Fig. 5. Stacked ConvLSTM for one-step prediction.

predictions by iterating it in an autoregressive manner. Then one-step-by-one-step crowd density

prediction model can be dened as follows:

dt+1=argmax

dt+1

P(dt+1|dt−(α−1),...,dt−1,dt).

Once the one-step model is trained, we can run it for βtimes to obtain ˆ

dt+1,ˆ

dt+2,..., ˆ

dt+β.Crowd

ow prediction could also be modeled in a similar way, which will be omitted for simplicity.

Moreover, the use of multiple stacked layers of neural networks can also be considered to boost

the performance in dicult time-series modeling tasks according to [17]. Thus, a deep architecture

constructed with multiple stacked ConvLSTM layers has been shown in Figure 5for one-step

prediction. It has strong representational power which makes it suitable for giving predictions in

complex phenomenons like the citywide crowd dynamics. Note that the same network architecture

can also be applied to one-step crowd ow prediction, but a special AutoEncoder component is

rst necessary due to the uniqueness of crowd ow video which will be explained in the following.

5.2 CNN AutoEncoder for Crowd Flow

Crowd density and ow video are both represented as 4D tensor RTimestep×Heiдht×Width×Channel ;

however, the Channel for ow is much larger than density. In our system, each grid is set to 500 m

×500 m, by taking into account all the possible transportation modes such as WALK, BUS, CAR,

and TRAIN, the transition distance from one grid-cell to another neighboring one can be up to 4

km within 5 minutes time interval (approximately 48 km/h at most). Thus kernel window needs to

be 15 ×15 to capture all the possible crowd ow within 5 minutes.Channel for ow is then equal to

225, which is just too large for most of the state-of-the-art video learning models to handle. Thus,

we build a special CNN AutoEncoder [4,45] to obtain a low-dimension representation of Channel

for crowd ow.

Compared with traditional neural networks, CNNs were designed specically for analyzing

visual imagery [27], where the neurons in a layer are only connected to a small region of the

previous layer instead of all of the neurons in a fully connected manner. CNNs are the state-of-

the-art method for image recognition or classication tasks [26,39]. For a typical CNN layer, the

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:11

Fig. 6. CNN AutoEncoder for crowd flow.

convolutional feature value at location (i,j)in the k-th feature map, feature ci,j,k, is calculated as:

ci,j,k=ReLU (wkxi,j+bk),

where wkand bkare the weight and bias of the k-th lter, respectively, and xi,jis the input patch

centered at location (i,j).ReLU is often used as the activation function.

The details of the special CNN AutoEncoder have been proposed in Figure 6.Notethatwe

aim to model the crowd dynamics at a citywide level by taking the entire city “image” as the

computation unit rather than each grid “pixel.” Therefore, the original crowd ow image of the

city is represented with a three-dimensional (3D) tensor (80, 80, 225). An encoder is constructed

with three convolutional layers to encode the image into a small 3D tensor (80, 80, 4), and then

a decoder is constructed with three convolutional layers to decode the compressed tensor back

to the original 3D tensor (80, 80, 225). The end-to-end model parameters can be optimized by

minimizing reconstruction error between the original ow image and decoded ow image. In our

system, we aim to obtain a compressed Channel (from 225 to 4) but keep the spatial structural

information of the citywide crowd ow image at (80, 80). Thus, only convolutional layer with 1 ×

1 kernel window is utilized. At the last layer of the encoder, a unique ReLU(MAX = 1.0) function

was utilized to ensure that the values are all scaled into [0, 1], which can keep the value range of

crowd ow approximately the same as the value range of crow density. Without this, the multitask

learning mechanism introduced in the following could not function well.

5.3 Multitask ConvLSTM Encoder and Decoder

With such a CNN AutoEncoder, citywide crowd ow can be modeled and computed with the same

architecture for crowd density shown in Figure 5. The prediction can be performed in an iterative

one-by-one manner, but one major limitation of this model is to predict relatively long short-term

crowd dynamics. With the iteration going on, the accumulated iteration error will become large,

which can result in terrible performance on the last several predicted steps. To tackle this problem,

we improve the one-step-by-one-step modeling with multi-step-to-multi-step modeling (Deni-

tion 5) aimed at achieving better performance on “long” short-term predictions. To deliver this

idea, a sequential encoder and decoder architecture [7,43] has been built with four ConvLSTM

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:12 R.Jiangetal.

Fig. 7. Multitask ConvLSTM enocder-decoder for simultaneous multi-step prediction of crowd density and

crowd flow.

layers in this study. It works in the following steps: (1) the rst two hidden layers of ConvLSTM

(encoder) map the αsteps of the inputted crowd density or ow video into a single latent vector,

which contains information about the entire video sequence; (2) this latent vector is repeated β

times to a constant sequence; and (3) the other two hidden layers of ConvLSTM (decoder) are used

to turn this constant sequence into the βsteps of the output video sequence. A batch normaliza-

tion layer is added between two consecutive ConvLSTM layers. ReLU is used as the activation

function in the nal decoding layer. The ConvLSTM Enc.-Dec. model can be separately trained by

minimizing the prediction error L(θd)and L(θf)for crowd density video YDand crowd ow video

YF, described as follows:

L(θd)=|| ˆ

YD−YD||2,

L(θf)=|| ˆ

YF−YF||2.

Crowd density video and crowd ow video share important information and are highly corre-

lated with each other. The insights behind this are two folds: (1) people ow tend to follow the

trend, especially under the event/emergency situations, which may make crowded places attract

more and more people; and (2) higher inow will lead to higher density for each grid-cell, and

similarly higher outow will reduce the crowd density. Moreover, as mentioned above, a crowd

management system needs to predict not only the crowd density but also the crowd ow for each

grid. Thus, we jointly model these two highly correlated tasks dened as Denition 6, and propose

a Multitask ConvLSTM Encoder and Decoder architecture as shown in Figure 7. The key concept

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:13

of multi-task learning [37] is to learn multiple tasks simultaneously with the aim of gaining mu-

tual benets; thus, learning performance can be improved through parallel learning while using

a shared latent representation. Therefore, it is reasonable to expect better performances from this

learning framework for our system. Our Multitask ConvLSTM Encoder and Decoder architecture

rst takes XDand XFas two separate inputs. The separate input encoders rst encode the crowd

density and crowd ow video respectively. Then, the shared encoder maps the encoded crowd

density and ow into a joint latent representation zα, which can be taken as the auto-extracted

features for the entire crowd dynamics; zαis then repeated βtimes to be passed to the shared

decoder, and nally the output decoders give the multiple steps of prediction results for crowd

density video ˆ

YDand ow video ˆ

YFrespectively. The entire model can be trained by minimizing

the total prediction error L(θ)of crowd density and ow, described as follows:

L(θ)=λ|| ˆ

YD−YD||2+(1−λ)|| ˆ

YF−YF||2,

where λis set to 0.5 to make the two parts of loss equally contributing. CNN AutoEncoder still

needs to be applied rst to the original crowd ow.

6 DYNAMIC CROWD MOBILITY GRAPH

So far, multiple steps of citywide crowd density and crowd ow can be modeled and predicted

simultaneously as crowd dynamics. At this stage, citywide crowd dynamics are still represented

with mesh-grids, in order to help conduct probabilistic reasoning of crowd movements during big

event situations, we build a series of dynamic crowd mobility graphs using the predicted crowd

dynamics. By using graph instead of mesh, a high-level representation of citywide crowd dynam-

ics can be obtained, more easily and eciently used for citywide-level event crowd management.

Moreover, if severe disasters happen, some parts of the real transportation network might be dam-

aged, therefore we need to build a virtual transportation network to replace or work together

with the real-world one. Comparing with the static transportation network, our proposed graph

is a dynamic one because: (1) one crowd mobility graph is corresponding to one frame of crowd

dynamics video; our system can report multiple steps of prediction results, thus we can build a

series of graphs for each timeslot; and (2) our graph can be updated every 5 minutes by our online

updating system.

Specically, given one frame of predicted crowd density video, we view the centroid of each

mesh-grid as a point, the density of the mesh-grid as the weight, then apply weighted KMeans

clustering on all the weighted points to get clusters of mesh-grids. Each cluster can be taken as

one Region-of-Interest (RoI), as well as one node of the mobility graph. Given one frame of

predicted crowd ow video, it is easy to build a transition matrix Ωbetween each mesh-grid pair.

By summing up the total transition number between each node pair, the edges of the mobility

graph can be generated. The details of this process are summarized in Algorithm 1.Notethat

other clustering or RoI construction algorithms (e.g., T-Pattern [15], MeanShift, or PopularRoutes

[51]) may also be used here, however, KMeans is adopted in our system because it can be easily

tuned using the only one parameter K.

Given β-step crowd ow video f1,f2,...,fβ, using each frame of crowd video, transition matri-

ces Ω1,Ω2,...,Ωβcan be calculated. Each Ωtessentially represents transition probability between

each mesh-grid pair in the timeslot tnamely Ωtij =P(дi→дj) after normalization. Each Ωtin our

nal system is just a transition matrix within 5 minutes, which is not sucient for a relatively long

short-term probabilistic reasoning. To address this issue, we view each Ωtas a rst-order proba-

bility and leverage the rst-order probability to get higher-order transition probabilities. Ω1×Ω2

contains all the second-order transition probability, thus Ω1:2 =Ω1+Ω1×Ω2after normalization

corresponds to the 2-step transition probability, which is the likelihood of transition from дito дj

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:14 R.Jiangetal.

Table 2. Event Information

Event Abbr. Date Training dates

3.11 Earthquake Earthquake 2011/03/11 2011/03/01–2011/03/10

Typhoon Roke Typhoon 2011/09/21 2011/09/11–2011/09/20

New Year’s Day New Year 2012/01/01 2011/12/22–2011/12/31

Tokyo Marathon Marathon 2011/02/27 2011/02/17–2011/02/26

ALGORITHM 1: Dynamic Crowd Mobility Graph Building

Input: Citywide crowd density and ow yd,yf,ameshM, node number K.

Output: Dynamic Crowd Mobility Graph

1Ψ←∅;// dynamic crowd mobility graph.

2for each t∈[t1,...,tβ]do

3dt,ft←retrieve density, ow at tfrom yd,yf;

4V←Node Construction(dt,M,K);

5E←Edge Construction(ft,M,V);

6Ψ←Ψ∪(V,E);

7return Ψ;

8Function Node Construction (d,M,K):

9C,W←∅,∅;

10 for each дm∈Mdo

11 C←C∪дm.centroid, W←W∪dm;

12 V←Weighted Kmeans Clustering(C,W,K);

13 return V;

14 Function Edge Construction (f,M,V):

15 Ω[|M|][|M|]←build grid transition matrix using f;

16 E[|V|][ |V |]←initialize node transition matrix with 0;

17 for each pair (дp,дq)∈(M,M)do

18 v←nd дpbelongs to which node in V;

19 v←nd дqbelongs to which node in V;

20 E[v][v]←E[v][v]+Ω[p][q];

21 return E;

within two steps namely 5*2 minutes. Analogously, we can get the β-step transition probability

by calculating the matrix Ω1:β=Ω1+Ω1×Ω2+··· +Ω1×Ω2×···×Ωβwith normalization.

Ω1:βbuilt with multiple steps of crowd ow can be used to replace the simple rst-order transition

matrix Ωin Algorithm 1(Edge Construction Function), then crowd mobility graph can contain the

transition information within a longer time interval i.e., 5*βminutes.

7 EXPERIMENT

In this section, we present the results of extensive experiments and comparisons of the perfor-

mance of our model with other baseline models.

7.1 Seings

Experimental Setup: We selected the Greater Tokyo Area (Lonд.∈[139.50,139.90], Lat .∈

[35.50,35.82]) as our target urban area. Four citywide-level events happened in this area were

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:15

Table 3. Summary of Tuned Parameters

Parameter Tuned value

Heiдht,Width 80, 80

Time Intervals of One Day 5 minutes ×288

Train/Test Timeslots 2880/288

Kernel Window η×η15 ×15

(1)Timestep α/β(Max lead time) 6/6 (30 minutes)

(2)Timestep α/β(Max lead time) 12/12 (60 minutes)

Optimizer Adam

Epoch 200

Learning Rate 0.0001

Batch Size 4

Scaling Factor for Density/Flow 500/100

selected as the testing events as summarized in Table 2: (1) 3.11 Earthquake (2011/03/11), a mag-

nitude 9.0-9.1 earthquake o the coast of Japan that occurred at 14:46 JST, which caused a great

impact on people’s behaviors in the Great Tokyo Area. (2) Typhoon Roke (2011/09/21), recorded

as one of the strongest typhoons in Japan’s history, which made subway operators shut down part

of their services. (3) New Year’s Day (2012/01/01). There are a number of New Year celebrations in

the Tokyo area, especially, for “Hatsumode” (the rst visit to Buddhist temple or shrine), most of

the railway lines operate overnight on New Year’s Eve for this. (4) Tokyo Marathon (2011/02/27).

The number of people attending this event was 2.16 million (the number of people along the road

was 1.53 million, and the number of visitors to the Tokyo Marathon Festival was 0.63 million).

Also, trac regulation was strictly conducted along the Marathon route. These four event days

were used as testing dates, and 10 consecutive days before the event day were utilized as training

and validation dataset, which means 2011/03/01–2011/03/10, 2011/09/11–2011/09/20, 2011/12/22–

2011/12/31, and 2011/02/17–2011/02/26 were the selected periods for the four events respectively.

Our data source contained approximately 100,000∼130,000 users’ GPS logs on each day within

the target urban area. After conducting data cleaning and noise reduction to the raw dataset, we

did linear interpolation to make sure each user’s 24-hour (00:00∼23:59) GPS log has a constant

5-minute sampling rate. Then by mapping each coordinate onto mesh-grid, crowd density video

and crowd ow video could be generated based on the denitions listed in Section 4.

Parameter Settings: The parameter settings are summarized in Table 3. We meshed the entire

area with ΔLonд.= 0.005, ΔLat .= 0.004 (approximately 500m×500m)togetan80×80 mesh-grid

map. As mentioned above, the time interval Δtof our system was set to 5 minutes. Therefore, we

got 2,880 timeslots (288 * 10 days) as training dataset and 288 timeslots as the testing dataset, and

crowd density frame and crowd ow frame were generated for each timeslot. Kernel window was

set to 15 ×15 for crowd ow, which could capture enough transit distance of crowd ow within

5 minutes. We set the observation step αand the prediction step βboth to 6 to generate length-6

crowd/ow video as inputs and their corresponding next length-6 videos as outputs. This means

our system could predict the crowd dynamics for the next 30 minutes. In each report, it contained

six steps of prediction results for every 5 minutes, and the result at 6th step gave us the maximum

lead time of 30 minutes. Similarly, an evaluation for the prediction with 60 minutes lead time is

also conducted by setting αand βto 12. Finally, we could get 2,868 sample pairs from the training

dataset, and randomly selected 80% of them (2,294) as the training samples and 20% of them (574) as

the validation samples. The Adam algorithm was employed to control the overall training process,

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:16 R.Jiangetal.

where the batch size was set to 4 and the learning rate to 0.0001 for all deep learning models except

that the learning rate of CNN AutoEncoder was tuned as 0.001. The training algorithm would be

stopped after 200 epochs and only the best model would be saved. In addition, we used 500 as

the scaling factor for crowd density to scale the data down to relatively small values, and 100 as

the scaling factor for crowd ow value. In the evaluation, we rescaled the predicted value back to

the normal values, and compared them with the ground-truth. The parameter settings were kept

the same for each event. Python and some Python libraries such as Keras [8] and TensorFlow [1]

were used in this study. The experiments were performed on a GPU server with four GeForce GTX

1080Ti graphics cards.

Baseline models: We implemented the following models as baseline models for comparison.

(1) HistoricalAverage. Crowd density/ow for each timeslot was estimated by averaging the last

10 days’ corresponding values.

(2) CopyYesterday. We directly used yesterday’s value as the predicted value on event days.

(3) CopyLastFrame. We directly copied the last/latest observation as the predicted value, which

can be a simple but very eective method for event situations.

(4) ARIMA. It is a classical time-series prediction model designed for one-dimensional data. For

each mesh-grid, we build one ARIMA model to predict the time-series density prediction.

However, for the ow tensor (80,80,225) at each timeslot, the dimension was just too high

for ARIMA to handle.

(5) VectorAutoRegressive. It is an advanced time-series prediction model designed for high

dimensional data. By attening the density tensor (80,80,1) at each timeslot into a 6,400-

dimension vector, the model could handle the crowd density prediction task. For ow tensor

(80,80,225), the dimension was also just too high for VAR to deal with.

(6) CityMomentum [10]. It was rstly proposed for momentary mobility prediction at the city-

wide level for big events. We implemented it by using a 500-meter mesh-grid and 5-minute

time interval, following exactly the same setting with our model. Although the model was

build from a perspective of individual’s mobility, the predicted/simulated trajectory of each

individual could be used for generating aggregated crowd density and ow, which makes it

comparable with our system.

(7) ST-ResNet [60]. This deep residual learning-based method shows a state-of-the-art perfor-

mance on citywide crowd ow prediction. To compare its performance under the same prob-

lem denition with us, we adapted ST-ResNet to take in limited steps of latest observations

on crowd density/ow and performed one-step-by-one-step autoregression to obtain multi-

ple steps of predictions. Here, we also found that 1-residual-unit ST-ResNet without external

features could achieve the best performances in our event situations.

(8) CNN. It is a one-step predictor constructed with four Conv layers. Note that the 4D tensor

would be converted to 3D tensor (Heiдht,Width,Timestep ∗Channel)by concatenating the

channels at each timestep just as the way [60] did, so that CNN could take our 4D tensors as

inputs. The rst three Conv layers used 32 lters of 3 ×3 kernel window, and the nal Conv

layer used a ReLU activation function to output a single step of video frame.

(9) CNN Enc.-Dec. It is a multi-step predictor also constructed with four Conv layers. It shares

the same parameter settings with (5). The only dierence is the nal Conv layer outputs a

3D tensor (Heiдht,Width,Timestep ∗Channel )as multiple steps of predictions.

(10) Multitask CNN Enc.-Dec. It has four Conv layers sharing a similar multitask architecture as

illustrated in Figure 7, namely, separate input encoding Conv layer, shared encoding and

decoding layer, and separate output Conv layer. All the parameters were kept the same

with (6).

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:17

Table 4. Performance Evaluation of 30 Minutes Ahead Prediction on Four Events

Model Earthquake Typhoon New Year Marathon

Density Flow Density Flow Density Flow Density Flow

HistoricalAverage 106.032 0.726 75.402 0.519 176.013 1.099 33.381 0.223

CopyYesterday 129.436 0.912 85.641 0.592 110.444 0.660 65.765 0.437

CopyLastFrame 7.824 0.116 9.756 0.186 5.498 0.079 6.496 0.107

ARIMA 10.430 NA 13.376 NA 8.343 NA 7.808 NA

VectorAutoRegressive 10.843 NA 13.377 NA 9.511 NA 9.380 NA

CityMomentum [10] 27.670 0.653 29.305 0.962 23.058 0.235 25.774 0.475

ST-ResNet [60] 6.542 0.113 7.802 0.183 4.544 0.080 5.548 0.103

CNN 8.698 0.178 10.245 0.196 6.178 0.083 6.614 0.100

CNN Enc.-Dec. 7.115 0.117 8.571 0.187 5.216 0.079 6.004 0.095

M.T. CNN Enc.-Dec. 6.802 0.119 8.226 0.197 5.158 0.084 5.953 0.097

ConvLSTM 6.737 0.124 7.959 0.195 4.679 0.077 5.675 0.094

ConvLSTM Enc.-Dec. 6.281 0.102 7.508 0.171 4.500 0.074 5.372 0.089

M.T. ConvLSTM Enc.-Dec. 5.549 0.102 6.753 0.170 4.117 0.074 5.012 0.086

(11) ConvLSTM (one-step-by-one-step) and (12) ConvLSTM Enc.-Dec (multi-step-to-multi-step)

are the proposed comparison models constructed with four ConvLSTM layers in Section 5.

Each ConvLSTM layer uses 32 lters of 3 ×3 kernel window and the ReLU activation is used

in the nal layer. BatchNormalization was added between two consecutive CNN/ConvLSTM

layers for all the models. Note that for all of the crowd ow parts, as shown in Figure 6,CNN

AutoEncoder will be rst applied to encode the original ow tensor and then decode the

(predicted) encoded ow back to the original format. Our nal system is implemented using

MultiTask ConvLSTM Enc.-Dec.

7.2 Performance Evaluation

Evaluation metric: We evaluated the performances of the models with MSE as follows:

MSE =1



ˆ

Yi−Yi2,

where n is the number of samples, Yand ˆ

Yare the ground-truth value and predicted value in 4D

tensor format, namely, (Timestep,Heiдht,Width,andChannel). Density tensor and ow tensor

dier at the Channel.

Overall performance: We compared the performance of the baseline models and our proposed

model Multitask ConvLSTM Enc.-Dec. on four events. The overall evaluation results are sum-

marized in Table 4for 30 minutes ahead prediction and Table 5for 60 minutes ahead prediction,

which both show that based on all four events: (1) our model performed better than the others;

and (2) all deep learning models had advantages compared with existing methodologies (CityMo-

mentum and VAR). In particular, we could also nd that (1) the superiority of ConvLSTM to CNN

on video-like modeling tasks; (2) encoder-decoder architecture had the advantage on multi-step

sequential prediction task; and (3) the eectiveness of multitask learning on enhancing the corre-

lated tasks.

Performance on density: We also veried the performance of our system by using a times-series

evaluation over the event day to show the ground-truth and predicted density for selected ar-

eas (Tokyo Station Area and Shinjuku Station Area) in the city. Each area consists of 3×3neigh-

boring mesh-grids, with Tokyo Station and Shinjuku Station locating at the central mesh-grid

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:18 R.Jiangetal.

Table 5. Performance Evaluation of 60 Minutes Ahead Prediction on Four Events

Model Earthquake Typhoon New Year Marathon

Density Flow Density Flow Density Flow Density Flow

HistoricalAverage 104.604 0.731 75.927 0.529 175.344 1.102 33.422 0.225

CopyYesterday 128.133 0.920 86.069 0.601 106.991 0.645 65.725 0.440

CopyLastFrame 13.020 0.168 16.607 0.252 8.650 0.096 10.004 0.128

ARIMA 24.296 NA 32.933 NA 15.411 NA 15.259 NA

VectorAutoRegressive 22.355 NA 29.872 NA 21.072 NA 17.261 NA

CityMomentum [10] 32.034 0.570 35.090 0.821 25.867 0.207 28.825 0.400

ST-ResNet [60] 11.899 0.157 13.418 0.238 7.633 0.103 8.501 0.124

CNN 12.247 0.189 17.670 0.360 10.469 0.119 12.114 0.223

CNN Enc.-Dec. 11.372 0.164 13.876 0.245 8.311 0.097 9.127 0.119

M.T. CNN Enc.-Dec. 10.812 0.177 13.800 0.247 8.153 0.101 9.004 0.124

ConvLSTM 11.355 0.139 12.285 0.228 7.615 0.118 9.511 0.140

ConvLSTM Enc.-Dec. 9.309 0.122 11.186 0.197 6.885 0.086 7.843 0.103

M.T. ConvLSTM Enc.-Dec. 8.094 0.122 9.900 0.196 6.496 0.085 7.483 0.101

respectively. From Figure 8, we can straightforwardly conrm the eectiveness of our model for

60 minutes ahead prediction and its high deployability for a real-world online event crowd manage-

ment system. Referring to the normal situation (prediction result of HistoricalAverage) shown in

the gure, we can nd that the densities on event days dier a lot from normal situations. Further-

more, even comparing these four events, the density patterns are quite dierent from each other.

This could further demonstrate the crowd management in event situations is really challenging

and our online prediction system can be so indispensable for these special cases.

Performance on dynamic graph: Using the proposed algorithm in Section 6, we build a series of

dynamic crowd mobility graph for the 3.11 Earthquake event, and demonstrate three snapshots of

the graph at 14:00, 15:00, and 16:00 (the earthquake occurred at 14:46 JST) in Figure 9. We generate

100 nodes and build the edge for each node pair based on 6-step transition matrix Ω1:6 ,which

can indicate the crowd ow transition probability in the next 30 minutes. Through Figure 9,we

could see how the crowd dynamics were gradually evolving during the earthquake. No matter

the ground-truth or predictions, it showed quite dierent details at 14:00, 15:00, and 16:00. Crowd

density prediction could achieve very good performances as shown in the previous, and the nodes

of the graph showed a close resemblance between the ground-truth and predictions. However,

there still existed some gap between the ground-truth and prediction of the transition probability

on edges. Underestimation could be observed through the gure, which left us room for further

improvement on the high-dimensional crowd ow video.

Overall eciency: The implementation is done with TensorFlow ver1.10 and the eciency test

is done on a GeForce GTX 1080Ti graphics card. Our proposed model (MultiTask ConvLSTM

Encoder-Decoder) has 271,224 parameters in total. We verify the overall eciency of our model by

plotting the learning curves on validation loss as Figure 10. Through it we can observe that, for 30

minutes lead time prediction, i.e., α/β= 6/6, it takes around 80 epochs for our model to converge

on those four datasets, while it takes average 100 epochs to converge on the four datasets when

the lead time comes to 60 minutes, i.e., α/β= 12/12. Each training epoch takes around 120 seconds,

and each batch takes around 52 milliseconds, which means that the deployed model can deliver

the prediction result within less than one second.

Hyperparameter study: We conduct hyperparameter studies for our proposed model (MultiTask

ConvLSTM Encoder-Decoder) on two hyperparameters: one is the observation/prediction step α/β,

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:19

Fig. 8. Visualization of the ground-truth crowd density, the prediction result of Historical Average (seen as

the normal situation), and the prediction result of our model (MultiTask ConvLSTM Enc.-Dec.) at four events.

another is the multitask weight λ, the results of which have been summarized in Figure 11.Wecan

observe that as α/βincreases from 3 to 15, the MSE losses on both density and ow show a slow

and gradual increase on four events, which is due to the limitation of LSTM on modeling too long

sequence. We can also see that when we adjust λfrom 0.1 to 0.9, the MSE loss on density gradually

decreases and the MSE loss on ow slowly increases since the multitask weights for density and

ow are λand 1 −λrespectively. This also veries that the CNN autoencoder could encode the

ow part to be a relatively balanced task with the density part. Thus, to well balance the density

prediction and the ow prediction, we set the λequal to 0.5 in our nal system.

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:20 R.Jiangetal.

Fig. 9. Visualization of the ground-truth dynamic crowd mobility graph (top) and the predicted results (bot-

tom) at 3.11 Earthquake from 14:00 to 16:00. The larger and darker nodes have higher crowd density and

the darker edges represent higher transition probability. The node number Kis set to 100, and the edges

correspond to six-step transition matrix Ω1:6.

Fig. 10. Learning curves on validation loss.

8 CONCLUSION

This article is an extended journal version of our previous work [25]. In this study, we built a

data-driven intelligent system called DeepUrbanEvent to predict citywide crowd dynamics at big

events in an analogous manner to a video prediction task. We proposed to decompose crowd

dynamics into crowd density and crowd ow and designed a Multitask ConvLSTM Encoder-

Decoder architecture to simultaneously predict multiple steps of crowd density and crowd ow

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:21

Fig. 11. Hyperparameter study: MSE on crowd density and flow prediction.

Fig. 12. Our prototype: Tokyo crowd dynamics system. In the le part, it can show the real-time scalar value

of crowd density and flow at the citywide level with a bar chart as well as for a selected region with a time-

series chart. In the right part, through the 3D histogram and OD (Origin-Destination) flow chart, it can

simultaneously show the multiple steps of crowd density and flow with dierent types of geospatial visual-

ization. The prediction result for the next step is highlighted with a large layout in the upper right. Tokyo

Metropolitan Government will develop and deploy a real crowd dynamics system based on our prototype.

for the future. The experimental results based on four big real-world events demonstrated the

superior performance of our proposed model compared with the baseline methods. Our model

has been successfully deployed to monitor the crowd dynamics for the Greater Tokyo Area

as demonstrated in Figure 12. The source codes for these models have been released at https:

//github.com/deepkashiwa20/DeepUrbanEvent.

REFERENCES

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy

Davis, Jerey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Georey Irving, Michael

Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga,

Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,

Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,

Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: large-scale machine learning on heterogeneous

systems. Retrieved from http://tensorow.org/.

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:22 R.Jiangetal.

[2] Yasunori Akagi, Takuya Nishimura, Takeshi Kurashima, and Hiroyuki Toda. 2018. A fast and accurate method for

estimating people ow from spatiotemporal population data. In Proceedings of the 27th International Joint Conference

on Articial Intelligence. 3293–3300.

[3] Lei Bai, Lina Yao, Salil S. Kanhere, Xianzhi Wang, and Quan Z. Sheng. 2019. Stg2seq: Spatial-temporal graph to se-

quence model for multi-step passenger demand forecasting. In IJCAI. 1981–1987.

[4] Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends® in Machine Learning 2, 1 (2009),

1–127.

[5] Pablo Samuel Castro, Daqing Zhang, and Shijian Li. 2012. Urban trac modelling and prediction using large scale

taxi GPS traces. In Pervasive Computing. Springer, 57–72.

[6] Di Chai, Leye Wang, and Qiang Yang. 2018. Bike ow prediction with multi-graph convolutional networks. In Proceed-

ings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 397–400.

[7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and

Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation.

In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734.

[8] François Chollet. 2015. keras. Retrieved from https://github.com/fchollet/keras.

[9] Zipei Fan, Xuan Song, Renhe Jiang, Quanjun Chen, and Ryosuke Shibasaki. 2019. Decentralized attention-based per-

sonalized human mobility prediction. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Tech-

nologies 3, 4 (2019), 1–26.

[10] Zipei Fan, Xuan Song, Ryosuke Shibasaki, and Ryutaro Adachi. 2015. CityMomentum: An online approach for crowd

behavior prediction at a citywide level. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and

Ubiquitous Computing. ACM, 559–569.

[11] Zipei Fan, Xuan Song, Tianqi Xia, Renhe Jiang, Ryosuke Shibasaki, and Ritsu Sakuramachi. 2018. Online deep en-

semble learning for predicting citywide human mobility. Proceedings of the ACM on Interactive, Mobile, Wearable and

Ubiquitous Technologies 2, 3 (2018), 1–21.

[12] Jie Feng, Yong Li, Chao Zhang, Funing Sun, Fanchao Meng, Ang Guo, and Depeng Jin. 2018. Deepmove: Predicting

human mobility with attentional recurrent networks. In Proceedings of the 2018 World Wide Web Conference. Interna-

tional World Wide Web Conferences Steering Committee, 1459–1468.

[13] Qiang Gao, Fan Zhou, Goce Trajcevski, Kunpeng Zhang, Ting Zhong, and Fengli Zhang. 2019. Predicting human

mobility via variational attention. In Proceedings of the World Wide Web Conference. ACM, 2750–2756.

[14] Xu Geng, Yaguang Li, Leye Wang, Lingyu Zhang, Qiang Yang, Jieping Ye, and Yan Liu. 2019. Spatiotemporal multi-

graph convolution network for ride-hailing demand forecasting. In Proceedings of the 2019 AAAI Conference on Arti-

cial Intelligence.

[15] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pedreschi. 2007. Trajectory pattern mining. In Proceedings of

the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 330–339.

[16] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph

convolutional networks for trac ow forecasting. In Proceedings of the AAAI Conference on Articial Intelligence.

Vol. 33. 922–929.

[17] Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. In Proceed-

ings of the Advances in Neural Information Processing Systems. 190–198.

[18] Minh X Hoang, Yu Zheng, and Ambuj K Singh. 2016. Forecasting citywide crowd ows based on big data. In Proceed-

ings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.

[19] T. Horanont, A. Witayangkurn, Y. Sekimoto, and R. Shibasaki. 2013. Large-scale auto-GPS analysis for discerning

behavior change during crisis. IEEE Intelligent Systems 28, 4 (2013), 26–34.

[20] Dou Huang, Xuan Song, Zipei Fan, Renhe Jiang, Ryosuke Shibasaki, Yu Zhang, Haizhong Wang, and Yugo Kato. 2019.

A variational autoencoder based generative model of urban human mobility. In Proceedings of the 2019 IEEE Conference

on Multimedia Information Processing and Retrieval. IEEE, 425–430.

[21] Wenhao Huang, Guojie Song, Haikun Hong, and Kunqing Xie. 2014. Deep architecture for trac ow prediction:

Deep belief networks with multitask learning. IEEE Transactions on Intelligent Transportation Systems 15, 5 (2014),

2191–2201.

[22] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Quanjun Chen, Qi Chen, and Ryosuke Shibasaki. 2018. Deep ROI-

based modeling for urban human mobility prediction. Proceedings of the ACM on Interactive, Mobile, Wearable and

Ubiquitous Technologies 2, 1 (2018), 1–29.

[23] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Quanjun Chen, Satoshi Miyazawa, and Ryosuke Shibasaki. 2018. Deep-

UrbanMomentum: An online deep-learning system for short-term urban mobility prediction. In Proceedings of the 32nd

AAAI conference on articial intelligence. 784–791.

[24] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Zhaonan Wang, Quanjun Chen, Zekun Cai, and Ryosuke Shibasaki.

2021. Transfer urban human mobility via POI embedding over multiple cities. ACM Transactions on Data Science 2, 1

(2021), 1–26.

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Predicting Citywide Crowd Dynamics at Big Events 21:23

[25] Renhe Jiang, Xuan Song, Dou Huang, Xiaoya Song, Tianqi Xia, Zekun Cai, Zhaonan Wang, Kyoung-Sook Kim, and

Ryosuke Shibasaki. 2019. Deepurbanevent: A system for predicting citywide crowd dynamics at big events. In Pro-

ceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2114–2122.

[26] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. Imagenet classication with deep convolutional neural

networks. In Proceedings of the Advances in Neural Information Processing Systems. 1097–1105.

[27] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haner. 1998. Gradient-based learning applied to document

recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.

[28] Xiucheng Li, Gao Cong, Aixin Sun, and Yun Cheng. 2019. Learning travel time distributions with deep generative

model. In Proceedings of the World Wide Web Conference. ACM, 1017–1027.

[29] Yaguang Li, Kun Fu, Zheng Wang, Cyrus Shahabi, Jieping Ye, and Yan Liu. 2018. Multi-task representation learning

for travel time estimation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery

&DataMining. ACM, 1695–1704.

[30] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diusion convolutionalrecurrent neural network: Data-driven

trac forecasting. In Proceedings of the International Conference on Learning Representations.

[31] Yuxuan Liang, Songyu Ke, Junbo Zhang, Xiuwen Yi, and Yu Zheng. 2018. Geoman: Multi-level attention networks for

geo-sensory time series prediction. In Proceedings of the 27th International Joint Conference on Articial Intelligence.

3428–3434.

[32] Ziqian Lin, Jie Feng, Ziyang Lu, Yong Li, and Depeng Jin. 2019. DeepSTN+: Context-aware spatial-temporal neural

network for crowd ow prediction in metropolis. In Proceedings of the AAAI Conference on Articial Intelligence.

[33] Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. Predicting the next location: A recurrent model with spatial

and temporal contexts. In Proceedings of the 13th AAAI Conference on Articial Intelligence.

[34] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. 2015. Trac ow prediction with big data:

A deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16, 2 (2015), 865–873.

[35] Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang. 2015. Long short-term memory neural net-

work for trac speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Tech-

nologies 54 (2015), 187–197.

[36] Xiaolei Ma, Haiyang Yu, Yunpeng Wang, and Yinhai Wang. 2015. Large-scale transportation network congestion

evolution prediction using deep learning theory. PloS One 10, 3 (2015), e0119044.

[37] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep

learning. In Proceedings of the 28th International Conference on Machine Learning. 689–696.

[38] Zheyi Pan, Yuxuan Liang, Weifeng Wang, Yong Yu, Yu Zheng, and Junbo Zhang. 2019. Urban trac prediction from

spatio-temporal data using deep meta learning. In Proceedings of the 25th ACM SIGKDD International Conference on

Knowledge Discovery & Data Mining. ACM, 1720–1730.

[39] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556 (2014).

[40] Chao Song, Youfang Lin, Shengnan Guo, and Huaiyu Wan. 2020. Spatial-temporal synchronous graph convolutional

networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI Conference on

Articial Intelligence. Vol. 34. 914–921.

[41] Xuan Song, Hiroshi Kanasugi, and Ryosuke Shibasaki. 2016. DeepTransport: Prediction and simulation of human

mobility and transportation mode at a citywide level. In Proceedings of the 25th International Joint Conference on

Articial Intelligence. 2618–2624.

[42] Xuan Song, Quanshi Zhang, Yoshihide Sekimoto, Ryosuke Shibasaki, Nicholas Jing Yuan, and Xing Xie. 2015. A simu-

lator of human emergency mobility following disasters: Knowledge transfer from big disaster data. In Proceedings of

the 29th AAAI Conference on Articial Intelligence.

[43] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings

of the Advances in Neural Information Processing Systems. 3104–3112.

[44] Yusuke Tanaka, Tomoharu Iwata, Takeshi Kurashima, Hiroyuki Toda, and Naonori Ueda. 2018. Estimating latent peo-

ple ow without tracking individuals. In Proceedings of the 2018 International Joint Conference on Articial Intelligence.

3556–3563.

[45] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked de-

noising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of

Machine Learning Research 11, Dec (2010), 3371–3408.

[46] Dong Wang, Wei Cao, Jian Li, and Jieping Ye. 2017. DeepSD: Supply-demand prediction for online car-hailing services

using deep neural networks. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering. IEEE,

243–254.

[47] Dong Wang, Junbo Zhang, Wei Cao, Jian Li, and Yu Zheng. 2018. When will you arrive? Estimating travel time based

on deep neural networks. In Proceedings of the 32nd AAAI Conference on Articial Intelligence.

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

21:24 R.Jiangetal.

[48] Senzhang Wang, Jiannong Cao, and Philip Yu. 2020. Deep learning for spatio-temporal data mining: A survey. IEEE

Transactions on Knowledge and Data Engineering (2020).

[49] Yuandong Wang, Hongzhi Yin, Hongxu Chen, Tianyu Wo, Jie Xu, and Kai Zheng. 2019. Origin-destination matrix

prediction via graph convolution: A new perspective of passenger demand modeling. In Proceedings of the 25th ACM

SIGKDD International Conference on Knowledge Discovery & Data Mining. 1227–1235.

[50] Zheng Wang, Kun Fu, and Jieping Ye. 2018. Learning to estimate the travel time. In Proceedings of the 24th ACM

SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 858–866.

[51] Ling-Yin Wei, Yu Zheng, and Wen-Chih Peng. 2012. Constructing popular routes from uncertain trajectories. In Pro-

ceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 195–203.

[52] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-

temporal graph modeling. In Proceedings of the 28th International Joint Conference on Articial Intelligence. 1907–1913.

[53] Shi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional

LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural

Information Processing Systems. 802–810.

[54] Yu Yang, Fan Zhang, and Desheng Zhang. 2018. SharedEdge: GPS-free ne-grained travel time estimation in state-

level highway systems. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1

(2018), 48.

[55] Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, and Zhenhui Li. 2019. Revisiting spatial-temporal similarity: A

deep learning framework for trac prediction. In Proceedings of the 2019 AAAI Conference on Articial Intelligence.

[56] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. 2018.

Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the 32nd AAAI Conference on

Articial Intelligence.

[57] Junchen Ye, Leilei Sun, Bowen Du, Yanjie Fu, Xinran Tong, and Hui Xiong. 2019. Co-prediction of multiple transporta-

tion demands based on deep spatio-temporal neural network. In Proceedings of the 25th ACM SIGKDD International

Conference on Knowledge Discovery & Data Mining. 305–313.

[58] Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018. Spatio-temporal graph convolutional networks: A deep learning

framework for trac forecasting. In Proceedings of the 27th International Joint Conference on Articial Intelligence.

AAAI Press, 3634–3640.

[59] Zhuoning Yuan, Xun Zhou, and Tianbao Yang. 2018. Hetero-ConvLSTM: A deep learning approach to trac accident

prediction on heterogeneous spatio-temporal data. In Proceedings of the 24th ACM SIGKDD International Conference

on Knowledge Discovery & Data Mining. ACM, 984–992.

[60] Junbo Zhang, Yu Zheng, and Dekang Qi. 2017. Deep spatio-temporal residual networks for citywide crowd ows

prediction. In Proceedings of the 31st AAAI Conference on Articial Intelligence. 1655–1661.

[61] Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. 2016. DNN-based prediction model for spatio-

temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic In-

formation Systems. ACM, 92.

[62] Junbo Zhang, Yu Zheng, Junkai Sun, and Dekang Qi. 2019. Flow prediction in spatio-temporal networks based on

multitask deep learning. IEEE Transactions on Knowledge and Data Engineering 32, 3 (2019), 468–478.

[63] Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. 2019. T-GCN: A temporal

graph convolutional network for trac prediction. IEEE Transactions on Intelligent Transportation Systems 21, 9 (2019),

3848–3858.

[64] Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020. Gman: A graph multi-attention network for

trac prediction. In Proceedings of the AAAI Conference on Articial Intelligence. Vol. 34. 1234–1241.

[65] Ali Zonoozi, Jung-jae Kim, Xiao-Li Li, and Gao Cong. 2018. Periodic-CRN: A convolutional recurrent model for crowd

density prediction with recurring periodic patterns. In Proceedings of the 27th International Joint Conference on Arti-

cial Intelligence. 3732–3738.

Received October 2020; revised April 2021; accepted June 2021

ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.

Learning Social Meta-knowledge for Nowcasting Human Mobility in Disaster

Conference Paper

Full-text available

Apr 2023

Mapping the knowledge domain of soft computing applications for emergency evacuation studies: A scientometric analysis and critical review

Article

Feb 2023
SAFETY SCI

Emergency evacuation is viewed as a common strategy adopted during the disaster preparedness stage of evacuation to ensure the safety of potentially affected populations. In emergency evacuation studies, soft computing approaches and methodologies have been widely used to support effective decision-making, providing robust and low-cost solutions. To understand the current status and trends of research on soft computing applications for emergency evacuation studies, 778 related studies published in the core database of Web of Science from 2000 to 2020 were considered in this study. A scientometric analysis and a comprehensive review were performed using a scientific mapping of the knowledge domain. This paper presents a set of analyses with the following primary objectives: (1) to explore and visualize the bibliometric characteristics and contents of the academic field concerned with the soft computing approaches for emergency evacuation; and (2) to review and analyze the knowledge, hotspots, and future outlooks related to soft computing approaches for emergency evacuation. The results provide some important insights regarding the existing soft computing methods that have been used in the emergency evacuation field over the past 20 years. Based on the conducted review, this paper proposes that future studies should concentrate on exploring the potential of innovative soft computing approaches for crowd modelling and enabling more accurate evacuation simulation and optimization.

DeepMeshCity: A Deep Learning Model for Urban Grid Prediction

Article

Mar 2024

Urban grid prediction can be applied to many classic spatial-temporal prediction tasks such as air quality prediction, crowd density prediction, and traffic flow prediction, which is of great importance to smart city building. In light of its practical values, many methods have been developed for it and have achieved promising results. Despite their successes, two main challenges remain open: a) how to well capture the global dependencies and b) how to effectively model the multi-scale spatial-temporal correlations? To address these two challenges, we propose a novel method— DeepMeshCity , with a carefully-designed Self-Attention Citywide Grid Learner ( SA-CGL ) block comprising of a self-attention unit and a Citywide Grid Learner ( CGL ) unit. The self-attention block aims to capture the global spatial dependencies, and the CGL unit is responsible for learning the spatial-temporal correlations. In particular, a multi-scale memory unit is proposed to traverse all stacked SA-CGL blocks along a zigzag path to capture the multi-scale spatial-temporal correlations. In addition, we propose to initialize the single-scale memory units and the multi-scale memory units by using the corresponding ones in the previous fragment stack, so as to speed up the model training. We evaluate the performance of our proposed model by comparing with several state-of-the-art methods on four real-world datasets for two urban grid prediction applications. The experimental results verify the superiority of DeepMeshCity over the existing ones. The code is available at https://github.com/ILoveStudying/DeepMeshCity.

A spatiotemporal deep learning approach for pedestrian crash risk prediction based on POI trip characteristics and pedestrian exposure intensity

Article

Feb 2024

Forecasting Citywide Crowd Transition Process via Convolutional Recurrent Neural Networks

Article

Jan 2023

Perceiving and modeling urban crowd movements are of great importance to smart city-related fields. Governments and public service operators can benefit from such efforts as they can be applied to crowd management, resource scheduling, and early emergency warning. However, most prior research on urban crowd modeling has failed to describe the dynamics and continuity of human mobility, leading to inconsistent and irrelevant results when they tackle multiple homogeneous forecasting tasks as they can only be modeled independently. To overcome this drawback, we propose to model human mobility from a new perspective, which uses the citywide crowd transition process constituted by a series of transition matrices from low order to high order, to help us understand how the crowd dynamics evolve step-by-step. We further propose a Deep Transition Process Network to process and predict such new high-dimensional data, where novel grid embedding with Graph Convolutional Network, parameter-shared Convolutional LSTM, and High-Dimensional Attention mechanism are designed to learn the complicated dependencies in terms of spatial, temporal, and ordinal features. We conduct experiments on two datasets generated by a large amount of GPS data collected from a real-world smartphone application. The experiment results demonstrate the superior performance of our proposed methodology over existing approaches.

Extreme-Aware Local-Global Attention for Spatio-Temporal Urban Mobility Learning

Conference Paper

Apr 2023

Urban Mobility-Driven Crowdsensing: Recent Advances in Machine Learning Designs and Ubiquitous Applications

Chapter

Apr 2023

DeepUrbanMomentum: An Online Deep-Learning System for Short-Term Urban Mobility Prediction

Article

Full-text available

Apr 2018

Big human mobility data are being continuously generated through a variety of sources, some of which can be treated and used as streaming data for understanding and predicting urban dynamics. With such streaming mobility data, the online prediction of short-term human mobility at the city level can be of great significance for transportation scheduling, urban regulation, and emergency management. In particular, when big rare events or disasters happen, such as large earthquakes or severe traffic accidents, people change their behaviors from their routine activities. This means people's movements will almost be uncorrelated with their past movements. Therefore, in this study, we build an online system called DeepUrbanMomentum to conduct the next short-term mobility predictions by using (the limited steps of) currently observed human mobility data. A deep-learning architecture built with recurrent neural networks is designed to effectively model these highly complex sequential data for a huge urban area. Experimental results demonstrate the superior performance of our proposed model as compared to the existing approaches. Lastly, we apply our system to a real emergency scenario and demonstrate that our system is applicable in the real world.

Transfer Urban Human Mobility via POI Embedding over Multiple Cities

Article

Full-text available

Jan 2021

Rapidly developing location acquisition technologies provide a powerful tool for understanding and predicting human mobility in cities, which is very significant for urban planning, traffic regulation, and emergency management. However, with the existing methodologies, it is still difficult to accurately predict millions of peoples’ mobility in a large urban area such as Tokyo, Shanghai, and Hong Kong, especially when collected data used for model training are often limited to a small portion of the total population. Obviously, human activities in city are closely linked with point-of-interest (POI) information, which can reflect the semantic meaning of human mobility. This motivates us to fuse human mobility data and city POI data to improve the prediction performance with limited training data, but current fusion technologies can hardly handle these two heterogeneous data. Therefore, we propose a unique POI-embedding mechanism, that aggregates the regional POIs by categories to generate an artificial POI-image for each urban grid and enriches each trajectory snippet to a four-dimensional tensor in an analogous manner to a short video. Then, we design a deep learning architecture combining CNN with LSTM to simultaneously capture both the spatiotemporal and geographical information from the enriched trajectories. Furthermore, transfer learning is employed to transfer mobility knowledge from one city to another, so that we can fully utilize other cities’ data to train a stronger model for the target city with only limited data available. Finally, we achieve satisfactory performance of human mobility prediction at the citywide level using a limited amount of trajectories as training data, which has been validated over five urban areas of different types and scales.

Decentralized Attention-based Personalized Human Mobility Prediction

Article

Full-text available

Dec 2019

Human mobility prediction is essential to a variety of human-centered computing applications achieved through upgrading of location-based services (LBS) to future-location-based services (FLBS). Previous studies on human mobility prediction have mainly focused on centralized human mobility prediction, where user mobility data are collected, trained and predicted at the cloud server side. However, such a centralized approach leads to a high risk of privacy issues, and a real-time centralized system for processing such a large volume of distributed data is extremely difficult to apply. Moreover, a large and dynamic set of users makes the predictive model extremely challenging to personalize. In this paper, we propose a novel decentralized attention-based human mobility predictor in which 1) no additional training procedure is required for personalized prediction, 2) no additional training procedure is required for incremental learning, and 3) the predictor can be trained and predicted in a decentralized way. We tested our method on big data of real-world mobile phone user GPS and on Android devices, and achieved a low-power consumption and a good prediction accuracy without collecting user data in the server or applying additional training on the user side.

Spatial-Temporal Synchronous Graph Convolutional Networks: A New Framework for Spatial-Temporal Network Data Forecasting

Article

Full-text available

Apr 2020

Spatial-temporal network data forecasting is of great importance in a huge amount of applications for traffic management and urban planning. However, the underlying complex spatial-temporal correlations and heterogeneities make this problem challenging. Existing methods usually use separate components to capture spatial and temporal correlations and ignore the heterogeneities in spatial-temporal data. In this paper, we propose a novel model, named Spatial-Temporal Synchronous Graph Convolutional Networks (STSGCN), for spatial-temporal network data forecasting. The model is able to effectively capture the complex localized spatial-temporal correlations through an elaborately designed spatial-temporal synchronous modeling mechanism. Meanwhile, multiple modules for different time periods are designed in the model to effectively capture the heterogeneities in localized spatial-temporal graphs. Extensive experiments are conducted on four real-world datasets, which demonstrates that our method achieves the state-of-the-art performance and consistently outperforms other baselines.

GMAN: A Graph Multi-Attention Network for Traffic Prediction

Article

Full-text available

Apr 2020

Long-term traffic prediction is highly challenging due to the complexity of traffic systems and the constantly changing nature of many impacting factors. In this paper, we focus on the spatio-temporal factors, and propose a graph multi-attention network (GMAN) to predict traffic conditions for time steps ahead at different locations on a road network graph. GMAN adapts an encoder-decoder architecture, where both the encoder and the decoder consist of multiple spatio-temporal attention blocks to model the impact of the spatio-temporal factors on traffic conditions. The encoder encodes the input traffic features and the decoder predicts the output sequence. Between the encoder and the decoder, a transform attention layer is applied to convert the encoded traffic features to generate the sequence representations of future time steps as the input of the decoder. The transform attention mechanism models the direct relationships between historical and future time steps that helps to alleviate the error propagation problem among prediction time steps. Experimental results on two real-world traffic prediction tasks (i.e., traffic volume prediction and traffic speed prediction) demonstrate the superiority of GMAN. In particular, in the 1 hour ahead prediction, GMAN outperforms state-of-the-art methods by up to 4% improvement in MAE measure. The source code is available at https://github.com/zhengchuanpan/GMAN.

Predicting the Next Location: A Recurrent Model with Spatial and Temporal Contexts

Article

Feb 2016

Spatial and temporal contextual information plays a key role for analyzing user behaviors, and is helpful for predicting where he or she will go next. With the growing ability of collecting information, more and more temporal and spatial contextual information is collected in systems, and the location prediction problem becomes crucial and feasible. Some works have been proposed to address this problem, but they all have their limitations. Factorizing Personalized Markov Chain (FPMC) is constructed based on a strong independence assumption among different factors, which limits its performance. Tensor Factorization (TF) faces the cold start problem in predicting future actions. Recurrent Neural Networks (RNN) model shows promising performance comparing with PFMC and TF, but all these methods have problem in modeling continuous time interval and geographical distance. In this paper, we extend RNN and propose a novel method called Spatial Temporal Recurrent Neural Networks (ST-RNN). ST-RNN can model local temporal and spatial contexts in each layer with time-specific transition matrices for different time intervals and distance-specific transition matrices for different geographical distances. Experimental results show that the proposed ST-RNN model yields significant improvements over the competitive compared methods on two typical datasets, i.e., Global Terrorism Database (GTD) and Gowalla dataset.

When Will You Arrive? Estimating Travel Time Based on Deep Neural Networks

Article

Apr 2018

Estimating the travel time of any path (denoted by a sequence of connected road segments) in a city is of great importance to traffic monitoring, route planning, ridesharing, taxi/Uber dispatching, etc. However, it is a very challenging problem, affected by diverse complex factors, including spatial correlations, temporal dependencies, external conditions (e.g. weather, traffic lights). Prior work usually focuses on estimating the travel times of individual road segments or sub-paths and then summing up these times, which leads to an inaccurate estimation because such approaches do not consider road intersections/traffic lights, and local errors may accumulate. To address these issues, we propose an end-to-end Deep learning framework for Travel Time Estimation called DeepTTE that estimates the travel time of the whole path directly. More specifically, we present a geo-convolution operation by integrating the geographic information into the classical convolution, capable of capturing spatial correlations. By stacking recurrent unit on the geo-convoluton layer, our DeepTTE can capture the temporal dependencies simultaneously. A multi-task learning component is given on the top of DeepTTE, that estimates the travel time of both the entire path and each local path simultaneously during the training phase. The extensive experiments on two large-scale datasets shows our DeepTTE significantly outperforms the state-of-the-art methods.

A Simulator of Human Emergency Mobility Following Disasters: Knowledge Transfer from Big Disaster Data

Article

Feb 2015

The frequency and intensity of natural disasters has significantly increased over the past decades and this trend is predicted to continue. Facing these possible and unexpected disasters, understanding and simulating of human emergency mobility following disasters will becomethe critical issue for planning effective humanitarian relief, disaster management, and long-term societal reconstruction. However, due to the uniquenessof various disasters and the unavailability of reliable and large scale human mobility data, such kind of research is very difficult to be performed. Hence, in this paper,we collect big and heterogeneous data (e.g. 1.6 million users' GPS records in three years, 17520 times of Japan earthquake data in four years, news reporting data, transportation network data and etc.) to capture and analyze human emergency mobility following different disasters. By mining these big data, we aim to understand what basic laws govern human mobility following disasters, and develop a general model of human emergency mobility for generating and simulating large amount of human emergency movements. The experimental results and validations demonstrate the efficiency of our simulation model, and suggest that human mobility following disasters may be significantly morepredictable and can be easier simulated than previously thought.

On Lightweight Privacy-preserving Collaborative Learning for Internet of Things by Independent Random Projections

Article

Mar 2021

The Internet of Things (IoT) will be a main data generation infrastructure for achieving better system intelligence. This article considers the design and implementation of a practical privacy-preserving collaborative learning scheme, in which a curious learning coordinator trains a better machine learning model based on the data samples contributed by a number of IoT objects, while the confidentiality of the raw forms of the training data is protected against the coordinator. Existing distributed machine learning and data encryption approaches incur significant computation and communication overhead, rendering them ill-suited for resource-constrained IoT objects. We study an approach that applies independent random projection at each IoT object to obfuscate data and trains a deep neural network at the coordinator based on the projected data from the IoT objects. This approach introduces light computation overhead to the IoT objects and moves most workload to the coordinator that can have sufficient computing resources. Although the independent projections performed by the IoT objects address the potential collusion between the curious coordinator and some compromised IoT objects, they significantly increase the complexity of the projected data. In this article, we leverage the superior learning capability of deep learning in capturing sophisticated patterns to maintain good learning performance. Extensive comparative evaluation shows that this approach outperforms other lightweight approaches that apply additive noisification for differential privacy and/or support vector machines for learning in the applications with light to moderate data pattern complexities.

Deep Learning for Spatio-Temporal Data Mining: A Survey

Article

Sep 2020

With the fast development of various positioning techniques such as Global Position System (GPS), mobile devices and remote sensing, spatio-temporal data has become increasingly available nowadays. Mining valuable knowledge from spatio-temporal data is critically important to many real-world applications including human mobility understanding, smart transportation, urban planning, public safety, health care and environmental management. As the number, volume and resolution of spatio-temporal data increase rapidly, traditional data mining methods, especially statistics based methods for dealing with such data are becoming overwhelmed. Recently deep learning models such as recurrent neural network (RNN) and convolutional neural network (CNN) have achieved remarkable success in many domains, and are also widely applied in various spatio-temporal data mining (STDM) tasks such as predictive learning, anomaly detection and classification. In this paper, we provide a comprehensive review of recent progress in applying deep learning techniques for STDM. We first categorize the spatio-temporal data into five different types, and then briefly introduce the deep learning models that are widely used in STDM. Next, we classify existing literature based on the types of spatio-temporal data, the data mining tasks, and the deep learning models, followed by the applications of deep learning for STDM in different domains.

Predicting Citywide Crowd Dynamics at Big Events: A Deep Learning System

Abstract and Figures

Recommended publications

DeepUrbanEvent: A System for Predicting Citywide Crowd Dynamics at Big Events

DeepCrowd: A Deep Model for Large-Scale Citywide Crowd Density and Flow Prediction

VLUC: An Empirical Benchmark for Video-Like Urban Computing on Citywide Crowd and Traffic Prediction

Forecasting Citywide Crowd Transition Process via Convolutional Recurrent Neural Networks

Countrywide Origin-Destination Matrix Prediction and Its Application for COVID-19