Content uploaded by Renhe Jiang
Author content
All content in this area was uploaded by Renhe Jiang on Jun 01, 2023
Content may be subject to copyright.
21
Predicting Citywide Crowd Dynamics at Big Events: A Deep
Learning System
RENHE JIANG, The University of Tokyo, SUSTech-UTokyo Joint Research Center on Super Smart City,
Southern University of Science and Technology
ZEKUN CAI, ZHAONAN WANG, and CHUANG YANG, The University of Tokyo
ZIPEI FAN and QUANJUN CHEN, The University of Tokyo, SUSTech-UTokyo Joint Research Center
on Super Smart City, Southern University of Science and Technology
XUAN SONG, SUSTech-UTokyo Joint Research Center on Super Smart City, Southern University of
Science and Technology, The University of Tokyo
RYOSUKE SHIBASAKI, The University of Tokyo
Event crowd management has been a signicant research topic with high social impact. When some big
events happen such as an earthquake, typhoon, and national festival, crowd management becomes the rst
priority for governments (e.g., police) and public service operators (e.g., subway/bus operator) to protect
people’s safety or maintain the operation of public infrastructures. However, under such event situations,
human behavior will become very dierent from daily routines, which makes prediction of crowd dynamics
at big events become highly challenging, especially at a citywide level. Therefore in this study, we aim to
extract the “deep” trend only from the current momentary observations and generate an accurate prediction
for the trend in the short future, which is considered to be an eective way to deal with the event situations.
Motivated by these, we build an online system called DeepUrbanEvent, which can iteratively take citywide
crowd dynamics from the current one hour as input and report the prediction results for the next one hour
as output. A novel deep learning architecture built with recurrent neural networks is designed to eectively
model these highly complex sequential data in an analogous manner to video prediction tasks. Experimental
results demonstrate the superior performance of our proposed methodology to the existing approaches. Lastly,
we apply our prototype system to multiple big real-world events and show that it is highly deployable as an
online crowd management system.
CCS Concepts: • Information systems →Information systems applications;•Human-centered com-
puting →Ubiquitous and mobile computing;•Computing methodologies →Articial intelligence;
Additional Key Words and Phrases: Crowd management, ubiquitous and mobile computing, deep learning,
application and system
This work was supported by Grant-in-Aid for Early-Career Scientists (20K19859) and Grant-in-Aid for Early-Career Scien-
tists (20K19782) of Japan Society for the Promotion of Science (JSPS).
Authors’ addresses: R. Jiang, Z. Fan, and Q. Chen, The University of Tokyo, SUSTech-UTokyo Joint Research Cen-
ter on Super Smart City, Southern University of Science and Technology; emails: jiangrh@csis.u-tokyo.ac.jp, {fanzipei,
chen1990}@iis.u-tokyo.ac.jp; Z. Cai, Z. Wang, C. Yang, and R. Shibasaki, The University of Tokyo; emails: {caizekun,
znwang, chuang.yang, shiba}@csis.u-tokyo.ac.jp; X. Song (corresponding author), SUSTech-UTokyo Joint Research Cen-
ter on Super Smart City, Southern University of Science and Technology, The University of Tokyo; email: songxuan@csis.
u-tokyo.ac.jp.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2022 Association for Computing Machinery.
2157-6904/2022/03-ART21 $15.00
https://doi.org/10.1145/3472300
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:2 R.Jiangetal.
ACM Reference format:
Renhe Jiang, Zekun Cai, Zhaonan Wang, Chuang Yang, Zipei Fan, Quanjun Chen, Xuan Song, and Ryosuke
Shibasaki. 2022. Predicting Citywide Crowd Dynamics at Big Events: A Deep Learning System. ACM Trans.
Intell. Syst. Technol. 13, 2, Article 21 (March 2022), 24 pages.
https://doi.org/10.1145/3472300
1 INTRODUCTION
Event crowd management has been a signicant research topic with high social impact. When
some big events happen such as an earthquake, typhoon, and national festival, crowd manage-
ment becomes the rst priority for governments (e.g., police) and public service operators (e.g.,
subway/bus operator) to protect people’s safety or maintain the operation of public infrastruc-
tures. Especially for a large urban area such as Tokyo and Shanghai, the population density is very
high, which naturally leads to high risks for various accidents and emergency situations. Recall
the tragedy on New Year’s Eve in Shanghai, around 300,000 people gathered to celebrate the ar-
rival of 2015 near Chen Yi Square on the Bund. However, the large crowd was not well controlled,
and a stampede occurred where 36 people died and 47 were injured in the tragedy. Meanwhile,
articial intelligence (AI) technology is rapidly developing and the 5G mobile internet technol-
ogy is forthcoming. Big human mobility data are being continuously generated through a variety
of sources, some of which can be treated and utilized as streaming data for understanding and
predicting crowd dynamics. All these stimulate us to take new eorts and achieve new success on
this social issue by using such streaming mobility data and advanced AI technologies. However,
when big events or disasters happen, urban human mobility may dramatically change from nor-
mal situations. It means people’s movements will almost be uncorrelated with their daily routines.
AsshowninFigure1, a big earthquake occurred at 14:46 JST March 11, 2011. Citywide human
mobility in Tokyo area was greatly impacted since the transportation network was suddenly shut
down by the earthquake. Due to this big event, an abnormal pattern of crowd density can be ob-
served in both Tokyo station area and Shinjuku station area. All these demonstrate that predicting
crowd dynamics under event situations is of high social impact, but very challenging, especially
at a citywide level.
To address this challenge, we aim to extract the “deep” trend only from the current momentary
observations and generate an accurate prediction for the trend in the short future, which is con-
sidered to be an eective way to handle the event situations [10]. We build an intelligent system
called DeepUrbanEvent based on collected big human mobility data and a unique deep-learning
architecture. It is designed to be deployed as an online system for crowd management at big events,
which can continuously take limited steps of currently observed crowd dynamics as input and re-
port multiple steps of prediction results for a short time period in the future as output. With such
multiple steps of prediction, we can easily understand how the crowd dynamics are evolving in
more detail.
Specically, in this study, citywide crowd dynamics are rst decomposed into two parts: crowd
density and crowd ow. By meshing a large urban area into ne-grained grids, they can both
be represented by a four-dimensional (4D) tensor (Timestep,Heiдht,Width,andChannel)
analogously to a short video, where Timestep represents the number of observation/prediction
steps, and Heiдht and Width are determined by mesh size. Crowd density/ow video represents
a time series of density/ow value for each mesh-grid, therefore Channel for density is equal to
1, whereas Channel for ow is equal to the size of ow kernel window η×η.Thestoredvalue
indicates how many people inside a central mesh-grid will transit to each of η×ηneighboring
mesh-grids in a given time interval. A Multitask ConvLSTM Encoder-Decoder architecture is
designed to simultaneously model these two kinds of high-dimensional sequential data to gain
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:3
Fig. 1. Citywide human mobility in Tokyo before and aer the Great East Japan Earthquake (top). Crowd
density in Tokyo station area and Shinjuku station area (boom).
concurrent enhancement. Based on this architecture, our system works as an online system that
can continuously take limited steps of currently observed crowd density and crowd ow as input,
and report multiple steps of predictions results as output for the future time period. Finally, we
validate our system on four big real-world events that happened in Tokyo area, namely, 3.11
Japan Earthquake, Typhoon Roke (2011), New Year’s Day (2012), and Tokyo Marathon (2011), and
demonstrate its superior performance to baseline models. The overview of our system has been
showninFigure2. In summary, our work has the following key characteristics that make it unique:
— For predicting crowd dynamics at citywide-level big events, we build an online deployable
system that needs only limited steps of current observations as input.
— Citywide crowd dynamics are decomposed into two kinds of articial videos, namely, crowd
density video and crowd ow video, and a Multitask ConvLSTM Encoder-Decoder is de-
signed to simultaneously predict multiple steps of crowd density and ow for the future
time period.
— Using the predicted crowd density and ow video, we can further build a series of dynamic
crowd mobility graph to help conduct probabilistic reasoning of crowd movements during
big events.
— We validate our system on four big real-world events with big human mobility data source
and verify it as a highly deployable prototype system.
The remainder of this article is organized as follows. In Section 2, we review some related works.
In Section 3, we provide a description of our data source. In Section 4, we give the problem deni-
tion of citywide crowd dynamics prediction. In Section 5, we illustrate the proposed deep sequen-
tial learning architecture. In Section 6, we explain how to build dynamic crowd mobility graph. We
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:4 R.Jiangetal.
Fig. 2. DeepUrbanEvent is designed as an eective real-world system for predicting citywide crowd dynamics
at big events. Citywide crowd dynamics are first decomposed into two parts: crowd density and crowd flow.
By meshing a large urban area into fine-grained grids, they can both be represented by a four-dimensional
tensor (Timestep,Heiдht,Width,andChannel) analogously to a short video, where Timestep represents
the number of observation/prediction steps, and Heiдht and Width are determined by mesh size. Crowd
dynamics graph for management can be built based on the predicted density (nodes) and flow (edges).
present and discuss the results of the experimental evaluation in Section 7. In Section 8, we give
our conclusions as well as future work.
2 RELATED WORK
Human mobility data have been widely studied in the eld of data science and AI. A very compre-
hensive survey on the recent deep spatiotemporal models has been given in [48]. Here, we briey
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:5
recap the related works in citywide human mobility prediction. According to the modeling strat-
egy, it could be divided into three categories: trajectory-based prediction, mesh-based prediction,
and network-based prediction. The trajectory-based methods [9–13,22,23,33,42] directly model
the trajectories as typical sequential data, whereas the mesh-based [18,32,46,49,55–57,59–61,65]
and network-based ones [5,14,21,28,29,34–36,38,41,47,50,54,58] aggregate or map the trajec-
tories with mesh-grid or transportation network.
2.1 Trajectory-Based Mobility Prediction
Many trajectory-based deep learning models were proposed to predict each individual’s move-
ment [12,13,33]. [33] extends a regular RNN by utilizing time and distance-specic transi-
tion matrices to propose an ST-RNN model for predicting the next location. DeepMove [12],
considered to be a state-of-the-art model for trajectory prediction, designed a historical atten-
tion module to capture periodicities and augment prediction accuracy. VANext [13] further en-
hanced DeepMove by proposing a novel variational attention mechanism. Besides, [9,11,22]
modeled human mobility for a large population at a citywide level. [10,23,42] simulated and
predicted human emergency mobility following disasters or events. [20] generated urban hu-
man mobility with variational autoencoder. [24] transferred urban human mobility via Point-
Of-Interest among dierent cities. However, these models are built based on millions of indi-
viduals’ mobility. In our experiment, we implement CityMomentum [10] as a trajectory-based
baseline.
2.2 Mesh-Based Mobility Prediction
Density (Demand) Prediction. Based on mesh, predicting citywide crowd density with histor-
ical observations can be represented by a 4D tensor (Timestep,Heiдht,Width,andChannel), as
demonstrated in Figure 3. Following the same formulation, Hetero-ConvLSTM [59] predicted traf-
c accidents using heterogeneous data sources; DeepSD [46], DMVST-Net [56], and Periodic-CRN
[65] predicted taxi demand using taxi request dataset collected from car-hailing company. CoST-
Net [57] predicted multiple transportation demands using both taxi and bike data.
In/Out Flow Prediction. Forecasting the citywide crowd ow based on mesh is proposed and
addressed by [18,60,61]. As illustrated by Figure 3, they dene inow and outow to represent
how many people will ow into or out from a certain mesh-grid. The prediction problem can be
represented by 4D tensor (Timestep,Heiдht,Width,Channel = 2), whereChannel stores the in ow
and out ow. [18,60,61] build deep learning models using deep neural network and convolutional
neural network (CNN) to do the citywide prediction. Following the same problem denition, a
series of deep learning models have been proposed to improve the performance of ST-ResNet [60],
including Periodic-CRN [65], STDN [55], and DeepSTN+ [32]. In our experiment, we implement
ST-ResNet [60] as the mesh-based baseline. We adapt ST-ResNet to take in limited steps of latest
observations on crowd density/ow and perform one-step-by-one-step autoregression to obtain
multiple steps of predictions.
Flow Prediction. In-ow and out-ow can only indicate how many people will ow into or out
from a certain mesh-grid, but cannot answer where the people ow come or transit. As illustrated
by Figure 3, citywide crowd ow can depict how a crowd of people move among the entire mesh-
grids. The problem can be represented by 4D tensor (Timestep,Heiдht,Width,Channel =Heiдht
*Width). MDL [62] also utilized multitask learning to simultaneously model and predict crowd
in/out ow and crowd in-out transition. GEML [49] predicted origin-destination transition matrix
via Graph Convolutional Network (GCN).Moreover,[2] conducts transition estimation from
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:6 R.Jiangetal.
Fig. 3. Mesh-based citywide crowd prediction: density, in/out flow, and transition, analogous to video data
prediction on 4D tensor (Timestep,Heiдht,Width,andChannel).
aggregated population data, and [44] estimates the transition populations using inow and outow.
However, all these models including MDL [62] and GEML [49] can only do the single-step ow
prediction rather than a continuous multi-step prediction. These models also cannot give out the
crowd density prediction in a straightforward way either, which is very crucial for event crowd
management. Our system aims to simultaneously model and predict citywide crowd density and
crowd ow in a multi-step-to-multi-step manner to better facilitate event crowd management in
the real world.
2.3 Network-Based Mobility Prediction
Based on road network-mapped trajectories, researchers also applied deep learning to predict the
trac ow [21,34], trac speed, congestion [35,36], travel time [28,29,47,50,54], and human
mobility with transportation mode [41]. On the other hand, based on road network-aggregated
time series data, GeoMAN [31] designed LSTM plus attention mechanism as a general solution to
geo-sensory time series prediction, and ST-MetaNet [38] employed meta-learning for trac ow
prediction. In particular, leveraging on the latest technique GCN, a series of models have been
proposed to address trac-related problems: ST-GCN [58] and DCRNN [30] rst applied GCN
to trac ow forecasting; ASTGCN [16], GMAN [64], and STSGCN [40] are the state-of-the-arts
extended from STGCN; Graph WaveNet [52] combined GCN with dilated casual convolution, while
T-GCN [63] combined GCN and gated recurrent unit to demonstrate the superior performances.
Besides, [3,6,14] utilized GCN for ride-hailing and car-hailing demand prediction. However, these
models did not directly address the citywide crowd density nor crowd ow prediction task.
3 DATA SOURCE
“Konzatsu-Tokei (R)” from ZENRIN DataCom Co., Ltd. was used. It refers to people ow data col-
lected by individual location data sent from mobile phones with an enabled AUTO-GPS function
under the users’ consent, through the “DoCoMo map navi” service provided by NTT DoCoMo,
Inc. Those data are processed collectively and statistically in order to conceal private information.
The original location data are Global Positioning System (GPS) data (latitude, longitude) sent
at a minimum period of about 5 min, and does not include information (such as gender or age)
to specify individuals. However, the data acquisition is aected by several factors such as loss
of signal or low battery power. In addition, when a mobile phone user stops at a location, the
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:7
Table 1. Notation Description
Symbol Description
M,дmesh for an urban area, mesh-grid
Timestep the number of observation/prediction steps
Heiдht number of mesh-grids along Latitude-axis
Width number of mesh-grids along Longitude-axis
Channel dimension to store the value for density/ow
Γiu each user’s(u) trajectory on each day(i)
dtm crowd density at timeslot ton mesh-grid дm
ftmw crowd ow from дmat t-1 to дwat t
dt,ftcitywide density/ow at t
xd,xfcurrent αsteps of density/ow w.r.t t
yd,yfnext βsteps of density/ow w.r.t t
Xsamples from all timeslots for the current
Ysamples from all timeslots for the next
∧the predicted results
positioning function of his/her mobile phone is automatically turned o to save power. The pro-
posed methodology is applied to raw GPS data from NTT DoCoMo, Inc. The raw GPS log dataset
was collected anonymously from approximately 1.6 million mobile phone users in Japan over a
three-year period (August 1, 2010, to July 31, 2013). Each record contains user ID, latitude, longi-
tude, altitude, timestamp, and accuracy level. In this study, we select Greater Tokyo Area (including
Tokyo City, Kanagawa Prefecture, Chiba Prefecture, and Saitama Prefecture) as the target area. We
can obtain 145,507 users’ trajectories in total that covers approximately 1% of the real-world popu-
lation. This dataset is slightly biased towards young people, because comparing with older people
and children, young people are more likely to have a mobile phone with localization function. To
discover the relationship between our dataset and real-world population and validate the repre-
sentativeness of the dataset, a previous study [19] estimated the home location of each user in the
dataset and compared it with census data on 1km mesh-grid, and found the linear relationship as:
NGPS =0.0063048 ∗Ncensus +0.73551,R2=0.79222,
where NGPS is the population estimated from our GPS dataset, Ncensus is the population given by
the national census data, and R2is the coecient of determination. This R2value demonstrates
that our dataset has very good representativeness of the read-world population. Please refer to
[19] for more details of the basic statistics of this dataset.
4 CITYWIDE CROWD DYNAMICS MODELING
Denition 1 (Calibrated Human Trajectory Database). Human trajectory is stored and indexed
by day (i) and userid (u) in the trajectory database Γ. Given a mesh Mof an area, namely a set
of mesh-grids {д1,д2,...,дm,...,дHeiдht∗Width}, and a time interval Δt, each user’s trajectory on
each day Γiu is mapped onto mesh-grids and then calibrated to obtain constant sampling rate as
follows:
Γiu =(t1,д1),...,(tk,дk)∧∀j∈[1,k),|tj+1−tj|=Δt,
which means that the time interval between any two consecutive timeslots is calibrated into Δt.
For simplicity, from now on we only consider a one-day slice of the trajectory database Γ, then the
day index (i) can be omitted when referring to Γ.Table1above summarizes the notations used in
this article in a comprehensible way.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:8 R.Jiangetal.
Denition 2 (Crowd Density). Given Γ,M, crowd density at timeslot ton mesh-grid дmis dened
as follows:
dtm =|{u|Γ
u.дt=дm}|,
which intuitively indicates how many people inside дmat t.
Denition 3 (Crowd Flow). To capture the crowd ow starting from a certain mesh-grid, we
utilize a kernel window denoted as η×ηw.r.t дm, which represents a square area made up ofη×η
neighboring mesh-grids with дmas the centroid mesh-grid. Given Γ,M, and a kernel window η×η
w.r.t ea ch д, crowd ow at timeslot ton mesh-grid дmis dened as follows:
ftmw =|{u|Γ
u.дt−1=дm∧Γ
u.дt=дw}|,
which intuitively indicates how many people transit from mesh-grid дmat timeslot t-1toaneigh-
boring mesh grid дwinside a kernel window at timeslott. After calculating the crowd density/ow
for each mesh-grid over the entire mesh, citywide crowd density/ow can be obtained for each
timeslot.
Denition 4 (Crowd Density/ow Video). As the mesh is represented in a two-dimensional format,
a crowd density/ow video containing a couple of consecutive frames can be represented by a 4D
tensor RTimestep×Heiдht×Width×Channel,whereTimestep represents the number of video frames,
Channel for density is equal to 1, and Channel for ow is equal to the size of the given kernel
window namely η2. An illustration for crowd density/ow video has been shown in Figure 2.
Denition 5 (Crowd Density/ow Video Prediction). Given currently observed a-step crowd den-
sity/ow video xd=dt−(α−1),...,dt,xf=ft−(α−1),...,ftat timeslot t, prediction for the next
β-step density/ow video ˆ
yd=ˆ
dt+1,..., ˆ
dt+β,ˆ
yf=ˆ
ft+1,..., ˆ
ft+βis modeled as follows:
ˆ
yd=ˆ
dt+1,ˆ
dt+2,..., ˆ
dt+β
=argmax
dt+1,dt+2,...,dt+β
P(dt+1,dt+2,...,dt+β|dt−(α−1),...,dt),
ˆ
yf=ˆ
ft+1,ˆ
ft+2,..., ˆ
ft+β
=argmax
ft+1,ft+2,...,ft+β
P(ft+1,ft+2,...,ft+β|ft−(α−1),...,ft).
Denition 6 (Citywide Crowd Dynamics Prediction). Given currently observed a-step crowd den-
sity/ow video, citywide crowd dynamics prediction aims to simultaneously generate next β-step
density/ow video, which is modeled as follows:
ˆ
yd,ˆ
yf=argmax
yd,yf
P(yd,yf|xd,xf).
Moreover, by jointly modeling these two highly correlated tasks, concurrent enhancements for
both can be expected. It should be noted that crowd density video and crowd ow video are sum-
marized and proposed as a new concept called crowd dynamics here, which aims to not only
reect the crowd density for each mesh-grid but also depict how a crowd of people move among
the mesh-grids. Figure 4demonstrates the overall problem denition mentioned above.
5 DEEP SEQUENTIAL LEARNING ARCHITECTURE
As shown above, citywide crowd dynamics problem has been dened in an analogous manner
to a video prediction task. However, citywide crowd dynamics are a highly complex phenomenon
especially when big events happen, which makes it very dicult to handle these high-dimensional
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:9
Fig. 4. Citywide crowd dynamics prediction.
sequential data with some classical methodologies. This naturally motivates us to employ the most
advanced deep video learning model as the basic component of our system.
Convolutional LSTM. ConvLSTM [53] has been proposed to build an end-to-end trainable model
for the precipitation nowcasting problem. It extends the fully connected LSTM to have convolu-
tional structures in both the input-to-state and state-to-state transitions and achieves new success
on video modeling tasks. Thus, ConvLSTM is utilized as the core component of our system for
the density and ow video prediction tasks. As shown in Figure 2, a ConvLSTM has three gates
comprising an input gate i, an output gate o,andaforgetgatefas same as an ordinary LSTM.
Hidden state htin a ConvLSTM is calculated iteratively from t=1 to Tfor an input sequence of
frames (x1,x2,...,xT)as follows:
it=σ(Wxi ∗xt+Whi ∗ht−1+Wci ct−1+bi),
ft=σ(Wxf ∗xt+Whf ∗ht−1+Wcf ct−1+bf),
ct=ftct−1+ittanh(Wxc ∗xt+Whc ∗ht−1+bc),
ot=σ(Wxo ∗xt+Who ∗ht−1+Wco ct+bo),
ht=ottanh(ct),
where Wis weight, bbias vector, * denotes the convolution operator, and represents Hadamard
product. All of these weight parameters are determined by applying the standard “backpropagation
through time” algorithm, which starts by unfolding the recurrent neural networks through time
and it then generalizes the backpropagation for feed-forward networks to minimize the dened
loss function, which will be Mean Squared Error (MSE) for our problem. The full details of the
algorithm are omitted here.
5.1 Stacked ConvLSTM Architecture
As a comparative modeling approach, we would like to verify how the performance of our sys-
tem would be like if we use a one-step-by-one-step prediction model and obtain multiple steps of
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:10 R.Jiangetal.
Fig. 5. Stacked ConvLSTM for one-step prediction.
predictions by iterating it in an autoregressive manner. Then one-step-by-one-step crowd density
prediction model can be dened as follows:
ˆ
dt+1=argmax
dt+1
P(dt+1|dt−(α−1),...,dt−1,dt).
Once the one-step model is trained, we can run it for βtimes to obtain ˆ
dt+1,ˆ
dt+2,..., ˆ
dt+β.Crowd
ow prediction could also be modeled in a similar way, which will be omitted for simplicity.
Moreover, the use of multiple stacked layers of neural networks can also be considered to boost
the performance in dicult time-series modeling tasks according to [17]. Thus, a deep architecture
constructed with multiple stacked ConvLSTM layers has been shown in Figure 5for one-step
prediction. It has strong representational power which makes it suitable for giving predictions in
complex phenomenons like the citywide crowd dynamics. Note that the same network architecture
can also be applied to one-step crowd ow prediction, but a special AutoEncoder component is
rst necessary due to the uniqueness of crowd ow video which will be explained in the following.
5.2 CNN AutoEncoder for Crowd Flow
Crowd density and ow video are both represented as 4D tensor RTimestep×Heiдht×Width×Channel ;
however, the Channel for ow is much larger than density. In our system, each grid is set to 500 m
×500 m, by taking into account all the possible transportation modes such as WALK, BUS, CAR,
and TRAIN, the transition distance from one grid-cell to another neighboring one can be up to 4
km within 5 minutes time interval (approximately 48 km/h at most). Thus kernel window needs to
be 15 ×15 to capture all the possible crowd ow within 5 minutes.Channel for ow is then equal to
225, which is just too large for most of the state-of-the-art video learning models to handle. Thus,
we build a special CNN AutoEncoder [4,45] to obtain a low-dimension representation of Channel
for crowd ow.
Compared with traditional neural networks, CNNs were designed specically for analyzing
visual imagery [27], where the neurons in a layer are only connected to a small region of the
previous layer instead of all of the neurons in a fully connected manner. CNNs are the state-of-
the-art method for image recognition or classication tasks [26,39]. For a typical CNN layer, the
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:11
Fig. 6. CNN AutoEncoder for crowd flow.
convolutional feature value at location (i,j)in the k-th feature map, feature ci,j,k, is calculated as:
ci,j,k=ReLU (wkxi,j+bk),
where wkand bkare the weight and bias of the k-th lter, respectively, and xi,jis the input patch
centered at location (i,j).ReLU is often used as the activation function.
The details of the special CNN AutoEncoder have been proposed in Figure 6.Notethatwe
aim to model the crowd dynamics at a citywide level by taking the entire city “image” as the
computation unit rather than each grid “pixel.” Therefore, the original crowd ow image of the
city is represented with a three-dimensional (3D) tensor (80, 80, 225). An encoder is constructed
with three convolutional layers to encode the image into a small 3D tensor (80, 80, 4), and then
a decoder is constructed with three convolutional layers to decode the compressed tensor back
to the original 3D tensor (80, 80, 225). The end-to-end model parameters can be optimized by
minimizing reconstruction error between the original ow image and decoded ow image. In our
system, we aim to obtain a compressed Channel (from 225 to 4) but keep the spatial structural
information of the citywide crowd ow image at (80, 80). Thus, only convolutional layer with 1 ×
1 kernel window is utilized. At the last layer of the encoder, a unique ReLU(MAX = 1.0) function
was utilized to ensure that the values are all scaled into [0, 1], which can keep the value range of
crowd ow approximately the same as the value range of crow density. Without this, the multitask
learning mechanism introduced in the following could not function well.
5.3 Multitask ConvLSTM Encoder and Decoder
With such a CNN AutoEncoder, citywide crowd ow can be modeled and computed with the same
architecture for crowd density shown in Figure 5. The prediction can be performed in an iterative
one-by-one manner, but one major limitation of this model is to predict relatively long short-term
crowd dynamics. With the iteration going on, the accumulated iteration error will become large,
which can result in terrible performance on the last several predicted steps. To tackle this problem,
we improve the one-step-by-one-step modeling with multi-step-to-multi-step modeling (Deni-
tion 5) aimed at achieving better performance on “long” short-term predictions. To deliver this
idea, a sequential encoder and decoder architecture [7,43] has been built with four ConvLSTM
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:12 R.Jiangetal.
Fig. 7. Multitask ConvLSTM enocder-decoder for simultaneous multi-step prediction of crowd density and
crowd flow.
layers in this study. It works in the following steps: (1) the rst two hidden layers of ConvLSTM
(encoder) map the αsteps of the inputted crowd density or ow video into a single latent vector,
which contains information about the entire video sequence; (2) this latent vector is repeated β
times to a constant sequence; and (3) the other two hidden layers of ConvLSTM (decoder) are used
to turn this constant sequence into the βsteps of the output video sequence. A batch normaliza-
tion layer is added between two consecutive ConvLSTM layers. ReLU is used as the activation
function in the nal decoding layer. The ConvLSTM Enc.-Dec. model can be separately trained by
minimizing the prediction error L(θd)and L(θf)for crowd density video YDand crowd ow video
YF, described as follows:
L(θd)=|| ˆ
YD−YD||2,
L(θf)=|| ˆ
YF−YF||2.
Crowd density video and crowd ow video share important information and are highly corre-
lated with each other. The insights behind this are two folds: (1) people ow tend to follow the
trend, especially under the event/emergency situations, which may make crowded places attract
more and more people; and (2) higher inow will lead to higher density for each grid-cell, and
similarly higher outow will reduce the crowd density. Moreover, as mentioned above, a crowd
management system needs to predict not only the crowd density but also the crowd ow for each
grid. Thus, we jointly model these two highly correlated tasks dened as Denition 6, and propose
a Multitask ConvLSTM Encoder and Decoder architecture as shown in Figure 7. The key concept
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:13
of multi-task learning [37] is to learn multiple tasks simultaneously with the aim of gaining mu-
tual benets; thus, learning performance can be improved through parallel learning while using
a shared latent representation. Therefore, it is reasonable to expect better performances from this
learning framework for our system. Our Multitask ConvLSTM Encoder and Decoder architecture
rst takes XDand XFas two separate inputs. The separate input encoders rst encode the crowd
density and crowd ow video respectively. Then, the shared encoder maps the encoded crowd
density and ow into a joint latent representation zα, which can be taken as the auto-extracted
features for the entire crowd dynamics; zαis then repeated βtimes to be passed to the shared
decoder, and nally the output decoders give the multiple steps of prediction results for crowd
density video ˆ
YDand ow video ˆ
YFrespectively. The entire model can be trained by minimizing
the total prediction error L(θ)of crowd density and ow, described as follows:
L(θ)=λ|| ˆ
YD−YD||2+(1−λ)|| ˆ
YF−YF||2,
where λis set to 0.5 to make the two parts of loss equally contributing. CNN AutoEncoder still
needs to be applied rst to the original crowd ow.
6 DYNAMIC CROWD MOBILITY GRAPH
So far, multiple steps of citywide crowd density and crowd ow can be modeled and predicted
simultaneously as crowd dynamics. At this stage, citywide crowd dynamics are still represented
with mesh-grids, in order to help conduct probabilistic reasoning of crowd movements during big
event situations, we build a series of dynamic crowd mobility graphs using the predicted crowd
dynamics. By using graph instead of mesh, a high-level representation of citywide crowd dynam-
ics can be obtained, more easily and eciently used for citywide-level event crowd management.
Moreover, if severe disasters happen, some parts of the real transportation network might be dam-
aged, therefore we need to build a virtual transportation network to replace or work together
with the real-world one. Comparing with the static transportation network, our proposed graph
is a dynamic one because: (1) one crowd mobility graph is corresponding to one frame of crowd
dynamics video; our system can report multiple steps of prediction results, thus we can build a
series of graphs for each timeslot; and (2) our graph can be updated every 5 minutes by our online
updating system.
Specically, given one frame of predicted crowd density video, we view the centroid of each
mesh-grid as a point, the density of the mesh-grid as the weight, then apply weighted KMeans
clustering on all the weighted points to get clusters of mesh-grids. Each cluster can be taken as
one Region-of-Interest (RoI), as well as one node of the mobility graph. Given one frame of
predicted crowd ow video, it is easy to build a transition matrix Ωbetween each mesh-grid pair.
By summing up the total transition number between each node pair, the edges of the mobility
graph can be generated. The details of this process are summarized in Algorithm 1.Notethat
other clustering or RoI construction algorithms (e.g., T-Pattern [15], MeanShift, or PopularRoutes
[51]) may also be used here, however, KMeans is adopted in our system because it can be easily
tuned using the only one parameter K.
Given β-step crowd ow video f1,f2,...,fβ, using each frame of crowd video, transition matri-
ces Ω1,Ω2,...,Ωβcan be calculated. Each Ωtessentially represents transition probability between
each mesh-grid pair in the timeslot tnamely Ωtij =P(дi→дj) after normalization. Each Ωtin our
nal system is just a transition matrix within 5 minutes, which is not sucient for a relatively long
short-term probabilistic reasoning. To address this issue, we view each Ωtas a rst-order proba-
bility and leverage the rst-order probability to get higher-order transition probabilities. Ω1×Ω2
contains all the second-order transition probability, thus Ω1:2 =Ω1+Ω1×Ω2after normalization
corresponds to the 2-step transition probability, which is the likelihood of transition from дito дj
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:14 R.Jiangetal.
Table 2. Event Information
Event Abbr. Date Training dates
3.11 Earthquake Earthquake 2011/03/11 2011/03/01–2011/03/10
Typhoon Roke Typhoon 2011/09/21 2011/09/11–2011/09/20
New Year’s Day New Year 2012/01/01 2011/12/22–2011/12/31
Tokyo Marathon Marathon 2011/02/27 2011/02/17–2011/02/26
ALGORITHM 1: Dynamic Crowd Mobility Graph Building
Input: Citywide crowd density and ow yd,yf,ameshM, node number K.
Output: Dynamic Crowd Mobility Graph
1Ψ←∅;// dynamic crowd mobility graph.
2for each t∈[t1,...,tβ]do
3dt,ft←retrieve density, ow at tfrom yd,yf;
4V←Node Construction(dt,M,K);
5E←Edge Construction(ft,M,V);
6Ψ←Ψ∪(V,E);
7return Ψ;
8Function Node Construction (d,M,K):
9C,W←∅,∅;
10 for each дm∈Mdo
11 C←C∪дm.centroid, W←W∪dm;
12 V←Weighted Kmeans Clustering(C,W,K);
13 return V;
14 Function Edge Construction (f,M,V):
15 Ω[|M|][|M|]←build grid transition matrix using f;
16 E[|V|][ |V |]←initialize node transition matrix with 0;
17 for each pair (дp,дq)∈(M,M)do
18 v←nd дpbelongs to which node in V;
19 v←nd дqbelongs to which node in V;
20 E[v][v]←E[v][v]+Ω[p][q];
21 return E;
within two steps namely 5*2 minutes. Analogously, we can get the β-step transition probability
by calculating the matrix Ω1:β=Ω1+Ω1×Ω2+··· +Ω1×Ω2×···×Ωβwith normalization.
Ω1:βbuilt with multiple steps of crowd ow can be used to replace the simple rst-order transition
matrix Ωin Algorithm 1(Edge Construction Function), then crowd mobility graph can contain the
transition information within a longer time interval i.e., 5*βminutes.
7 EXPERIMENT
In this section, we present the results of extensive experiments and comparisons of the perfor-
mance of our model with other baseline models.
7.1 Seings
Experimental Setup: We selected the Greater Tokyo Area (Lonд.∈[139.50,139.90], Lat .∈
[35.50,35.82]) as our target urban area. Four citywide-level events happened in this area were
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:15
Table 3. Summary of Tuned Parameters
Parameter Tuned value
Heiдht,Width 80, 80
Time Intervals of One Day 5 minutes ×288
Train/Test Timeslots 2880/288
Kernel Window η×η15 ×15
(1)Timestep α/β(Max lead time) 6/6 (30 minutes)
(2)Timestep α/β(Max lead time) 12/12 (60 minutes)
Optimizer Adam
Epoch 200
Learning Rate 0.0001
Batch Size 4
Scaling Factor for Density/Flow 500/100
selected as the testing events as summarized in Table 2: (1) 3.11 Earthquake (2011/03/11), a mag-
nitude 9.0-9.1 earthquake o the coast of Japan that occurred at 14:46 JST, which caused a great
impact on people’s behaviors in the Great Tokyo Area. (2) Typhoon Roke (2011/09/21), recorded
as one of the strongest typhoons in Japan’s history, which made subway operators shut down part
of their services. (3) New Year’s Day (2012/01/01). There are a number of New Year celebrations in
the Tokyo area, especially, for “Hatsumode” (the rst visit to Buddhist temple or shrine), most of
the railway lines operate overnight on New Year’s Eve for this. (4) Tokyo Marathon (2011/02/27).
The number of people attending this event was 2.16 million (the number of people along the road
was 1.53 million, and the number of visitors to the Tokyo Marathon Festival was 0.63 million).
Also, trac regulation was strictly conducted along the Marathon route. These four event days
were used as testing dates, and 10 consecutive days before the event day were utilized as training
and validation dataset, which means 2011/03/01–2011/03/10, 2011/09/11–2011/09/20, 2011/12/22–
2011/12/31, and 2011/02/17–2011/02/26 were the selected periods for the four events respectively.
Our data source contained approximately 100,000∼130,000 users’ GPS logs on each day within
the target urban area. After conducting data cleaning and noise reduction to the raw dataset, we
did linear interpolation to make sure each user’s 24-hour (00:00∼23:59) GPS log has a constant
5-minute sampling rate. Then by mapping each coordinate onto mesh-grid, crowd density video
and crowd ow video could be generated based on the denitions listed in Section 4.
Parameter Settings: The parameter settings are summarized in Table 3. We meshed the entire
area with ΔLonд.= 0.005, ΔLat .= 0.004 (approximately 500m×500m)togetan80×80 mesh-grid
map. As mentioned above, the time interval Δtof our system was set to 5 minutes. Therefore, we
got 2,880 timeslots (288 * 10 days) as training dataset and 288 timeslots as the testing dataset, and
crowd density frame and crowd ow frame were generated for each timeslot. Kernel window was
set to 15 ×15 for crowd ow, which could capture enough transit distance of crowd ow within
5 minutes. We set the observation step αand the prediction step βboth to 6 to generate length-6
crowd/ow video as inputs and their corresponding next length-6 videos as outputs. This means
our system could predict the crowd dynamics for the next 30 minutes. In each report, it contained
six steps of prediction results for every 5 minutes, and the result at 6th step gave us the maximum
lead time of 30 minutes. Similarly, an evaluation for the prediction with 60 minutes lead time is
also conducted by setting αand βto 12. Finally, we could get 2,868 sample pairs from the training
dataset, and randomly selected 80% of them (2,294) as the training samples and 20% of them (574) as
the validation samples. The Adam algorithm was employed to control the overall training process,
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:16 R.Jiangetal.
where the batch size was set to 4 and the learning rate to 0.0001 for all deep learning models except
that the learning rate of CNN AutoEncoder was tuned as 0.001. The training algorithm would be
stopped after 200 epochs and only the best model would be saved. In addition, we used 500 as
the scaling factor for crowd density to scale the data down to relatively small values, and 100 as
the scaling factor for crowd ow value. In the evaluation, we rescaled the predicted value back to
the normal values, and compared them with the ground-truth. The parameter settings were kept
the same for each event. Python and some Python libraries such as Keras [8] and TensorFlow [1]
were used in this study. The experiments were performed on a GPU server with four GeForce GTX
1080Ti graphics cards.
Baseline models: We implemented the following models as baseline models for comparison.
(1) HistoricalAverage. Crowd density/ow for each timeslot was estimated by averaging the last
10 days’ corresponding values.
(2) CopyYesterday. We directly used yesterday’s value as the predicted value on event days.
(3) CopyLastFrame. We directly copied the last/latest observation as the predicted value, which
can be a simple but very eective method for event situations.
(4) ARIMA. It is a classical time-series prediction model designed for one-dimensional data. For
each mesh-grid, we build one ARIMA model to predict the time-series density prediction.
However, for the ow tensor (80,80,225) at each timeslot, the dimension was just too high
for ARIMA to handle.
(5) VectorAutoRegressive. It is an advanced time-series prediction model designed for high
dimensional data. By attening the density tensor (80,80,1) at each timeslot into a 6,400-
dimension vector, the model could handle the crowd density prediction task. For ow tensor
(80,80,225), the dimension was also just too high for VAR to deal with.
(6) CityMomentum [10]. It was rstly proposed for momentary mobility prediction at the city-
wide level for big events. We implemented it by using a 500-meter mesh-grid and 5-minute
time interval, following exactly the same setting with our model. Although the model was
build from a perspective of individual’s mobility, the predicted/simulated trajectory of each
individual could be used for generating aggregated crowd density and ow, which makes it
comparable with our system.
(7) ST-ResNet [60]. This deep residual learning-based method shows a state-of-the-art perfor-
mance on citywide crowd ow prediction. To compare its performance under the same prob-
lem denition with us, we adapted ST-ResNet to take in limited steps of latest observations
on crowd density/ow and performed one-step-by-one-step autoregression to obtain multi-
ple steps of predictions. Here, we also found that 1-residual-unit ST-ResNet without external
features could achieve the best performances in our event situations.
(8) CNN. It is a one-step predictor constructed with four Conv layers. Note that the 4D tensor
would be converted to 3D tensor (Heiдht,Width,Timestep ∗Channel)by concatenating the
channels at each timestep just as the way [60] did, so that CNN could take our 4D tensors as
inputs. The rst three Conv layers used 32 lters of 3 ×3 kernel window, and the nal Conv
layer used a ReLU activation function to output a single step of video frame.
(9) CNN Enc.-Dec. It is a multi-step predictor also constructed with four Conv layers. It shares
the same parameter settings with (5). The only dierence is the nal Conv layer outputs a
3D tensor (Heiдht,Width,Timestep ∗Channel )as multiple steps of predictions.
(10) Multitask CNN Enc.-Dec. It has four Conv layers sharing a similar multitask architecture as
illustrated in Figure 7, namely, separate input encoding Conv layer, shared encoding and
decoding layer, and separate output Conv layer. All the parameters were kept the same
with (6).
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:17
Table 4. Performance Evaluation of 30 Minutes Ahead Prediction on Four Events
Model Earthquake Typhoon New Year Marathon
Density Flow Density Flow Density Flow Density Flow
HistoricalAverage 106.032 0.726 75.402 0.519 176.013 1.099 33.381 0.223
CopyYesterday 129.436 0.912 85.641 0.592 110.444 0.660 65.765 0.437
CopyLastFrame 7.824 0.116 9.756 0.186 5.498 0.079 6.496 0.107
ARIMA 10.430 NA 13.376 NA 8.343 NA 7.808 NA
VectorAutoRegressive 10.843 NA 13.377 NA 9.511 NA 9.380 NA
CityMomentum [10] 27.670 0.653 29.305 0.962 23.058 0.235 25.774 0.475
ST-ResNet [60] 6.542 0.113 7.802 0.183 4.544 0.080 5.548 0.103
CNN 8.698 0.178 10.245 0.196 6.178 0.083 6.614 0.100
CNN Enc.-Dec. 7.115 0.117 8.571 0.187 5.216 0.079 6.004 0.095
M.T. CNN Enc.-Dec. 6.802 0.119 8.226 0.197 5.158 0.084 5.953 0.097
ConvLSTM 6.737 0.124 7.959 0.195 4.679 0.077 5.675 0.094
ConvLSTM Enc.-Dec. 6.281 0.102 7.508 0.171 4.500 0.074 5.372 0.089
M.T. ConvLSTM Enc.-Dec. 5.549 0.102 6.753 0.170 4.117 0.074 5.012 0.086
(11) ConvLSTM (one-step-by-one-step) and (12) ConvLSTM Enc.-Dec (multi-step-to-multi-step)
are the proposed comparison models constructed with four ConvLSTM layers in Section 5.
Each ConvLSTM layer uses 32 lters of 3 ×3 kernel window and the ReLU activation is used
in the nal layer. BatchNormalization was added between two consecutive CNN/ConvLSTM
layers for all the models. Note that for all of the crowd ow parts, as shown in Figure 6,CNN
AutoEncoder will be rst applied to encode the original ow tensor and then decode the
(predicted) encoded ow back to the original format. Our nal system is implemented using
MultiTask ConvLSTM Enc.-Dec.
7.2 Performance Evaluation
Evaluation metric: We evaluated the performances of the models with MSE as follows:
MSE =1
n
n
i
ˆ
Yi−Yi2,
where n is the number of samples, Yand ˆ
Yare the ground-truth value and predicted value in 4D
tensor format, namely, (Timestep,Heiдht,Width,andChannel). Density tensor and ow tensor
dier at the Channel.
Overall performance: We compared the performance of the baseline models and our proposed
model Multitask ConvLSTM Enc.-Dec. on four events. The overall evaluation results are sum-
marized in Table 4for 30 minutes ahead prediction and Table 5for 60 minutes ahead prediction,
which both show that based on all four events: (1) our model performed better than the others;
and (2) all deep learning models had advantages compared with existing methodologies (CityMo-
mentum and VAR). In particular, we could also nd that (1) the superiority of ConvLSTM to CNN
on video-like modeling tasks; (2) encoder-decoder architecture had the advantage on multi-step
sequential prediction task; and (3) the eectiveness of multitask learning on enhancing the corre-
lated tasks.
Performance on density: We also veried the performance of our system by using a times-series
evaluation over the event day to show the ground-truth and predicted density for selected ar-
eas (Tokyo Station Area and Shinjuku Station Area) in the city. Each area consists of 3×3neigh-
boring mesh-grids, with Tokyo Station and Shinjuku Station locating at the central mesh-grid
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:18 R.Jiangetal.
Table 5. Performance Evaluation of 60 Minutes Ahead Prediction on Four Events
Model Earthquake Typhoon New Year Marathon
Density Flow Density Flow Density Flow Density Flow
HistoricalAverage 104.604 0.731 75.927 0.529 175.344 1.102 33.422 0.225
CopyYesterday 128.133 0.920 86.069 0.601 106.991 0.645 65.725 0.440
CopyLastFrame 13.020 0.168 16.607 0.252 8.650 0.096 10.004 0.128
ARIMA 24.296 NA 32.933 NA 15.411 NA 15.259 NA
VectorAutoRegressive 22.355 NA 29.872 NA 21.072 NA 17.261 NA
CityMomentum [10] 32.034 0.570 35.090 0.821 25.867 0.207 28.825 0.400
ST-ResNet [60] 11.899 0.157 13.418 0.238 7.633 0.103 8.501 0.124
CNN 12.247 0.189 17.670 0.360 10.469 0.119 12.114 0.223
CNN Enc.-Dec. 11.372 0.164 13.876 0.245 8.311 0.097 9.127 0.119
M.T. CNN Enc.-Dec. 10.812 0.177 13.800 0.247 8.153 0.101 9.004 0.124
ConvLSTM 11.355 0.139 12.285 0.228 7.615 0.118 9.511 0.140
ConvLSTM Enc.-Dec. 9.309 0.122 11.186 0.197 6.885 0.086 7.843 0.103
M.T. ConvLSTM Enc.-Dec. 8.094 0.122 9.900 0.196 6.496 0.085 7.483 0.101
respectively. From Figure 8, we can straightforwardly conrm the eectiveness of our model for
60 minutes ahead prediction and its high deployability for a real-world online event crowd manage-
ment system. Referring to the normal situation (prediction result of HistoricalAverage) shown in
the gure, we can nd that the densities on event days dier a lot from normal situations. Further-
more, even comparing these four events, the density patterns are quite dierent from each other.
This could further demonstrate the crowd management in event situations is really challenging
and our online prediction system can be so indispensable for these special cases.
Performance on dynamic graph: Using the proposed algorithm in Section 6, we build a series of
dynamic crowd mobility graph for the 3.11 Earthquake event, and demonstrate three snapshots of
the graph at 14:00, 15:00, and 16:00 (the earthquake occurred at 14:46 JST) in Figure 9. We generate
100 nodes and build the edge for each node pair based on 6-step transition matrix Ω1:6 ,which
can indicate the crowd ow transition probability in the next 30 minutes. Through Figure 9,we
could see how the crowd dynamics were gradually evolving during the earthquake. No matter
the ground-truth or predictions, it showed quite dierent details at 14:00, 15:00, and 16:00. Crowd
density prediction could achieve very good performances as shown in the previous, and the nodes
of the graph showed a close resemblance between the ground-truth and predictions. However,
there still existed some gap between the ground-truth and prediction of the transition probability
on edges. Underestimation could be observed through the gure, which left us room for further
improvement on the high-dimensional crowd ow video.
Overall eciency: The implementation is done with TensorFlow ver1.10 and the eciency test
is done on a GeForce GTX 1080Ti graphics card. Our proposed model (MultiTask ConvLSTM
Encoder-Decoder) has 271,224 parameters in total. We verify the overall eciency of our model by
plotting the learning curves on validation loss as Figure 10. Through it we can observe that, for 30
minutes lead time prediction, i.e., α/β= 6/6, it takes around 80 epochs for our model to converge
on those four datasets, while it takes average 100 epochs to converge on the four datasets when
the lead time comes to 60 minutes, i.e., α/β= 12/12. Each training epoch takes around 120 seconds,
and each batch takes around 52 milliseconds, which means that the deployed model can deliver
the prediction result within less than one second.
Hyperparameter study: We conduct hyperparameter studies for our proposed model (MultiTask
ConvLSTM Encoder-Decoder) on two hyperparameters: one is the observation/prediction step α/β,
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:19
Fig. 8. Visualization of the ground-truth crowd density, the prediction result of Historical Average (seen as
the normal situation), and the prediction result of our model (MultiTask ConvLSTM Enc.-Dec.) at four events.
another is the multitask weight λ, the results of which have been summarized in Figure 11.Wecan
observe that as α/βincreases from 3 to 15, the MSE losses on both density and ow show a slow
and gradual increase on four events, which is due to the limitation of LSTM on modeling too long
sequence. We can also see that when we adjust λfrom 0.1 to 0.9, the MSE loss on density gradually
decreases and the MSE loss on ow slowly increases since the multitask weights for density and
ow are λand 1 −λrespectively. This also veries that the CNN autoencoder could encode the
ow part to be a relatively balanced task with the density part. Thus, to well balance the density
prediction and the ow prediction, we set the λequal to 0.5 in our nal system.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:20 R.Jiangetal.
Fig. 9. Visualization of the ground-truth dynamic crowd mobility graph (top) and the predicted results (bot-
tom) at 3.11 Earthquake from 14:00 to 16:00. The larger and darker nodes have higher crowd density and
the darker edges represent higher transition probability. The node number Kis set to 100, and the edges
correspond to six-step transition matrix Ω1:6.
Fig. 10. Learning curves on validation loss.
8 CONCLUSION
This article is an extended journal version of our previous work [25]. In this study, we built a
data-driven intelligent system called DeepUrbanEvent to predict citywide crowd dynamics at big
events in an analogous manner to a video prediction task. We proposed to decompose crowd
dynamics into crowd density and crowd ow and designed a Multitask ConvLSTM Encoder-
Decoder architecture to simultaneously predict multiple steps of crowd density and crowd ow
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:21
Fig. 11. Hyperparameter study: MSE on crowd density and flow prediction.
Fig. 12. Our prototype: Tokyo crowd dynamics system. In the le part, it can show the real-time scalar value
of crowd density and flow at the citywide level with a bar chart as well as for a selected region with a time-
series chart. In the right part, through the 3D histogram and OD (Origin-Destination) flow chart, it can
simultaneously show the multiple steps of crowd density and flow with dierent types of geospatial visual-
ization. The prediction result for the next step is highlighted with a large layout in the upper right. Tokyo
Metropolitan Government will develop and deploy a real crowd dynamics system based on our prototype.
for the future. The experimental results based on four big real-world events demonstrated the
superior performance of our proposed model compared with the baseline methods. Our model
has been successfully deployed to monitor the crowd dynamics for the Greater Tokyo Area
as demonstrated in Figure 12. The source codes for these models have been released at https:
//github.com/deepkashiwa20/DeepUrbanEvent.
REFERENCES
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy
Davis, Jerey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Georey Irving, Michael
Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga,
Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,
Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: large-scale machine learning on heterogeneous
systems. Retrieved from http://tensorow.org/.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:22 R.Jiangetal.
[2] Yasunori Akagi, Takuya Nishimura, Takeshi Kurashima, and Hiroyuki Toda. 2018. A fast and accurate method for
estimating people ow from spatiotemporal population data. In Proceedings of the 27th International Joint Conference
on Articial Intelligence. 3293–3300.
[3] Lei Bai, Lina Yao, Salil S. Kanhere, Xianzhi Wang, and Quan Z. Sheng. 2019. Stg2seq: Spatial-temporal graph to se-
quence model for multi-step passenger demand forecasting. In IJCAI. 1981–1987.
[4] Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends® in Machine Learning 2, 1 (2009),
1–127.
[5] Pablo Samuel Castro, Daqing Zhang, and Shijian Li. 2012. Urban trac modelling and prediction using large scale
taxi GPS traces. In Pervasive Computing. Springer, 57–72.
[6] Di Chai, Leye Wang, and Qiang Yang. 2018. Bike ow prediction with multi-graph convolutional networks. In Proceed-
ings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 397–400.
[7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation.
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734.
[8] François Chollet. 2015. keras. Retrieved from https://github.com/fchollet/keras.
[9] Zipei Fan, Xuan Song, Renhe Jiang, Quanjun Chen, and Ryosuke Shibasaki. 2019. Decentralized attention-based per-
sonalized human mobility prediction. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Tech-
nologies 3, 4 (2019), 1–26.
[10] Zipei Fan, Xuan Song, Ryosuke Shibasaki, and Ryutaro Adachi. 2015. CityMomentum: An online approach for crowd
behavior prediction at a citywide level. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and
Ubiquitous Computing. ACM, 559–569.
[11] Zipei Fan, Xuan Song, Tianqi Xia, Renhe Jiang, Ryosuke Shibasaki, and Ritsu Sakuramachi. 2018. Online deep en-
semble learning for predicting citywide human mobility. Proceedings of the ACM on Interactive, Mobile, Wearable and
Ubiquitous Technologies 2, 3 (2018), 1–21.
[12] Jie Feng, Yong Li, Chao Zhang, Funing Sun, Fanchao Meng, Ang Guo, and Depeng Jin. 2018. Deepmove: Predicting
human mobility with attentional recurrent networks. In Proceedings of the 2018 World Wide Web Conference. Interna-
tional World Wide Web Conferences Steering Committee, 1459–1468.
[13] Qiang Gao, Fan Zhou, Goce Trajcevski, Kunpeng Zhang, Ting Zhong, and Fengli Zhang. 2019. Predicting human
mobility via variational attention. In Proceedings of the World Wide Web Conference. ACM, 2750–2756.
[14] Xu Geng, Yaguang Li, Leye Wang, Lingyu Zhang, Qiang Yang, Jieping Ye, and Yan Liu. 2019. Spatiotemporal multi-
graph convolution network for ride-hailing demand forecasting. In Proceedings of the 2019 AAAI Conference on Arti-
cial Intelligence.
[15] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pedreschi. 2007. Trajectory pattern mining. In Proceedings of
the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 330–339.
[16] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph
convolutional networks for trac ow forecasting. In Proceedings of the AAAI Conference on Articial Intelligence.
Vol. 33. 922–929.
[17] Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. In Proceed-
ings of the Advances in Neural Information Processing Systems. 190–198.
[18] Minh X Hoang, Yu Zheng, and Ambuj K Singh. 2016. Forecasting citywide crowd ows based on big data. In Proceed-
ings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.
[19] T. Horanont, A. Witayangkurn, Y. Sekimoto, and R. Shibasaki. 2013. Large-scale auto-GPS analysis for discerning
behavior change during crisis. IEEE Intelligent Systems 28, 4 (2013), 26–34.
[20] Dou Huang, Xuan Song, Zipei Fan, Renhe Jiang, Ryosuke Shibasaki, Yu Zhang, Haizhong Wang, and Yugo Kato. 2019.
A variational autoencoder based generative model of urban human mobility. In Proceedings of the 2019 IEEE Conference
on Multimedia Information Processing and Retrieval. IEEE, 425–430.
[21] Wenhao Huang, Guojie Song, Haikun Hong, and Kunqing Xie. 2014. Deep architecture for trac ow prediction:
Deep belief networks with multitask learning. IEEE Transactions on Intelligent Transportation Systems 15, 5 (2014),
2191–2201.
[22] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Quanjun Chen, Qi Chen, and Ryosuke Shibasaki. 2018. Deep ROI-
based modeling for urban human mobility prediction. Proceedings of the ACM on Interactive, Mobile, Wearable and
Ubiquitous Technologies 2, 1 (2018), 1–29.
[23] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Quanjun Chen, Satoshi Miyazawa, and Ryosuke Shibasaki. 2018. Deep-
UrbanMomentum: An online deep-learning system for short-term urban mobility prediction. In Proceedings of the 32nd
AAAI conference on articial intelligence. 784–791.
[24] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Zhaonan Wang, Quanjun Chen, Zekun Cai, and Ryosuke Shibasaki.
2021. Transfer urban human mobility via POI embedding over multiple cities. ACM Transactions on Data Science 2, 1
(2021), 1–26.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:23
[25] Renhe Jiang, Xuan Song, Dou Huang, Xiaoya Song, Tianqi Xia, Zekun Cai, Zhaonan Wang, Kyoung-Sook Kim, and
Ryosuke Shibasaki. 2019. Deepurbanevent: A system for predicting citywide crowd dynamics at big events. In Pro-
ceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2114–2122.
[26] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. Imagenet classication with deep convolutional neural
networks. In Proceedings of the Advances in Neural Information Processing Systems. 1097–1105.
[27] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haner. 1998. Gradient-based learning applied to document
recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
[28] Xiucheng Li, Gao Cong, Aixin Sun, and Yun Cheng. 2019. Learning travel time distributions with deep generative
model. In Proceedings of the World Wide Web Conference. ACM, 1017–1027.
[29] Yaguang Li, Kun Fu, Zheng Wang, Cyrus Shahabi, Jieping Ye, and Yan Liu. 2018. Multi-task representation learning
for travel time estimation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery
&DataMining. ACM, 1695–1704.
[30] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diusion convolutionalrecurrent neural network: Data-driven
trac forecasting. In Proceedings of the International Conference on Learning Representations.
[31] Yuxuan Liang, Songyu Ke, Junbo Zhang, Xiuwen Yi, and Yu Zheng. 2018. Geoman: Multi-level attention networks for
geo-sensory time series prediction. In Proceedings of the 27th International Joint Conference on Articial Intelligence.
3428–3434.
[32] Ziqian Lin, Jie Feng, Ziyang Lu, Yong Li, and Depeng Jin. 2019. DeepSTN+: Context-aware spatial-temporal neural
network for crowd ow prediction in metropolis. In Proceedings of the AAAI Conference on Articial Intelligence.
[33] Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. Predicting the next location: A recurrent model with spatial
and temporal contexts. In Proceedings of the 13th AAAI Conference on Articial Intelligence.
[34] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. 2015. Trac ow prediction with big data:
A deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16, 2 (2015), 865–873.
[35] Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang. 2015. Long short-term memory neural net-
work for trac speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Tech-
nologies 54 (2015), 187–197.
[36] Xiaolei Ma, Haiyang Yu, Yunpeng Wang, and Yinhai Wang. 2015. Large-scale transportation network congestion
evolution prediction using deep learning theory. PloS One 10, 3 (2015), e0119044.
[37] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep
learning. In Proceedings of the 28th International Conference on Machine Learning. 689–696.
[38] Zheyi Pan, Yuxuan Liang, Weifeng Wang, Yong Yu, Yu Zheng, and Junbo Zhang. 2019. Urban trac prediction from
spatio-temporal data using deep meta learning. In Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. ACM, 1720–1730.
[39] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556 (2014).
[40] Chao Song, Youfang Lin, Shengnan Guo, and Huaiyu Wan. 2020. Spatial-temporal synchronous graph convolutional
networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI Conference on
Articial Intelligence. Vol. 34. 914–921.
[41] Xuan Song, Hiroshi Kanasugi, and Ryosuke Shibasaki. 2016. DeepTransport: Prediction and simulation of human
mobility and transportation mode at a citywide level. In Proceedings of the 25th International Joint Conference on
Articial Intelligence. 2618–2624.
[42] Xuan Song, Quanshi Zhang, Yoshihide Sekimoto, Ryosuke Shibasaki, Nicholas Jing Yuan, and Xing Xie. 2015. A simu-
lator of human emergency mobility following disasters: Knowledge transfer from big disaster data. In Proceedings of
the 29th AAAI Conference on Articial Intelligence.
[43] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings
of the Advances in Neural Information Processing Systems. 3104–3112.
[44] Yusuke Tanaka, Tomoharu Iwata, Takeshi Kurashima, Hiroyuki Toda, and Naonori Ueda. 2018. Estimating latent peo-
ple ow without tracking individuals. In Proceedings of the 2018 International Joint Conference on Articial Intelligence.
3556–3563.
[45] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked de-
noising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of
Machine Learning Research 11, Dec (2010), 3371–3408.
[46] Dong Wang, Wei Cao, Jian Li, and Jieping Ye. 2017. DeepSD: Supply-demand prediction for online car-hailing services
using deep neural networks. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering. IEEE,
243–254.
[47] Dong Wang, Junbo Zhang, Wei Cao, Jian Li, and Yu Zheng. 2018. When will you arrive? Estimating travel time based
on deep neural networks. In Proceedings of the 32nd AAAI Conference on Articial Intelligence.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:24 R.Jiangetal.
[48] Senzhang Wang, Jiannong Cao, and Philip Yu. 2020. Deep learning for spatio-temporal data mining: A survey. IEEE
Transactions on Knowledge and Data Engineering (2020).
[49] Yuandong Wang, Hongzhi Yin, Hongxu Chen, Tianyu Wo, Jie Xu, and Kai Zheng. 2019. Origin-destination matrix
prediction via graph convolution: A new perspective of passenger demand modeling. In Proceedings of the 25th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining. 1227–1235.
[50] Zheng Wang, Kun Fu, and Jieping Ye. 2018. Learning to estimate the travel time. In Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 858–866.
[51] Ling-Yin Wei, Yu Zheng, and Wen-Chih Peng. 2012. Constructing popular routes from uncertain trajectories. In Pro-
ceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 195–203.
[52] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-
temporal graph modeling. In Proceedings of the 28th International Joint Conference on Articial Intelligence. 1907–1913.
[53] Shi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional
LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural
Information Processing Systems. 802–810.
[54] Yu Yang, Fan Zhang, and Desheng Zhang. 2018. SharedEdge: GPS-free ne-grained travel time estimation in state-
level highway systems. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1
(2018), 48.
[55] Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, and Zhenhui Li. 2019. Revisiting spatial-temporal similarity: A
deep learning framework for trac prediction. In Proceedings of the 2019 AAAI Conference on Articial Intelligence.
[56] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. 2018.
Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the 32nd AAAI Conference on
Articial Intelligence.
[57] Junchen Ye, Leilei Sun, Bowen Du, Yanjie Fu, Xinran Tong, and Hui Xiong. 2019. Co-prediction of multiple transporta-
tion demands based on deep spatio-temporal neural network. In Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 305–313.
[58] Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018. Spatio-temporal graph convolutional networks: A deep learning
framework for trac forecasting. In Proceedings of the 27th International Joint Conference on Articial Intelligence.
AAAI Press, 3634–3640.
[59] Zhuoning Yuan, Xun Zhou, and Tianbao Yang. 2018. Hetero-ConvLSTM: A deep learning approach to trac accident
prediction on heterogeneous spatio-temporal data. In Proceedings of the 24th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining. ACM, 984–992.
[60] Junbo Zhang, Yu Zheng, and Dekang Qi. 2017. Deep spatio-temporal residual networks for citywide crowd ows
prediction. In Proceedings of the 31st AAAI Conference on Articial Intelligence. 1655–1661.
[61] Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. 2016. DNN-based prediction model for spatio-
temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic In-
formation Systems. ACM, 92.
[62] Junbo Zhang, Yu Zheng, Junkai Sun, and Dekang Qi. 2019. Flow prediction in spatio-temporal networks based on
multitask deep learning. IEEE Transactions on Knowledge and Data Engineering 32, 3 (2019), 468–478.
[63] Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. 2019. T-GCN: A temporal
graph convolutional network for trac prediction. IEEE Transactions on Intelligent Transportation Systems 21, 9 (2019),
3848–3858.
[64] Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020. Gman: A graph multi-attention network for
trac prediction. In Proceedings of the AAAI Conference on Articial Intelligence. Vol. 34. 1234–1241.
[65] Ali Zonoozi, Jung-jae Kim, Xiao-Li Li, and Gao Cong. 2018. Periodic-CRN: A convolutional recurrent model for crowd
density prediction with recurring periodic patterns. In Proceedings of the 27th International Joint Conference on Arti-
cial Intelligence. 3732–3738.
Received October 2020; revised April 2021; accepted June 2021
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.