ArticlePDF Available

Predicting Citywide Crowd Dynamics at Big Events: A Deep Learning System

Authors:

Abstract and Figures

Event crowd management has been a significant research topic with high social impact. When some big events happen such as an earthquake, typhoon, and national festival, crowd management becomes the first priority for governments (e.g., police) and public service operators (e.g., subway/bus operator) to protect people’s safety or maintain the operation of public infrastructures. However, under such event situations, human behavior will become very different from daily routines, which makes prediction of crowd dynamics at big events become highly challenging, especially at a citywide level. Therefore in this study, we aim to extract the “deep” trend only from the current momentary observations and generate an accurate prediction for the trend in the short future, which is considered to be an effective way to deal with the event situations. Motivated by these, we build an online system called DeepUrbanEvent, which can iteratively take citywide crowd dynamics from the current one hour as input and report the prediction results for the next one hour as output. A novel deep learning architecture built with recurrent neural networks is designed to effectively model these highly complex sequential data in an analogous manner to video prediction tasks. Experimental results demonstrate the superior performance of our proposed methodology to the existing approaches. Lastly, we apply our prototype system to multiple big real-world events and show that it is highly deployable as an online crowd management system.
Content may be subject to copyright.
21
Predicting Citywide Crowd Dynamics at Big Events: A Deep
Learning System
RENHE JIANG, The University of Tokyo, SUSTech-UTokyo Joint Research Center on Super Smart City,
Southern University of Science and Technology
ZEKUN CAI, ZHAONAN WANG, and CHUANG YANG, The University of Tokyo
ZIPEI FAN and QUANJUN CHEN, The University of Tokyo, SUSTech-UTokyo Joint Research Center
on Super Smart City, Southern University of Science and Technology
XUAN SONG, SUSTech-UTokyo Joint Research Center on Super Smart City, Southern University of
Science and Technology, The University of Tokyo
RYOSUKE SHIBASAKI, The University of Tokyo
Event crowd management has been a signicant research topic with high social impact. When some big
events happen such as an earthquake, typhoon, and national festival, crowd management becomes the rst
priority for governments (e.g., police) and public service operators (e.g., subway/bus operator) to protect
people’s safety or maintain the operation of public infrastructures. However, under such event situations,
human behavior will become very dierent from daily routines, which makes prediction of crowd dynamics
at big events become highly challenging, especially at a citywide level. Therefore in this study, we aim to
extract the “deep” trend only from the current momentary observations and generate an accurate prediction
for the trend in the short future, which is considered to be an eective way to deal with the event situations.
Motivated by these, we build an online system called DeepUrbanEvent, which can iteratively take citywide
crowd dynamics from the current one hour as input and report the prediction results for the next one hour
as output. A novel deep learning architecture built with recurrent neural networks is designed to eectively
model these highly complex sequential data in an analogous manner to video prediction tasks. Experimental
results demonstrate the superior performance of our proposed methodology to the existing approaches. Lastly,
we apply our prototype system to multiple big real-world events and show that it is highly deployable as an
online crowd management system.
CCS Concepts: Information systems Information systems applications;•Human-centered com-
puting Ubiquitous and mobile computing;•Computing methodologies Articial intelligence;
Additional Key Words and Phrases: Crowd management, ubiquitous and mobile computing, deep learning,
application and system
This work was supported by Grant-in-Aid for Early-Career Scientists (20K19859) and Grant-in-Aid for Early-Career Scien-
tists (20K19782) of Japan Society for the Promotion of Science (JSPS).
Authors’ addresses: R. Jiang, Z. Fan, and Q. Chen, The University of Tokyo, SUSTech-UTokyo Joint Research Cen-
ter on Super Smart City, Southern University of Science and Technology; emails: jiangrh@csis.u-tokyo.ac.jp, {fanzipei,
chen1990}@iis.u-tokyo.ac.jp; Z. Cai, Z. Wang, C. Yang, and R. Shibasaki, The University of Tokyo; emails: {caizekun,
znwang, chuang.yang, shiba}@csis.u-tokyo.ac.jp; X. Song (corresponding author), SUSTech-UTokyo Joint Research Cen-
ter on Super Smart City, Southern University of Science and Technology, The University of Tokyo; email: songxuan@csis.
u-tokyo.ac.jp.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2022 Association for Computing Machinery.
2157-6904/2022/03-ART21 $15.00
https://doi.org/10.1145/3472300
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:2 R.Jiangetal.
ACM Reference format:
Renhe Jiang, Zekun Cai, Zhaonan Wang, Chuang Yang, Zipei Fan, Quanjun Chen, Xuan Song, and Ryosuke
Shibasaki. 2022. Predicting Citywide Crowd Dynamics at Big Events: A Deep Learning System. ACM Trans.
Intell. Syst. Technol. 13, 2, Article 21 (March 2022), 24 pages.
https://doi.org/10.1145/3472300
1 INTRODUCTION
Event crowd management has been a signicant research topic with high social impact. When
some big events happen such as an earthquake, typhoon, and national festival, crowd manage-
ment becomes the rst priority for governments (e.g., police) and public service operators (e.g.,
subway/bus operator) to protect people’s safety or maintain the operation of public infrastruc-
tures. Especially for a large urban area such as Tokyo and Shanghai, the population density is very
high, which naturally leads to high risks for various accidents and emergency situations. Recall
the tragedy on New Year’s Eve in Shanghai, around 300,000 people gathered to celebrate the ar-
rival of 2015 near Chen Yi Square on the Bund. However, the large crowd was not well controlled,
and a stampede occurred where 36 people died and 47 were injured in the tragedy. Meanwhile,
articial intelligence (AI) technology is rapidly developing and the 5G mobile internet technol-
ogy is forthcoming. Big human mobility data are being continuously generated through a variety
of sources, some of which can be treated and utilized as streaming data for understanding and
predicting crowd dynamics. All these stimulate us to take new eorts and achieve new success on
this social issue by using such streaming mobility data and advanced AI technologies. However,
when big events or disasters happen, urban human mobility may dramatically change from nor-
mal situations. It means people’s movements will almost be uncorrelated with their daily routines.
AsshowninFigure1, a big earthquake occurred at 14:46 JST March 11, 2011. Citywide human
mobility in Tokyo area was greatly impacted since the transportation network was suddenly shut
down by the earthquake. Due to this big event, an abnormal pattern of crowd density can be ob-
served in both Tokyo station area and Shinjuku station area. All these demonstrate that predicting
crowd dynamics under event situations is of high social impact, but very challenging, especially
at a citywide level.
To address this challenge, we aim to extract the “deep” trend only from the current momentary
observations and generate an accurate prediction for the trend in the short future, which is con-
sidered to be an eective way to handle the event situations [10]. We build an intelligent system
called DeepUrbanEvent based on collected big human mobility data and a unique deep-learning
architecture. It is designed to be deployed as an online system for crowd management at big events,
which can continuously take limited steps of currently observed crowd dynamics as input and re-
port multiple steps of prediction results for a short time period in the future as output. With such
multiple steps of prediction, we can easily understand how the crowd dynamics are evolving in
more detail.
Specically, in this study, citywide crowd dynamics are rst decomposed into two parts: crowd
density and crowd ow. By meshing a large urban area into ne-grained grids, they can both
be represented by a four-dimensional (4D) tensor (Timestep,Heht,Width,andChannel)
analogously to a short video, where Timestep represents the number of observation/prediction
steps, and Heiдht and Width are determined by mesh size. Crowd density/ow video represents
a time series of density/ow value for each mesh-grid, therefore Channel for density is equal to
1, whereas Channel for ow is equal to the size of ow kernel window η×η.Thestoredvalue
indicates how many people inside a central mesh-grid will transit to each of η×ηneighboring
mesh-grids in a given time interval. A Multitask ConvLSTM Encoder-Decoder architecture is
designed to simultaneously model these two kinds of high-dimensional sequential data to gain
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:3
Fig. 1. Citywide human mobility in Tokyo before and aer the Great East Japan Earthquake (top). Crowd
density in Tokyo station area and Shinjuku station area (boom).
concurrent enhancement. Based on this architecture, our system works as an online system that
can continuously take limited steps of currently observed crowd density and crowd ow as input,
and report multiple steps of predictions results as output for the future time period. Finally, we
validate our system on four big real-world events that happened in Tokyo area, namely, 3.11
Japan Earthquake, Typhoon Roke (2011), New Year’s Day (2012), and Tokyo Marathon (2011), and
demonstrate its superior performance to baseline models. The overview of our system has been
showninFigure2. In summary, our work has the following key characteristics that make it unique:
For predicting crowd dynamics at citywide-level big events, we build an online deployable
system that needs only limited steps of current observations as input.
Citywide crowd dynamics are decomposed into two kinds of articial videos, namely, crowd
density video and crowd ow video, and a Multitask ConvLSTM Encoder-Decoder is de-
signed to simultaneously predict multiple steps of crowd density and ow for the future
time period.
Using the predicted crowd density and ow video, we can further build a series of dynamic
crowd mobility graph to help conduct probabilistic reasoning of crowd movements during
big events.
We validate our system on four big real-world events with big human mobility data source
and verify it as a highly deployable prototype system.
The remainder of this article is organized as follows. In Section 2, we review some related works.
In Section 3, we provide a description of our data source. In Section 4, we give the problem deni-
tion of citywide crowd dynamics prediction. In Section 5, we illustrate the proposed deep sequen-
tial learning architecture. In Section 6, we explain how to build dynamic crowd mobility graph. We
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:4 R.Jiangetal.
Fig. 2. DeepUrbanEvent is designed as an eective real-world system for predicting citywide crowd dynamics
at big events. Citywide crowd dynamics are first decomposed into two parts: crowd density and crowd flow.
By meshing a large urban area into fine-grained grids, they can both be represented by a four-dimensional
tensor (Timestep,Heiдht,Width,andChannel) analogously to a short video, where Timestep represents
the number of observation/prediction steps, and Heiдht and Width are determined by mesh size. Crowd
dynamics graph for management can be built based on the predicted density (nodes) and flow (edges).
present and discuss the results of the experimental evaluation in Section 7. In Section 8, we give
our conclusions as well as future work.
2 RELATED WORK
Human mobility data have been widely studied in the eld of data science and AI. A very compre-
hensive survey on the recent deep spatiotemporal models has been given in [48]. Here, we briey
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:5
recap the related works in citywide human mobility prediction. According to the modeling strat-
egy, it could be divided into three categories: trajectory-based prediction, mesh-based prediction,
and network-based prediction. The trajectory-based methods [913,22,23,33,42] directly model
the trajectories as typical sequential data, whereas the mesh-based [18,32,46,49,5557,5961,65]
and network-based ones [5,14,21,28,29,3436,38,41,47,50,54,58] aggregate or map the trajec-
tories with mesh-grid or transportation network.
2.1 Trajectory-Based Mobility Prediction
Many trajectory-based deep learning models were proposed to predict each individual’s move-
ment [12,13,33]. [33] extends a regular RNN by utilizing time and distance-specic transi-
tion matrices to propose an ST-RNN model for predicting the next location. DeepMove [12],
considered to be a state-of-the-art model for trajectory prediction, designed a historical atten-
tion module to capture periodicities and augment prediction accuracy. VANext [13] further en-
hanced DeepMove by proposing a novel variational attention mechanism. Besides, [9,11,22]
modeled human mobility for a large population at a citywide level. [10,23,42] simulated and
predicted human emergency mobility following disasters or events. [20] generated urban hu-
man mobility with variational autoencoder. [24] transferred urban human mobility via Point-
Of-Interest among dierent cities. However, these models are built based on millions of indi-
viduals’ mobility. In our experiment, we implement CityMomentum [10] as a trajectory-based
baseline.
2.2 Mesh-Based Mobility Prediction
Density (Demand) Prediction. Based on mesh, predicting citywide crowd density with histor-
ical observations can be represented by a 4D tensor (Timestep,Heiдht,Width,andChannel), as
demonstrated in Figure 3. Following the same formulation, Hetero-ConvLSTM [59] predicted traf-
c accidents using heterogeneous data sources; DeepSD [46], DMVST-Net [56], and Periodic-CRN
[65] predicted taxi demand using taxi request dataset collected from car-hailing company. CoST-
Net [57] predicted multiple transportation demands using both taxi and bike data.
In/Out Flow Prediction. Forecasting the citywide crowd ow based on mesh is proposed and
addressed by [18,60,61]. As illustrated by Figure 3, they dene inow and outow to represent
how many people will ow into or out from a certain mesh-grid. The prediction problem can be
represented by 4D tensor (Timestep,Heiдht,Width,Channel = 2), whereChannel stores the in ow
and out ow. [18,60,61] build deep learning models using deep neural network and convolutional
neural network (CNN) to do the citywide prediction. Following the same problem denition, a
series of deep learning models have been proposed to improve the performance of ST-ResNet [60],
including Periodic-CRN [65], STDN [55], and DeepSTN+ [32]. In our experiment, we implement
ST-ResNet [60] as the mesh-based baseline. We adapt ST-ResNet to take in limited steps of latest
observations on crowd density/ow and perform one-step-by-one-step autoregression to obtain
multiple steps of predictions.
Flow Prediction. In-ow and out-ow can only indicate how many people will ow into or out
from a certain mesh-grid, but cannot answer where the people ow come or transit. As illustrated
by Figure 3, citywide crowd ow can depict how a crowd of people move among the entire mesh-
grids. The problem can be represented by 4D tensor (Timestep,Heiдht,Width,Channel =Heht
*Width). MDL [62] also utilized multitask learning to simultaneously model and predict crowd
in/out ow and crowd in-out transition. GEML [49] predicted origin-destination transition matrix
via Graph Convolutional Network (GCN).Moreover,[2] conducts transition estimation from
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:6 R.Jiangetal.
Fig. 3. Mesh-based citywide crowd prediction: density, in/out flow, and transition, analogous to video data
prediction on 4D tensor (Timestep,Heiдht,Width,andChannel).
aggregated population data, and [44] estimates the transition populations using inow and outow.
However, all these models including MDL [62] and GEML [49] can only do the single-step ow
prediction rather than a continuous multi-step prediction. These models also cannot give out the
crowd density prediction in a straightforward way either, which is very crucial for event crowd
management. Our system aims to simultaneously model and predict citywide crowd density and
crowd ow in a multi-step-to-multi-step manner to better facilitate event crowd management in
the real world.
2.3 Network-Based Mobility Prediction
Based on road network-mapped trajectories, researchers also applied deep learning to predict the
trac ow [21,34], trac speed, congestion [35,36], travel time [28,29,47,50,54], and human
mobility with transportation mode [41]. On the other hand, based on road network-aggregated
time series data, GeoMAN [31] designed LSTM plus attention mechanism as a general solution to
geo-sensory time series prediction, and ST-MetaNet [38] employed meta-learning for trac ow
prediction. In particular, leveraging on the latest technique GCN, a series of models have been
proposed to address trac-related problems: ST-GCN [58] and DCRNN [30] rst applied GCN
to trac ow forecasting; ASTGCN [16], GMAN [64], and STSGCN [40] are the state-of-the-arts
extended from STGCN; Graph WaveNet [52] combined GCN with dilated casual convolution, while
T-GCN [63] combined GCN and gated recurrent unit to demonstrate the superior performances.
Besides, [3,6,14] utilized GCN for ride-hailing and car-hailing demand prediction. However, these
models did not directly address the citywide crowd density nor crowd ow prediction task.
3 DATA SOURCE
“Konzatsu-Tokei (R)” from ZENRIN DataCom Co., Ltd. was used. It refers to people ow data col-
lected by individual location data sent from mobile phones with an enabled AUTO-GPS function
under the users’ consent, through the “DoCoMo map navi” service provided by NTT DoCoMo,
Inc. Those data are processed collectively and statistically in order to conceal private information.
The original location data are Global Positioning System (GPS) data (latitude, longitude) sent
at a minimum period of about 5 min, and does not include information (such as gender or age)
to specify individuals. However, the data acquisition is aected by several factors such as loss
of signal or low battery power. In addition, when a mobile phone user stops at a location, the
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:7
Table 1. Notation Description
Symbol Description
M,дmesh for an urban area, mesh-grid
Timestep the number of observation/prediction steps
Heiдht number of mesh-grids along Latitude-axis
Width number of mesh-grids along Longitude-axis
Channel dimension to store the value for density/ow
Γiu each user’s(u) trajectory on each day(i)
dtm crowd density at timeslot ton mesh-grid дm
ftmw crowd ow from дmat t-1 to дwat t
dt,ftcitywide density/ow at t
xd,xfcurrent αsteps of density/ow w.r.t t
yd,yfnext βsteps of density/ow w.r.t t
Xsamples from all timeslots for the current
Ysamples from all timeslots for the next
the predicted results
positioning function of his/her mobile phone is automatically turned o to save power. The pro-
posed methodology is applied to raw GPS data from NTT DoCoMo, Inc. The raw GPS log dataset
was collected anonymously from approximately 1.6 million mobile phone users in Japan over a
three-year period (August 1, 2010, to July 31, 2013). Each record contains user ID, latitude, longi-
tude, altitude, timestamp, and accuracy level. In this study, we select Greater Tokyo Area (including
Tokyo City, Kanagawa Prefecture, Chiba Prefecture, and Saitama Prefecture) as the target area. We
can obtain 145,507 users’ trajectories in total that covers approximately 1% of the real-world popu-
lation. This dataset is slightly biased towards young people, because comparing with older people
and children, young people are more likely to have a mobile phone with localization function. To
discover the relationship between our dataset and real-world population and validate the repre-
sentativeness of the dataset, a previous study [19] estimated the home location of each user in the
dataset and compared it with census data on 1km mesh-grid, and found the linear relationship as:
NGPS =0.0063048 Ncensus +0.73551,R2=0.79222,
where NGPS is the population estimated from our GPS dataset, Ncensus is the population given by
the national census data, and R2is the coecient of determination. This R2value demonstrates
that our dataset has very good representativeness of the read-world population. Please refer to
[19] for more details of the basic statistics of this dataset.
4 CITYWIDE CROWD DYNAMICS MODELING
Denition 1 (Calibrated Human Trajectory Database). Human trajectory is stored and indexed
by day (i) and userid (u) in the trajectory database Γ. Given a mesh Mof an area, namely a set
of mesh-grids {д1,д2,...,дm,...,дHeiдhtWidth}, and a time interval Δt, each user’s trajectory on
each day Γiu is mapped onto mesh-grids and then calibrated to obtain constant sampling rate as
follows:
Γiu =(t1,д1),...,(tk,дk)∧∀j[1,k),|tj+1tj|=Δt,
which means that the time interval between any two consecutive timeslots is calibrated into Δt.
For simplicity, from now on we only consider a one-day slice of the trajectory database Γ, then the
day index (i) can be omitted when referring to Γ.Table1above summarizes the notations used in
this article in a comprehensible way.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:8 R.Jiangetal.
Denition 2 (Crowd Density). Given Γ,M, crowd density at timeslot ton mesh-grid дmis dened
as follows:
dtm =|{u|Γ
u.дt=дm}|,
which intuitively indicates how many people inside дmat t.
Denition 3 (Crowd Flow). To capture the crowd ow starting from a certain mesh-grid, we
utilize a kernel window denoted as η×ηw.r.t дm, which represents a square area made up ofη×η
neighboring mesh-grids with дmas the centroid mesh-grid. Given Γ,M, and a kernel window η×η
w.r.t ea ch д, crowd ow at timeslot ton mesh-grid дmis dened as follows:
ftmw =|{u|Γ
u.дt1=дmΓ
u.дt=дw}|,
which intuitively indicates how many people transit from mesh-grid дmat timeslot t-1toaneigh-
boring mesh grid дwinside a kernel window at timeslott. After calculating the crowd density/ow
for each mesh-grid over the entire mesh, citywide crowd density/ow can be obtained for each
timeslot.
Denition 4 (Crowd Density/ow Video). As the mesh is represented in a two-dimensional format,
a crowd density/ow video containing a couple of consecutive frames can be represented by a 4D
tensor RTimestep×Heiдht×Width×Channel,whereTimestep represents the number of video frames,
Channel for density is equal to 1, and Channel for ow is equal to the size of the given kernel
window namely η2. An illustration for crowd density/ow video has been shown in Figure 2.
Denition 5 (Crowd Density/ow Video Prediction). Given currently observed a-step crowd den-
sity/ow video xd=dt(α1),...,dt,xf=ft(α1),...,ftat timeslot t, prediction for the next
β-step density/ow video ˆ
yd=ˆ
dt+1,..., ˆ
dt+β,ˆ
yf=ˆ
ft+1,..., ˆ
ft+βis modeled as follows:
ˆ
yd=ˆ
dt+1,ˆ
dt+2,..., ˆ
dt+β
=argmax
dt+1,dt+2,...,dt+β
P(dt+1,dt+2,...,dt+β|dt(α1),...,dt),
ˆ
yf=ˆ
ft+1,ˆ
ft+2,..., ˆ
ft+β
=argmax
ft+1,ft+2,...,ft+β
P(ft+1,ft+2,...,ft+β|ft(α1),...,ft).
Denition 6 (Citywide Crowd Dynamics Prediction). Given currently observed a-step crowd den-
sity/ow video, citywide crowd dynamics prediction aims to simultaneously generate next β-step
density/ow video, which is modeled as follows:
ˆ
yd,ˆ
yf=argmax
yd,yf
P(yd,yf|xd,xf).
Moreover, by jointly modeling these two highly correlated tasks, concurrent enhancements for
both can be expected. It should be noted that crowd density video and crowd ow video are sum-
marized and proposed as a new concept called crowd dynamics here, which aims to not only
reect the crowd density for each mesh-grid but also depict how a crowd of people move among
the mesh-grids. Figure 4demonstrates the overall problem denition mentioned above.
5 DEEP SEQUENTIAL LEARNING ARCHITECTURE
As shown above, citywide crowd dynamics problem has been dened in an analogous manner
to a video prediction task. However, citywide crowd dynamics are a highly complex phenomenon
especially when big events happen, which makes it very dicult to handle these high-dimensional
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:9
Fig. 4. Citywide crowd dynamics prediction.
sequential data with some classical methodologies. This naturally motivates us to employ the most
advanced deep video learning model as the basic component of our system.
Convolutional LSTM. ConvLSTM [53] has been proposed to build an end-to-end trainable model
for the precipitation nowcasting problem. It extends the fully connected LSTM to have convolu-
tional structures in both the input-to-state and state-to-state transitions and achieves new success
on video modeling tasks. Thus, ConvLSTM is utilized as the core component of our system for
the density and ow video prediction tasks. As shown in Figure 2, a ConvLSTM has three gates
comprising an input gate i, an output gate o,andaforgetgatefas same as an ordinary LSTM.
Hidden state htin a ConvLSTM is calculated iteratively from t=1 to Tfor an input sequence of
frames (x1,x2,...,xT)as follows:
it=σ(Wxi xt+Whi ht1+Wci ct1+bi),
ft=σ(Wxf xt+Whf ht1+Wcf ct1+bf),
ct=ftct1+ittanh(Wxc xt+Whc ht1+bc),
ot=σ(Wxo xt+Who ht1+Wco ct+bo),
ht=ottanh(ct),
where Wis weight, bbias vector, * denotes the convolution operator, and represents Hadamard
product. All of these weight parameters are determined by applying the standard “backpropagation
through time” algorithm, which starts by unfolding the recurrent neural networks through time
and it then generalizes the backpropagation for feed-forward networks to minimize the dened
loss function, which will be Mean Squared Error (MSE) for our problem. The full details of the
algorithm are omitted here.
5.1 Stacked ConvLSTM Architecture
As a comparative modeling approach, we would like to verify how the performance of our sys-
tem would be like if we use a one-step-by-one-step prediction model and obtain multiple steps of
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:10 R.Jiangetal.
Fig. 5. Stacked ConvLSTM for one-step prediction.
predictions by iterating it in an autoregressive manner. Then one-step-by-one-step crowd density
prediction model can be dened as follows:
ˆ
dt+1=argmax
dt+1
P(dt+1|dt(α1),...,dt1,dt).
Once the one-step model is trained, we can run it for βtimes to obtain ˆ
dt+1,ˆ
dt+2,..., ˆ
dt+β.Crowd
ow prediction could also be modeled in a similar way, which will be omitted for simplicity.
Moreover, the use of multiple stacked layers of neural networks can also be considered to boost
the performance in dicult time-series modeling tasks according to [17]. Thus, a deep architecture
constructed with multiple stacked ConvLSTM layers has been shown in Figure 5for one-step
prediction. It has strong representational power which makes it suitable for giving predictions in
complex phenomenons like the citywide crowd dynamics. Note that the same network architecture
can also be applied to one-step crowd ow prediction, but a special AutoEncoder component is
rst necessary due to the uniqueness of crowd ow video which will be explained in the following.
5.2 CNN AutoEncoder for Crowd Flow
Crowd density and ow video are both represented as 4D tensor RTimestep×Heiдht×Width×Channel ;
however, the Channel for ow is much larger than density. In our system, each grid is set to 500 m
×500 m, by taking into account all the possible transportation modes such as WALK, BUS, CAR,
and TRAIN, the transition distance from one grid-cell to another neighboring one can be up to 4
km within 5 minutes time interval (approximately 48 km/h at most). Thus kernel window needs to
be 15 ×15 to capture all the possible crowd ow within 5 minutes.Channel for ow is then equal to
225, which is just too large for most of the state-of-the-art video learning models to handle. Thus,
we build a special CNN AutoEncoder [4,45] to obtain a low-dimension representation of Channel
for crowd ow.
Compared with traditional neural networks, CNNs were designed specically for analyzing
visual imagery [27], where the neurons in a layer are only connected to a small region of the
previous layer instead of all of the neurons in a fully connected manner. CNNs are the state-of-
the-art method for image recognition or classication tasks [26,39]. For a typical CNN layer, the
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:11
Fig. 6. CNN AutoEncoder for crowd flow.
convolutional feature value at location (i,j)in the k-th feature map, feature ci,j,k, is calculated as:
ci,j,k=ReLU (wkxi,j+bk),
where wkand bkare the weight and bias of the k-th lter, respectively, and xi,jis the input patch
centered at location (i,j).ReLU is often used as the activation function.
The details of the special CNN AutoEncoder have been proposed in Figure 6.Notethatwe
aim to model the crowd dynamics at a citywide level by taking the entire city “image” as the
computation unit rather than each grid “pixel. Therefore, the original crowd ow image of the
city is represented with a three-dimensional (3D) tensor (80, 80, 225). An encoder is constructed
with three convolutional layers to encode the image into a small 3D tensor (80, 80, 4), and then
a decoder is constructed with three convolutional layers to decode the compressed tensor back
to the original 3D tensor (80, 80, 225). The end-to-end model parameters can be optimized by
minimizing reconstruction error between the original ow image and decoded ow image. In our
system, we aim to obtain a compressed Channel (from 225 to 4) but keep the spatial structural
information of the citywide crowd ow image at (80, 80). Thus, only convolutional layer with 1 ×
1 kernel window is utilized. At the last layer of the encoder, a unique ReLU(MAX = 1.0) function
was utilized to ensure that the values are all scaled into [0, 1], which can keep the value range of
crowd ow approximately the same as the value range of crow density. Without this, the multitask
learning mechanism introduced in the following could not function well.
5.3 Multitask ConvLSTM Encoder and Decoder
With such a CNN AutoEncoder, citywide crowd ow can be modeled and computed with the same
architecture for crowd density shown in Figure 5. The prediction can be performed in an iterative
one-by-one manner, but one major limitation of this model is to predict relatively long short-term
crowd dynamics. With the iteration going on, the accumulated iteration error will become large,
which can result in terrible performance on the last several predicted steps. To tackle this problem,
we improve the one-step-by-one-step modeling with multi-step-to-multi-step modeling (Deni-
tion 5) aimed at achieving better performance on “long” short-term predictions. To deliver this
idea, a sequential encoder and decoder architecture [7,43] has been built with four ConvLSTM
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:12 R.Jiangetal.
Fig. 7. Multitask ConvLSTM enocder-decoder for simultaneous multi-step prediction of crowd density and
crowd flow.
layers in this study. It works in the following steps: (1) the rst two hidden layers of ConvLSTM
(encoder) map the αsteps of the inputted crowd density or ow video into a single latent vector,
which contains information about the entire video sequence; (2) this latent vector is repeated β
times to a constant sequence; and (3) the other two hidden layers of ConvLSTM (decoder) are used
to turn this constant sequence into the βsteps of the output video sequence. A batch normaliza-
tion layer is added between two consecutive ConvLSTM layers. ReLU is used as the activation
function in the nal decoding layer. The ConvLSTM Enc.-Dec. model can be separately trained by
minimizing the prediction error L(θd)and L(θf)for crowd density video YDand crowd ow video
YF, described as follows:
L(θd)=|| ˆ
YDYD||2,
L(θf)=|| ˆ
YFYF||2.
Crowd density video and crowd ow video share important information and are highly corre-
lated with each other. The insights behind this are two folds: (1) people ow tend to follow the
trend, especially under the event/emergency situations, which may make crowded places attract
more and more people; and (2) higher inow will lead to higher density for each grid-cell, and
similarly higher outow will reduce the crowd density. Moreover, as mentioned above, a crowd
management system needs to predict not only the crowd density but also the crowd ow for each
grid. Thus, we jointly model these two highly correlated tasks dened as Denition 6, and propose
a Multitask ConvLSTM Encoder and Decoder architecture as shown in Figure 7. The key concept
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:13
of multi-task learning [37] is to learn multiple tasks simultaneously with the aim of gaining mu-
tual benets; thus, learning performance can be improved through parallel learning while using
a shared latent representation. Therefore, it is reasonable to expect better performances from this
learning framework for our system. Our Multitask ConvLSTM Encoder and Decoder architecture
rst takes XDand XFas two separate inputs. The separate input encoders rst encode the crowd
density and crowd ow video respectively. Then, the shared encoder maps the encoded crowd
density and ow into a joint latent representation zα, which can be taken as the auto-extracted
features for the entire crowd dynamics; zαis then repeated βtimes to be passed to the shared
decoder, and nally the output decoders give the multiple steps of prediction results for crowd
density video ˆ
YDand ow video ˆ
YFrespectively. The entire model can be trained by minimizing
the total prediction error L(θ)of crowd density and ow, described as follows:
L(θ)=λ|| ˆ
YDYD||2+(1λ)|| ˆ
YFYF||2,
where λis set to 0.5 to make the two parts of loss equally contributing. CNN AutoEncoder still
needs to be applied rst to the original crowd ow.
6 DYNAMIC CROWD MOBILITY GRAPH
So far, multiple steps of citywide crowd density and crowd ow can be modeled and predicted
simultaneously as crowd dynamics. At this stage, citywide crowd dynamics are still represented
with mesh-grids, in order to help conduct probabilistic reasoning of crowd movements during big
event situations, we build a series of dynamic crowd mobility graphs using the predicted crowd
dynamics. By using graph instead of mesh, a high-level representation of citywide crowd dynam-
ics can be obtained, more easily and eciently used for citywide-level event crowd management.
Moreover, if severe disasters happen, some parts of the real transportation network might be dam-
aged, therefore we need to build a virtual transportation network to replace or work together
with the real-world one. Comparing with the static transportation network, our proposed graph
is a dynamic one because: (1) one crowd mobility graph is corresponding to one frame of crowd
dynamics video; our system can report multiple steps of prediction results, thus we can build a
series of graphs for each timeslot; and (2) our graph can be updated every 5 minutes by our online
updating system.
Specically, given one frame of predicted crowd density video, we view the centroid of each
mesh-grid as a point, the density of the mesh-grid as the weight, then apply weighted KMeans
clustering on all the weighted points to get clusters of mesh-grids. Each cluster can be taken as
one Region-of-Interest (RoI), as well as one node of the mobility graph. Given one frame of
predicted crowd ow video, it is easy to build a transition matrix Ωbetween each mesh-grid pair.
By summing up the total transition number between each node pair, the edges of the mobility
graph can be generated. The details of this process are summarized in Algorithm 1.Notethat
other clustering or RoI construction algorithms (e.g., T-Pattern [15], MeanShift, or PopularRoutes
[51]) may also be used here, however, KMeans is adopted in our system because it can be easily
tuned using the only one parameter K.
Given β-step crowd ow video f1,f2,...,fβ, using each frame of crowd video, transition matri-
ces Ω1,Ω2,...,Ωβcan be calculated. Each Ωtessentially represents transition probability between
each mesh-grid pair in the timeslot tnamely Ωtij =P(дiдj) after normalization. Each Ωtin our
nal system is just a transition matrix within 5 minutes, which is not sucient for a relatively long
short-term probabilistic reasoning. To address this issue, we view each Ωtas a rst-order proba-
bility and leverage the rst-order probability to get higher-order transition probabilities. Ω1×Ω2
contains all the second-order transition probability, thus Ω1:2 =Ω1+Ω1×Ω2after normalization
corresponds to the 2-step transition probability, which is the likelihood of transition from дito дj
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:14 R.Jiangetal.
Table 2. Event Information
Event Abbr. Date Training dates
3.11 Earthquake Earthquake 2011/03/11 2011/03/01–2011/03/10
Typhoon Roke Typhoon 2011/09/21 2011/09/11–2011/09/20
New Year’s Day New Year 2012/01/01 2011/12/22–2011/12/31
Tokyo Marathon Marathon 2011/02/27 2011/02/17–2011/02/26
ALGORITHM 1: Dynamic Crowd Mobility Graph Building
Input: Citywide crowd density and ow yd,yf,ameshM, node number K.
Output: Dynamic Crowd Mobility Graph
1Ψ←∅;// dynamic crowd mobility graph.
2for each t[t1,...,tβ]do
3dt,ftretrieve density, ow at tfrom yd,yf;
4V←Node Construction(dt,M,K);
5E←Edge Construction(ft,M,V);
6ΨΨ(V,E);
7return Ψ;
8Function Node Construction (d,M,K):
9C,W←∅,;
10 for each дmMdo
11 C←C∪дm.centroid, W←W∪dm;
12 V←Weighted Kmeans Clustering(C,W,K);
13 return V;
14 Function Edge Construction (f,M,V):
15 Ω[|M|][|M|]build grid transition matrix using f;
16 E[|V|][ |V |]initialize node transition matrix with 0;
17 for each pair (дp,дq)(M,M)do
18 vnd дpbelongs to which node in V;
19 vnd дqbelongs to which node in V;
20 E[v][v]←E[v][v]+Ω[p][q];
21 return E;
within two steps namely 5*2 minutes. Analogously, we can get the β-step transition probability
by calculating the matrix Ω1:β=Ω1+Ω1×Ω2+··· +Ω1×Ω2×···×Ωβwith normalization.
Ω1:βbuilt with multiple steps of crowd ow can be used to replace the simple rst-order transition
matrix Ωin Algorithm 1(Edge Construction Function), then crowd mobility graph can contain the
transition information within a longer time interval i.e., 5*βminutes.
7 EXPERIMENT
In this section, we present the results of extensive experiments and comparisons of the perfor-
mance of our model with other baseline models.
7.1 Seings
Experimental Setup: We selected the Greater Tokyo Area (Lonд.[139.50,139.90], Lat .
[35.50,35.82]) as our target urban area. Four citywide-level events happened in this area were
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:15
Table 3. Summary of Tuned Parameters
Parameter Tuned value
Heiдht,Width 80, 80
Time Intervals of One Day 5 minutes ×288
Train/Test Timeslots 2880/288
Kernel Window η×η15 ×15
(1)Timestep α/β(Max lead time) 6/6 (30 minutes)
(2)Timestep α/β(Max lead time) 12/12 (60 minutes)
Optimizer Adam
Epoch 200
Learning Rate 0.0001
Batch Size 4
Scaling Factor for Density/Flow 500/100
selected as the testing events as summarized in Table 2: (1) 3.11 Earthquake (2011/03/11), a mag-
nitude 9.0-9.1 earthquake o the coast of Japan that occurred at 14:46 JST, which caused a great
impact on people’s behaviors in the Great Tokyo Area. (2) Typhoon Roke (2011/09/21), recorded
as one of the strongest typhoons in Japan’s history, which made subway operators shut down part
of their services. (3) New Year’s Day (2012/01/01). There are a number of New Year celebrations in
the Tokyo area, especially, for “Hatsumode” (the rst visit to Buddhist temple or shrine), most of
the railway lines operate overnight on New Year’s Eve for this. (4) Tokyo Marathon (2011/02/27).
The number of people attending this event was 2.16 million (the number of people along the road
was 1.53 million, and the number of visitors to the Tokyo Marathon Festival was 0.63 million).
Also, trac regulation was strictly conducted along the Marathon route. These four event days
were used as testing dates, and 10 consecutive days before the event day were utilized as training
and validation dataset, which means 2011/03/01–2011/03/10, 2011/09/11–2011/09/20, 2011/12/22–
2011/12/31, and 2011/02/17–2011/02/26 were the selected periods for the four events respectively.
Our data source contained approximately 100,000130,000 users’ GPS logs on each day within
the target urban area. After conducting data cleaning and noise reduction to the raw dataset, we
did linear interpolation to make sure each user’s 24-hour (00:0023:59) GPS log has a constant
5-minute sampling rate. Then by mapping each coordinate onto mesh-grid, crowd density video
and crowd ow video could be generated based on the denitions listed in Section 4.
Parameter Settings: The parameter settings are summarized in Table 3. We meshed the entire
area with ΔLonд.= 0.005, ΔLat .= 0.004 (approximately 500m×500m)togetan80×80 mesh-grid
map. As mentioned above, the time interval Δtof our system was set to 5 minutes. Therefore, we
got 2,880 timeslots (288 * 10 days) as training dataset and 288 timeslots as the testing dataset, and
crowd density frame and crowd ow frame were generated for each timeslot. Kernel window was
set to 15 ×15 for crowd ow, which could capture enough transit distance of crowd ow within
5 minutes. We set the observation step αand the prediction step βboth to 6 to generate length-6
crowd/ow video as inputs and their corresponding next length-6 videos as outputs. This means
our system could predict the crowd dynamics for the next 30 minutes. In each report, it contained
six steps of prediction results for every 5 minutes, and the result at 6th step gave us the maximum
lead time of 30 minutes. Similarly, an evaluation for the prediction with 60 minutes lead time is
also conducted by setting αand βto 12. Finally, we could get 2,868 sample pairs from the training
dataset, and randomly selected 80% of them (2,294) as the training samples and 20% of them (574) as
the validation samples. The Adam algorithm was employed to control the overall training process,
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:16 R.Jiangetal.
where the batch size was set to 4 and the learning rate to 0.0001 for all deep learning models except
that the learning rate of CNN AutoEncoder was tuned as 0.001. The training algorithm would be
stopped after 200 epochs and only the best model would be saved. In addition, we used 500 as
the scaling factor for crowd density to scale the data down to relatively small values, and 100 as
the scaling factor for crowd ow value. In the evaluation, we rescaled the predicted value back to
the normal values, and compared them with the ground-truth. The parameter settings were kept
the same for each event. Python and some Python libraries such as Keras [8] and TensorFlow [1]
were used in this study. The experiments were performed on a GPU server with four GeForce GTX
1080Ti graphics cards.
Baseline models: We implemented the following models as baseline models for comparison.
(1) HistoricalAverage. Crowd density/ow for each timeslot was estimated by averaging the last
10 days’ corresponding values.
(2) CopyYesterday. We directly used yesterday’s value as the predicted value on event days.
(3) CopyLastFrame. We directly copied the last/latest observation as the predicted value, which
can be a simple but very eective method for event situations.
(4) ARIMA. It is a classical time-series prediction model designed for one-dimensional data. For
each mesh-grid, we build one ARIMA model to predict the time-series density prediction.
However, for the ow tensor (80,80,225) at each timeslot, the dimension was just too high
for ARIMA to handle.
(5) VectorAutoRegressive. It is an advanced time-series prediction model designed for high
dimensional data. By attening the density tensor (80,80,1) at each timeslot into a 6,400-
dimension vector, the model could handle the crowd density prediction task. For ow tensor
(80,80,225), the dimension was also just too high for VAR to deal with.
(6) CityMomentum [10]. It was rstly proposed for momentary mobility prediction at the city-
wide level for big events. We implemented it by using a 500-meter mesh-grid and 5-minute
time interval, following exactly the same setting with our model. Although the model was
build from a perspective of individual’s mobility, the predicted/simulated trajectory of each
individual could be used for generating aggregated crowd density and ow, which makes it
comparable with our system.
(7) ST-ResNet [60]. This deep residual learning-based method shows a state-of-the-art perfor-
mance on citywide crowd ow prediction. To compare its performance under the same prob-
lem denition with us, we adapted ST-ResNet to take in limited steps of latest observations
on crowd density/ow and performed one-step-by-one-step autoregression to obtain multi-
ple steps of predictions. Here, we also found that 1-residual-unit ST-ResNet without external
features could achieve the best performances in our event situations.
(8) CNN. It is a one-step predictor constructed with four Conv layers. Note that the 4D tensor
would be converted to 3D tensor (Heiдht,Width,Timestep Channel)by concatenating the
channels at each timestep just as the way [60] did, so that CNN could take our 4D tensors as
inputs. The rst three Conv layers used 32 lters of 3 ×3 kernel window, and the nal Conv
layer used a ReLU activation function to output a single step of video frame.
(9) CNN Enc.-Dec. It is a multi-step predictor also constructed with four Conv layers. It shares
the same parameter settings with (5). The only dierence is the nal Conv layer outputs a
3D tensor (Heiдht,Width,Timestep Channel )as multiple steps of predictions.
(10) Multitask CNN Enc.-Dec. It has four Conv layers sharing a similar multitask architecture as
illustrated in Figure 7, namely, separate input encoding Conv layer, shared encoding and
decoding layer, and separate output Conv layer. All the parameters were kept the same
with (6).
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:17
Table 4. Performance Evaluation of 30 Minutes Ahead Prediction on Four Events
Model Earthquake Typhoon New Year Marathon
Density Flow Density Flow Density Flow Density Flow
HistoricalAverage 106.032 0.726 75.402 0.519 176.013 1.099 33.381 0.223
CopyYesterday 129.436 0.912 85.641 0.592 110.444 0.660 65.765 0.437
CopyLastFrame 7.824 0.116 9.756 0.186 5.498 0.079 6.496 0.107
ARIMA 10.430 NA 13.376 NA 8.343 NA 7.808 NA
VectorAutoRegressive 10.843 NA 13.377 NA 9.511 NA 9.380 NA
CityMomentum [10] 27.670 0.653 29.305 0.962 23.058 0.235 25.774 0.475
ST-ResNet [60] 6.542 0.113 7.802 0.183 4.544 0.080 5.548 0.103
CNN 8.698 0.178 10.245 0.196 6.178 0.083 6.614 0.100
CNN Enc.-Dec. 7.115 0.117 8.571 0.187 5.216 0.079 6.004 0.095
M.T. CNN Enc.-Dec. 6.802 0.119 8.226 0.197 5.158 0.084 5.953 0.097
ConvLSTM 6.737 0.124 7.959 0.195 4.679 0.077 5.675 0.094
ConvLSTM Enc.-Dec. 6.281 0.102 7.508 0.171 4.500 0.074 5.372 0.089
M.T. ConvLSTM Enc.-Dec. 5.549 0.102 6.753 0.170 4.117 0.074 5.012 0.086
(11) ConvLSTM (one-step-by-one-step) and (12) ConvLSTM Enc.-Dec (multi-step-to-multi-step)
are the proposed comparison models constructed with four ConvLSTM layers in Section 5.
Each ConvLSTM layer uses 32 lters of 3 ×3 kernel window and the ReLU activation is used
in the nal layer. BatchNormalization was added between two consecutive CNN/ConvLSTM
layers for all the models. Note that for all of the crowd ow parts, as shown in Figure 6,CNN
AutoEncoder will be rst applied to encode the original ow tensor and then decode the
(predicted) encoded ow back to the original format. Our nal system is implemented using
MultiTask ConvLSTM Enc.-Dec.
7.2 Performance Evaluation
Evaluation metric: We evaluated the performances of the models with MSE as follows:
MSE =1
n
n
i
ˆ
YiYi2,
where n is the number of samples, Yand ˆ
Yare the ground-truth value and predicted value in 4D
tensor format, namely, (Timestep,Heiдht,Width,andChannel). Density tensor and ow tensor
dier at the Channel.
Overall performance: We compared the performance of the baseline models and our proposed
model Multitask ConvLSTM Enc.-Dec. on four events. The overall evaluation results are sum-
marized in Table 4for 30 minutes ahead prediction and Table 5for 60 minutes ahead prediction,
which both show that based on all four events: (1) our model performed better than the others;
and (2) all deep learning models had advantages compared with existing methodologies (CityMo-
mentum and VAR). In particular, we could also nd that (1) the superiority of ConvLSTM to CNN
on video-like modeling tasks; (2) encoder-decoder architecture had the advantage on multi-step
sequential prediction task; and (3) the eectiveness of multitask learning on enhancing the corre-
lated tasks.
Performance on density: We also veried the performance of our system by using a times-series
evaluation over the event day to show the ground-truth and predicted density for selected ar-
eas (Tokyo Station Area and Shinjuku Station Area) in the city. Each area consists of 3×3neigh-
boring mesh-grids, with Tokyo Station and Shinjuku Station locating at the central mesh-grid
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:18 R.Jiangetal.
Table 5. Performance Evaluation of 60 Minutes Ahead Prediction on Four Events
Model Earthquake Typhoon New Year Marathon
Density Flow Density Flow Density Flow Density Flow
HistoricalAverage 104.604 0.731 75.927 0.529 175.344 1.102 33.422 0.225
CopyYesterday 128.133 0.920 86.069 0.601 106.991 0.645 65.725 0.440
CopyLastFrame 13.020 0.168 16.607 0.252 8.650 0.096 10.004 0.128
ARIMA 24.296 NA 32.933 NA 15.411 NA 15.259 NA
VectorAutoRegressive 22.355 NA 29.872 NA 21.072 NA 17.261 NA
CityMomentum [10] 32.034 0.570 35.090 0.821 25.867 0.207 28.825 0.400
ST-ResNet [60] 11.899 0.157 13.418 0.238 7.633 0.103 8.501 0.124
CNN 12.247 0.189 17.670 0.360 10.469 0.119 12.114 0.223
CNN Enc.-Dec. 11.372 0.164 13.876 0.245 8.311 0.097 9.127 0.119
M.T. CNN Enc.-Dec. 10.812 0.177 13.800 0.247 8.153 0.101 9.004 0.124
ConvLSTM 11.355 0.139 12.285 0.228 7.615 0.118 9.511 0.140
ConvLSTM Enc.-Dec. 9.309 0.122 11.186 0.197 6.885 0.086 7.843 0.103
M.T. ConvLSTM Enc.-Dec. 8.094 0.122 9.900 0.196 6.496 0.085 7.483 0.101
respectively. From Figure 8, we can straightforwardly conrm the eectiveness of our model for
60 minutes ahead prediction and its high deployability for a real-world online event crowd manage-
ment system. Referring to the normal situation (prediction result of HistoricalAverage) shown in
the gure, we can nd that the densities on event days dier a lot from normal situations. Further-
more, even comparing these four events, the density patterns are quite dierent from each other.
This could further demonstrate the crowd management in event situations is really challenging
and our online prediction system can be so indispensable for these special cases.
Performance on dynamic graph: Using the proposed algorithm in Section 6, we build a series of
dynamic crowd mobility graph for the 3.11 Earthquake event, and demonstrate three snapshots of
the graph at 14:00, 15:00, and 16:00 (the earthquake occurred at 14:46 JST) in Figure 9. We generate
100 nodes and build the edge for each node pair based on 6-step transition matrix Ω1:6 ,which
can indicate the crowd ow transition probability in the next 30 minutes. Through Figure 9,we
could see how the crowd dynamics were gradually evolving during the earthquake. No matter
the ground-truth or predictions, it showed quite dierent details at 14:00, 15:00, and 16:00. Crowd
density prediction could achieve very good performances as shown in the previous, and the nodes
of the graph showed a close resemblance between the ground-truth and predictions. However,
there still existed some gap between the ground-truth and prediction of the transition probability
on edges. Underestimation could be observed through the gure, which left us room for further
improvement on the high-dimensional crowd ow video.
Overall eciency: The implementation is done with TensorFlow ver1.10 and the eciency test
is done on a GeForce GTX 1080Ti graphics card. Our proposed model (MultiTask ConvLSTM
Encoder-Decoder) has 271,224 parameters in total. We verify the overall eciency of our model by
plotting the learning curves on validation loss as Figure 10. Through it we can observe that, for 30
minutes lead time prediction, i.e., α/β= 6/6, it takes around 80 epochs for our model to converge
on those four datasets, while it takes average 100 epochs to converge on the four datasets when
the lead time comes to 60 minutes, i.e., α/β= 12/12. Each training epoch takes around 120 seconds,
and each batch takes around 52 milliseconds, which means that the deployed model can deliver
the prediction result within less than one second.
Hyperparameter study: We conduct hyperparameter studies for our proposed model (MultiTask
ConvLSTM Encoder-Decoder) on two hyperparameters: one is the observation/prediction step α/β,
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:19
Fig. 8. Visualization of the ground-truth crowd density, the prediction result of Historical Average (seen as
the normal situation), and the prediction result of our model (MultiTask ConvLSTM Enc.-Dec.) at four events.
another is the multitask weight λ, the results of which have been summarized in Figure 11.Wecan
observe that as α/βincreases from 3 to 15, the MSE losses on both density and ow show a slow
and gradual increase on four events, which is due to the limitation of LSTM on modeling too long
sequence. We can also see that when we adjust λfrom 0.1 to 0.9, the MSE loss on density gradually
decreases and the MSE loss on ow slowly increases since the multitask weights for density and
ow are λand 1 λrespectively. This also veries that the CNN autoencoder could encode the
ow part to be a relatively balanced task with the density part. Thus, to well balance the density
prediction and the ow prediction, we set the λequal to 0.5 in our nal system.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:20 R.Jiangetal.
Fig. 9. Visualization of the ground-truth dynamic crowd mobility graph (top) and the predicted results (bot-
tom) at 3.11 Earthquake from 14:00 to 16:00. The larger and darker nodes have higher crowd density and
the darker edges represent higher transition probability. The node number Kis set to 100, and the edges
correspond to six-step transition matrix Ω1:6.
Fig. 10. Learning curves on validation loss.
8 CONCLUSION
This article is an extended journal version of our previous work [25]. In this study, we built a
data-driven intelligent system called DeepUrbanEvent to predict citywide crowd dynamics at big
events in an analogous manner to a video prediction task. We proposed to decompose crowd
dynamics into crowd density and crowd ow and designed a Multitask ConvLSTM Encoder-
Decoder architecture to simultaneously predict multiple steps of crowd density and crowd ow
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:21
Fig. 11. Hyperparameter study: MSE on crowd density and flow prediction.
Fig. 12. Our prototype: Tokyo crowd dynamics system. In the le part, it can show the real-time scalar value
of crowd density and flow at the citywide level with a bar chart as well as for a selected region with a time-
series chart. In the right part, through the 3D histogram and OD (Origin-Destination) flow chart, it can
simultaneously show the multiple steps of crowd density and flow with dierent types of geospatial visual-
ization. The prediction result for the next step is highlighted with a large layout in the upper right. Tokyo
Metropolitan Government will develop and deploy a real crowd dynamics system based on our prototype.
for the future. The experimental results based on four big real-world events demonstrated the
superior performance of our proposed model compared with the baseline methods. Our model
has been successfully deployed to monitor the crowd dynamics for the Greater Tokyo Area
as demonstrated in Figure 12. The source codes for these models have been released at https:
//github.com/deepkashiwa20/DeepUrbanEvent.
REFERENCES
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy
Davis, Jerey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Georey Irving, Michael
Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga,
Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,
Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: large-scale machine learning on heterogeneous
systems. Retrieved from http://tensorow.org/.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:22 R.Jiangetal.
[2] Yasunori Akagi, Takuya Nishimura, Takeshi Kurashima, and Hiroyuki Toda. 2018. A fast and accurate method for
estimating people ow from spatiotemporal population data. In Proceedings of the 27th International Joint Conference
on Articial Intelligence. 3293–3300.
[3] Lei Bai, Lina Yao, Salil S. Kanhere, Xianzhi Wang, and Quan Z. Sheng. 2019. Stg2seq: Spatial-temporal graph to se-
quence model for multi-step passenger demand forecasting. In IJCAI. 1981–1987.
[4] Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends® in Machine Learning 2, 1 (2009),
1–127.
[5] Pablo Samuel Castro, Daqing Zhang, and Shijian Li. 2012. Urban trac modelling and prediction using large scale
taxi GPS traces. In Pervasive Computing. Springer, 57–72.
[6] Di Chai, Leye Wang, and Qiang Yang. 2018. Bike ow prediction with multi-graph convolutional networks. In Proceed-
ings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 397–400.
[7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation.
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734.
[8] François Chollet. 2015. keras. Retrieved from https://github.com/fchollet/keras.
[9] Zipei Fan, Xuan Song, Renhe Jiang, Quanjun Chen, and Ryosuke Shibasaki. 2019. Decentralized attention-based per-
sonalized human mobility prediction. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Tech-
nologies 3, 4 (2019), 1–26.
[10] Zipei Fan, Xuan Song, Ryosuke Shibasaki, and Ryutaro Adachi. 2015. CityMomentum: An online approach for crowd
behavior prediction at a citywide level. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and
Ubiquitous Computing. ACM, 559–569.
[11] Zipei Fan, Xuan Song, Tianqi Xia, Renhe Jiang, Ryosuke Shibasaki, and Ritsu Sakuramachi. 2018. Online deep en-
semble learning for predicting citywide human mobility. Proceedings of the ACM on Interactive, Mobile, Wearable and
Ubiquitous Technologies 2, 3 (2018), 1–21.
[12] Jie Feng, Yong Li, Chao Zhang, Funing Sun, Fanchao Meng, Ang Guo, and Depeng Jin. 2018. Deepmove: Predicting
human mobility with attentional recurrent networks. In Proceedings of the 2018 World Wide Web Conference. Interna-
tional World Wide Web Conferences Steering Committee, 1459–1468.
[13] Qiang Gao, Fan Zhou, Goce Trajcevski, Kunpeng Zhang, Ting Zhong, and Fengli Zhang. 2019. Predicting human
mobility via variational attention. In Proceedings of the World Wide Web Conference. ACM, 2750–2756.
[14] Xu Geng, Yaguang Li, Leye Wang, Lingyu Zhang, Qiang Yang, Jieping Ye, and Yan Liu. 2019. Spatiotemporal multi-
graph convolution network for ride-hailing demand forecasting. In Proceedings of the 2019 AAAI Conference on Arti-
cial Intelligence.
[15] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pedreschi. 2007. Trajectory pattern mining. In Proceedings of
the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 330–339.
[16] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph
convolutional networks for trac ow forecasting. In Proceedings of the AAAI Conference on Articial Intelligence.
Vol. 33. 922–929.
[17] Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. In Proceed-
ings of the Advances in Neural Information Processing Systems. 190–198.
[18] Minh X Hoang, Yu Zheng, and Ambuj K Singh. 2016. Forecasting citywide crowd ows based on big data. In Proceed-
ings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.
[19] T. Horanont, A. Witayangkurn, Y. Sekimoto, and R. Shibasaki. 2013. Large-scale auto-GPS analysis for discerning
behavior change during crisis. IEEE Intelligent Systems 28, 4 (2013), 26–34.
[20] Dou Huang, Xuan Song, Zipei Fan, Renhe Jiang, Ryosuke Shibasaki, Yu Zhang, Haizhong Wang, and Yugo Kato. 2019.
A variational autoencoder based generative model of urban human mobility. In Proceedings of the 2019 IEEE Conference
on Multimedia Information Processing and Retrieval. IEEE, 425–430.
[21] Wenhao Huang, Guojie Song, Haikun Hong, and Kunqing Xie. 2014. Deep architecture for trac ow prediction:
Deep belief networks with multitask learning. IEEE Transactions on Intelligent Transportation Systems 15, 5 (2014),
2191–2201.
[22] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Quanjun Chen, Qi Chen, and Ryosuke Shibasaki. 2018. Deep ROI-
based modeling for urban human mobility prediction. Proceedings of the ACM on Interactive, Mobile, Wearable and
Ubiquitous Technologies 2, 1 (2018), 1–29.
[23] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Quanjun Chen, Satoshi Miyazawa, and Ryosuke Shibasaki. 2018. Deep-
UrbanMomentum: An online deep-learning system for short-term urban mobility prediction. In Proceedings of the 32nd
AAAI conference on articial intelligence. 784–791.
[24] Renhe Jiang, Xuan Song, Zipei Fan, Tianqi Xia, Zhaonan Wang, Quanjun Chen, Zekun Cai, and Ryosuke Shibasaki.
2021. Transfer urban human mobility via POI embedding over multiple cities. ACM Transactions on Data Science 2, 1
(2021), 1–26.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
Predicting Citywide Crowd Dynamics at Big Events 21:23
[25] Renhe Jiang, Xuan Song, Dou Huang, Xiaoya Song, Tianqi Xia, Zekun Cai, Zhaonan Wang, Kyoung-Sook Kim, and
Ryosuke Shibasaki. 2019. Deepurbanevent: A system for predicting citywide crowd dynamics at big events. In Pro-
ceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2114–2122.
[26] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. 2012. Imagenet classication with deep convolutional neural
networks. In Proceedings of the Advances in Neural Information Processing Systems. 1097–1105.
[27] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haner. 1998. Gradient-based learning applied to document
recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
[28] Xiucheng Li, Gao Cong, Aixin Sun, and Yun Cheng. 2019. Learning travel time distributions with deep generative
model. In Proceedings of the World Wide Web Conference. ACM, 1017–1027.
[29] Yaguang Li, Kun Fu, Zheng Wang, Cyrus Shahabi, Jieping Ye, and Yan Liu. 2018. Multi-task representation learning
for travel time estimation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery
&DataMining. ACM, 1695–1704.
[30] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diusion convolutionalrecurrent neural network: Data-driven
trac forecasting. In Proceedings of the International Conference on Learning Representations.
[31] Yuxuan Liang, Songyu Ke, Junbo Zhang, Xiuwen Yi, and Yu Zheng. 2018. Geoman: Multi-level attention networks for
geo-sensory time series prediction. In Proceedings of the 27th International Joint Conference on Articial Intelligence.
3428–3434.
[32] Ziqian Lin, Jie Feng, Ziyang Lu, Yong Li, and Depeng Jin. 2019. DeepSTN+: Context-aware spatial-temporal neural
network for crowd ow prediction in metropolis. In Proceedings of the AAAI Conference on Articial Intelligence.
[33] Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. Predicting the next location: A recurrent model with spatial
and temporal contexts. In Proceedings of the 13th AAAI Conference on Articial Intelligence.
[34] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. 2015. Trac ow prediction with big data:
A deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16, 2 (2015), 865–873.
[35] Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang. 2015. Long short-term memory neural net-
work for trac speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Tech-
nologies 54 (2015), 187–197.
[36] Xiaolei Ma, Haiyang Yu, Yunpeng Wang, and Yinhai Wang. 2015. Large-scale transportation network congestion
evolution prediction using deep learning theory. PloS One 10, 3 (2015), e0119044.
[37] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep
learning. In Proceedings of the 28th International Conference on Machine Learning. 689–696.
[38] Zheyi Pan, Yuxuan Liang, Weifeng Wang, Yong Yu, Yu Zheng, and Junbo Zhang. 2019. Urban trac prediction from
spatio-temporal data using deep meta learning. In Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. ACM, 1720–1730.
[39] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556 (2014).
[40] Chao Song, Youfang Lin, Shengnan Guo, and Huaiyu Wan. 2020. Spatial-temporal synchronous graph convolutional
networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI Conference on
Articial Intelligence. Vol. 34. 914–921.
[41] Xuan Song, Hiroshi Kanasugi, and Ryosuke Shibasaki. 2016. DeepTransport: Prediction and simulation of human
mobility and transportation mode at a citywide level. In Proceedings of the 25th International Joint Conference on
Articial Intelligence. 2618–2624.
[42] Xuan Song, Quanshi Zhang, Yoshihide Sekimoto, Ryosuke Shibasaki, Nicholas Jing Yuan, and Xing Xie. 2015. A simu-
lator of human emergency mobility following disasters: Knowledge transfer from big disaster data. In Proceedings of
the 29th AAAI Conference on Articial Intelligence.
[43] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings
of the Advances in Neural Information Processing Systems. 3104–3112.
[44] Yusuke Tanaka, Tomoharu Iwata, Takeshi Kurashima, Hiroyuki Toda, and Naonori Ueda. 2018. Estimating latent peo-
ple ow without tracking individuals. In Proceedings of the 2018 International Joint Conference on Articial Intelligence.
3556–3563.
[45] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked de-
noising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of
Machine Learning Research 11, Dec (2010), 3371–3408.
[46] Dong Wang, Wei Cao, Jian Li, and Jieping Ye. 2017. DeepSD: Supply-demand prediction for online car-hailing services
using deep neural networks. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering. IEEE,
243–254.
[47] Dong Wang, Junbo Zhang, Wei Cao, Jian Li, and Yu Zheng. 2018. When will you arrive? Estimating travel time based
on deep neural networks. In Proceedings of the 32nd AAAI Conference on Articial Intelligence.
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
21:24 R.Jiangetal.
[48] Senzhang Wang, Jiannong Cao, and Philip Yu. 2020. Deep learning for spatio-temporal data mining: A survey. IEEE
Transactions on Knowledge and Data Engineering (2020).
[49] Yuandong Wang, Hongzhi Yin, Hongxu Chen, Tianyu Wo, Jie Xu, and Kai Zheng. 2019. Origin-destination matrix
prediction via graph convolution: A new perspective of passenger demand modeling. In Proceedings of the 25th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining. 1227–1235.
[50] Zheng Wang, Kun Fu, and Jieping Ye. 2018. Learning to estimate the travel time. In Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 858–866.
[51] Ling-Yin Wei, Yu Zheng, and Wen-Chih Peng. 2012. Constructing popular routes from uncertain trajectories. In Pro-
ceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 195–203.
[52] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-
temporal graph modeling. In Proceedings of the 28th International Joint Conference on Articial Intelligence. 1907–1913.
[53] Shi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional
LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural
Information Processing Systems. 802–810.
[54] Yu Yang, Fan Zhang, and Desheng Zhang. 2018. SharedEdge: GPS-free ne-grained travel time estimation in state-
level highway systems. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1
(2018), 48.
[55] Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, and Zhenhui Li. 2019. Revisiting spatial-temporal similarity: A
deep learning framework for trac prediction. In Proceedings of the 2019 AAAI Conference on Articial Intelligence.
[56] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. 2018.
Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the 32nd AAAI Conference on
Articial Intelligence.
[57] Junchen Ye, Leilei Sun, Bowen Du, Yanjie Fu, Xinran Tong, and Hui Xiong. 2019. Co-prediction of multiple transporta-
tion demands based on deep spatio-temporal neural network. In Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 305–313.
[58] Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018. Spatio-temporal graph convolutional networks: A deep learning
framework for trac forecasting. In Proceedings of the 27th International Joint Conference on Articial Intelligence.
AAAI Press, 3634–3640.
[59] Zhuoning Yuan, Xun Zhou, and Tianbao Yang. 2018. Hetero-ConvLSTM: A deep learning approach to trac accident
prediction on heterogeneous spatio-temporal data. In Proceedings of the 24th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining. ACM, 984–992.
[60] Junbo Zhang, Yu Zheng, and Dekang Qi. 2017. Deep spatio-temporal residual networks for citywide crowd ows
prediction. In Proceedings of the 31st AAAI Conference on Articial Intelligence. 1655–1661.
[61] Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. 2016. DNN-based prediction model for spatio-
temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic In-
formation Systems. ACM, 92.
[62] Junbo Zhang, Yu Zheng, Junkai Sun, and Dekang Qi. 2019. Flow prediction in spatio-temporal networks based on
multitask deep learning. IEEE Transactions on Knowledge and Data Engineering 32, 3 (2019), 468–478.
[63] Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. 2019. T-GCN: A temporal
graph convolutional network for trac prediction. IEEE Transactions on Intelligent Transportation Systems 21, 9 (2019),
3848–3858.
[64] Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020. Gman: A graph multi-attention network for
trac prediction. In Proceedings of the AAAI Conference on Articial Intelligence. Vol. 34. 1234–1241.
[65] Ali Zonoozi, Jung-jae Kim, Xiao-Li Li, and Gao Cong. 2018. Periodic-CRN: A convolutional recurrent model for crowd
density prediction with recurring periodic patterns. In Proceedings of the 27th International Joint Conference on Arti-
cial Intelligence. 3732–3738.
Received October 2020; revised April 2021; accepted June 2021
ACM Transactions on Intelligent Systems and Technology, Vol. 13, No. 2, Article 21. Publication date: March 2022.
... Unfortunately, these methods can hardly be used for applications in disaster management and responses since the required additional disaster-related information are usually not available immediately after the disasters. To mitigate this issue, without using additional data, some deep learning models were proposed to predict the short-term crowd dynamics [14,16], transportation demands [32], and individual trajectories [7,15] under various major events. But nowcasting human mobility in disasters remains challenging for nationwide spatial scale and regions in different countries. ...
... Intelligent evacuation management systems cover the aspects of crowd monitoring, crowd disaster prediction, evacuation modelling, and evacuation path guidelines. SC approaches play a vital role in the design and deployment of intelligent evacuation applications pertaining to crowd control management Jiang et al., 2022). However, the effectiveness of current AI technology depends to a large extent on the performance of computer hardware and algorithms, and solving mass evacuation problems may require considerable energy consumption. ...
Article
Emergency evacuation is viewed as a common strategy adopted during the disaster preparedness stage of evacuation to ensure the safety of potentially affected populations. In emergency evacuation studies, soft computing approaches and methodologies have been widely used to support effective decision-making, providing robust and low-cost solutions. To understand the current status and trends of research on soft computing applications for emergency evacuation studies, 778 related studies published in the core database of Web of Science from 2000 to 2020 were considered in this study. A scientometric analysis and a comprehensive review were performed using a scientific mapping of the knowledge domain. This paper presents a set of analyses with the following primary objectives: (1) to explore and visualize the bibliometric characteristics and contents of the academic field concerned with the soft computing approaches for emergency evacuation; and (2) to review and analyze the knowledge, hotspots, and future outlooks related to soft computing approaches for emergency evacuation. The results provide some important insights regarding the existing soft computing methods that have been used in the emergency evacuation field over the past 20 years. Based on the conducted review, this paper proposes that future studies should concentrate on exploring the potential of innovative soft computing approaches for crowd modelling and enabling more accurate evacuation simulation and optimization.
Article
Urban grid prediction can be applied to many classic spatial-temporal prediction tasks such as air quality prediction, crowd density prediction, and traffic flow prediction, which is of great importance to smart city building. In light of its practical values, many methods have been developed for it and have achieved promising results. Despite their successes, two main challenges remain open: a) how to well capture the global dependencies and b) how to effectively model the multi-scale spatial-temporal correlations? To address these two challenges, we propose a novel method— DeepMeshCity , with a carefully-designed Self-Attention Citywide Grid Learner ( SA-CGL ) block comprising of a self-attention unit and a Citywide Grid Learner ( CGL ) unit. The self-attention block aims to capture the global spatial dependencies, and the CGL unit is responsible for learning the spatial-temporal correlations. In particular, a multi-scale memory unit is proposed to traverse all stacked SA-CGL blocks along a zigzag path to capture the multi-scale spatial-temporal correlations. In addition, we propose to initialize the single-scale memory units and the multi-scale memory units by using the corresponding ones in the previous fragment stack, so as to speed up the model training. We evaluate the performance of our proposed model by comparing with several state-of-the-art methods on four real-world datasets for two urban grid prediction applications. The experimental results verify the superiority of DeepMeshCity over the existing ones. The code is available at https://github.com/ILoveStudying/DeepMeshCity.
Article
Perceiving and modeling urban crowd movements are of great importance to smart city-related fields. Governments and public service operators can benefit from such efforts as they can be applied to crowd management, resource scheduling, and early emergency warning. However, most prior research on urban crowd modeling has failed to describe the dynamics and continuity of human mobility, leading to inconsistent and irrelevant results when they tackle multiple homogeneous forecasting tasks as they can only be modeled independently. To overcome this drawback, we propose to model human mobility from a new perspective, which uses the citywide crowd transition process constituted by a series of transition matrices from low order to high order, to help us understand how the crowd dynamics evolve step-by-step. We further propose a Deep Transition Process Network to process and predict such new high-dimensional data, where novel grid embedding with Graph Convolutional Network, parameter-shared Convolutional LSTM, and High-Dimensional Attention mechanism are designed to learn the complicated dependencies in terms of spatial, temporal, and ordinal features. We conduct experiments on two datasets generated by a large amount of GPS data collected from a real-world smartphone application. The experiment results demonstrate the superior performance of our proposed methodology over existing approaches.
Article
Full-text available
Big human mobility data are being continuously generated through a variety of sources, some of which can be treated and used as streaming data for understanding and predicting urban dynamics. With such streaming mobility data, the online prediction of short-term human mobility at the city level can be of great significance for transportation scheduling, urban regulation, and emergency management. In particular, when big rare events or disasters happen, such as large earthquakes or severe traffic accidents, people change their behaviors from their routine activities. This means people's movements will almost be uncorrelated with their past movements. Therefore, in this study, we build an online system called DeepUrbanMomentum to conduct the next short-term mobility predictions by using (the limited steps of) currently observed human mobility data. A deep-learning architecture built with recurrent neural networks is designed to effectively model these highly complex sequential data for a huge urban area. Experimental results demonstrate the superior performance of our proposed model as compared to the existing approaches. Lastly, we apply our system to a real emergency scenario and demonstrate that our system is applicable in the real world.
Article
Full-text available
Rapidly developing location acquisition technologies provide a powerful tool for understanding and predicting human mobility in cities, which is very significant for urban planning, traffic regulation, and emergency management. However, with the existing methodologies, it is still difficult to accurately predict millions of peoples’ mobility in a large urban area such as Tokyo, Shanghai, and Hong Kong, especially when collected data used for model training are often limited to a small portion of the total population. Obviously, human activities in city are closely linked with point-of-interest (POI) information, which can reflect the semantic meaning of human mobility. This motivates us to fuse human mobility data and city POI data to improve the prediction performance with limited training data, but current fusion technologies can hardly handle these two heterogeneous data. Therefore, we propose a unique POI-embedding mechanism, that aggregates the regional POIs by categories to generate an artificial POI-image for each urban grid and enriches each trajectory snippet to a four-dimensional tensor in an analogous manner to a short video. Then, we design a deep learning architecture combining CNN with LSTM to simultaneously capture both the spatiotemporal and geographical information from the enriched trajectories. Furthermore, transfer learning is employed to transfer mobility knowledge from one city to another, so that we can fully utilize other cities’ data to train a stronger model for the target city with only limited data available. Finally, we achieve satisfactory performance of human mobility prediction at the citywide level using a limited amount of trajectories as training data, which has been validated over five urban areas of different types and scales.
Article
Full-text available
Human mobility prediction is essential to a variety of human-centered computing applications achieved through upgrading of location-based services (LBS) to future-location-based services (FLBS). Previous studies on human mobility prediction have mainly focused on centralized human mobility prediction, where user mobility data are collected, trained and predicted at the cloud server side. However, such a centralized approach leads to a high risk of privacy issues, and a real-time centralized system for processing such a large volume of distributed data is extremely difficult to apply. Moreover, a large and dynamic set of users makes the predictive model extremely challenging to personalize. In this paper, we propose a novel decentralized attention-based human mobility predictor in which 1) no additional training procedure is required for personalized prediction, 2) no additional training procedure is required for incremental learning, and 3) the predictor can be trained and predicted in a decentralized way. We tested our method on big data of real-world mobile phone user GPS and on Android devices, and achieved a low-power consumption and a good prediction accuracy without collecting user data in the server or applying additional training on the user side.
Article
Full-text available
Spatial-temporal network data forecasting is of great importance in a huge amount of applications for traffic management and urban planning. However, the underlying complex spatial-temporal correlations and heterogeneities make this problem challenging. Existing methods usually use separate components to capture spatial and temporal correlations and ignore the heterogeneities in spatial-temporal data. In this paper, we propose a novel model, named Spatial-Temporal Synchronous Graph Convolutional Networks (STSGCN), for spatial-temporal network data forecasting. The model is able to effectively capture the complex localized spatial-temporal correlations through an elaborately designed spatial-temporal synchronous modeling mechanism. Meanwhile, multiple modules for different time periods are designed in the model to effectively capture the heterogeneities in localized spatial-temporal graphs. Extensive experiments are conducted on four real-world datasets, which demonstrates that our method achieves the state-of-the-art performance and consistently outperforms other baselines.
Article
Full-text available
Long-term traffic prediction is highly challenging due to the complexity of traffic systems and the constantly changing nature of many impacting factors. In this paper, we focus on the spatio-temporal factors, and propose a graph multi-attention network (GMAN) to predict traffic conditions for time steps ahead at different locations on a road network graph. GMAN adapts an encoder-decoder architecture, where both the encoder and the decoder consist of multiple spatio-temporal attention blocks to model the impact of the spatio-temporal factors on traffic conditions. The encoder encodes the input traffic features and the decoder predicts the output sequence. Between the encoder and the decoder, a transform attention layer is applied to convert the encoded traffic features to generate the sequence representations of future time steps as the input of the decoder. The transform attention mechanism models the direct relationships between historical and future time steps that helps to alleviate the error propagation problem among prediction time steps. Experimental results on two real-world traffic prediction tasks (i.e., traffic volume prediction and traffic speed prediction) demonstrate the superiority of GMAN. In particular, in the 1 hour ahead prediction, GMAN outperforms state-of-the-art methods by up to 4% improvement in MAE measure. The source code is available at https://github.com/zhengchuanpan/GMAN.
Article
Spatial and temporal contextual information plays a key role for analyzing user behaviors, and is helpful for predicting where he or she will go next. With the growing ability of collecting information, more and more temporal and spatial contextual information is collected in systems, and the location prediction problem becomes crucial and feasible. Some works have been proposed to address this problem, but they all have their limitations. Factorizing Personalized Markov Chain (FPMC) is constructed based on a strong independence assumption among different factors, which limits its performance. Tensor Factorization (TF) faces the cold start problem in predicting future actions. Recurrent Neural Networks (RNN) model shows promising performance comparing with PFMC and TF, but all these methods have problem in modeling continuous time interval and geographical distance. In this paper, we extend RNN and propose a novel method called Spatial Temporal Recurrent Neural Networks (ST-RNN). ST-RNN can model local temporal and spatial contexts in each layer with time-specific transition matrices for different time intervals and distance-specific transition matrices for different geographical distances. Experimental results show that the proposed ST-RNN model yields significant improvements over the competitive compared methods on two typical datasets, i.e., Global Terrorism Database (GTD) and Gowalla dataset.
Article
Estimating the travel time of any path (denoted by a sequence of connected road segments) in a city is of great importance to traffic monitoring, route planning, ridesharing, taxi/Uber dispatching, etc. However, it is a very challenging problem, affected by diverse complex factors, including spatial correlations, temporal dependencies, external conditions (e.g. weather, traffic lights). Prior work usually focuses on estimating the travel times of individual road segments or sub-paths and then summing up these times, which leads to an inaccurate estimation because such approaches do not consider road intersections/traffic lights, and local errors may accumulate. To address these issues, we propose an end-to-end Deep learning framework for Travel Time Estimation called DeepTTE that estimates the travel time of the whole path directly. More specifically, we present a geo-convolution operation by integrating the geographic information into the classical convolution, capable of capturing spatial correlations. By stacking recurrent unit on the geo-convoluton layer, our DeepTTE can capture the temporal dependencies simultaneously. A multi-task learning component is given on the top of DeepTTE, that estimates the travel time of both the entire path and each local path simultaneously during the training phase. The extensive experiments on two large-scale datasets shows our DeepTTE significantly outperforms the state-of-the-art methods.
Article
The frequency and intensity of natural disasters has significantly increased over the past decades and this trend is predicted to continue. Facing these possible and unexpected disasters, understanding and simulating of human emergency mobility following disasters will becomethe critical issue for planning effective humanitarian relief, disaster management, and long-term societal reconstruction. However, due to the uniquenessof various disasters and the unavailability of reliable and large scale human mobility data, such kind of research is very difficult to be performed. Hence, in this paper,we collect big and heterogeneous data (e.g. 1.6 million users' GPS records in three years, 17520 times of Japan earthquake data in four years, news reporting data, transportation network data and etc.) to capture and analyze human emergency mobility following different disasters. By mining these big data, we aim to understand what basic laws govern human mobility following disasters, and develop a general model of human emergency mobility for generating and simulating large amount of human emergency movements. The experimental results and validations demonstrate the efficiency of our simulation model, and suggest that human mobility following disasters may be significantly morepredictable and can be easier simulated than previously thought.
Article
The Internet of Things (IoT) will be a main data generation infrastructure for achieving better system intelligence. This article considers the design and implementation of a practical privacy-preserving collaborative learning scheme, in which a curious learning coordinator trains a better machine learning model based on the data samples contributed by a number of IoT objects, while the confidentiality of the raw forms of the training data is protected against the coordinator. Existing distributed machine learning and data encryption approaches incur significant computation and communication overhead, rendering them ill-suited for resource-constrained IoT objects. We study an approach that applies independent random projection at each IoT object to obfuscate data and trains a deep neural network at the coordinator based on the projected data from the IoT objects. This approach introduces light computation overhead to the IoT objects and moves most workload to the coordinator that can have sufficient computing resources. Although the independent projections performed by the IoT objects address the potential collusion between the curious coordinator and some compromised IoT objects, they significantly increase the complexity of the projected data. In this article, we leverage the superior learning capability of deep learning in capturing sophisticated patterns to maintain good learning performance. Extensive comparative evaluation shows that this approach outperforms other lightweight approaches that apply additive noisification for differential privacy and/or support vector machines for learning in the applications with light to moderate data pattern complexities.
Article
With the fast development of various positioning techniques such as Global Position System (GPS), mobile devices and remote sensing, spatio-temporal data has become increasingly available nowadays. Mining valuable knowledge from spatio-temporal data is critically important to many real-world applications including human mobility understanding, smart transportation, urban planning, public safety, health care and environmental management. As the number, volume and resolution of spatio-temporal data increase rapidly, traditional data mining methods, especially statistics based methods for dealing with such data are becoming overwhelmed. Recently deep learning models such as recurrent neural network (RNN) and convolutional neural network (CNN) have achieved remarkable success in many domains, and are also widely applied in various spatio-temporal data mining (STDM) tasks such as predictive learning, anomaly detection and classification. In this paper, we provide a comprehensive review of recent progress in applying deep learning techniques for STDM. We first categorize the spatio-temporal data into five different types, and then briefly introduce the deep learning models that are widely used in STDM. Next, we classify existing literature based on the types of spatio-temporal data, the data mining tasks, and the deep learning models, followed by the applications of deep learning for STDM in different domains.