ArticlePDF Available

Crowd sensing and spatiotemporal analysis in urban open space using multi‐viewpoint geotagged videos

Authors:

Abstract

Increasing concern for urban public safety has motivated the deployment of a large number of surveillance cameras in open spaces such as city squares, stations, and shopping malls. The efficient detection of crowd dynamics in urban open spaces using multi‐viewpoint surveillance videos continues to be a fundamental problem in the field of urban security. The use of existing methods for extracting features from video images has resulted in significant progress in single‐camera image space. However, surveillance videos are geotagged videos with location information, and few studies have fully exploited the spatial semantics of these videos. In this study, multi‐viewpoint videos in geographic space are used to fuse object trajectories for crowd sensing and spatiotemporal analysis. The YOLOv3‐DeepSORT model is used to detect a pedestrian and extract the corresponding image coordinates, combine spatial semantics (such as the positions of the pedestrian in the field of view of the camera) to build a projection transformation matrix and map the object recorded by a single camera to geographic space. Trajectories from multi‐viewpoint videos are fused based on the features of location, time, and directions to generate a complete pedestrian trajectory. Then, crowd spatial pattern analysis, density estimation, and motion trend analysis are performed. Experimental results demonstrate that the proposed method can be used to identify crowd dynamics and analyze the corresponding spatiotemporal pattern in an urban open space from a global perspective, providing a means of intelligent spatiotemporal analysis of geotagged videos.
Transactions in GIS. 2023;27:494–515.© 2023 John Wiley & Sons Ltd. wileyonlinelibrary.com/journal/tgis
494
1Key Laboratory of Geospatial Technology for
the Middle and Lower Yellow River Regions
(Henan University), Ministry of Education,
Kaifeng, China
2College of Geography and Environmental
Science, Henan University, Kaifeng, China
3Henan Industrial Technology Academy of
Spatiotemporal Big Data, Henan University,
Zhengzhou, China
4Urban Big Data Institute, Henan University,
Kaifeng, China
5Henan Technology Innovation Center of
Spatiotemporal Big Data, Henan University,
Zhengzhou, China
6School of Computer and Information
Engineering, Henan University, Kaifeng, China
Correspondence
Zhigang Han, Key Laboratory of Geospatial
Technology for the Middle and Lower Yellow
River Regions (Henan University), Ministry of
Education, Kaifeng, China.
Email: zghan@henu.edu.cn
Funding information
National Natural Science Foundation of China
Abstract
Increasing concern for urban public safety has motivated
the deployment of a large number of surveillance cameras
in open spaces such as city squares, stations, and shopping
malls. The efficient detection of crowd dynamics in urban
open spaces using multi-viewpoint surveillance videos
continues to be a fundamental problem in the field of urban
security. The use of existing methods for extracting features
from video images has resulted in significant progress in
single-camera image space. However, surveillance videos
are geotagged videos with location information, and few
studies have fully exploited the spatial semantics of these
videos. In this study, multi-viewpoint videos in geographic
space are used to fuse object trajectories for crowd sensing
and spatiotemporal analysis. The YOLOv3-DeepSORT model
is used to detect a pedestrian and extract the correspond-
ing image coordinates, combine spatial semantics (such as
the positions of the pedestrian in the field of view of the
camera) to build a projection transformation matrix and map
the object recorded by a single camera to geographic space.
Trajectories from multi-viewpoint videos are fused based on
the features of location, time, and directions to generate a
complete pedestrian trajectory. Then, crowd spatial pattern
analysis, density estimation, and motion trend analysis
are performed. Experimental results demonstrate that the
proposed method can be used to identify crowd dynamics
RESEARCH ARTICLE
Crowd sensing and spatiotemporal analysis
in urban open space using multi-viewpoint
geotagged videos
Feng Liu1,2 | Zhigang Han1,2,3,4  | Hongquan Song1,2,4 | 
Jiayao Wang1,2,3,5 | Chun Liu3,5,6 | Gaohan Ban1,2
DOI: 10.1111/tgis.13036
Received: 28 June 2022    Revised: 25 January 2023    Accepted: 6 February 2023
LIU et al.495
1 | INTRODUCTION
Continuous urbanization and the rapid growth of urban populations in recent years have resulted in increased atten-
tion being given to urban security (Laufs et al., 2020; Nishiyama, 2018; Socha & Kogut, 2020). Urban open spaces,
such as city squares, large shopping malls and stations, are characterized by dense crowds, frequent exchanges,
and complex situations. As a result, these spaces often become primary targets for terrorism (Jing et al., 2021;
Newburn, 2021; Qian et al., 2019), and the potential occurrence of emergencies, such as crowd stampedes (Zhang
et al., 2017; Zhao et al., 2021), poses severe challenges to urban security. In urban open spaces, the perception and
analysis of the temporal and spatial dynamics of crowds can provide key support for accurate decision-making, effi-
cient early warnings and emergency responses to relevant urban security issues (Draghici & Steen, 2018; Han, Li, Cui,
Han, et al., 2019; Socha & Kogut, 2020).
Continuous advancements in smart city development have enabled the deployment of a set number of surveil-
lance cameras in urban open spaces to capture real-time video data as a significant means of triggering early warnings
for crowd congestion, deterring criminal behavior, and ensuring urban security. This method has been widely used
in major cities around the world due to its low cost and ease of maintenance (Laufs et al., 2020). Surveillance video
is typically observed 24 h a day by dedicated personnel to achieve real-time monitoring of scenes, which, however,
makes continuous surveillance very expensive (Ahmed et al., 2018; Elharrouss et al., 2021). Automatic identification
of dynamic targets, such as crowds, from surveillance videos, perception of the spatial and temporal distribution of
these targets, and early warning predictions have become major concerns in urban security. The rapid development
of computer vision in recent years has facilitated significant progress in object detection and analysis in image space.
A series of deep learning-based object detection and tracking models, such as the region-based convolutional neural
network (RCNN), You Only Look Once (YOLO) network and Siamese Net, have been developed (Li et al., 2013; Liu
et al., 2020; Marvasti-Zadeh et al., 2021; Zou et al., 2019). Attention has been given to analyzing and mining surveil-
lance video data in many fields (Subudhi et al., 2019).
Video collected by surveillance cameras contains a stream representing observations of a particular geographic
space. This stream is a typical geotagged video that includes both spatial and temporal features captured through
ground- or nonground-based cameras with interior and exterior orientation elements (Furht, 2008; Kong, 2010).
Geotagged video is natural perception data that contains rich spatial semantics (Han et al., 2016; Jamonnak
et al., 2021; Lewis et al., 2011). Integrating surveillance video data with a geographic information system (GIS), using
a unified geographic reference system for video data management and analysis, and enhancing urban video surveil-
lance systems are very useful for urban security (Milosavljević et al., 2016; Patel et al., 2021; Wu et al., 2017; Xie
et al., 2017). Continual progress in object detection and tracking algorithms based on deep learning has enabled the
implementation of intelligent analysis and target perception of surveillance videos by extracting moving objects from
surveillance videos. However, research in this area has focused on feature analysis of images from single cameras, and
there has been limited investigation of fusion of multi-viewpoint surveillance video deployed in multiple locations
and relatively little spatiotemporal analysis of crowd trajectories (Elharrouss et al., 2021; Li et al., 2022; Milosavljević
et al., 2016; Zhang et al., 2019). In this study, multi-viewpoint surveillance video is used to conduct crowd sensing
in an urban open space and spatiotemporal analysis in a unified geographic space, providing means of using spatial
semantics to perform surveillance video analysis and essential support for urban public safety. Two issues are consid-
ered in this study: (1) How can surveillance video be used to perform pedestrian detection and tracking? (2) How can
and analyze the corresponding spatiotemporal pattern in
an urban open space from a global perspective, providing  a
means of intelligent spatiotemporal analysis of geotagged
videos.
LIU et al.
496
geospatial mapping from video image space to geographic space and pedestrian trajectory fusion be performed for
application to crowd spatiotemporal analysis?
The remainder of this article is organized as follows. A literature review is presented in Section 2. A methodology
for crowd sensing and spatiotemporal analysis in an urban open space using multi-viewpoint geotagged videos is
introduced in Section 3. Experiments on pedestrian detection and tracking and geospatial mapping are described in
Section 4. The results of pedestrian trajectory fusion and crowd spatiotemporal analysis are presented in Section 5.
The article is concluded in Section 6 with a brief summary and discussion.
2 | RELATED STUDIES
Video surveillance systems constitute one of the most active research areas in computer vision (Subudhi et al., 2019).
In major cities around the world, thousands of cameras collect massive quantities of video data every day. Detect-
ing and tracking moving objects are key to video surveillance (Zou et al., 2019). Object detection is the process of
identifying boxes and categories for objects in video images (Liu et al., 2020). There are primarily two types of object
detection algorithms, that is, conventional visual detection and deep learning-based methods. The former primarily
use traditional computer vision algorithms based on image features, including histogram of oriented gradient (HOG)
detection, frame difference, background difference, optical flow, and others (Kilger, 1992; Lipton et al., 1998; Mae
et al., 1996; Neri et al., 1998; Stauffer & Grimson, 1999). Deep learning methods have been applied to object detec-
tion using two types of algorithms. One type of algorithm employs two-stage detection to extract a set of object
candidate boxes by a selective search and then inputs each candidate box into the convolutional neural network
(CNN) for feature extraction and recognition of object categories. Algorithms within this category include RCNN, Fast
RCNN, spatial pyramid pooling network (SPP-Net), region-based fully convolutional network (R-FCN), mask RCNN,
and feature pyramid network, which have high accuracies but low calculation speeds (Dai et al., 2016; Girshick, 2015;
Girshick et al., 2014; He et al., 2017; Lin et al., 2017). The other type of algorithm performs single-stage detection to
enhance the detection speed and uses a single network structure for object detection. The main algorithms in this
category include YOLO, single-shot detection and Retina-Net (Lin et al., 2020b; Liu, Anguelov, et al., 2016; Redmon
et al., 2016). YOLO divides an image into multiple regions and predicts the bounding box of each region simultane-
ously with the probability of the object to which the box belongs to. With the introduction of multiscale features, the
small object detection performance of YOLOv3 improves significantly (Redmon & Farhadi, 2018).
Object tracking identifies the path or trajectory of an object in a given video sequence when the video frame
contains only the initial state of the object (Marvasti-Zadeh et al., 2021; Yilmaz et al., 2006). Conventional tracking
methods in computer vision primarily use image features (e.g., HOG) and model the appearance and motion of a target
by adopting template matching, mean filtering, scale-invariant feature transformation, the Kanade–Lucas–Tomasi
tracking algorithm, the Kalman filter, and the Hungarian algorithm (Bae & Song, 2008; Du et al., 2012; Lowe, 2004;
Sahbani & Adiprawita, 2016; Simon, 2001; Svoboda, 2007). The Hungarian algorithm is a classical combinatorial
optimization schema that is used in multitarget tracking for target matching between two frames, that is, the front
and rear frames. The Kalman filter is employed in the image space to define and combine the simple, online, and real-
time (SORT) tracker with the frame-by-frame data association metric of the Hungarian algorithm, thereby increasing
the frame rate of multitarget tracking (Bewley et al., 2016). With the method being rapidly applied based on deep
learning, a CNN is introduced into the target tracking calculation, and a Siamese neural network is designed based on
similarity learning for target tracking, which is highly accurate and reliable (Bertinetto et al., 2016). The deficiencies of
SORT in tracking occlusion have been addressed in the DeepSORT tracker. A CNN network is integrated into Deep-
SORT for feature extraction, and a combined metric for the association of target motion and appearance information
is used to increase the robustness against target omission and occlusion. DeepSORT is also easy to implement,
computationally efficient, and suitable for multitarget tracking (Wojke et al., 2017).
Video clips are a commonly used type of media that consist of a collection of images with temporal relation-
ships. Video clips offer the advantages of spatiotemporal semantics, high information resolution, direct expression,
LIU et al.497
and accurate transmission of spatial relationships. Geotagged video can integrate geospatial semantics and video
image features by extracting spatial information, such as the video location and field of view. Thus, geotagged video
is an important source of geographic information that has received attention in many fields, such as GIS, computer
vision and data mining (Luo et al., 2011). Several studies have been performed in this area over the last several years,
including on geotagged video data collection and processing (Burr et al., 2018; Mills et al., 2010); data modeling (Han
et al., 2016; Lewis et al., 2011); video retrieval based on spatial semantics (Kim et al., 2010; Lu & Shahabi, 2017;
Ma et al., 2014; Wu et al., 2018); video mapping, analysis, and mining (Jamonnak et al., 2021; Rumora et al., 2021;
Wang  et al., 2022; Zhang et al., 2021); and video synopsis (Jamonnak et al., 2020; Xie et al., 2022; Zhang et al., 2019).
The integration and fusion of surveillance video and GIS can significantly improve the management efficiency of
surveillance video and enhance video surveillance systems. GIS offers significant advantages as a general framework
for video surveillance. Two types of modes, namely, GIS-enhanced video and video-enhanced GIS, have been defined
and verified using the GeoScopeAVS prototype (Milosavljević et al., 2010; Sankaranarayanan & Davis, 2008). Spatial
semantics are used to integrate dynamic targets between GIS and surveillance videos and retrieve target trajectories
(Li et al., 2021; Xie et al., 2017, 2021). The continuous trajectory of dynamic objects in video is extracted through
background subtraction and Canny operator fusion, and objects in three-dimensional (3D) geographic scenes are
located by combining imaging rays and digital surface model intersections (Han et al., 2022; Li et al., 2022). Align-
ment and matching of video image space and 3D geographic space are key to integrating surveillance video and GIS.
For effective image matching, a method for projecting vector data into surveillance video has been proposed based
on using remote sensing and video images (Shao et al., 2020). High-resolution orthoimages and digital elevation
models can be used in conjunction with the Levenberg–Marquardt iterative optimization method to determine the
locations and orientations of cameras (Milosavljević et al., 2017). A nonlinear perspective correction model has been
used to calculate a holography matrix based on multiple matching points to implement the geospatial mapping of
video objects (Zhang et al., 2021). This method has the advantages of facile parameter acquisition and convenient
calculation (Lin et al., 2020a).
In summary, the advantages of deep learning have facilitated remarkable progress in object detection and track-
ing. Surveillance video is a typical geotagged video, such that its spatial semantics can be integrated with GIS to
perform spatiotemporal analysis and facilitate the creation of more intelligent geographic information by the geospa-
tial artificial intelligence (GeoAI) (Janowicz et al., 2020). A review of the current literature reveals that limited studies
have been performed on crowd sensing and spatiotemporal analysis in urban open spaces based on multi-viewpoint
cameras. In this study, object detection and tracking methods are integrated to extract crowd trajectories from
multi-viewpoint surveillance videos and perform geospatial mapping for crowd sensing and corresponding spatio-
temporal analysis.
3 | METHODOLOGY
Crowdsensing and spatiotemporal analysis in urban open spaces are based on determining the location of pedestri-
ans and spatial analysis. The main framework of this method is described in this section (Figure 1) and consists of four
parts, that is, pedestrian detection, crowd tracking, geospatial mapping of pedestrians, trajectory fusion and crowd
spatiotemporal analysis. The spatiotemporal analysis method is presented in Section 5.
3.1 | YOLOv3 model for crowd detection
Object detection and tracking are key tasks in surveillance video analysis. Considering the low latency requirements
for surveillance video data processing, a single-stage detection model that considers both speed and accuracy has
significant advantages. In this study, the YOLOv3 model for object detection is integrated with the DeepSORT
LIU et al.
498
tracking algorithm for crowd detection and tracking in surveillance videos (Ahmed et al., 2021; Punn et al., 2020). The
video image data are input into the YOLOv3 detector to determine the object bounding box as output to the Deep-
SORT tracker (Figure 1). The advantages of the YOLOv3-DeepSORT model include low object occlusion and illumi-
nation interference and high tracking reliability to realize multitarget detection and crowd tracking in surveillance
videos and lays a foundation for crowd target trajectory fusion and spatiotemporal analysis (Hossain & Lee, 2019;
Zhang et al., 2019).
Within YOLOv3, object detection is treated as a regression problem. Feature maps of different scales are
extracted and fused from the input image through multiple convolutions and pooling. The model structure is
shown in Figure 1. YOLOv3 includes a feature extraction layer based on Darknet-53 and a YOLO multiscale
prediction layer. Darknet-53 consists of five residual blocks, each of which is composed of a prescribed number
of convolution blocks following a prescribed number of cycles through fusion. In the YOLO prediction layer, three
different scales are first used for detection in 32, 16, and 8 down samplings; then, the upsampling module is fused
to extract deep features, and the predictive frames with scale sizes of 13 × 13, 26 × 26, and 52 × 52 are finally
used to predict objects. The YOLOv3 calculation process is as follows: first, the video frame image is input, and
adjusted in size; then, convolution and pooling operations are used to generate a three-layer feature map. These
feature maps of different scales are used to predict the object bounding box for each scale, where each bound-
ing box includes four coordinates (the center point coordinates and the width or height). Then, the classification
confidence is output as the detection result (Redmon & Farhadi, 2018). The object type in this study is set as
“pedestrians,” and the YOLOv3 algorithm is used to generate image coordinates and assign unique IDs to detected
pedestrians.
FIGURE 1 The proposed method for crowd sensing and spatiotemporal analysis.
LIU et al.499
3.2 | DeepSORT model for crowd tracking
The object bounding box detected by YOLOv3 is input into the DeepSORT tracker, which is combined with the
improved Kalman filter to predict the object position. The Mahalanobis distance and cosine distance of the depth
descriptor are used as fused metrics, in the Hungarian algorithm to perform cascade matching, and the object track-
ing results are output (Figure 1). The Kalman filter and Hungarian algorithm are combined in SORT to set the motion
of the detected object as linear motion and predict the position of the object in the next frame according to the
position of the current frame and the target motion speed (Bewley et al., 2016). The correlation between the predic-
tion and truth is then measured by using the appearance function and Mahalanobis distance. The CNN in the Deep-
SORT algorithm uses a large-scale pedestrian recognition dataset for pretraining to build the target tracking appear-
ance features (Wojke et al., 2017). The association of the data is converted into motion and appearance information,
thereby reducing interference from occlusion. Using the object bounding box detected by YOLOv3 as the input to
DeepSORT tracking accomplishes the multitarget visual tracking task according to the unique pedestrian  ID.
3.3 | Geospatial mapping for video objects
After the objects in the video are detected and tracked, the pixel coordinates of the object are generated in image
space. The pixel coordinates must be converted to geographic coordinates to perform crowd spatiotemporal anal-
ysis. This transformation is carried out using a camera calibration model based on the homography matrix method.
Consider a point P (X, Y, Z) in the real world that is projected to the two-dimensional (2D) image plane in the
camera calibration model as p (u, v). Due to imaging deformation, a distortion correction coefficient must be added
to the mapping matrix to improve conversion accuracy. Normalized homogeneous coordinates are used to define
AP
=
[
XYZ1
]
T and
Ap
=
[
uv1
]T
 . The mapping matrix is denoted by M, and the transformation from pixel coordi-
nates to geographic coordinates is defined as
Ap
=
MP
 . To perform camera calibration, it is necessary to determine
the camera interior orientation elements, such as the center point and focal length, as well as the camera exterior
orientation elements, such as the translation, rotation and scale factor. M is defined as the mapping matrix for the
camera calibration process and includes the camera interior and exterior orientation elements and camera lens
distortion coefficient. Thus, the mapping matrix can be defined as
AM
=sK
Q
 . where s is the scale factor, and
K is the camera interior orientation parameter matrix related to the internal structure of each camera, as defined
below:
K
=
1
dx0u0
01
dyv0
001
fx0cx
0fycy
001
(1)
where
Ad
x
and
Ad
y
are the pixel sizes of the camera in the x and y directions, respectively;
Au
0
and
are the coordi-
nates of the image center points;
Af
x
and
Af
y
are the camera focal lengths; and
Ac
x
and
Ac
y
are the camera optical axis
offsets.
AQ
is the matrix of camera exterior orientation elements and is defined as
AQ
=
r1r2r3t
 , where
Ar
1
 ,
Ar
2
and
Ar
3
denote the rotation angles along the X-, Y- and Z-axes, respectively; and
At
denotes the translation value between
the two coordinate systems. In the homography matrix method, the field of view of the video in the geographic
space is assumed to be a plane, that is, the ground is assumed to be the Z = 0 plane. Then,
AP
=
XY01
T
is defined,
such that the mapping relationship between the image and geographic spaces can be considered a mapping from
one plane to another. The
Ar
3
rotating around the Z-axis can be removed to yield the following simplified formula:
LIU et al.
500
u
v
1
=s
1
dx0u0
01
dyv0
001
fx
0
0cx
fycy
0 01
r1r2t
X
Y
1
=s·K·Q·P
= MP
(2)
The inverse of the M matrix can be used to map the 2D image coordinates to the world coordinate system.
X
Y
1
=M
1
u
v
1
(3)
The camera interior orientation parameters for the transformation process are calculated using Zhang's (2000)
calibration method. The camera exterior orientation elements are calculated by the Perspective-n-Point (PnP) algo-
rithm, where the points with the same labels are used to establish the mapping relationship between the image and
geographic space (Lepetit et al., 2009).
3.4 | Pedestrian trajectory fusion based on a multidimensional similarity measure
Pedestrian detection and tracking and the corresponding geospatial mapping are used to generate the pedestrian
trajectories within the field of view of each camera. However, it is necessary to use the similarity measure method
for trajectory fusion to build complete trajectories of pedestrians in urban open spaces for crowd spatiotemporal
analysis. Dynamic time warping (DTW) (Senin, 2008), longest common subsequence (LCSS) (Vlachos et al., 2002) and
edit distance on real sequence (EDR) (Chen et al., 2005) are commonly used to calculate the trajectory similarity (Jing
et al., 2022). The DTW algorithm is primarily used to process trajectory sequences of equal length, whereas both
the LCSS and EDR are variants of the editing distance algorithm. The trajectory similarity is calculated by defining a
matching threshold to search for the longest common subsequence between two trajectory sequences. Most of the
abovementioned algorithms use a single feature, such as the Euclidean distance to calculate the trajectory similarity.
Considering the distance, time and direction features of pedestrian trajectories within the field of view of multiple
cameras, the multidimensional similarity measure (MSM) based on multiple features is used to perform multicamera
pedestrian trajectory matching and fusion (Furtado et al., 2016). The direction of each point in the trajectory is deter-
mined by calculating the azimuth of the vector that points from the previous point to the current point. The MSM is
used to calculate the similarity score for two trajectory sequences
AT
1
and
AT
2
by searching for the best matching score
for all the elements in these sequences. The azimuth angles of the corresponding trajectory points, distance and time
are compared. The number of elements in
AT
1
and
AT
2
are denoted by
AN
1
and
AN
2
 , respectively, and the similarity index
M is defined as follows:
M
(T1,T2)=
0 if N1= 0 or N2
=0
p(T1,T2)+p(T2,T1)
N
1
+N
2
otherwise
(4)
where
Ap
(
T
1
,T
2)
is the sum of the maximum matching scores of all the elements
At
i
in
AT
1
and any element in
AT
2
and vice
versa.
p
(T1,T2)=
t
i
T
1
max
s
ti,tj
:tjT2
(5)
LIU et al.501
where
As
ti,tj
is the weighted sum of the matching scores between the trajectory sequence elements
At
i
and
At
j
in
AK
dimension features and is defined as:
s
ti,tj=
K
k=1
mkti,tjωk
(6)
where
Aωk
denotes the weight. The matching score
Am
k
of
At
i
and
At
j
is a binary value: 1 if the matching condition is met
and 0 otherwise. The matching conditions are defined based on the distance, time and azimuth difference of the
trajectory points:
m
kti,tj=
1 if dkti,tjD
k
0 otherwise
(7)
where
AD
k
is the matching threshold. The number of points in
AT
1
and
AT
2
may not be identical, resulting in a discrepancy
in the number of accumulations for the maximum
As
ti,tj
when computing
Ap
(T
1
,T
2)
and
Ap
(T
2
,T
1)
 . Thus, it is neces-
sary to calculate
Ap
(T
1
,T
2)
and
Ap
(T
2
,T
1)
respectively. The similarity index of the pedestrian trajectories extracted by
different cameras is calculated to realize trajectory matching and fusion of the same pedestrian trajectory with the
multi-viewpoint cameras, thereby generating the complete trajectory of all pedestrians in the monitoring area.
4 | PEDESTRIAN DETECTION AND GEOSPATIAL MAPPING
A campus square is selected as the study area. This square is almost circular with a diameter of approximately 110 m,
and the geographic coordinates of the center point are 114°18′12.88″ N, 34°49′7.55″ E. Four Sony FDR-AXP55-4K
cameras with a resolution of 1920 × 1080 pixels are deployed in the area for scene monitoring, where the camera
positions are shown in Figure 2. Video data are synchronously captured in four video clips, each of which is 31 s long
with 775 frames. Experiments are performed on pedestrian detection and tracking, geospatial mapping and crowd
spatiotemporal analysis, and the results are analyzed. The hardware environment is a 3.60-GHz Intel Core i7 4790
processor with 16.0 GB of random-access memory. The software environment is TensorFlow1.15 for deep-learning-
based object detection and tracking and OpenCV4.2 for video data processing, and the Python script is used for the
corresponding calculation and analysis.
4.1 | Pedestrian detection and tracking
Pedestrian detection and tracking are the basis of crowd spatiotemporal analysis. OpenCV is first used to read the
video frame sequence, and the YOLOv3-DeepSORT model is then used to detect and track the objects of the corre-
sponding frame. The VOC2007 dataset is used to train the YOLOv3 model and generate the model parameters. In
this study, we set the object category to “pedestrian” in the calculations. The center point of the object boundary
box is used as the output pedestrian position to generate the same point between the image and geographic coor-
dinates using a handheld GPS receiver. Figure 3 shows the results of the YOLOv3-DeepSORT object detection and
tracking. The average detection and tracking speed of the YOLOv3-DeepSORT model is 5.09 frames per second,
which takes a total of 2.55 min in our experiment. The screenshots of the two videos before and after running the
detection and tracking model indicate that the overlapped and occluded images of the pedestrians with IDs 23 and
24 are identified by the model as objects 23 and 24, thus achieving continuous tracking. As the DeepSORT algorithm
LIU et al.
502
performs nearest-neighbor-matching based on object appearance features, the robustness of pedestrian tracking is
significantly improved.
4.2 | Geospatial mapping for video objects
Geospatial mapping of video objects comprises three steps, that is, camera calibration, calculation of the exterior
orientation elements, and coordinate transformation. Zhang's (2000) method is used to perform camera calibration
in this study. A calibrated checkerboard image is shot and corner points are identified to generate the internal camera
parameter matrix and distortion coefficient, and the video frame image distortion is corrected. The interior orienta-
tion parameter and distortion correction metrics of the four cameras are shown in Appendix A. The calculated exterior
orientation elements are compiled into the matrix
AQ
 , including the rotation and translation parameters. Pedestrians
hold a handheld GPS receiver in the monitoring area and simultaneously appear in the images of the four cameras
enabling the geographic coordinates of the pedestrians to be recorded (Figure 4). Then, the YOLOv3-DeepSORT
model is used to sequentially calculate the coordinates of the pedestrians in the image space. A total of 10 points of
GPS coordinates are recorded, and the geospatial coordinates are calculated using the Universal Transverse Merca-
tor projection. The image and geographic coordinates of the points are shown in Appendix B. The PnP algorithm in
OpenCV is used to generate the external camera parameter matrix, thereby establishing the geospatial mapping
FIGURE 2 The location and estimated field of view of the cameras in the study area.
LIU et al.503
relationship between the image and the geographic coordinates. The coordinate transformation accuracy is evaluated
in terms of the root mean square error (RMSE) and the mean absolute error (MAE), which are defined below:
RMSE =
1
n
n
i=1
PiPi
2
(8)
FIGURE 3 Pedestrian detection and tracking using the YOLOv3-DeepSORT model.
(a) Overlapping and occlusion of pedestrian 23 and 24 during detection and tracking
(b) Detection and tracking of the overlapped and occluded images of pedestrian 23 and 24
LIU et al.
504
MAE =
1
n
n
i=1
abs
Pi
Pi
(9)
where n denotes the number of samples,
AP
i
represents the transformed coordinates, and
AP
i
denotes the real coor-
dinates. The results for the evaluation indicators are provided in Appendix C. The average RMSE and MAE are
1.115 and 0.984 m, respectively, for the transformed x-coordinates and 2.000 and 1.580 m, respectively, for the
transformed y-coordinates. The errors in the geospatial mapping for video objects are typically caused by multi-
ple factors, including equipment error, data acquisition error, and data processing error. The inherent errors of
cameras, GPS receivers and other equipment introduce systematic errors into the calibration of the interior and
exterior orientation elements of the cameras and coordinates determination. During data acquisition, changes in
lighting, texture, etc., cause the detection frame to be offset during object detection. The occlusion of buildings also
affects the location accuracy of the GPS receivers. The coordinates transformation generates fitting residuals for
the solu tion of the transformation parameters using the pinhole camera model, etc. These errors can be reduced by
using more precise equipment, selecting optimal lighting conditions and objects with dense textures, and repeating
the calculations for fitting the transformation parameters. As crowd analysis is used to determine an overall distri-
bution, the conversion results can be used to perform a spatiotemporal analysis of the crowd distribution in the
study area.
FIGURE 4 Pedestrian locations in the map and image.
LIU et al.505
5 | FUSION OF PEDESTRIAN TRAJECTORIES AND CROWD SPATIOTEMPORAL
ANALYSIS
5.1 | Fusion of pedestrian trajectories
The pedestrian detection and tracking and geospatial mapping results are used to extract the pedestrian trajectories
from the videos captured by the four cameras. These trajectories are fused using the method presented in Section 3.3.
The time threshold is set to 10 s, the angle threshold is 1°, and the distance threshold is 0.5 m. The weights are 0.2, 0.2,
and 0.6. Table 1 shows the similarity results for three pedestrian trajectories, that is, 2, 33, and 49, in the C1 camera
video, and three pedestrian trajectories, that is, 1, 6 and 10, in the C2 camera video. The similarity value of Trajectory
49 (C1) and Trajectory 10 (C2) is 0.4539, which is significantly higher than that for Trajectories 1 and 6 (C2). Among the
aforementioned trajectories, Trajectory 2 (C1) has the highest similarity with Trajectory 6 (C2) of 0.3816. The similarity
of Trajectory 2 (C1) and Trajectory 1 (C2) of 0.3802 is higher than that of Trajectory 33 (C1) and Trajectory 1 (C2) of
0.1977, but lower than the aforementioned similarity of Trajectories 2 and 6 (0.3816); therefore, Trajectories 2 and 6
and Trajectories 33 and 1 are best matched. Figures 5a,b shows that Trajectory 33 (C1) and Trajectory 1 (C2) are the
same target trajectory, and Trajectories 2 and 49 (C1) correspond to the C1 video targets 6 and 10, respectively. Despite
the closeness of Trajectories 1 and 6 and Trajectories 2 and 33, Trajectories 1 and 33 and Trajectories 2 and 6 can still
be correctly matched because the MSM method comprehensively considers distance, direction, and time factors. The
corresponding relationship is used to combine the trajectory points into a complete pedestrian trajectory dataset across
the camera field of view of the study area, where each trajectory point contains time and location information.
TABLE 1 The similarity scores for trajectories extracted from different cameras.
-
Trajectory 2 extracted from
C1
Trajectory 33 extracted from
C1
Trajectory 49
extracted from C1
Trajectory 1 extracted from C2 0.3802 0.1977 0.1810
Trajectory 6 extracted from C2 0.3816 0.1511 0.1708
Trajectory 10 extracted from C2 0.1744 0.0773 0.4539
Note
: The similarity score is assessed using both rows and columns to identify the best matched trajectories (bold style).
The similarity score between Trajectories 2 and 1 is 0.3802, which is lower than that between Trajectories 2 and 6 (0.3816),
indicating that Trajectory 2 is a better match to Trajectory 6 than to Trajectory 1. As the best matched trajectory to
Trajectory 2 has been found, only the similarity scores for Trajectory 1 and the remaining trajectories must be compared.
Trajectory 1 has a higher similarity score with Trajectory 33 (0.1977) than with Trajectory 49 (0.1810).
FIGURE 5 Fusion of pedestrian trajectories using the MSM method. (a) Plots of the pedestrian trajectories
with IDs 2, 33, and 49 recorded by the C1 camera; (b) plots of the pedestrian trajectories with IDs 1, 6, and 10
recorded by the C2 camera; and (c) the fusion results for Trajectories 2 and 6; 33 and 1; and 49 and 10.
(a) Trajectories extractedfromC1(b) Trajectories extractedfromC2(c) Trajectories fusion results
LIU et al.
506
5.2 | Crowd movement analysis
To analyze the movement pattern of the crowd, the pedestrian trajectory points are connected into trajectory lines
through the sequence of trajectory ID and time, and the movement direction of each pedestrian is analyzed at the
individual level. Figure 6a shows the classification results for the movement directions of different pedestrian  trajec-
tories for the crowd. The green movement trajectory is in the southeast direction, whereas Pedestrian 72 moves
in the opposite direction. A majority of people moving in the southeast direction increases the risk of collisions.
Pedes trian 2 in a stagnant state presents a high risk in an urban venue with high foot traffic. Overall, crowd move-
ment patterns can be detected considering the direction of single trajectories. Figure 6b shows the movement
conditions of the crowd. Figure 6a presents the accumulated displacements of the target trajectories in the same
movement direction, indicating that the movement of the crowd is primarily concentrated in the southwest direction,
followed by the northeast direction, corresponding to a two-way movement mode. Different movement patterns can
represent the crowd at different places and times, and scientific management can be conducted for different places.
For example, entrances and exits of large-scale sports and performance stadiums can only accommodate one-way
movement patterns, whereas the overpasses and zebra crossings of roads can accommodate two-way movement
patterns. Timely warnings are necessary for disorderly movement patterns in specific places.
5.3 | Crowd distribution analysis
The trajectory points at different times are used to calculate the crowd density to identify and visualize congested
regions. First, the study area is divided into 1 × 1 m grids. The number of trajectory points in the grid is counted at
two time points, that is, t1 = 15 s and t2 = 31 s, to calculate the crowd density, and thematic visualization is performed
to analyze the spatial distribution and changes in congested areas at a previous time. Figures 7a,b show the crowd
FIGURE 6 Crowd movement analysis. The pedestrian trajectories are plotted in (a). These trajectories are
categorized at 45 o intervals by determining their azimuth angles in accordance with the starting and ending points
of the trajectories. (b) Depicts the corresponding accumulated trajectory displacements for different categories.
nrettaptnemevomdworC(b)noitceridtnemevomdworC(a)
LIU et al.507
density distribution at t1 and t2, respectively, in which the orange-red and red grids correspond to congested
regions. By observing the changes in the crowd density distribution at two time points, congested areas can be
identified in time, and early warnings can be promptly issued. Second, the standard deviational ellipse method is
applied to the pedestrian trajectory dataset to analyze the overall directions of the crowd distribution at t1 and t2
(Figure 7c). The center of the crowd distribution ellipses at t1 and t2 moves from A (114°18′12.82″, 34°49′07.42″)
to B (114°18′12.84″, 34°49′07.41″), and the azimuth angle of the major axis of the ellipse changes from 50.68° to
48.95°, thus realizing the spatiotemporal analysis of the overall direction of the crowd distribution.
6 | DISCUSSION AND CONCLUSIONS
Video surveillance has become an integral part of our surroundings, and surveillance cameras deployed in urban open
spaces are very useful for urban security (Socha & Kogut, 2020). Advances in machine learning and computer vision
have resulted in the increasing use of object detection and tracking in surveillance video and its intelligent analysis
(Ahmed et al., 2018). Surveillance video is typical geotagged video data and its spatial semantics are of great value
in video analysis. Spatial information from multi-viewpoint surveillance videos of open spaces (e.g., urban squares) is
used to develop a method for crowd sensing and its spatiotemporal analysis in this study. The object detection and
tracking model YOLOv3-DeepSORT is used to extract image coordinates of pedestrians from video clips. The camera
calibration and PnP algorithm are applied to the calculation of camera interior and exterior orientation elements to
perform the geospatial mapping from image coordinates to geographic coordinates by solving for the transformation
matrix. Multiple pedestrian trajectories generated using single cameras with different viewpoints are matched and
fused using the MSM method by integrating features of distance, time and direction. A campus square is selected as
the study area for data collection, pedestrian detection and tracking, trajectory extraction and fusion experiments.
Crowd spatiotemporal analysis is performed on the pedestrian trajectories dataset by applying standard deviation
ellipses and density estimation methods, and the overall characteristics of the crowd distribution and movement
patterns are identified. The results demonstrate that multiple surveillance cameras can be used in conjunction with
the pedestrian detection and tracking method and geospatial mapping of video objects to generate complete pedes-
trian trajectories in the monitoring area, as well as to create situational awareness and perform spatiotemporal anal-
ysis of the crowd, thereby providing solutions for the intelligent analysis of surveillance video and promoting the
in-depth application of geotagged videos to urban security.
In practical applications, different tactics should be employed depending on whether surveillance cameras are
deployed. For new open spaces without cameras, detailed camera parameters should be inferred by combining the data
FIGURE 7 Crowd distribution analysis. (a,b) The crowd density distribution at t1 = 15 s and t2 = 31 s. (c) The
standard deviation ellipse for the crowd distribution at t1 and t2.
(a) Crowddensity in t1 (b)Crowddensity in t2 (c) SDEint1and t2
LIU et al.
508
for building footprints and points of interest in the surveillance area, and spatial optimization models, such as maximum
coverage location problem, should be used to estimate the number and location of cameras for camera planning to
achieve high surveillance coverage at a low cost (Han, Li, Cui, Song, et al., 2019; Liu, Sridharan, et al., 2016). Then, camera
internal and external orientation element calibration should be performed for geospatial mapping and crowd sensing.
For spaces with deployed cameras, camera interior and exterior orientation element should be determined using Zhang's
(2000) calibration method and the PnP algorithm. In an expansive urban open space, the overlap area of the field of view
of the camera may be minimal or nonexistent or the crowd may be obscured by buildings, which can interrupt pedestrian
trajectories tracked by different cameras. To address this issue, pedestrian reidentification technology can be used to
match pedestrians detected by different cameras based on image features, and the pedestrian trajectories can then be
fused (Brasó & Leal-Taixé, 2020; Weng et al., 2022). A long short-term memory network and graph-based spatiotemporal
reasoning models can be used in conjunction with the MSM approach for trajectory matching to predict and comple-
ment the interrupted trajectories. Appropriate image or location obfuscation methods should also be used to effectively
address privacy protection concerns associated with video surveillance.
An important direction for future research is to combine video surveillance and the Internet of Video Things
(IoVT) with an edge computing framework (Chen, 2020). Video data are currently mainly collected by cameras and
then transmitted to a server for processing. Uninterrupted collection of surveillance video continuously generates
large quantities of video data, resulting in a high network transmission and server computing load. Visual sensors
are being continuously developed and can be combined with IoT and edge computing technology as a paradigm
of IoVT technology, such that a portion of the video processing tasks can be assigned to the camera side. As each
camera completes these processing tasks, the results are returned to the server side to achieve load balancing and
reduce the video data volume. For example, the object detection and tracking model can be integrated on the camera
side to extract dynamic targets, such as pedestrians, from each camera and send the results to the server side
for trajectory fusion and spatiotemporal analysis, thus improving the computational efficiency and usability of the
surveillance system. Another direction would be to develop an intelligent surveillance platform based on geotagged
videos for real-time analysis and efficient early warning of crowd congestion. Object trajectories determined using
multi-viewpoint surveillance videos and geospatial mapping methods can be fused for spatiotemporal analysis under
a unified geo-referencing system to determine the overall distribution of objects and significantly enhance the power
of the video surveillance system.
We consider that the methodology presented in this article (in particular, crowd sensing, and spatiotemporal
analysis based on multi-viewpoint geotagged videos) has application potential to intelligent surveillance systems
used for urban security. A variety of innovative surveillance systems could be developed by integration with spatial
analysis in GIS and video data processing algorithms for use in urban security and smart cities.
ACKNOWLEDGMENTS
Special thanks go to the editor and anonymous reviewers of this article for their constructive comments especially
during the COVID-19 pandemic, which substantially improved the quality of this article.
FUNDING INFORMATION
This research was funded by the National Natural Science Foundation of China under Grant 41871316, the National
Key R&D Program of China under Grant 2021YFE0106700, the Key Technologies R&D Program of the Henan
province under Grant 212102310421, the Foundation of Key Laboratory of Soil and Water Conservation on the
Loess Plateau of Ministry of Water Resources under Grant WSCLP202101, and Natural Resources Science and Tech-
nology Innovation Project of Henan Province under Grant No. 202016511.
CONFLICT OF INTEREST STATEMENT
No potential conflict of interest was reported by the authors.
LIU et al.509
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available on request from the corresponding author. The data are
not publicly available due to privacy and ethical restrictions.
ORCID
Zhigang Han https://orcid.org/0000-0002-9993-3382
REFERENCES
Ahmed, I., Ahmad, M., Rodrigues, J. J., Jeon, G., & Din, S. (2021). A deep learning-based social distance monitoring framework
for COVID-19. Sustainable Cities and Society, 65, 102571. https://doi.org/10.1016/j.scs.2020.102571
Ahmed, S. A., Dogra, D. P., Kar, S., & Roy, P. P. (2018). Trajectory-based surveillance analysis: A survey. IEEE Transactions on
Circuits and Systems for Video Technology, 29(7), 1985–1997. https://doi.org/10.1109/TCSVT.2018.2857489
Bae, J. S., & Song, T. L. (2008). Image tracking algorithm using template matching and PSNF-m. International Journal of Control,
Automation, and Systems, 6(3), 413–423. https://koreascience.kr/article/JAKO200822049838914.page
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object
tracking. In B. Leibe, J. Matel, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016 (pp. 850–865). Springer. https://
doi.org/10.1007/978-3-319-48881-3_56
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. IEEE International Conference
on Image Processing, Phoenix, AR (pp. 3464–3468). IEEE. https://doi.org/10.1109/ICIP.2016.7533003
Brasó, G., & Leal-Taixé, L. (2020). Learning a neural solver for multiple object tracking. IEEE/CVF Conference on Computer
Vision and Pattern Recognition, Seattle, WA (pp. 6247–6257). IEEE. https://doi.org/10.48550/arXiv.1912.07515
Burr, A., Schaeg, N., & Hall, D. M. (2018). Assessing residential front yards using Google street view and geospatial video:
A virtual survey approach for urban pollinator conservation. Applied Geography, 92, 12–20. https://doi.org/10.1016/j.
apgeog.2018.01.010
Chen, C. W. (2020). Internet of video things: Next-generation IoT with visual sensors. IEEE Internet of Things Journal, 7(8),
6676–6685. https://doi.org/10.1109/JIOT.2020.3005727
Chen, L., Özsu, M. T., & Oria, V. (2005). Robust and fast similarity search for moving object trajectories. ACM SIGMOD International
Conference on Management of Data, Baltimore, MD (pp. 491–502). ACM. https://doi.org/10.1145/1066157.1066213
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. 30th Confer-
ence on Advances in Neural Information Processing Systems, Barcelona, Spain (pp. 1–29). https://doi.org/10.48550/
arXiv.1605.06409
Draghici, A., & Steen, M. V. (2018). A survey of techniques for automatically sensing the behavior of a crowd. ACM Computing
Surveys, 51(1), 1–40. https://doi.org/10.1145/3129343
Du, K., Ju, Y., Jin, Y., Li, G., Qian, S., & Li, Y. (2012). MeanShift tracking algorithm with adaptive block color histogram. Second
International Conference on Consumer Electronics, Communications and Networks, Yichang, China (pp. 2692–2695). IEEE.
https://doi.org/10.1109/CECNet.2012.6202074
Elharrouss, O., Almaadeed, N., & Al-Maadeed, S. (2021). A review of video surveillance systems. Journal of Visual Communica-
tion and Image Representation, 77, 103116. https://doi.org/10.1016/j.jvcir.2021.103116
Furht, B. (2008). Geographic video content. Encyclopedia of multimedia (2nd ed., pp. 271–272). https://doi.
org/10.1007/978-0-387-78414-4_328
Furtado, A. S., Kopanaki, D., Alvares, L. O., & Bogorny, V. (2016). Multidimensional similarity measuring for semantic trajecto-
ries. Transactions in GIS, 20(2), 280–298. https://doi.org/10.1111/tgis.12156
Girshick, R. (2015). Fast R-CNN. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile (pp. 1440–1448).
IEEE. https://doi.org/10.1109/ICCV.2015.169
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic
segmentation. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH (pp. 580–587). IEEE. https://
doi.org/10.1109/CVPR.2014.81
Han, S., Dong, X., Hao, X., & Miao, S. (2022). Extracting objects' spatial–temporal information based on surveillance videos
and the digital surface model. ISPRS International Journal of Geo-Information, 11(2), 103. https://doi.org/10.3390/
ijgi11020103
Han, Z., Cui, C., Kong, Y., Qin, F., & Fu, P. (2016). Video data model and retrieval service framework using geographic informa-
tion. Transactions in GIS, 20(5), 701–717. https://doi.org/10.1111/tgis.12175
Han, Z., Li, S., Cui, C., Han, D., & Song, H. (2019). Geosocial media as a proxy for security: A review. IEEE Access, 7, 154224–
154238. https://doi.org/10.1109/ACCESS.2019.2949115
LIU et al.
510
Han, Z., Li, S., Cui, C., Song, H., Kong, Y., & Qin, F. (2019). Camera planning for area surveillance: A new method for coverage
inference and optimization using location-based service data. Computers, Environment and Urban Systems, 78, 101396.
https://doi.org/10.1016/j.compenvurbsys.2019.101396
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. IEEE International Conference on Computer Vision, Venice,
Italy (pp. 2980–2988). IEEE. https://doi.org/10.1109/ICCV.2017.322
Hossain, S., & Lee, D. J. (2019). Deep learning-based real-time multiple-object detection and tracking from aerial imagery via
a flying robot with GPU-based embedded devices. Sensors, 19(15), 3371. https://doi.org/10.3390/s19153371
Jamonnak, S., Zhao, Y., Curtis, A., Al-Dohuki, S., Ye, X., Kamw, F., & Yang, J. (2020). GeoVisuals: A visual analytics approach to
leverage the potential of spatial videos and associated geonarratives. International Journal of Geographical Information
Science, 34(11), 2115–2135. https://doi.org/10.1080/13658816.2020.1737700
Jamonnak, S., Zhao, Y., Huang, X., & Amiruzzaman, M. (2021). Geo-context aware study of vision-based autonomous driving
models and spatial video data. IEEE Transactions on Visualization and Computer Graphics, 28(1), 1019–1029. https://doi.
org/10.1109/TVCG.2021.3114853
Janowicz, K., Gao, S., McKenzie, G., Hu, Y., & Bhaduri, B. (2020). GeoAI: Spatially explicit artificial intelligence techniques for
geographic knowledge discovery and beyond. International Journal of Geographical Information Science, 34(4), 625–636.
https://doi.org/10.1080/13658816.2019.1684500
Jing, C., Hu, Y., Zhang, H., Du, M., Xu, S., Guo, X., & Jiang, J. (2022). Context-aware matrix factorization for the identification
of urban functional regions with POI and taxi OD data. ISPRS International Journal of Geo-Information, 11(6), 351. https://
doi.org/10.3390/ijgi11060351
Jing, C., Zhu, Y., Du, M., & Liu, X. (2021). Visualizing spatiotemporal patterns of city service demand through a space-time
exploratory approach. Transactions in GIS, 25(4), 1766–1783. https://doi.org/10.1111/tgis.12820
Kilger, M. A. (1992). Shadow handler in a video-based real-time traffic monitoring system. IEEE Workshop on Applications of
Computer Vision, Palm Springs, CA (pp. 11–18). IEEE. https://doi.org/10.1109/ACV.1992.240332
Kim, S. H., Ay, S. A., & Zimmermann, R. (2010). Design and implementation of geo-tagged video search framework. Journal of
Visual Communication and Image Representation, 21(8), 773–786. https://doi.org/10.1016/j.jvcir.2010.07.004
Kong, Y. (2010). Design of GeoVideo data model and implementation of web-based VideoGIS. Geomatics and Information
Science of Wuhan University, 35(2), 133–137. https://doi.org/10.13203/j.whugis2010.02.019
Laufs, J., Borrion, H., & Bradford, B. (2020). Security and the smart city: A systematic review. Sustainable Cities and Society, 55,
102023. https://doi.org/10.1016/j.scs.2020.102023
Lepetit, V., Moreno-Noguer, F., & Fua, P. (2009). EPnP: An accurate O (n) solution to the PnP problem. International Journal of
Computer Vision, 81(2), 155–166. https://doi.org/10.1007/s11263-008-0152-6
Lewis, P., Fotheringham, S., & Winstanley, A. (2011). Spatial video and GIS. International Journal of Geographical Information
Science, 25(5), 697–716. https://doi.org/10.1080/13658816.2010.505196
Li, C., Liu, Z., Zhao, Z., & Dai, Z. (2021). A fast fusion method for multi-videos with three-dimensional GIS scenes. Multimedia
Tools and Applications, 80(2), 1671–1686. https://doi.org/10.1007/s11042-020-09742-4
Li, J., Wei, J., Jiang, J., Lu, Y., Liu, L., Tang, Y., & Li, X. (2022). Spatio-temporal information extraction method for dynamic
targets in multi-perspective surveillance video. Acta Geodaetica et Cartographica Sinica, 51(3), 388–400. https://doi.
org/10.11947/j.AGCS.2022.20200507
Li, X., Hu, W., Shen, C., Zhang, Z., Dick, A., & Hengel, A. V. D. (2013). A survey of appearance models in visual object tracking.
ACM Transactions on Intelligent Systems and Technology, 4(4), 1–48. https://doi.org/10.1145/2508037.2508039
Lin, B., Xu, C., Lan, X., & Zhou, L. (2020a). A method of perspective normalization for video images based on map data. Annals
of GIS, 26(1), 35–47. https://doi.org/10.1080/19475683.2019.1704870
Lin, T.-. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detec-
tion. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI (pp. 936–944). IEEE. https://doi.
org/10.1109/CVPR.2017.106
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2020b). Focal loss for dense object detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 42(2), 318–327. https://doi.org/10.1109/TPAMI.2018.2858826
Lipton, A. J., Fujiyoshi, H., & Patil, R. S. (1998). Moving target classification and tracking from real-time video. Fourth IEEE
Workshop on Applications of Computer Vision WACV'98 (Cat. No. 98EX201), Princeton, NJ (pp. 8–14). IEEE. https://doi.
org/10.1109/ACV.1998.732851
Liu, J., Sridharan, S., & Fookes, C. (2016). Recent advances in camera planning for large area surveillance: A comprehensive
review. ACM Computing Surveys, 49, 6–37. https://doi.org/10.1145/2906148
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020). Deep learning for generic object detection:
A survey. International Journal of Computer Vision, 128(2), 261–318. https://doi.org/10.1007/s11263-019-01247-4
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In B.
Leibe, J. Matel, N. Sabe, & M. Welling (Eds.), European Conference on Computer Vision—ECCV 2016 (pp. 21–37). Springer.
https://doi.org/10.1007/978-3-319-46448-0_2
LIU et al.511
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2),
91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
Lu, Y., & Shahabi, C. (2017). Efficient indexing and querying of geo-tagged aerial videos. 25th ACM SIGSPATIAL Interna-
tional Conference on Advances in Geographic Information Systems, Redondo Beach, CA (pp. 1–10). ACM. https://doi.
org/10.1145/3139958.3140046
Luo, J., Joshi, D., Yu, J., & Gallagher, A. (2011). Geotagging in multimedia and computer vision—A survey. Multimedia Tools and
Applications, 51(1), 187–211. https://doi.org/10.1007/s11042-010-0623-y
Ma, H., Arslan Ay, S., Zimmermann, R., & Kim, S. H. (2014). Large-scale geo-tagged video indexing and queries. GeoInformat-
ica, 18(4), 671–697. https://doi.org/10.1007/s10707-013-0199-6
Mae, Y., Shirai, Y., Miura, J., & Kuno, Y. (1996). Object tracking in cluttered background based on optical flow and edges.
13th International Conference on Pattern Recognition, Vienna, Austria (Vol. 1, pp. 196–200). https://doi.org/10.1109/
ICPR.1996.546018
Marvasti-Zadeh, S. M., Cheng, L., Ghanei-Yakhdan, H., & Kasaei, S. (2021). Deep learning for visual tracking: A compre-
hensive survey. IEEE Transactions on Intelligent Transportation Systems, 23(5), 3943–3968. https://doi.org/10.1109/
TITS.2020.3046478
Mills, J. W., Curtis, A., Kennedy, B., Kennedy, S. W., & Edwards, J. D. (2010). Geospatial video for field data collection. Applied
Geography, 30(4), 533–547. https://doi.org/10.1016/j.apgeog.2010.03.008
Milosavljević, A., Dimitrijević, A., & Rančić, D. (2010). GIS-augmented video surveillance. International Journal of Geographical
Information Science, 24(9), 1415–1433. https://doi.org/10.1080/13658811003792213
Milosavljević, A., Rančić, D., Dimitrijević, A., Predić, B., & Mihajlović, V. (2016). Integration of GIS and video surveillance.
International Journal of Geographical Information Science, 30(10), 2089–2107. https://doi.org/10.1080/13658816.201
6.1161197
Milosavljević, A., Rančić, D., Dimitrijević, A., Predić, B., & Mihajlović, V. (2017). A method for estimating surveillance video
georeferences. ISPRS International Journal of Geo-Information, 6(7), 211. https://doi.org/10.3390/ijgi6070211
Neri, A., Colonnese, S., Russo, G., & Talone, P. (1998). Automatic moving object and background separation. Signal Processing,
66(2), 219–232. https://doi.org/10.1016/S0165-1684(98)00007-3
Newburn, T. (2021). The causes and consequences of urban riot and unrest. Annual Review of Criminology, 4, 53–73. https://
doi.org/10.1146/annurev-criminol-061020-124931
Nishiyama, H. (2018). Crowd surveillance: The (in) securitization of the urban body. Security Dialogue, 49(3), 200–216. https://
doi.org/10.1177/0967010617741436
Patel, T., Yao, A. Y. H., Qiang, Y., Ooi, W. T., & Zimmermann, R. (2021). Multi-camera video scene graphs for surveillance videos
indexing and retrieval. IEEE International Conference on Image Processing (ICIP), Anchorage, AK (pp. 2383–2387). IEEE.
https://doi.org/10.1109/ICIP42928.2021.9506713
Punn, N. S., Sonbhadra, S. K., Agarwal, S., & Rai, G. (2020). Monitoring COVID-19 social distancing with person detec-
tion and tracking via fine-tuned YOLO v3 and DeepSORT techniques. arXiv:2005.01385. https://doi.org/10.48550/
arXiv.2005.01385
Qian, X., Li, M., Ren, Y., & Jiang, S. (2019). Social media based event summarization by user–text–image co-clustering.
Knowledge-Based Systems, 164, 107–121. https://doi.org/10.1016/j.knosys.2018.10.028
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV (pp. 779–788). IEEE. https://doi.org/10.1109/
CVPR.2016.91
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv:1804.02767. https://doi.org/10.48550/
arXiv.1804.02767
Rumora, L., Majić, I., Miler, M., & Medak, D. (2021). Spatial video remote sensing for urban vegetation mapping using vegeta-
tion indices. Urban Ecosystem, 24(1), 21–33. https://doi.org/10.1007/s11252-020-01002-5
Sahbani, B., & Adiprawita, W. (2016). Kalman filter and iterative-hungarian algorithm implementation for low complexity
point tracking as part of fast multiple object tracking system. Sixth International Conference on System Engineering and
Technology (ICSET), Shah Alam, Malaysia (pp. 109–115). https://doi.org/10.1109/ICSEngT.2016.7849633
Sankaranarayanan, K., & Davis, J. W. (2008). A fast linear registration framework for multi-camera GIS coordination. Fifth
IEEE International Conference on Advanced Video and Signal Based Surveillance, Santa Fe, NM (pp. 245–251). https://doi.
org/10.1109/AVSS.2008.20
Senin, P. (2008). Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii
at Manoa Honolulu. https://csdl.ics.hawaii.edu/techreports/2008/08-04/08-04.pdf
Shao, Z., Li, C., Li, D., Altan, O., Zhang, L., & Ding, L. (2020). An accurate matching method for projecting vector data into
surveillance video to monitor and protect cultivated land. ISPRS International Journal of Geo-Information, 9(7), 448.
https://doi.org/10.3390/ijgi9070448
Simon, D. (2001). Kalman filtering. Embedded Systems Programming, 14(6), 72–79. http://abel.math.harvard.edu/archive/116_
fall_03/handouts/kalman.pdf
LIU et al.
512
Socha, R., & Kogut, B. (2020). Urban video surveillance as a tool to improve security in public spaces. Sustainability, 12(15),
6210. https://doi.org/10.3390/su12156210
Stauffer, C., & Grimson, W. E. L. (1999). Adaptive background mixture models for real-time tracking. IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition, Fort Collins, CO (Vol. 2, pp. 246–252). IEEE. https://doi.
org/10.1109/CVPR.1999.784637
Subudhi, B. N., Rout, D. K., & Ghosh, A. (2019). Big data analytics for video surveillance. Multimedia Tools and Applications,
78(18), 26129–26162. https://doi.org/10.1007/s11042-019-07793-w
Svoboda, T. (2007). Kanade-Lucas-Tomasi tracking (KLT tracker). Czech Technical University in Prague, Center for Machine
Perception. https://cs.gmu.edu/~zduric/cs682/slides/klt.pdf
Vlachos, M., Kollios, G., & Gunopulos, D. (2002). Discovering similar multidimensional trajectories. 18th International Confer-
ence on Data Engineering, San Jose, CA (pp. 673–684). https://doi.org/10.1109/ICDE.2002.994784
Wang, X., Wang, M., Liu, X., Zhu, L., Glade, T., Chen, M., Zhao, W., & Xie, Y. (2022). A novel quality control model of rainfall
estimation with videos—A survey based on multi-surveillance cameras. Journal of Hydrology, 605, 127312. https://doi.
org/10.1016/j.jhydrol.2021.127312
Weng, X., Ivanovic, B., Kitani, K., & Pavone, M. (2022). Whose track is it anyway? Improving robustness to tracking errors with
affinity-based trajectory prediction. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA
(pp. 6573–6582). IEEE. https://doi.org/10.1109/CVPR52688.2022.00646
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. IEEE Inter-
national Conference on Image Processing (ICIP), Beijing, China (pp. 3645–3649). IEEE. https://doi.org/10.1109/
ICIP.2017.8296962
Wu, C., Zhu, Q., Zhang, Y., Du, Z., Ye, X., Qin, H., & Zhou, Y. (2017). A NOSQL–SQL hybrid organization and management
approach for real-time geospatial data: A case study of public security video surveillance. ISPRS International Journal of
Geo-Information, 6(1), 21. https://doi.org/10.3390/ijgi6010021
Wu, C., Zhu, Q., Zhang, Y., Xie, X., Qin, H., Zhou, Y., Zhang, P., & Yang, W. (2018). Movement-oriented objectified organiza-
tion and retrieval approach for heterogeneous GeoVideo data. ISPRS International Journal of Geo-Information, 7(7), 255.
https://doi.org/10.3390/ijgi7070255
Xie, Y., Wang, M., Liu, X., Wang, X., Wu, Y., Wang, F., & Wang, X. (2022). Multi-camera video synopsis of a geographic scene
based on optimal virtual viewpoint. Transactions in GIS, 26(3), 1221–1239. https://doi.org/10.1111/tgis.12862
Xie, Y., Wang, M., Liu, X., Wang, Z., Mao, B., Wang, F., & Wang, X. (2021). Spatiotemporal retrieval of dynamic video object
trajectories in geographical scenes. Transactions in GIS, 25(1), 450–467. https://doi.org/10.1111/tgis.12696
Xie, Y., Wang, M., Liu, X., & Wu, Y. (2017). Integration of GIS and moving objects in surveillance video. ISPRS International
Journal of Geo-Information, 6(4), 94. https://doi.org/10.3390/ijgi6040094
Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys, 38(4), 13-es. https://doi.
org/10.1145/1177352.1177355
Zhang, J., Zheng, Y., & Qi, D. (2017). Deep spatio-temporal residual networks for citywide crowd flows prediction. Thirty-First
AAAI Conference on Artificial Intelligence, San Francisco, California, USA (pp. 1–7). https://doi.org/10.1609/aaai.
v31i1.10735
Zhang, X., Hao, X., Liu, S., Wang, J., Xu, J., & Hu, J. (2019). Multi-target tracking of surveillance video with differential
YOLO and DeepSORT. Eleventh International Conference on Digital Image Processing (ICDIP 2019), Los Angeles, CA (Vol.
11179, pp. 701–710). SPIE. https://doi.org/10.1117/12.2540269
Zhang, X., Shi, X., Luo, X., Sun, Y., & Zhou, Y. (2021). Real-time web map construction based on multiple cameras and GIS.
ISPRS International Journal of Geo-Information, 10(12), 803. https://doi.org/10.3390/ijgi10120803
Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 22(11), 1330–1334. https://doi.org/10.1109/34.888718
Zhao, R., Wang, D., Wang, Y., Han, C., Jia, P., Li, C., & Ma, Y. (2021). Macroscopic view: Crowd evacuation dynamics at
T-shaped street junctions using a modified aw-Rascle traffic flow model. IEEE Transactions on Intelligent Transportation
Systems, 22(10), 6612–6621. https://doi.org/10.1109/TITS.2021.3095829
Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object detection in 20 years: A survey. arXiv:1905.05055. https://doi.org/10.48550/
arXiv.1905.05055
How to cite this article: Liu, F., Han, Z., Song, H., Wang, J., Liu, C., & Ban, G. (2023). Crowd sensing and
spatiotemporal analysis in urban open space using multi-viewpoint geotagged videos. Transactions in GIS,
27, 494–515. https://doi.org/10.1111/tgis.13036
LIU et al.513
APPENDIX A
THE CAMERA INTERIOR ORIENTATION ELEMENTS, DISTORTION CORRECTION COEFFICIENTS, AND EXTERIOR ORIENTATION ELEMENT MATRIX
Camera
ID
Image
center
points
Camera
focal length Camera distortion correction coefficients
Camera exterior orientation elements matrixu0, v0fx, fy
Ak
1
Ak
2
Ak
3
Aq
1
Aq
2
C1 994.35,
530.72
159.31,
158.71
0.184952 −0.182719 −0.105806 0.006815 0.153519
C2 967.99,
521.39
253.18,
164.39
0.037365 −0.022141 0.044989 0.009549 0.004931
A
1.363081 e + 03 1.091650 e + 03 1.482575 e + 05
8.446769 e + 02 8.446769 e + 02 8.194328 e + 04
3.986530 e 01 1.143757 e 01 9.241020 e + 01
C3 962.45,
34.06
160.89,
262.40
−0.009123 0.514534 −0.058777 0.001946 −0.937182
A
1.935160 e + 03 7.660439 e + 01 1.881655 e + 05
1.994035 e + 02 6.504662 e+ 03 1.745761 e + 05
7.312295 e 01 2.701989 e 03 1.161133 e + 02
C4 956.08,
521.43
155.22,
175.30
0.043948 0.132562 −0.091745 0.000965 −0.200468
A
9.844750 e + 02 1.138992 e + 03 4.123191 e + 04
8.393584 e + 02 1.076158 e + 03 1.115333 e + 05
2.037393 e 01 1.246928 e 01 5.489779 e + 01
A
2.631332 e + 02 5.564476 e + 01 8.320149 e+ 04
6.413476 e + 02 1.383987 e + 03 1.644971 e + 05
7.394724 e 01 1.929569 e 01 1.515290 e + 02
LIU et al.
514
APPENDIX B
THE IMAGE AND GEOSPATIAL COORDINATES FOR THE 10 POINTS
No.
Image coordinates Geospatial coordinates
C1 camera C2 camera C3 camera C4 camera
Lon Lat x yu v u v u v u v
1 261 544 1095 646 1801 483 819 787 114°18′12.32″ 34°49′7.59″ 3,857,923.15 802,164.55
2 138 583 390 659 1177 472 1836 763 114°18′12.87″ 34°49′7.00″ 3,857,905.42 802,179.13
3 895 606 109 725 612 469 1572 748 114°18′13.23″ 34°49′7.02″ 3,857,906.34 802,188.26
4 1526 573 766 722 118 524 977 743 114°18′13.58″ 34°49′7.44″ 3,857,919.58 802,196.73
5 1104 560 1291 735 627 535 685 754 114°18′13.31″ 34°49′7.73″ 3,857,928.29 802,189.57
6 876 521 1527 684 1119 552 426 812 114°18′12.78″ 34°49′7.77″ 3,857,929.08 802,176.06
7 362 560 993 658 1489 511 1118 825 114°18′12.34″ 34°49′7.43″ 3,857,918.23 802,165.22
8 459 580 720 656 1167 492 1384 788 114°18′12.65″ 34°49′7.30″ 3,857,914.49 802,173.23
9 716 599 509 686 841 496 1461 771 114°18′12.90″ 34°49′7.18″ 3,857,911.00 802,179.70
10 752 561 874 662 944 497 1087 749 114°18′12.84″ 34°49′7.62″ 3,857,924.51 802,177.74
LIU et al.515
APPENDIX C
THE EVALUATION FOR COORDINATES TRANSFORMATION
No.
Camera
ID
Image
coordinates Real geospatial coordinates Transformed geospatial coordinates Errors
u v Lon Lat x y Lon Lat x y x y
1 C1 550 440 114°18′12.30″ 34°49′07.71″ 802,163.81 3,857,926.94 114°18′12.27″ 34°49′07.64″ 802,163.35 3,857,924.72 0.464 2.214
2 C1 601 345 114°18′12.41″ 34°49′07.86″ 802,166.50 3,857,931.43 114°18′12.37″ 34°49′07.83″ 802,165.46 3,857,930.62 1.033 0.812
3 C1 537 547 114°18′12.27″ 34°49′07.55″ 802,163.40 3,857,921.99 114°18′12.25″ 34°49′07.43″ 802,162.86 3,857,918.22 0.543 3.771
4 C2 809 320 114°18′12.79″ 34°49′07.88″ 802,176.09 3,857,932.54 114°18′12.72″ 34°49′07.89″ 802,174.30 3,857,932.75 1.792 −0.211
5 C2 682 274 114°18′12.56″ 34°49′07.96″ 802,170.23 3,857,934.85 114°18′12.51″ 34°49′07.97″ 802,168.87 3,857,935.15 1.358 −0.308
6 C3 606 154 114°18′12.42″ 34°49′08.15″ 802,166.65 3,857,940.39 114°18′12.38″ 34°49′08.21″ 802,165.56 3,857,942.19 1.085 −1.790
7 C3 757 98 114°18′12.70″ 34°49′08.22″ 802,173.68 3,857,943.01 114°18′12.64″ 34°49′08.32″ 802,171.96 3,857,946.02 1.722 −3.012
8 C4 421 397 114°18′12.07″ 34°49′07.78″ 802,158.12 3,857,928.88 114°18′12.06″ 34°49′07.72″ 802,157.83 3,857,926.94 0.290 1.938
9 C4 491 258 114°18′12.21″ 34°49′07.99″ 802,161.30 3,857,935.39 114°18′12.19″ 34°49′08.00″ 802,160.73 3,857,935.55 0.567 −0.166
RMSE 1.115 2.000
MAE 0.984 1.580
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The identification of urban functional regions (UFRs) is important for urban planning and sustainable development. Because this involves a set of interrelated processes, it is difficult to identify UFRs using only single data sources. Data fusion methods have the potential to improve the identification accuracy. However, the use of existing fusion methods remains challenging when mining shared semantic information among multiple data sources. In order to address this issue, we propose a context-coupling matrix factorization (CCMF) method which considers contextual relationships. This method was designed based on the fact that the contextual relationships embedded in all of the data are shared and complementary to one another. An empirical study was carried out by fusing point-of-interest (POI) data and taxi origin–destination (OD) data in Beijing, China. There are three steps in CCMF. First, contextual information is extracted from POI and taxi OD trajectory data. Second, fusion is performed using contextual information. Finally, spectral clustering is used to identify the functional regions. The results show that the proposed method achieved an overall accuracy (OA) of 90% and a kappa of 0.88 in the study area. The results were compared with the results obtained using single sources of non-fused data and other fusion methods in order to validate the effectiveness of our method. The results demonstrate that an improvement in the OA of about 5% in comparison to a similar method in the literature could be achieved using this method.
Article
Full-text available
Surveillance systems focus on the image itself, mainly from the perspective of computer vision, which lacks integration with geographic information. It is difficult to obtain the location, size, and other spatial information of moving objects from surveillance systems, which lack any ability to couple with the geographical environment. To overcome such limitations, we propose a fusion framework of 3D geographic information and moving objects in surveillance video, which provides ideas for related research. We propose a general framework that can extract objects’ spatial–temporal information and visualize object trajectories in a 3D model. The framework does not rely on specific algorithms for determining the camera model, object extraction, or the mapping model. In our experiment, we used the Zhang Zhengyou calibration method and the EPNP method to determine the camera model, YOLOv5 and deep SORT to extract objects from a video, and an imaging ray intersection with the digital surface model to locate objects in the 3D geographical scene. The experimental results show that when the bounding box can thoroughly outline the entire object, the maximum error and root mean square error of the planar position are within 31 cm and 10 cm, respectively, and within 10 cm and 3 cm, respectively, in elevation. The errors of the average width and height of moving objects are within 5 cm and 2 cm, respectively, which is consistent with reality. To our knowledge, we first proposed the general fusion framework. This paper offers a solution to integrate 3D geographic information and surveillance video, which will not only provide a spatial perspective for intelligent video analysis, but also provide a new approach for the multi-dimensional expression of geographic information, object statistics, and object measurement.
Article
Full-text available
Previous VideoGIS integration methods mostly used geographic homography mapping. However, the related processing techniques were mainly for independent cameras and the software architecture was C/S, resulting in large deviations in geographic video mapping for small scenes, a lack of multi-camera video fusion, and difficulty in accessing real-time information with WebGIS. Therefore, we propose real-time web map construction based on the object height and camera posture (RTWM-HP for short). We first consider the constraint of having a similar height for each object by constructing an auxiliary plane and establishing a high-precision homography matrix (HP-HM) between the plane and the map; thus, the accuracy of geographic video mapping can be improved. Then, we map the objects in the multi-camera video with overlapping areas to geographic space and perform the object selection with the multi-camera (OS-CDD) algorithm, which includes the confidence of the object, the distance, and the angle between the objects and the center of the cameras. Further, we use the WebSocket technology to design a hybrid C/S and B/S software framework that is suitable for WebGIS integration. Experiments were carried out based on multi-camera videos and high-precision geospatial data in an office and a parking lot. The case study’s results show the following: (1) The HP-HM method can achieve the high-precision geographic mapping of objects (such as human heads and cars) with multiple cameras; (2) the OS-CDD algorithm can optimize and adjust the positions of the objects in the overlapping area and achieve a better map visualization effect; (3) RTWM-HP can publish real-time maps of objects with multiple cameras, which can be browsed in real time through point layers and hot-spot layers through WebGIS. The methods can be applied to some fields, such as person or car supervision and the flow analysis of customers or traffic passengers.
Article
The widespread use of surveillance cameras has become an emerging means for rainfall observations. With the advantages of high spatial-temporal resolution, rainfall information obtained from surveillance videos is highly suitable for meteorological-related research and has bright prospects. However, due to the complex and variable monitoring scenarios, the quality of the rainfall data estimated by each camera is always inconsistent, resulting in low practical value. Dense ground-level surveillance cameras have temporal and spatial correlations that can be used to improve the accuracy of rainfall estimation through mutual verification. In this study, we first introduce camera parameters to refine the spatial volume of rainfall (SVoR)¹ perceived by cameras to improve the accuracy of rainfall intensity (RI) estimation. Next, a novel quality control (QC) model of rainfall estimation with multi-surveillance camera collaboration that takes the rainfall observations of all cameras as input is proposed. (i) We build a reliability evaluation (RE) model for the estimation of RI in accordance with raindrop imagery features to provide a reference for the subsequent correction of RI estimation; (ii) inspired by the first law of geography (Tobler, 1970), we then construct a spatial-temporal consistency filter and a situation consistency filter by using the spatial-temporal constraints between cameras to coarsely evaluate the RI values; and (iii) the correlation between cameras is calculated based on the fuzzy method to further build a correlation filter for the fine-grained correction of RI values. Experiments show that our method can effectively eliminate RI outliers and improve the accuracy and reliability of rainfall estimation results. Moreover, our method is highly suitable for heavy and violent rainfall application scenarios and can provide high-resolution rainfall data support for flooding warnings and simulations in urban areas.
Article
Video synopsis offers the possibility to compress long‐term video objects into a short‐term playback. This is beneficial for the quick retrieval and expression of numerous video objects in a virtual geographic environment. However, existing methods do not consider the spatiotemporal relationships among video objects in different cameras. To overcome this problem, we propose optimal virtual viewpoint‐based multi‐camera video synopsis, which involves: locating the camera position and the field of view; constructing a video image observability model to enumerate the observable camera combinations; constructing an evaluation model to optimally select the observable combinations and obtain the virtual viewpoint group; and setting the parameters of the video object display to obtain a multi‐camera video synopsis of a geographic scene. Experimental results showed that our method achieved the rapid display of global motion across cameras with numerous video objects, and always outperformed separate camera‐based video synopsis expression.
Article
Vision-based deep learning (DL) methods have made great progress in learning autonomous driving models from large-scale crowdsourced video datasets. They are trained to predict instantaneous driving behaviors from video data captured by on-vehicle cameras. In this paper, we develop a geo-context aware visualization system for the study of Autonomous Driving Model (ADM) predictions together with large-scale ADM video data. The visual study is seamlessly integrated with the geographical environment by combining DL model performance with geospatial visualization techniques. Model performance measures can be studied together with a set of geospatial attributes over map views. Users can also discover and compare prediction behaviors of multiple DL models in both city-wide and street-level analysis, together with road images and video contents. Therefore, the system provides a new visual exploration platform for DL model designers in autonomous driving. Use cases and domain expert evaluation show the utility and effectiveness of the visualization system.
Article
This study investigates a dynamic flow model for crowd evacuation at T-shaped street junctions (TSJs) from a macroscopic view. The Aw-Rascle traffic flow model is modified by constructing an impact matrix in the street intersection area to practically describe the crowd convergence mechanism at a TSJ. For coherence, this modified model is proved to be anisotropic, similar to the original Aw-Rascle traffic flow model. To describe real scenarios with higher crowd density and lower speeds during organized pilgrimages, the initial Gaussian distribution of the crowd is improved to a higher-order smoothing function. To validate the modified Aw-Rascle traffic flow model, we reconstruct the drastic stampede that occurred at the TSJ of streets 204 and 223 during the 2015 Mecca pilgrimage. Further, the main environmental parameters that potentially lead to a stampede are discussed with numerical simulations. A valuable suggestion is that the street width ratio should be extended from 1.1 to 1.4 to prevent stampedes, matching the expansion engineering of street 204 reported by BBC News. An interesting phenomenon is that the closer the bus unloading location on street 223 is to the TSJ center, the lower the maximum crowd density and the safer the pedestrians will be. With this modified Aw-Rascle flow model at TSJs, this paper provides strategic and technical suggestions for future crowd flow control to reduce the risk of crowd stampedes.
Article
City service demand fluctuates across space and time. Although various data, such as 311 hotline data and social media data, have been used to explore the spatiotemporal patterns of city services, data uncertainty and the uneven distribution of service demand are overlooked to some extent and thus could result in bias. To overcome these shortcomings, top-down collected city service data that fully cover urban areas are used as an emerging data source in this paper. A visual analytical approach that employs a 3D model based on a space-time cube combined with the Mann-Kendall algorithm is developed and applied in Xicheng District, Beijing, China. The results show that in comparison to other methods, the emerging data and visualization method have more power to explain city services in terms of overall trends and micro-scale details. For instance, city service cases demonstrate a significant downward trend. Meanwhile, the distribution of hot/cold spots is found to be related to the built environment and population density. For example, high-incidence cases are located in some communities that are the key governance areas, indicating a demand to increase the staffing of grid administrators. The findings of this work will practicallycan potentially benefit other cities in China and worldwide.