ArticlePDF Available

Crowd sensing and spatiotemporal analysis in urban open space using multi‐viewpoint geotagged videos

March 2023
Transactions in GIS 27(3)

March 2023
27(3)

DOI:10.1111/tgis.13036

Authors:

Zhigang Han

Henan University

Hongquan Song

Henan University

Show all 6 authorsHide

Increasing concern for urban public safety has motivated the deployment of a large number of surveillance cameras in open spaces such as city squares, stations, and shopping malls. The efficient detection of crowd dynamics in urban open spaces using multi‐viewpoint surveillance videos continues to be a fundamental problem in the field of urban security. The use of existing methods for extracting features from video images has resulted in significant progress in single‐camera image space. However, surveillance videos are geotagged videos with location information, and few studies have fully exploited the spatial semantics of these videos. In this study, multi‐viewpoint videos in geographic space are used to fuse object trajectories for crowd sensing and spatiotemporal analysis. The YOLOv3‐DeepSORT model is used to detect a pedestrian and extract the corresponding image coordinates, combine spatial semantics (such as the positions of the pedestrian in the field of view of the camera) to build a projection transformation matrix and map the object recorded by a single camera to geographic space. Trajectories from multi‐viewpoint videos are fused based on the features of location, time, and directions to generate a complete pedestrian trajectory. Then, crowd spatial pattern analysis, density estimation, and motion trend analysis are performed. Experimental results demonstrate that the proposed method can be used to identify crowd dynamics and analyze the corresponding spatiotemporal pattern in an urban open space from a global perspective, providing a means of intelligent spatiotemporal analysis of geotagged videos.

Content uploaded by Zhigang Han

Content may be subject to copyright.

494

1Key Laboratory of Geospatial Technology for

the Middle and Lower Yellow River Regions

(Henan University), Ministry of Education,

Kaifeng, China

2College of Geography and Environmental

Science, Henan University, Kaifeng, China

3Henan Industrial Technology Academy of

Spatiotemporal Big Data, Henan University,

Zhengzhou, China

4Urban Big Data Institute, Henan University,

Kaifeng, China

5Henan Technology Innovation Center of

Spatiotemporal Big Data, Henan University,

Zhengzhou, China

6School of Computer and Information

Engineering, Henan University, Kaifeng, China

Correspondence

Zhigang Han, Key Laboratory of Geospatial

Technology for the Middle and Lower Yellow

River Regions (Henan University), Ministry of

Education, Kaifeng, China.

Email: zghan@henu.edu.cn

Funding information

National Natural Science Foundation of China

Abstract

Increasing concern for urban public safety has motivated

the deployment of a large number of surveillance cameras

in open spaces such as city squares, stations, and shopping

malls. The efficient detection of crowd dynamics in urban

open spaces using multi-viewpoint surveillance videos

continues to be a fundamental problem in the field of urban

security. The use of existing methods for extracting features

from video images has resulted in significant progress in

single-camera image space. However, surveillance videos

are geotagged videos with location information, and few

studies have fully exploited the spatial semantics of these

videos. In this study, multi-viewpoint videos in geographic

space are used to fuse object trajectories for crowd sensing

and spatiotemporal analysis. The YOLOv3-DeepSORT model

is used to detect a pedestrian and extract the correspond-

ing image coordinates, combine spatial semantics (such as

the positions of the pedestrian in the field of view of the

camera) to build a projection transformation matrix and map

the object recorded by a single camera to geographic space.

Trajectories from multi-viewpoint videos are fused based on

the features of location, time, and directions to generate a

complete pedestrian trajectory. Then, crowd spatial pattern

analysis, density estimation, and motion trend analysis

are performed. Experimental results demonstrate that the

proposed method can be used to identify crowd dynamics

RESEARCH ARTICLE

Crowd sensing and spatiotemporal analysis

in urban open space using multi-viewpoint

geotagged videos

Feng Liu1,2 | Zhigang Han1,2,3,4 | Hongquan Song1,2,4 |

Jiayao Wang1,2,3,5 | Chun Liu3,5,6 | Gaohan Ban1,2

DOI: 10.1111/tgis.13036

Received: 28 June 2022 Revised: 25 January 2023 Accepted: 6 February 2023

LIU et al.495

1 | INTRODUCTION

Continuous urbanization and the rapid growth of urban populations in recent years have resulted in increased atten-

tion being given to urban security (Laufs et al., 2020; Nishiyama, 2018; Socha & Kogut, 2020). Urban open spaces,

such as city squares, large shopping malls and stations, are characterized by dense crowds, frequent exchanges,

and complex situations. As a result, these spaces often become primary targets for terrorism (Jing et al., 2021;

Newburn, 2021; Qian et al., 2019), and the potential occurrence of emergencies, such as crowd stampedes (Zhang

et al., 2017; Zhao et al., 2021), poses severe challenges to urban security. In urban open spaces, the perception and

analysis of the temporal and spatial dynamics of crowds can provide key support for accurate decision-making, effi-

cient early warnings and emergency responses to relevant urban security issues (Draghici & Steen, 2018; Han, Li, Cui,

Han, et al., 2019; Socha & Kogut, 2020).

Continuous advancements in smart city development have enabled the deployment of a set number of surveil-

lance cameras in urban open spaces to capture real-time video data as a significant means of triggering early warnings

for crowd congestion, deterring criminal behavior, and ensuring urban security. This method has been widely used

in major cities around the world due to its low cost and ease of maintenance (Laufs et al., 2020). Surveillance video

is typically observed 24 h a day by dedicated personnel to achieve real-time monitoring of scenes, which, however,

makes continuous surveillance very expensive (Ahmed et al., 2018; Elharrouss et al., 2021). Automatic identification

of dynamic targets, such as crowds, from surveillance videos, perception of the spatial and temporal distribution of

these targets, and early warning predictions have become major concerns in urban security. The rapid development

of computer vision in recent years has facilitated significant progress in object detection and analysis in image space.

A series of deep learning-based object detection and tracking models, such as the region-based convolutional neural

network (RCNN), You Only Look Once (YOLO) network and Siamese Net, have been developed (Li et al., 2013; Liu

et al., 2020; Marvasti-Zadeh et al., 2021; Zou et al., 2019). Attention has been given to analyzing and mining surveil-

lance video data in many fields (Subudhi et al., 2019).

Video collected by surveillance cameras contains a stream representing observations of a particular geographic

space. This stream is a typical geotagged video that includes both spatial and temporal features captured through

ground- or nonground-based cameras with interior and exterior orientation elements (Furht, 2008; Kong, 2010).

Geotagged video is natural perception data that contains rich spatial semantics (Han et al., 2016; Jamonnak

et al., 2021; Lewis et al., 2011). Integrating surveillance video data with a geographic information system (GIS), using

a unified geographic reference system for video data management and analysis, and enhancing urban video surveil-

lance systems are very useful for urban security (Milosavljević et al., 2016; Patel et al., 2021; Wu et al., 2017; Xie

et al., 2017). Continual progress in object detection and tracking algorithms based on deep learning has enabled the

implementation of intelligent analysis and target perception of surveillance videos by extracting moving objects from

surveillance videos. However, research in this area has focused on feature analysis of images from single cameras, and

there has been limited investigation of fusion of multi-viewpoint surveillance video deployed in multiple locations

and relatively little spatiotemporal analysis of crowd trajectories (Elharrouss et al., 2021; Li et al., 2022; Milosavljević

et al., 2016; Zhang et al., 2019). In this study, multi-viewpoint surveillance video is used to conduct crowd sensing

in an urban open space and spatiotemporal analysis in a unified geographic space, providing means of using spatial

semantics to perform surveillance video analysis and essential support for urban public safety. Two issues are consid-

ered in this study: (1) How can surveillance video be used to perform pedestrian detection and tracking? (2) How can

and analyze the corresponding spatiotemporal pattern in

an urban open space from a global perspective, providing a

means of intelligent spatiotemporal analysis of geotagged

videos.

LIU et al.

496

geospatial mapping from video image space to geographic space and pedestrian trajectory fusion be performed for

application to crowd spatiotemporal analysis?

The remainder of this article is organized as follows. A literature review is presented in Section 2. A methodology

for crowd sensing and spatiotemporal analysis in an urban open space using multi-viewpoint geotagged videos is

introduced in Section 3. Experiments on pedestrian detection and tracking and geospatial mapping are described in

Section 4. The results of pedestrian trajectory fusion and crowd spatiotemporal analysis are presented in Section 5.

The article is concluded in Section 6 with a brief summary and discussion.

2 | RELATED STUDIES

Video surveillance systems constitute one of the most active research areas in computer vision (Subudhi et al., 2019).

In major cities around the world, thousands of cameras collect massive quantities of video data every day. Detect-

ing and tracking moving objects are key to video surveillance (Zou et al., 2019). Object detection is the process of

identifying boxes and categories for objects in video images (Liu et al., 2020). There are primarily two types of object

detection algorithms, that is, conventional visual detection and deep learning-based methods. The former primarily

use traditional computer vision algorithms based on image features, including histogram of oriented gradient (HOG)

detection, frame difference, background difference, optical flow, and others (Kilger, 1992; Lipton et al., 1998; Mae

et al., 1996; Neri et al., 1998; Stauffer & Grimson, 1999). Deep learning methods have been applied to object detec-

tion using two types of algorithms. One type of algorithm employs two-stage detection to extract a set of object

candidate boxes by a selective search and then inputs each candidate box into the convolutional neural network

(CNN) for feature extraction and recognition of object categories. Algorithms within this category include RCNN, Fast

RCNN, spatial pyramid pooling network (SPP-Net), region-based fully convolutional network (R-FCN), mask RCNN,

and feature pyramid network, which have high accuracies but low calculation speeds (Dai et al., 2016; Girshick, 2015;

Girshick et al., 2014; He et al., 2017; Lin et al., 2017). The other type of algorithm performs single-stage detection to

enhance the detection speed and uses a single network structure for object detection. The main algorithms in this

category include YOLO, single-shot detection and Retina-Net (Lin et al., 2020b; Liu, Anguelov, et al., 2016; Redmon

et al., 2016). YOLO divides an image into multiple regions and predicts the bounding box of each region simultane-

ously with the probability of the object to which the box belongs to. With the introduction of multiscale features, the

small object detection performance of YOLOv3 improves significantly (Redmon & Farhadi, 2018).

Object tracking identifies the path or trajectory of an object in a given video sequence when the video frame

contains only the initial state of the object (Marvasti-Zadeh et al., 2021; Yilmaz et al., 2006). Conventional tracking

methods in computer vision primarily use image features (e.g., HOG) and model the appearance and motion of a target

by adopting template matching, mean filtering, scale-invariant feature transformation, the Kanade–Lucas–Tomasi

tracking algorithm, the Kalman filter, and the Hungarian algorithm (Bae & Song, 2008; Du et al., 2012; Lowe, 2004;

Sahbani & Adiprawita, 2016; Simon, 2001; Svoboda, 2007). The Hungarian algorithm is a classical combinatorial

optimization schema that is used in multitarget tracking for target matching between two frames, that is, the front

and rear frames. The Kalman filter is employed in the image space to define and combine the simple, online, and real-

time (SORT) tracker with the frame-by-frame data association metric of the Hungarian algorithm, thereby increasing

the frame rate of multitarget tracking (Bewley et al., 2016). With the method being rapidly applied based on deep

learning, a CNN is introduced into the target tracking calculation, and a Siamese neural network is designed based on

similarity learning for target tracking, which is highly accurate and reliable (Bertinetto et al., 2016). The deficiencies of

SORT in tracking occlusion have been addressed in the DeepSORT tracker. A CNN network is integrated into Deep-

SORT for feature extraction, and a combined metric for the association of target motion and appearance information

is used to increase the robustness against target omission and occlusion. DeepSORT is also easy to implement,

computationally efficient, and suitable for multitarget tracking (Wojke et al., 2017).

Video clips are a commonly used type of media that consist of a collection of images with temporal relation-

ships. Video clips offer the advantages of spatiotemporal semantics, high information resolution, direct expression,

LIU et al.497

and accurate transmission of spatial relationships. Geotagged video can integrate geospatial semantics and video

image features by extracting spatial information, such as the video location and field of view. Thus, geotagged video

is an important source of geographic information that has received attention in many fields, such as GIS, computer

vision and data mining (Luo et al., 2011). Several studies have been performed in this area over the last several years,

including on geotagged video data collection and processing (Burr et al., 2018; Mills et al., 2010); data modeling (Han

et al., 2016; Lewis et al., 2011); video retrieval based on spatial semantics (Kim et al., 2010; Lu & Shahabi, 2017;

Ma et al., 2014; Wu et al., 2018); video mapping, analysis, and mining (Jamonnak et al., 2021; Rumora et al., 2021;

Wang et al., 2022; Zhang et al., 2021); and video synopsis (Jamonnak et al., 2020; Xie et al., 2022; Zhang et al., 2019).

The integration and fusion of surveillance video and GIS can significantly improve the management efficiency of

surveillance video and enhance video surveillance systems. GIS offers significant advantages as a general framework

for video surveillance. Two types of modes, namely, GIS-enhanced video and video-enhanced GIS, have been defined

and verified using the GeoScopeAVS prototype (Milosavljević et al., 2010; Sankaranarayanan & Davis, 2008). Spatial

semantics are used to integrate dynamic targets between GIS and surveillance videos and retrieve target trajectories

(Li et al., 2021; Xie et al., 2017, 2021). The continuous trajectory of dynamic objects in video is extracted through

background subtraction and Canny operator fusion, and objects in three-dimensional (3D) geographic scenes are

located by combining imaging rays and digital surface model intersections (Han et al., 2022; Li et al., 2022). Align-

ment and matching of video image space and 3D geographic space are key to integrating surveillance video and GIS.

For effective image matching, a method for projecting vector data into surveillance video has been proposed based

on using remote sensing and video images (Shao et al., 2020). High-resolution orthoimages and digital elevation

models can be used in conjunction with the Levenberg–Marquardt iterative optimization method to determine the

locations and orientations of cameras (Milosavljević et al., 2017). A nonlinear perspective correction model has been

used to calculate a holography matrix based on multiple matching points to implement the geospatial mapping of

video objects (Zhang et al., 2021). This method has the advantages of facile parameter acquisition and convenient

calculation (Lin et al., 2020a).

In summary, the advantages of deep learning have facilitated remarkable progress in object detection and track-

ing. Surveillance video is a typical geotagged video, such that its spatial semantics can be integrated with GIS to

perform spatiotemporal analysis and facilitate the creation of more intelligent geographic information by the geospa-

tial artificial intelligence (GeoAI) (Janowicz et al., 2020). A review of the current literature reveals that limited studies

have been performed on crowd sensing and spatiotemporal analysis in urban open spaces based on multi-viewpoint

cameras. In this study, object detection and tracking methods are integrated to extract crowd trajectories from

multi-viewpoint surveillance videos and perform geospatial mapping for crowd sensing and corresponding spatio-

temporal analysis.

3 | METHODOLOGY

Crowdsensing and spatiotemporal analysis in urban open spaces are based on determining the location of pedestri-

ans and spatial analysis. The main framework of this method is described in this section (Figure 1) and consists of four

parts, that is, pedestrian detection, crowd tracking, geospatial mapping of pedestrians, trajectory fusion and crowd

spatiotemporal analysis. The spatiotemporal analysis method is presented in Section 5.

3.1 | YOLOv3 model for crowd detection

Object detection and tracking are key tasks in surveillance video analysis. Considering the low latency requirements

for surveillance video data processing, a single-stage detection model that considers both speed and accuracy has

significant advantages. In this study, the YOLOv3 model for object detection is integrated with the DeepSORT

LIU et al.

498

tracking algorithm for crowd detection and tracking in surveillance videos (Ahmed et al., 2021; Punn et al., 2020). The

video image data are input into the YOLOv3 detector to determine the object bounding box as output to the Deep-

SORT tracker (Figure 1). The advantages of the YOLOv3-DeepSORT model include low object occlusion and illumi-

nation interference and high tracking reliability to realize multitarget detection and crowd tracking in surveillance

videos and lays a foundation for crowd target trajectory fusion and spatiotemporal analysis (Hossain & Lee, 2019;

Zhang et al., 2019).

Within YOLOv3, object detection is treated as a regression problem. Feature maps of different scales are

extracted and fused from the input image through multiple convolutions and pooling. The model structure is

shown in Figure 1. YOLOv3 includes a feature extraction layer based on Darknet-53 and a YOLO multiscale

prediction layer. Darknet-53 consists of five residual blocks, each of which is composed of a prescribed number

of convolution blocks following a prescribed number of cycles through fusion. In the YOLO prediction layer, three

different scales are first used for detection in 32, 16, and 8 down samplings; then, the upsampling module is fused

to extract deep features, and the predictive frames with scale sizes of 13 × 13, 26 × 26, and 52 × 52 are finally

used to predict objects. The YOLOv3 calculation process is as follows: first, the video frame image is input, and

adjusted in size; then, convolution and pooling operations are used to generate a three-layer feature map. These

feature maps of different scales are used to predict the object bounding box for each scale, where each bound-

ing box includes four coordinates (the center point coordinates and the width or height). Then, the classification

confidence is output as the detection result (Redmon & Farhadi, 2018). The object type in this study is set as

“pedestrians,” and the YOLOv3 algorithm is used to generate image coordinates and assign unique IDs to detected

pedestrians.

FIGURE 1 The proposed method for crowd sensing and spatiotemporal analysis.

LIU et al.499

3.2 | DeepSORT model for crowd tracking

The object bounding box detected by YOLOv3 is input into the DeepSORT tracker, which is combined with the

improved Kalman filter to predict the object position. The Mahalanobis distance and cosine distance of the depth

descriptor are used as fused metrics, in the Hungarian algorithm to perform cascade matching, and the object track-

ing results are output (Figure 1). The Kalman filter and Hungarian algorithm are combined in SORT to set the motion

of the detected object as linear motion and predict the position of the object in the next frame according to the

position of the current frame and the target motion speed (Bewley et al., 2016). The correlation between the predic-

tion and truth is then measured by using the appearance function and Mahalanobis distance. The CNN in the Deep-

SORT algorithm uses a large-scale pedestrian recognition dataset for pretraining to build the target tracking appear-

ance features (Wojke et al., 2017). The association of the data is converted into motion and appearance information,

thereby reducing interference from occlusion. Using the object bounding box detected by YOLOv3 as the input to

DeepSORT tracking accomplishes the multitarget visual tracking task according to the unique pedestrian ID.

3.3 | Geospatial mapping for video objects

After the objects in the video are detected and tracked, the pixel coordinates of the object are generated in image

space. The pixel coordinates must be converted to geographic coordinates to perform crowd spatiotemporal anal-

ysis. This transformation is carried out using a camera calibration model based on the homography matrix method.

Consider a point P (X, Y, Z) in the real world that is projected to the two-dimensional (2D) image plane in the

camera calibration model as p (u, v). Due to imaging deformation, a distortion correction coefficient must be added

to the mapping matrix to improve conversion accuracy. Normalized homogeneous coordinates are used to define

[

XYZ1

]

T and

[

uv1

. The mapping matrix is denoted by M, and the transformation from pixel coordi-

nates to geographic coordinates is defined as

. To perform camera calibration, it is necessary to determine

the camera interior orientation elements, such as the center point and focal length, as well as the camera exterior

orientation elements, such as the translation, rotation and scale factor. M is defined as the mapping matrix for the

camera calibration process and includes the camera interior and exterior orientation elements and camera lens

distortion coefficient. Thus, the mapping matrix can be defined as

=s•K•

. where s is the scale factor, and

K is the camera interior orientation parameter matrix related to the internal structure of each camera, as defined

below:







dx0u0

dyv0

001













fx0cx

0fycy

001







(1)

where

and

are the pixel sizes of the camera in the x and y directions, respectively;

and

are the coordi-

nates of the image center points;

and

are the camera focal lengths; and

and

are the camera optical axis

offsets.

is the matrix of camera exterior orientation elements and is defined as



r1r2r3t



, where

and

denote the rotation angles along the X-, Y- and Z-axes, respectively; and

denotes the translation value between

the two coordinate systems. In the homography matrix method, the field of view of the video in the geographic

space is assumed to be a plane, that is, the ground is assumed to be the Z = 0 plane. Then,



XY01

T

is defined,

such that the mapping relationship between the image and geographic spaces can be considered a mapping from

one plane to another. The

rotating around the Z-axis can be removed to yield the following simplified formula:

LIU et al.

500



















dx0u0

dyv0

001













0cx

fycy

0 01







r1r2t













=s·K·Q·P

= MP

(2)

The inverse of the M matrix can be used to map the 2D image coordinates to the world coordinate system.













−1













(3)

The camera interior orientation parameters for the transformation process are calculated using Zhang's (2000)

calibration method. The camera exterior orientation elements are calculated by the Perspective-n-Point (PnP) algo-

rithm, where the points with the same labels are used to establish the mapping relationship between the image and

geographic space (Lepetit et al., 2009).

3.4 | Pedestrian trajectory fusion based on a multidimensional similarity measure

Pedestrian detection and tracking and the corresponding geospatial mapping are used to generate the pedestrian

trajectories within the field of view of each camera. However, it is necessary to use the similarity measure method

for trajectory fusion to build complete trajectories of pedestrians in urban open spaces for crowd spatiotemporal

analysis. Dynamic time warping (DTW) (Senin, 2008), longest common subsequence (LCSS) (Vlachos et al., 2002) and

edit distance on real sequence (EDR) (Chen et al., 2005) are commonly used to calculate the trajectory similarity (Jing

et al., 2022). The DTW algorithm is primarily used to process trajectory sequences of equal length, whereas both

the LCSS and EDR are variants of the editing distance algorithm. The trajectory similarity is calculated by defining a

matching threshold to search for the longest common subsequence between two trajectory sequences. Most of the

abovementioned algorithms use a single feature, such as the Euclidean distance to calculate the trajectory similarity.

Considering the distance, time and direction features of pedestrian trajectories within the field of view of multiple

cameras, the multidimensional similarity measure (MSM) based on multiple features is used to perform multicamera

pedestrian trajectory matching and fusion (Furtado et al., 2016). The direction of each point in the trajectory is deter-

mined by calculating the azimuth of the vector that points from the previous point to the current point. The MSM is

used to calculate the similarity score for two trajectory sequences

and

by searching for the best matching score

for all the elements in these sequences. The azimuth angles of the corresponding trajectory points, distance and time

are compared. The number of elements in

and

are denoted by

and

, respectively, and the similarity index

M is defined as follows:

(T1,T2)=











0 if N1= 0 or N2

p(T1,T2)+p(T2,T1)

otherwise

(4)

where

(

is the sum of the maximum matching scores of all the elements

and any element in

and vice

versa.

(T1,T2)=

∑

∈T

max





ti,tj



:tj∈T2



(5)

LIU et al.501

where



ti,tj



is the weighted sum of the matching scores between the trajectory sequence elements

and

dimension features and is defined as:

ti,tj=

∑

k=1

mkti,tj∗ωk



(6)

where

Aωk

denotes the weight. The matching score

and

is a binary value: 1 if the matching condition is met

and 0 otherwise. The matching conditions are defined based on the distance, time and azimuth difference of the

trajectory points:

kti,tj=











1 if dkti,tj≤D

0 otherwise

(7)

where

is the matching threshold. The number of points in

and

may not be identical, resulting in a discrepancy

in the number of accumulations for the maximum



ti,tj



when computing

and

. Thus, it is neces-

sary to calculate

and

respectively. The similarity index of the pedestrian trajectories extracted by

different cameras is calculated to realize trajectory matching and fusion of the same pedestrian trajectory with the

multi-viewpoint cameras, thereby generating the complete trajectory of all pedestrians in the monitoring area.

4 | PEDESTRIAN DETECTION AND GEOSPATIAL MAPPING

A campus square is selected as the study area. This square is almost circular with a diameter of approximately 110 m,

and the geographic coordinates of the center point are 114°18′12.88″ N, 34°49′7.55″ E. Four Sony FDR-AXP55-4K

cameras with a resolution of 1920 × 1080 pixels are deployed in the area for scene monitoring, where the camera

positions are shown in Figure 2. Video data are synchronously captured in four video clips, each of which is 31 s long

with 775 frames. Experiments are performed on pedestrian detection and tracking, geospatial mapping and crowd

spatiotemporal analysis, and the results are analyzed. The hardware environment is a 3.60-GHz Intel Core i7 4790

processor with 16.0 GB of random-access memory. The software environment is TensorFlow1.15 for deep-learning-

based object detection and tracking and OpenCV4.2 for video data processing, and the Python script is used for the

corresponding calculation and analysis.

4.1 | Pedestrian detection and tracking

Pedestrian detection and tracking are the basis of crowd spatiotemporal analysis. OpenCV is first used to read the

video frame sequence, and the YOLOv3-DeepSORT model is then used to detect and track the objects of the corre-

sponding frame. The VOC2007 dataset is used to train the YOLOv3 model and generate the model parameters. In

this study, we set the object category to “pedestrian” in the calculations. The center point of the object boundary

box is used as the output pedestrian position to generate the same point between the image and geographic coor-

dinates using a handheld GPS receiver. Figure 3 shows the results of the YOLOv3-DeepSORT object detection and

tracking. The average detection and tracking speed of the YOLOv3-DeepSORT model is 5.09 frames per second,

which takes a total of 2.55 min in our experiment. The screenshots of the two videos before and after running the

detection and tracking model indicate that the overlapped and occluded images of the pedestrians with IDs 23 and

24 are identified by the model as objects 23 and 24, thus achieving continuous tracking. As the DeepSORT algorithm

LIU et al.

502

performs nearest-neighbor-matching based on object appearance features, the robustness of pedestrian tracking is

significantly improved.

4.2 | Geospatial mapping for video objects

Geospatial mapping of video objects comprises three steps, that is, camera calibration, calculation of the exterior

orientation elements, and coordinate transformation. Zhang's (2000) method is used to perform camera calibration

in this study. A calibrated checkerboard image is shot and corner points are identified to generate the internal camera

parameter matrix and distortion coefficient, and the video frame image distortion is corrected. The interior orienta-

tion parameter and distortion correction metrics of the four cameras are shown in Appendix A. The calculated exterior

orientation elements are compiled into the matrix

, including the rotation and translation parameters. Pedestrians

hold a handheld GPS receiver in the monitoring area and simultaneously appear in the images of the four cameras

enabling the geographic coordinates of the pedestrians to be recorded (Figure 4). Then, the YOLOv3-DeepSORT

model is used to sequentially calculate the coordinates of the pedestrians in the image space. A total of 10 points of

GPS coordinates are recorded, and the geospatial coordinates are calculated using the Universal Transverse Merca-

tor projection. The image and geographic coordinates of the points are shown in Appendix B. The PnP algorithm in

OpenCV is used to generate the external camera parameter matrix, thereby establishing the geospatial mapping

FIGURE 2 The location and estimated field of view of the cameras in the study area.

LIU et al.503

relationship between the image and the geographic coordinates. The coordinate transformation accuracy is evaluated

in terms of the root mean square error (RMSE) and the mean absolute error (MAE), which are defined below:

RMSE =



∑

i=1



Pi′−Pi

2

(8)

FIGURE 3 Pedestrian detection and tracking using the YOLOv3-DeepSORT model.

(a) Overlapping and occlusion of pedestrian 23 and 24 during detection and tracking

(b) Detection and tracking of the overlapped and occluded images of pedestrian 23 and 24

LIU et al.

504

MAE =



∑

i=1

abs



′−Pi



(9)

where n denotes the number of samples,

i′

represents the transformed coordinates, and

denotes the real coor-

dinates. The results for the evaluation indicators are provided in Appendix C. The average RMSE and MAE are

1.115 and 0.984 m, respectively, for the transformed x-coordinates and 2.000 and 1.580 m, respectively, for the

transformed y-coordinates. The errors in the geospatial mapping for video objects are typically caused by multi-

ple factors, including equipment error, data acquisition error, and data processing error. The inherent errors of

cameras, GPS receivers and other equipment introduce systematic errors into the calibration of the interior and

exterior orientation elements of the cameras and coordinates determination. During data acquisition, changes in

lighting, texture, etc., cause the detection frame to be offset during object detection. The occlusion of buildings also

affects the location accuracy of the GPS receivers. The coordinates transformation generates fitting residuals for

the solu tion of the transformation parameters using the pinhole camera model, etc. These errors can be reduced by

using more precise equipment, selecting optimal lighting conditions and objects with dense textures, and repeating

the calculations for fitting the transformation parameters. As crowd analysis is used to determine an overall distri-

bution, the conversion results can be used to perform a spatiotemporal analysis of the crowd distribution in the

study area.

FIGURE 4 Pedestrian locations in the map and image.

LIU et al.505

5 | FUSION OF PEDESTRIAN TRAJECTORIES AND CROWD SPATIOTEMPORAL

ANALYSIS

5.1 | Fusion of pedestrian trajectories

The pedestrian detection and tracking and geospatial mapping results are used to extract the pedestrian trajectories

from the videos captured by the four cameras. These trajectories are fused using the method presented in Section 3.3.

The time threshold is set to 10 s, the angle threshold is 1°, and the distance threshold is 0.5 m. The weights are 0.2, 0.2,

and 0.6. Table 1 shows the similarity results for three pedestrian trajectories, that is, 2, 33, and 49, in the C1 camera

video, and three pedestrian trajectories, that is, 1, 6 and 10, in the C2 camera video. The similarity value of Trajectory

49 (C1) and Trajectory 10 (C2) is 0.4539, which is significantly higher than that for Trajectories 1 and 6 (C2). Among the

aforementioned trajectories, Trajectory 2 (C1) has the highest similarity with Trajectory 6 (C2) of 0.3816. The similarity

of Trajectory 2 (C1) and Trajectory 1 (C2) of 0.3802 is higher than that of Trajectory 33 (C1) and Trajectory 1 (C2) of

0.1977, but lower than the aforementioned similarity of Trajectories 2 and 6 (0.3816); therefore, Trajectories 2 and 6

and Trajectories 33 and 1 are best matched. Figures 5a,b shows that Trajectory 33 (C1) and Trajectory 1 (C2) are the

same target trajectory, and Trajectories 2 and 49 (C1) correspond to the C1 video targets 6 and 10, respectively. Despite

the closeness of Trajectories 1 and 6 and Trajectories 2 and 33, Trajectories 1 and 33 and Trajectories 2 and 6 can still

be correctly matched because the MSM method comprehensively considers distance, direction, and time factors. The

corresponding relationship is used to combine the trajectory points into a complete pedestrian trajectory dataset across

the camera field of view of the study area, where each trajectory point contains time and location information.

TABLE 1 The similarity scores for trajectories extracted from different cameras.

Trajectory 2 extracted from

Trajectory 33 extracted from

Trajectory 49

extracted from C1

Trajectory 1 extracted from C2 0.3802 0.1977 0.1810

Trajectory 6 extracted from C2 0.3816 0.1511 0.1708

Trajectory 10 extracted from C2 0.1744 0.0773 0.4539

Note

: The similarity score is assessed using both rows and columns to identify the best matched trajectories (bold style).

The similarity score between Trajectories 2 and 1 is 0.3802, which is lower than that between Trajectories 2 and 6 (0.3816),

indicating that Trajectory 2 is a better match to Trajectory 6 than to Trajectory 1. As the best matched trajectory to

Trajectory 2 has been found, only the similarity scores for Trajectory 1 and the remaining trajectories must be compared.

Trajectory 1 has a higher similarity score with Trajectory 33 (0.1977) than with Trajectory 49 (0.1810).

FIGURE 5 Fusion of pedestrian trajectories using the MSM method. (a) Plots of the pedestrian trajectories

with IDs 2, 33, and 49 recorded by the C1 camera; (b) plots of the pedestrian trajectories with IDs 1, 6, and 10

recorded by the C2 camera; and (c) the fusion results for Trajectories 2 and 6; 33 and 1; and 49 and 10.

(a) Trajectories extractedfromC1(b) Trajectories extractedfromC2(c) Trajectories fusion results

LIU et al.

506

5.2 | Crowd movement analysis

To analyze the movement pattern of the crowd, the pedestrian trajectory points are connected into trajectory lines

through the sequence of trajectory ID and time, and the movement direction of each pedestrian is analyzed at the

individual level. Figure 6a shows the classification results for the movement directions of different pedestrian trajec-

tories for the crowd. The green movement trajectory is in the southeast direction, whereas Pedestrian 72 moves

in the opposite direction. A majority of people moving in the southeast direction increases the risk of collisions.

Pedes trian 2 in a stagnant state presents a high risk in an urban venue with high foot traffic. Overall, crowd move-

ment patterns can be detected considering the direction of single trajectories. Figure 6b shows the movement

conditions of the crowd. Figure 6a presents the accumulated displacements of the target trajectories in the same

movement direction, indicating that the movement of the crowd is primarily concentrated in the southwest direction,

followed by the northeast direction, corresponding to a two-way movement mode. Different movement patterns can

represent the crowd at different places and times, and scientific management can be conducted for different places.

For example, entrances and exits of large-scale sports and performance stadiums can only accommodate one-way

movement patterns, whereas the overpasses and zebra crossings of roads can accommodate two-way movement

patterns. Timely warnings are necessary for disorderly movement patterns in specific places.

5.3 | Crowd distribution analysis

The trajectory points at different times are used to calculate the crowd density to identify and visualize congested

regions. First, the study area is divided into 1 × 1 m grids. The number of trajectory points in the grid is counted at

two time points, that is, t1 = 15 s and t2 = 31 s, to calculate the crowd density, and thematic visualization is performed

to analyze the spatial distribution and changes in congested areas at a previous time. Figures 7a,b show the crowd

FIGURE 6 Crowd movement analysis. The pedestrian trajectories are plotted in (a). These trajectories are

categorized at 45 o intervals by determining their azimuth angles in accordance with the starting and ending points

of the trajectories. (b) Depicts the corresponding accumulated trajectory displacements for different categories.

nrettaptnemevomdworC(b)noitceridtnemevomdworC(a)

LIU et al.507

density distribution at t1 and t2, respectively, in which the orange-red and red grids correspond to congested

regions. By observing the changes in the crowd density distribution at two time points, congested areas can be

identified in time, and early warnings can be promptly issued. Second, the standard deviational ellipse method is

applied to the pedestrian trajectory dataset to analyze the overall directions of the crowd distribution at t1 and t2

(Figure 7c). The center of the crowd distribution ellipses at t1 and t2 moves from A (114°18′12.82″, 34°49′07.42″)

to B (114°18′12.84″, 34°49′07.41″), and the azimuth angle of the major axis of the ellipse changes from 50.68° to

48.95°, thus realizing the spatiotemporal analysis of the overall direction of the crowd distribution.

6 | DISCUSSION AND CONCLUSIONS

Video surveillance has become an integral part of our surroundings, and surveillance cameras deployed in urban open

spaces are very useful for urban security (Socha & Kogut, 2020). Advances in machine learning and computer vision

have resulted in the increasing use of object detection and tracking in surveillance video and its intelligent analysis

(Ahmed et al., 2018). Surveillance video is typical geotagged video data and its spatial semantics are of great value

in video analysis. Spatial information from multi-viewpoint surveillance videos of open spaces (e.g., urban squares) is

used to develop a method for crowd sensing and its spatiotemporal analysis in this study. The object detection and

tracking model YOLOv3-DeepSORT is used to extract image coordinates of pedestrians from video clips. The camera

calibration and PnP algorithm are applied to the calculation of camera interior and exterior orientation elements to

perform the geospatial mapping from image coordinates to geographic coordinates by solving for the transformation

matrix. Multiple pedestrian trajectories generated using single cameras with different viewpoints are matched and

fused using the MSM method by integrating features of distance, time and direction. A campus square is selected as

the study area for data collection, pedestrian detection and tracking, trajectory extraction and fusion experiments.

Crowd spatiotemporal analysis is performed on the pedestrian trajectories dataset by applying standard deviation

ellipses and density estimation methods, and the overall characteristics of the crowd distribution and movement

patterns are identified. The results demonstrate that multiple surveillance cameras can be used in conjunction with

the pedestrian detection and tracking method and geospatial mapping of video objects to generate complete pedes-

trian trajectories in the monitoring area, as well as to create situational awareness and perform spatiotemporal anal-

ysis of the crowd, thereby providing solutions for the intelligent analysis of surveillance video and promoting the

in-depth application of geotagged videos to urban security.

In practical applications, different tactics should be employed depending on whether surveillance cameras are

deployed. For new open spaces without cameras, detailed camera parameters should be inferred by combining the data

FIGURE 7 Crowd distribution analysis. (a,b) The crowd density distribution at t1 = 15 s and t2 = 31 s. (c) The

standard deviation ellipse for the crowd distribution at t1 and t2.

(a) Crowddensity in t1 (b)Crowddensity in t2 (c) SDEint1and t2

LIU et al.

508

for building footprints and points of interest in the surveillance area, and spatial optimization models, such as maximum

coverage location problem, should be used to estimate the number and location of cameras for camera planning to

achieve high surveillance coverage at a low cost (Han, Li, Cui, Song, et al., 2019; Liu, Sridharan, et al., 2016). Then, camera

internal and external orientation element calibration should be performed for geospatial mapping and crowd sensing.

For spaces with deployed cameras, camera interior and exterior orientation element should be determined using Zhang's

(2000) calibration method and the PnP algorithm. In an expansive urban open space, the overlap area of the field of view

of the camera may be minimal or nonexistent or the crowd may be obscured by buildings, which can interrupt pedestrian

trajectories tracked by different cameras. To address this issue, pedestrian reidentification technology can be used to

match pedestrians detected by different cameras based on image features, and the pedestrian trajectories can then be

fused (Brasó & Leal-Taixé, 2020; Weng et al., 2022). A long short-term memory network and graph-based spatiotemporal

reasoning models can be used in conjunction with the MSM approach for trajectory matching to predict and comple-

ment the interrupted trajectories. Appropriate image or location obfuscation methods should also be used to effectively

address privacy protection concerns associated with video surveillance.

An important direction for future research is to combine video surveillance and the Internet of Video Things

(IoVT) with an edge computing framework (Chen, 2020). Video data are currently mainly collected by cameras and

then transmitted to a server for processing. Uninterrupted collection of surveillance video continuously generates

large quantities of video data, resulting in a high network transmission and server computing load. Visual sensors

are being continuously developed and can be combined with IoT and edge computing technology as a paradigm

of IoVT technology, such that a portion of the video processing tasks can be assigned to the camera side. As each

camera completes these processing tasks, the results are returned to the server side to achieve load balancing and

reduce the video data volume. For example, the object detection and tracking model can be integrated on the camera

side to extract dynamic targets, such as pedestrians, from each camera and send the results to the server side

for trajectory fusion and spatiotemporal analysis, thus improving the computational efficiency and usability of the

surveillance system. Another direction would be to develop an intelligent surveillance platform based on geotagged

videos for real-time analysis and efficient early warning of crowd congestion. Object trajectories determined using

multi-viewpoint surveillance videos and geospatial mapping methods can be fused for spatiotemporal analysis under

a unified geo-referencing system to determine the overall distribution of objects and significantly enhance the power

of the video surveillance system.

We consider that the methodology presented in this article (in particular, crowd sensing, and spatiotemporal

analysis based on multi-viewpoint geotagged videos) has application potential to intelligent surveillance systems

used for urban security. A variety of innovative surveillance systems could be developed by integration with spatial

analysis in GIS and video data processing algorithms for use in urban security and smart cities.

ACKNOWLEDGMENTS

Special thanks go to the editor and anonymous reviewers of this article for their constructive comments especially

during the COVID-19 pandemic, which substantially improved the quality of this article.

FUNDING INFORMATION

This research was funded by the National Natural Science Foundation of China under Grant 41871316, the National

Key R&D Program of China under Grant 2021YFE0106700, the Key Technologies R&D Program of the Henan

province under Grant 212102310421, the Foundation of Key Laboratory of Soil and Water Conservation on the

Loess Plateau of Ministry of Water Resources under Grant WSCLP202101, and Natural Resources Science and Tech-

nology Innovation Project of Henan Province under Grant No. 202016511.

CONFLICT OF INTEREST STATEMENT

No potential conflict of interest was reported by the authors.

LIU et al.509

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available on request from the corresponding author. The data are

not publicly available due to privacy and ethical restrictions.

ORCID

Zhigang Han https://orcid.org/0000-0002-9993-3382

REFERENCES

Ahmed, I., Ahmad, M., Rodrigues, J. J., Jeon, G., & Din, S. (2021). A deep learning-based social distance monitoring framework

for COVID-19. Sustainable Cities and Society, 65, 102571. https://doi.org/10.1016/j.scs.2020.102571

Ahmed, S. A., Dogra, D. P., Kar, S., & Roy, P. P. (2018). Trajectory-based surveillance analysis: A survey. IEEE Transactions on

Circuits and Systems for Video Technology, 29(7), 1985–1997. https://doi.org/10.1109/TCSVT.2018.2857489

Bae, J. S., & Song, T. L. (2008). Image tracking algorithm using template matching and PSNF-m. International Journal of Control,

Automation, and Systems, 6(3), 413–423. https://koreascience.kr/article/JAKO200822049838914.page

Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object

tracking. In B. Leibe, J. Matel, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016 (pp. 850–865). Springer. https://

doi.org/10.1007/978-3-319-48881-3_56

Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. IEEE International Conference

on Image Processing, Phoenix, AR (pp. 3464–3468). IEEE. https://doi.org/10.1109/ICIP.2016.7533003

Brasó, G., & Leal-Taixé, L. (2020). Learning a neural solver for multiple object tracking. IEEE/CVF Conference on Computer

Vision and Pattern Recognition, Seattle, WA (pp. 6247–6257). IEEE. https://doi.org/10.48550/arXiv.1912.07515

Burr, A., Schaeg, N., & Hall, D. M. (2018). Assessing residential front yards using Google street view and geospatial video:

A virtual survey approach for urban pollinator conservation. Applied Geography, 92, 12–20. https://doi.org/10.1016/j.

apgeog.2018.01.010

Chen, C. W. (2020). Internet of video things: Next-generation IoT with visual sensors. IEEE Internet of Things Journal, 7(8),

6676–6685. https://doi.org/10.1109/JIOT.2020.3005727

Chen, L., Özsu, M. T., & Oria, V. (2005). Robust and fast similarity search for moving object trajectories. ACM SIGMOD International

Conference on Management of Data, Baltimore, MD (pp. 491–502). ACM. https://doi.org/10.1145/1066157.1066213

Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. 30th Confer-

ence on Advances in Neural Information Processing Systems, Barcelona, Spain (pp. 1–29). https://doi.org/10.48550/

arXiv.1605.06409

Draghici, A., & Steen, M. V. (2018). A survey of techniques for automatically sensing the behavior of a crowd. ACM Computing

Surveys, 51(1), 1–40. https://doi.org/10.1145/3129343

Du, K., Ju, Y., Jin, Y., Li, G., Qian, S., & Li, Y. (2012). MeanShift tracking algorithm with adaptive block color histogram. Second

International Conference on Consumer Electronics, Communications and Networks, Yichang, China (pp. 2692–2695). IEEE.

https://doi.org/10.1109/CECNet.2012.6202074

Elharrouss, O., Almaadeed, N., & Al-Maadeed, S. (2021). A review of video surveillance systems. Journal of Visual Communica-

tion and Image Representation, 77, 103116. https://doi.org/10.1016/j.jvcir.2021.103116

Furht, B. (2008). Geographic video content. Encyclopedia of multimedia (2nd ed., pp. 271–272). https://doi.

org/10.1007/978-0-387-78414-4_328

Furtado, A. S., Kopanaki, D., Alvares, L. O., & Bogorny, V. (2016). Multidimensional similarity measuring for semantic trajecto-

ries. Transactions in GIS, 20(2), 280–298. https://doi.org/10.1111/tgis.12156

Girshick, R. (2015). Fast R-CNN. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile (pp. 1440–1448).

IEEE. https://doi.org/10.1109/ICCV.2015.169

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic

segmentation. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH (pp. 580–587). IEEE. https://

doi.org/10.1109/CVPR.2014.81

Han, S., Dong, X., Hao, X., & Miao, S. (2022). Extracting objects' spatial–temporal information based on surveillance videos

and the digital surface model. ISPRS International Journal of Geo-Information, 11(2), 103. https://doi.org/10.3390/

ijgi11020103

Han, Z., Cui, C., Kong, Y., Qin, F., & Fu, P. (2016). Video data model and retrieval service framework using geographic informa-

tion. Transactions in GIS, 20(5), 701–717. https://doi.org/10.1111/tgis.12175

Han, Z., Li, S., Cui, C., Han, D., & Song, H. (2019). Geosocial media as a proxy for security: A review. IEEE Access, 7, 154224–

154238. https://doi.org/10.1109/ACCESS.2019.2949115

LIU et al.

510

Han, Z., Li, S., Cui, C., Song, H., Kong, Y., & Qin, F. (2019). Camera planning for area surveillance: A new method for coverage

inference and optimization using location-based service data. Computers, Environment and Urban Systems, 78, 101396.

https://doi.org/10.1016/j.compenvurbsys.2019.101396

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. IEEE International Conference on Computer Vision, Venice,

Italy (pp. 2980–2988). IEEE. https://doi.org/10.1109/ICCV.2017.322

Hossain, S., & Lee, D. J. (2019). Deep learning-based real-time multiple-object detection and tracking from aerial imagery via

a flying robot with GPU-based embedded devices. Sensors, 19(15), 3371. https://doi.org/10.3390/s19153371

Jamonnak, S., Zhao, Y., Curtis, A., Al-Dohuki, S., Ye, X., Kamw, F., & Yang, J. (2020). GeoVisuals: A visual analytics approach to

leverage the potential of spatial videos and associated geonarratives. International Journal of Geographical Information

Science, 34(11), 2115–2135. https://doi.org/10.1080/13658816.2020.1737700

Jamonnak, S., Zhao, Y., Huang, X., & Amiruzzaman, M. (2021). Geo-context aware study of vision-based autonomous driving

models and spatial video data. IEEE Transactions on Visualization and Computer Graphics, 28(1), 1019–1029. https://doi.

org/10.1109/TVCG.2021.3114853

Janowicz, K., Gao, S., McKenzie, G., Hu, Y., & Bhaduri, B. (2020). GeoAI: Spatially explicit artificial intelligence techniques for

geographic knowledge discovery and beyond. International Journal of Geographical Information Science, 34(4), 625–636.

https://doi.org/10.1080/13658816.2019.1684500

Jing, C., Hu, Y., Zhang, H., Du, M., Xu, S., Guo, X., & Jiang, J. (2022). Context-aware matrix factorization for the identification

of urban functional regions with POI and taxi OD data. ISPRS International Journal of Geo-Information, 11(6), 351. https://

doi.org/10.3390/ijgi11060351

Jing, C., Zhu, Y., Du, M., & Liu, X. (2021). Visualizing spatiotemporal patterns of city service demand through a space-time

exploratory approach. Transactions in GIS, 25(4), 1766–1783. https://doi.org/10.1111/tgis.12820

Kilger, M. A. (1992). Shadow handler in a video-based real-time traffic monitoring system. IEEE Workshop on Applications of

Computer Vision, Palm Springs, CA (pp. 11–18). IEEE. https://doi.org/10.1109/ACV.1992.240332

Kim, S. H., Ay, S. A., & Zimmermann, R. (2010). Design and implementation of geo-tagged video search framework. Journal of

Visual Communication and Image Representation, 21(8), 773–786. https://doi.org/10.1016/j.jvcir.2010.07.004

Kong, Y. (2010). Design of GeoVideo data model and implementation of web-based VideoGIS. Geomatics and Information

Science of Wuhan University, 35(2), 133–137. https://doi.org/10.13203/j.whugis2010.02.019

Laufs, J., Borrion, H., & Bradford, B. (2020). Security and the smart city: A systematic review. Sustainable Cities and Society, 55,

102023. https://doi.org/10.1016/j.scs.2020.102023

Lepetit, V., Moreno-Noguer, F., & Fua, P. (2009). EPnP: An accurate O (n) solution to the PnP problem. International Journal of

Computer Vision, 81(2), 155–166. https://doi.org/10.1007/s11263-008-0152-6

Lewis, P., Fotheringham, S., & Winstanley, A. (2011). Spatial video and GIS. International Journal of Geographical Information

Science, 25(5), 697–716. https://doi.org/10.1080/13658816.2010.505196

Li, C., Liu, Z., Zhao, Z., & Dai, Z. (2021). A fast fusion method for multi-videos with three-dimensional GIS scenes. Multimedia

Tools and Applications, 80(2), 1671–1686. https://doi.org/10.1007/s11042-020-09742-4

Li, J., Wei, J., Jiang, J., Lu, Y., Liu, L., Tang, Y., & Li, X. (2022). Spatio-temporal information extraction method for dynamic

targets in multi-perspective surveillance video. Acta Geodaetica et Cartographica Sinica, 51(3), 388–400. https://doi.

org/10.11947/j.AGCS.2022.20200507

Li, X., Hu, W., Shen, C., Zhang, Z., Dick, A., & Hengel, A. V. D. (2013). A survey of appearance models in visual object tracking.

ACM Transactions on Intelligent Systems and Technology, 4(4), 1–48. https://doi.org/10.1145/2508037.2508039

Lin, B., Xu, C., Lan, X., & Zhou, L. (2020a). A method of perspective normalization for video images based on map data. Annals

of GIS, 26(1), 35–47. https://doi.org/10.1080/19475683.2019.1704870

Lin, T.-. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detec-

tion. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI (pp. 936–944). IEEE. https://doi.

org/10.1109/CVPR.2017.106

Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2020b). Focal loss for dense object detection. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 42(2), 318–327. https://doi.org/10.1109/TPAMI.2018.2858826

Lipton, A. J., Fujiyoshi, H., & Patil, R. S. (1998). Moving target classification and tracking from real-time video. Fourth IEEE

Workshop on Applications of Computer Vision WACV'98 (Cat. No. 98EX201), Princeton, NJ (pp. 8–14). IEEE. https://doi.

org/10.1109/ACV.1998.732851

Liu, J., Sridharan, S., & Fookes, C. (2016). Recent advances in camera planning for large area surveillance: A comprehensive

review. ACM Computing Surveys, 49, 6–37. https://doi.org/10.1145/2906148

Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020). Deep learning for generic object detection:

A survey. International Journal of Computer Vision, 128(2), 261–318. https://doi.org/10.1007/s11263-019-01247-4

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In B.

Leibe, J. Matel, N. Sabe, & M. Welling (Eds.), European Conference on Computer Vision—ECCV 2016 (pp. 21–37). Springer.

https://doi.org/10.1007/978-3-319-46448-0_2

LIU et al.511

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2),

91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94

Lu, Y., & Shahabi, C. (2017). Efficient indexing and querying of geo-tagged aerial videos. 25th ACM SIGSPATIAL Interna-

tional Conference on Advances in Geographic Information Systems, Redondo Beach, CA (pp. 1–10). ACM. https://doi.

org/10.1145/3139958.3140046

Luo, J., Joshi, D., Yu, J., & Gallagher, A. (2011). Geotagging in multimedia and computer vision—A survey. Multimedia Tools and

Applications, 51(1), 187–211. https://doi.org/10.1007/s11042-010-0623-y

Ma, H., Arslan Ay, S., Zimmermann, R., & Kim, S. H. (2014). Large-scale geo-tagged video indexing and queries. GeoInformat-

ica, 18(4), 671–697. https://doi.org/10.1007/s10707-013-0199-6

Mae, Y., Shirai, Y., Miura, J., & Kuno, Y. (1996). Object tracking in cluttered background based on optical flow and edges.

13th International Conference on Pattern Recognition, Vienna, Austria (Vol. 1, pp. 196–200). https://doi.org/10.1109/

ICPR.1996.546018

Marvasti-Zadeh, S. M., Cheng, L., Ghanei-Yakhdan, H., & Kasaei, S. (2021). Deep learning for visual tracking: A compre-

hensive survey. IEEE Transactions on Intelligent Transportation Systems, 23(5), 3943–3968. https://doi.org/10.1109/

TITS.2020.3046478

Mills, J. W., Curtis, A., Kennedy, B., Kennedy, S. W., & Edwards, J. D. (2010). Geospatial video for field data collection. Applied

Geography, 30(4), 533–547. https://doi.org/10.1016/j.apgeog.2010.03.008

Milosavljević, A., Dimitrijević, A., & Rančić, D. (2010). GIS-augmented video surveillance. International Journal of Geographical

Information Science, 24(9), 1415–1433. https://doi.org/10.1080/13658811003792213

Milosavljević, A., Rančić, D., Dimitrijević, A., Predić, B., & Mihajlović, V. (2016). Integration of GIS and video surveillance.

International Journal of Geographical Information Science, 30(10), 2089–2107. https://doi.org/10.1080/13658816.201

6.1161197

Milosavljević, A., Rančić, D., Dimitrijević, A., Predić, B., & Mihajlović, V. (2017). A method for estimating surveillance video

georeferences. ISPRS International Journal of Geo-Information, 6(7), 211. https://doi.org/10.3390/ijgi6070211

Neri, A., Colonnese, S., Russo, G., & Talone, P. (1998). Automatic moving object and background separation. Signal Processing,

66(2), 219–232. https://doi.org/10.1016/S0165-1684(98)00007-3

Newburn, T. (2021). The causes and consequences of urban riot and unrest. Annual Review of Criminology, 4, 53–73. https://

doi.org/10.1146/annurev-criminol-061020-124931

Nishiyama, H. (2018). Crowd surveillance: The (in) securitization of the urban body. Security Dialogue, 49(3), 200–216. https://

doi.org/10.1177/0967010617741436

Patel, T., Yao, A. Y. H., Qiang, Y., Ooi, W. T., & Zimmermann, R. (2021). Multi-camera video scene graphs for surveillance videos

indexing and retrieval. IEEE International Conference on Image Processing (ICIP), Anchorage, AK (pp. 2383–2387). IEEE.

https://doi.org/10.1109/ICIP42928.2021.9506713

Punn, N. S., Sonbhadra, S. K., Agarwal, S., & Rai, G. (2020). Monitoring COVID-19 social distancing with person detec-

tion and tracking via fine-tuned YOLO v3 and DeepSORT techniques. arXiv:2005.01385. https://doi.org/10.48550/

arXiv.2005.01385

Qian, X., Li, M., Ren, Y., & Jiang, S. (2019). Social media based event summarization by user–text–image co-clustering.

Knowledge-Based Systems, 164, 107–121. https://doi.org/10.1016/j.knosys.2018.10.028

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV (pp. 779–788). IEEE. https://doi.org/10.1109/

CVPR.2016.91

Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv:1804.02767. https://doi.org/10.48550/

arXiv.1804.02767

Rumora, L., Majić, I., Miler, M., & Medak, D. (2021). Spatial video remote sensing for urban vegetation mapping using vegeta-

tion indices. Urban Ecosystem, 24(1), 21–33. https://doi.org/10.1007/s11252-020-01002-5

Sahbani, B., & Adiprawita, W. (2016). Kalman filter and iterative-hungarian algorithm implementation for low complexity

point tracking as part of fast multiple object tracking system. Sixth International Conference on System Engineering and

Technology (ICSET), Shah Alam, Malaysia (pp. 109–115). https://doi.org/10.1109/ICSEngT.2016.7849633

Sankaranarayanan, K., & Davis, J. W. (2008). A fast linear registration framework for multi-camera GIS coordination. Fifth

IEEE International Conference on Advanced Video and Signal Based Surveillance, Santa Fe, NM (pp. 245–251). https://doi.

org/10.1109/AVSS.2008.20

Senin, P. (2008). Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii

at Manoa Honolulu. https://csdl.ics.hawaii.edu/techreports/2008/08-04/08-04.pdf

Shao, Z., Li, C., Li, D., Altan, O., Zhang, L., & Ding, L. (2020). An accurate matching method for projecting vector data into

surveillance video to monitor and protect cultivated land. ISPRS International Journal of Geo-Information, 9(7), 448.

https://doi.org/10.3390/ijgi9070448

Simon, D. (2001). Kalman filtering. Embedded Systems Programming, 14(6), 72–79. http://abel.math.harvard.edu/archive/116_

fall_03/handouts/kalman.pdf

LIU et al.

512

Socha, R., & Kogut, B. (2020). Urban video surveillance as a tool to improve security in public spaces. Sustainability, 12(15),

6210. https://doi.org/10.3390/su12156210

Stauffer, C., & Grimson, W. E. L. (1999). Adaptive background mixture models for real-time tracking. IEEE Computer Soci-

ety Conference on Computer Vision and Pattern Recognition, Fort Collins, CO (Vol. 2, pp. 246–252). IEEE. https://doi.

org/10.1109/CVPR.1999.784637

Subudhi, B. N., Rout, D. K., & Ghosh, A. (2019). Big data analytics for video surveillance. Multimedia Tools and Applications,

78(18), 26129–26162. https://doi.org/10.1007/s11042-019-07793-w

Svoboda, T. (2007). Kanade-Lucas-Tomasi tracking (KLT tracker). Czech Technical University in Prague, Center for Machine

Perception. https://cs.gmu.edu/~zduric/cs682/slides/klt.pdf

Vlachos, M., Kollios, G., & Gunopulos, D. (2002). Discovering similar multidimensional trajectories. 18th International Confer-

ence on Data Engineering, San Jose, CA (pp. 673–684). https://doi.org/10.1109/ICDE.2002.994784

Wang, X., Wang, M., Liu, X., Zhu, L., Glade, T., Chen, M., Zhao, W., & Xie, Y. (2022). A novel quality control model of rainfall

estimation with videos—A survey based on multi-surveillance cameras. Journal of Hydrology, 605, 127312. https://doi.

org/10.1016/j.jhydrol.2021.127312

Weng, X., Ivanovic, B., Kitani, K., & Pavone, M. (2022). Whose track is it anyway? Improving robustness to tracking errors with

affinity-based trajectory prediction. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA

(pp. 6573–6582). IEEE. https://doi.org/10.1109/CVPR52688.2022.00646

Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. IEEE Inter-

national Conference on Image Processing (ICIP), Beijing, China (pp. 3645–3649). IEEE. https://doi.org/10.1109/

ICIP.2017.8296962

Wu, C., Zhu, Q., Zhang, Y., Du, Z., Ye, X., Qin, H., & Zhou, Y. (2017). A NOSQL–SQL hybrid organization and management

approach for real-time geospatial data: A case study of public security video surveillance. ISPRS International Journal of

Geo-Information, 6(1), 21. https://doi.org/10.3390/ijgi6010021

Wu, C., Zhu, Q., Zhang, Y., Xie, X., Qin, H., Zhou, Y., Zhang, P., & Yang, W. (2018). Movement-oriented objectified organiza-

tion and retrieval approach for heterogeneous GeoVideo data. ISPRS International Journal of Geo-Information, 7(7), 255.

https://doi.org/10.3390/ijgi7070255

Xie, Y., Wang, M., Liu, X., Wang, X., Wu, Y., Wang, F., & Wang, X. (2022). Multi-camera video synopsis of a geographic scene

based on optimal virtual viewpoint. Transactions in GIS, 26(3), 1221–1239. https://doi.org/10.1111/tgis.12862

Xie, Y., Wang, M., Liu, X., Wang, Z., Mao, B., Wang, F., & Wang, X. (2021). Spatiotemporal retrieval of dynamic video object

trajectories in geographical scenes. Transactions in GIS, 25(1), 450–467. https://doi.org/10.1111/tgis.12696

Xie, Y., Wang, M., Liu, X., & Wu, Y. (2017). Integration of GIS and moving objects in surveillance video. ISPRS International

Journal of Geo-Information, 6(4), 94. https://doi.org/10.3390/ijgi6040094

Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys, 38(4), 13-es. https://doi.

org/10.1145/1177352.1177355

Zhang, J., Zheng, Y., & Qi, D. (2017). Deep spatio-temporal residual networks for citywide crowd flows prediction. Thirty-First

AAAI Conference on Artificial Intelligence, San Francisco, California, USA (pp. 1–7). https://doi.org/10.1609/aaai.

v31i1.10735

Zhang, X., Hao, X., Liu, S., Wang, J., Xu, J., & Hu, J. (2019). Multi-target tracking of surveillance video with differential

YOLO and DeepSORT. Eleventh International Conference on Digital Image Processing (ICDIP 2019), Los Angeles, CA (Vol.

11179, pp. 701–710). SPIE. https://doi.org/10.1117/12.2540269

Zhang, X., Shi, X., Luo, X., Sun, Y., & Zhou, Y. (2021). Real-time web map construction based on multiple cameras and GIS.

ISPRS International Journal of Geo-Information, 10(12), 803. https://doi.org/10.3390/ijgi10120803

Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 22(11), 1330–1334. https://doi.org/10.1109/34.888718

Zhao, R., Wang, D., Wang, Y., Han, C., Jia, P., Li, C., & Ma, Y. (2021). Macroscopic view: Crowd evacuation dynamics at

T-shaped street junctions using a modified aw-Rascle traffic flow model. IEEE Transactions on Intelligent Transportation

Systems, 22(10), 6612–6621. https://doi.org/10.1109/TITS.2021.3095829

Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object detection in 20 years: A survey. arXiv:1905.05055. https://doi.org/10.48550/

arXiv.1905.05055

How to cite this article: Liu, F., Han, Z., Song, H., Wang, J., Liu, C., & Ban, G. (2023). Crowd sensing and

spatiotemporal analysis in urban open space using multi-viewpoint geotagged videos. Transactions in GIS,

27, 494–515. https://doi.org/10.1111/tgis.13036

LIU et al.513

APPENDIX A

THE CAMERA INTERIOR ORIENTATION ELEMENTS, DISTORTION CORRECTION COEFFICIENTS, AND EXTERIOR ORIENTATION ELEMENT MATRIX

Camera

Image

center

points

Camera

focal length Camera distortion correction coefficients

Camera exterior orientation elements matrixu0, v0fx, fy

C1 994.35,

530.72

159.31,

158.71

0.184952 −0.182719 −0.105806 0.006815 0.153519

C2 967.99,

521.39

253.18,

164.39

0.037365 −0.022141 0.044989 0.009549 0.004931

A





−1.363081 e + 03 1.091650 e + 03 1.482575 e + 05

8.446769 e + 02 8.446769 e + 02 −8.194328 e + 04

−3.986530 e −01 −1.143757 e −01 9.241020 e + 01







C3 962.45,

34.06

160.89,

262.40

−0.009123 0.514534 −0.058777 0.001946 −0.937182

A





−1.935160 e + 03 7.660439 e + 01 1.881655 e + 05

−1.994035 e + 02 6.504662 e+ 03 −1.745761 e + 05

−7.312295 e −01 −2.701989 e −03 1.161133 e + 02







C4 956.08,

521.43

155.22,

175.30

0.043948 0.132562 −0.091745 0.000965 −0.200468

A





−9.844750 e + 02 1.138992 e + 03 4.123191 e + 04

−8.393584 e + 02 −1.076158 e + 03 1.115333 e + 05

2.037393 e −01 1.246928 e −01 −5.489779 e + 01







A





2.631332 e + 02 5.564476 e + 01 8.320149 e+ 04

−6.413476 e + 02 −1.383987 e + 03 1.644971 e + 05

−7.394724 e −01 1.929569 e −01 1.515290 e + 02







LIU et al.

514

APPENDIX B

THE IMAGE AND GEOSPATIAL COORDINATES FOR THE 10 POINTS

No.

Image coordinates Geospatial coordinates

C1 camera C2 camera C3 camera C4 camera

Lon Lat x yu v u v u v u v

1 261 544 1095 646 1801 483 819 787 114°18′12.32″ 34°49′7.59″ 3,857,923.15 802,164.55

2 138 583 390 659 1177 472 1836 763 114°18′12.87″ 34°49′7.00″ 3,857,905.42 802,179.13

3 895 606 109 725 612 469 1572 748 114°18′13.23″ 34°49′7.02″ 3,857,906.34 802,188.26

4 1526 573 766 722 118 524 977 743 114°18′13.58″ 34°49′7.44″ 3,857,919.58 802,196.73

5 1104 560 1291 735 627 535 685 754 114°18′13.31″ 34°49′7.73″ 3,857,928.29 802,189.57

6 876 521 1527 684 1119 552 426 812 114°18′12.78″ 34°49′7.77″ 3,857,929.08 802,176.06

7 362 560 993 658 1489 511 1118 825 114°18′12.34″ 34°49′7.43″ 3,857,918.23 802,165.22

8 459 580 720 656 1167 492 1384 788 114°18′12.65″ 34°49′7.30″ 3,857,914.49 802,173.23

9 716 599 509 686 841 496 1461 771 114°18′12.90″ 34°49′7.18″ 3,857,911.00 802,179.70

10 752 561 874 662 944 497 1087 749 114°18′12.84″ 34°49′7.62″ 3,857,924.51 802,177.74

LIU et al.515

APPENDIX C

THE EVALUATION FOR COORDINATES TRANSFORMATION

No.

Camera

Image

coordinates Real geospatial coordinates Transformed geospatial coordinates Errors

u v Lon Lat x y Lon Lat x y x y

1 C1 550 440 114°18′12.30″ 34°49′07.71″ 802,163.81 3,857,926.94 114°18′12.27″ 34°49′07.64″ 802,163.35 3,857,924.72 0.464 2.214

2 C1 601 345 114°18′12.41″ 34°49′07.86″ 802,166.50 3,857,931.43 114°18′12.37″ 34°49′07.83″ 802,165.46 3,857,930.62 1.033 0.812

3 C1 537 547 114°18′12.27″ 34°49′07.55″ 802,163.40 3,857,921.99 114°18′12.25″ 34°49′07.43″ 802,162.86 3,857,918.22 0.543 3.771

4 C2 809 320 114°18′12.79″ 34°49′07.88″ 802,176.09 3,857,932.54 114°18′12.72″ 34°49′07.89″ 802,174.30 3,857,932.75 1.792 −0.211

5 C2 682 274 114°18′12.56″ 34°49′07.96″ 802,170.23 3,857,934.85 114°18′12.51″ 34°49′07.97″ 802,168.87 3,857,935.15 1.358 −0.308

6 C3 606 154 114°18′12.42″ 34°49′08.15″ 802,166.65 3,857,940.39 114°18′12.38″ 34°49′08.21″ 802,165.56 3,857,942.19 1.085 −1.790

7 C3 757 98 114°18′12.70″ 34°49′08.22″ 802,173.68 3,857,943.01 114°18′12.64″ 34°49′08.32″ 802,171.96 3,857,946.02 1.722 −3.012

8 C4 421 397 114°18′12.07″ 34°49′07.78″ 802,158.12 3,857,928.88 114°18′12.06″ 34°49′07.72″ 802,157.83 3,857,926.94 0.290 1.938

9 C4 491 258 114°18′12.21″ 34°49′07.99″ 802,161.30 3,857,935.39 114°18′12.19″ 34°49′08.00″ 802,160.73 3,857,935.55 0.567 −0.166

RMSE 1.115 2.000

MAE 0.984 1.580

ResearchGate has not been able to resolve any citations for this publication.

Context-Aware Matrix Factorization for the Identification of Urban Functional Regions with POI and Taxi OD Data

Article

Full-text available

Jun 2022
ISPRS

The identification of urban functional regions (UFRs) is important for urban planning and sustainable development. Because this involves a set of interrelated processes, it is difficult to identify UFRs using only single data sources. Data fusion methods have the potential to improve the identification accuracy. However, the use of existing fusion methods remains challenging when mining shared semantic information among multiple data sources. In order to address this issue, we propose a context-coupling matrix factorization (CCMF) method which considers contextual relationships. This method was designed based on the fact that the contextual relationships embedded in all of the data are shared and complementary to one another. An empirical study was carried out by fusing point-of-interest (POI) data and taxi origin–destination (OD) data in Beijing, China. There are three steps in CCMF. First, contextual information is extracted from POI and taxi OD trajectory data. Second, fusion is performed using contextual information. Finally, spectral clustering is used to identify the functional regions. The results show that the proposed method achieved an overall accuracy (OA) of 90% and a kappa of 0.88 in the study area. The results were compared with the results obtained using single sources of non-fused data and other fusion methods in order to validate the effectiveness of our method. The results demonstrate that an improvement in the OA of about 5% in comparison to a similar method in the literature could be achieved using this method.

Extracting Objects’ Spatial–Temporal Information Based on Surveillance Videos and the Digital Surface Model

Article

Full-text available

Feb 2022
ISPRS

Surveillance systems focus on the image itself, mainly from the perspective of computer vision, which lacks integration with geographic information. It is difficult to obtain the location, size, and other spatial information of moving objects from surveillance systems, which lack any ability to couple with the geographical environment. To overcome such limitations, we propose a fusion framework of 3D geographic information and moving objects in surveillance video, which provides ideas for related research. We propose a general framework that can extract objects’ spatial–temporal information and visualize object trajectories in a 3D model. The framework does not rely on specific algorithms for determining the camera model, object extraction, or the mapping model. In our experiment, we used the Zhang Zhengyou calibration method and the EPNP method to determine the camera model, YOLOv5 and deep SORT to extract objects from a video, and an imaging ray intersection with the digital surface model to locate objects in the 3D geographical scene. The experimental results show that when the bounding box can thoroughly outline the entire object, the maximum error and root mean square error of the planar position are within 31 cm and 10 cm, respectively, and within 10 cm and 3 cm, respectively, in elevation. The errors of the average width and height of moving objects are within 5 cm and 2 cm, respectively, which is consistent with reality. To our knowledge, we first proposed the general fusion framework. This paper offers a solution to integrate 3D geographic information and surveillance video, which will not only provide a spatial perspective for intelligent video analysis, but also provide a new approach for the multi-dimensional expression of geographic information, object statistics, and object measurement.

Real-Time Web Map Construction Based on Multiple Cameras and GIS

Article

Full-text available

Nov 2021
ISPRS

Previous VideoGIS integration methods mostly used geographic homography mapping. However, the related processing techniques were mainly for independent cameras and the software architecture was C/S, resulting in large deviations in geographic video mapping for small scenes, a lack of multi-camera video fusion, and difficulty in accessing real-time information with WebGIS. Therefore, we propose real-time web map construction based on the object height and camera posture (RTWM-HP for short). We first consider the constraint of having a similar height for each object by constructing an auxiliary plane and establishing a high-precision homography matrix (HP-HM) between the plane and the map; thus, the accuracy of geographic video mapping can be improved. Then, we map the objects in the multi-camera video with overlapping areas to geographic space and perform the object selection with the multi-camera (OS-CDD) algorithm, which includes the confidence of the object, the distance, and the angle between the objects and the center of the cameras. Further, we use the WebSocket technology to design a hybrid C/S and B/S software framework that is suitable for WebGIS integration. Experiments were carried out based on multi-camera videos and high-precision geospatial data in an office and a parking lot. The case study’s results show the following: (1) The HP-HM method can achieve the high-precision geographic mapping of objects (such as human heads and cars) with multiple cameras; (2) the OS-CDD algorithm can optimize and adjust the positions of the objects in the overlapping area and achieve a better map visualization effect; (3) RTWM-HP can publish real-time maps of objects with multiple cameras, which can be browsed in real time through point layers and hot-spot layers through WebGIS. The methods can be applied to some fields, such as person or car supervision and the flow analysis of customers or traffic passengers.

Whose Track Is It Anyway? Improving Robustness to Tracking Errors with Affinity-based Trajectory Prediction

Conference Paper

Jun 2022

A novel quality control model of rainfall estimation with videos – A survey based on multi-surveillance cameras

Article

Dec 2021
J HYDROL

The widespread use of surveillance cameras has become an emerging means for rainfall observations. With the advantages of high spatial-temporal resolution, rainfall information obtained from surveillance videos is highly suitable for meteorological-related research and has bright prospects. However, due to the complex and variable monitoring scenarios, the quality of the rainfall data estimated by each camera is always inconsistent, resulting in low practical value. Dense ground-level surveillance cameras have temporal and spatial correlations that can be used to improve the accuracy of rainfall estimation through mutual verification. In this study, we first introduce camera parameters to refine the spatial volume of rainfall (SVoR)¹ perceived by cameras to improve the accuracy of rainfall intensity (RI) estimation. Next, a novel quality control (QC) model of rainfall estimation with multi-surveillance camera collaboration that takes the rainfall observations of all cameras as input is proposed. (i) We build a reliability evaluation (RE) model for the estimation of RI in accordance with raindrop imagery features to provide a reference for the subsequent correction of RI estimation; (ii) inspired by the first law of geography (Tobler, 1970), we then construct a spatial-temporal consistency filter and a situation consistency filter by using the spatial-temporal constraints between cameras to coarsely evaluate the RI values; and (iii) the correlation between cameras is calculated based on the fuzzy method to further build a correlation filter for the fine-grained correction of RI values. Experiments show that our method can effectively eliminate RI outliers and improve the accuracy and reliability of rainfall estimation results. Moreover, our method is highly suitable for heavy and violent rainfall application scenarios and can provide high-resolution rainfall data support for flooding warnings and simulations in urban areas.

Multi‐camera video synopsis of a geographic scene based on optimal virtual viewpoint

Article

Nov 2021

Video synopsis offers the possibility to compress long‐term video objects into a short‐term playback. This is beneficial for the quick retrieval and expression of numerous video objects in a virtual geographic environment. However, existing methods do not consider the spatiotemporal relationships among video objects in different cameras. To overcome this problem, we propose optimal virtual viewpoint‐based multi‐camera video synopsis, which involves: locating the camera position and the field of view; constructing a video image observability model to enumerate the observable camera combinations; constructing an evaluation model to optimally select the observable combinations and obtain the virtual viewpoint group; and setting the parameters of the video object display to obtain a multi‐camera video synopsis of a geographic scene. Experimental results showed that our method achieved the rapid display of global motion across cameras with numerous video objects, and always outperformed separate camera‐based video synopsis expression.

Geo-Context Aware Study of Vision-Based Autonomous Driving Models and Spatial Video Data

Article

Oct 2021

Vision-based deep learning (DL) methods have made great progress in learning autonomous driving models from large-scale crowdsourced video datasets. They are trained to predict instantaneous driving behaviors from video data captured by on-vehicle cameras. In this paper, we develop a geo-context aware visualization system for the study of Autonomous Driving Model (ADM) predictions together with large-scale ADM video data. The visual study is seamlessly integrated with the geographical environment by combining DL model performance with geospatial visualization techniques. Model performance measures can be studied together with a set of geospatial attributes over map views. Users can also discover and compare prediction behaviors of multiple DL models in both city-wide and street-level analysis, together with road images and video contents. Therefore, the system provides a new visual exploration platform for DL model designers in autonomous driving. Use cases and domain expert evaluation show the utility and effectiveness of the visualization system.

Multi-Camera Video Scene Graphs for Surveillance Videos Indexing and Retrieval

Conference Paper

Sep 2021

Macroscopic View: Crowd Evacuation Dynamics at T-Shaped Street Junctions Using a Modified Aw-Rascle Traffic Flow Model

Article

Jul 2021

This study investigates a dynamic flow model for crowd evacuation at T-shaped street junctions (TSJs) from a macroscopic view. The Aw-Rascle traffic flow model is modified by constructing an impact matrix in the street intersection area to practically describe the crowd convergence mechanism at a TSJ. For coherence, this modified model is proved to be anisotropic, similar to the original Aw-Rascle traffic flow model. To describe real scenarios with higher crowd density and lower speeds during organized pilgrimages, the initial Gaussian distribution of the crowd is improved to a higher-order smoothing function. To validate the modified Aw-Rascle traffic flow model, we reconstruct the drastic stampede that occurred at the TSJ of streets 204 and 223 during the 2015 Mecca pilgrimage. Further, the main environmental parameters that potentially lead to a stampede are discussed with numerical simulations. A valuable suggestion is that the street width ratio should be extended from 1.1 to 1.4 to prevent stampedes, matching the expansion engineering of street 204 reported by BBC News. An interesting phenomenon is that the closer the bus unloading location on street 223 is to the TSJ center, the lower the maximum crowd density and the safer the pedestrians will be. With this modified Aw-Rascle flow model at TSJs, this paper provides strategic and technical suggestions for future crowd flow control to reduce the risk of crowd stampedes.

Visualizing spatiotemporal patterns of city service demand through a space-time exploratory approach

Article

Jul 2021

City service demand fluctuates across space and time. Although various data, such as 311 hotline data and social media data, have been used to explore the spatiotemporal patterns of city services, data uncertainty and the uneven distribution of service demand are overlooked to some extent and thus could result in bias. To overcome these shortcomings, top-down collected city service data that fully cover urban areas are used as an emerging data source in this paper. A visual analytical approach that employs a 3D model based on a space-time cube combined with the Mann-Kendall algorithm is developed and applied in Xicheng District, Beijing, China. The results show that in comparison to other methods, the emerging data and visualization method have more power to explain city services in terms of overall trends and micro-scale details. For instance, city service cases demonstrate a significant downward trend. Meanwhile, the distribution of hot/cold spots is found to be related to the built environment and population density. For example, high-incidence cases are located in some communities that are the key governance areas, indicating a demand to increase the staffing of grid administrators. The findings of this work will practicallycan potentially benefit other cities in China and worldwide.

Crowd sensing and spatiotemporal analysis in urban open space using multi‐viewpoint geotagged videos

Abstract

Recommended publications

Dynamic Target Tracking and Geospatial Transformation in Place-Based on DNN

3D Geographic Trajectories’ Generation and Visualization of Dynamic Objects in Surveillance Videos

Movement-Oriented Objectified Organization and Retrieval Approach for Heterogeneous GeoVideo Data

A new method for orthographic video map construction based on video and GIS