ArticlePDF Available

Digitalization of Traffic Scenes in Support of Intelligent Transportation Applications

Authors:
Digitalization of Traffic Scenes in Support of
Intelligent Transportation Applications
Linjun Lu, S.M.ASCE1; and Fei Dai, M.ASCE2
Abstract: Digitalization of real-world traffic scenes is a fundamental task in development of digital twins of road transportation. However,
the existing digitalization approaches are either expensive in equipment costs or inapplicable to collect granular level data of traffic scenes.
This study proposed a vision-based method for real-time digitalization of traffic scenes through modeling and merging the road infrastructure
(static components) and road users (dynamic components) progressively. Specifically, the former is reconstructed by leveraging unmanned
aerial vehicles (UAVs) and structure from motion; and the latter is digitized via using roadside surveillance videos and a new reconstruction
process through applying deep learning and view geometry. Last, the digital model of the traffic scene is built by merging the digital models of
static and dynamic components. A field experiment was performed to evaluate the performance of the proposed method. The results showed
that the traffic scene can be successfully digitalized by the proposed method with promising accuracy, thus signifying the methods potential
for the development of the digital twins of road transportation in support of intelligent transportation applications. DOI: 10.1061/JCCEE5.
CPENG-5204.© 2023 American Society of Civil Engineers.
Author keywords: Digital twins; Image-based methods; Intelligent transportation; Computer vision; Automated vehicles.
Introduction
Digital twins have been receiving a surge of attention in the traffic
community as a promising and powerful tool for the improvement
of road transportation management and operations (Gao et al. 2021;
Liu et al. 2021a;Lv et al. 2021). A digital twin of road transpor-
tation is a digital replica of an actual traffic system created in a
virtual space, which enables real-time information interaction,
closed-loop data transmission, and adaptive iterative optimization
between the digital and physical counterparts (Bao et al. 2021).
Thanks to recent advances in smart sensing and artificial intelli-
gence, the performance of digital twins has been significantly im-
proved, allowing for analysis, quantification, and modeling of the
intricate hierarchical relationships and interactions among different
traffic entities. By continuously infusing the observed traffic data,
the digital twins of road transportation will progress in intelligence
and fidelity and can make ever-improving predictions about traffic
behaviors and potential problems. This will assist the traffic man-
agement systems in making optimal and timely traffic coordination
decisions on a global scope. As a result, the digital twins of road
transportation have potential to actively contribute to the realization
of intelligent transportation systems (Liu et al. 2021a;Pan et al.
2021;Strigel et al. 2014).
The digitalization of real-world traffic scenes is a fundamental
task for the development of the digital twins of road transportation,
which is a process of converting the valuable information in traffic
scenes into an electronically stored digital format that is readily for
use by transportation managers (Hu et al. 2021). Laser scanning is
the widely used technology in this regard for constructing three-
dimensional (3D) models of structures and scenes. It produces the
3D point clouds by measuring the distance from the laser sensor to
the scanning targets based on the time-of-flight or phase-based
principle (Aryan et al. 2021). However, the adoption of laser scan-
ners is currently limited in traffic monitoring applications because
of high equipment cost. Additionally, the semantic understanding
of traffic scenes remains a challenge in the use of monochromatic
3D point clouds generated by laser scanners, despite some research
efforts have been devoted to coping with this issue (Chen et al.
2019).
With advances in electro-optical sensors and computational
capacities, machine vision is regarded as a cost-efficient alternative
for 3D model reconstruction (Szeliski 2010), upon which the se-
mantic information can be feasibly extracted from the recorded
high-resolution images. This study proposed a machine vision
based method for real-time digitalization of traffic scenes by lever-
aging unmanned aerial vehicles (UAVs) and roadside surveillance
cameras. The methods performance was evaluated through a field
experiment at a road intersection. The results showed that the traffic
scene can be successfully digitalized by the proposed method with
promising accuracy, thus signifying the methods potential for the
development of digital twins of road transportation in support of
intelligent transportation applications.
Background
State of Research in Digital Twins for Intelligent
Transportation Applications
A few studies have been conducted to explore the applicability of
digital twins for intelligent transportation management and opera-
tions. For instance, by leveraging digital twins and machine learn-
ing, Kumar et al. (2018) developed a virtual vehicle model for
driving intention prediction that can be utilized to assist automated
1Graduate Research Assistant, Wadsworth Dept. of Civil and Environ-
mental Engineering, West Virginia Univ., Morgantown, WV 26506. Email:
ll0074@mix.wvu.edu
2Associate Professor, Wadsworth Dept. of Civil and Environmental
Engineering, West Virginia Univ., Morgantown, WV 26506 (corresponding
author). ORCID: https://orcid.org/0000-0002-8868-2821. Email: fei.dai@
mail.wvu.edu
Note. This manuscript was submitted on September 29, 2022; approved
on February 1, 2023; published online on May 19, 2023. Discussion period
open until October 19, 2023; separate discussions must be submitted for
individual papers. This paper is part of the Journal of Computing in Civil
Engineering, © ASCE, ISSN 0887-3801.
© ASCE 04023019-1 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
and legacy vehicles to make optimal path planning. Hui et al.
(2021) proposed a digital twinenabled path planning scheme to
help increase traffic efficiency. Particularly, the proposed scheme
embedded a personalized utility module able to meet the specific
needs from each vehicle, thereby contributing to higher overall ve-
hicle utility than the traditional path planning schemes. Liu et al.
(2021a) introduced an infrastructure-vehicle cooperation system
for the advancement of automated driving. Their approach involved
using roadside sensors to digitize traffic scenes and transmitting the
perception results to the neighboring automated vehicle, which en-
ables the automated vehicles to get a full picture of the current traf-
fic situation.
However, the creation of digital twins in the aforementioned
methods relies on the information gathered from different types
of roadside sensors, such as cameras, LiDARs, and radars, making
these methods less applicable in field applications due to the high
equipment costs. To tackle this hurdle, El Marai et al. (2020)ex-
ploited omnidirectional cameras to create the digital twins of traffic
scenes in urban cities and simultaneously implemented the deep
learning method for detection and recognition of the road users.
However, the lack of spatial information about road users, which
would be otherwise of vital importance for traffic condition analy-
sis and prediction, rendered the generated digital twins by this
method less practical.
State of Research in Traffic Scene Reconstruction
Traffic scene reconstruction and understanding is an active topic
in intelligent transportation systems, in which the primary task
is to perceive 3D attributes of road users in dynamic environments.
Using monocular video for traffic scene reconstruction has been
receiving extensive interest recently because of the advantage of
simple implementation (Kar et al. 2015). The monocular-based
methods normally start with detecting the objects in the images
with two-dimensional (2D) bounding boxes and key points and
then fitting 3D templates (3D pose and shape) to best match their
2D observations (Su et al. 2015). Subsequently, postprocessing
is carried out to improve the initial estimation via nonlinear opti-
mization (Mottaghi et al. 2015). However, these methods are
only applicable to limited types of road users. In addition, they
are substantially time-consuming and fall short of the real-time
processing speeds needed by a plethora of intelligent transporta-
tion applications.
Recently, convolutional neural networks (CNNs) are gaining in-
creasing popularity and have been exploited to directly predict the
spatial location and pose of road users in traffic images in an end-
to-end manner (Ke et al. 2020;Reddy et al. 2018). Nevertheless, the
reconstruction results produced by the CNN-based methods are
usually too coarse to be used in real traffic applications due to
the difficulty of retrieving the depth information missing in 2D im-
ages. To address these hurdles, some research efforts have resorted
to multiple-view reconstruction or integrating depth information
into monocular images. Particularly, the multiple-view-based meth-
ods reconstruct traffic scenes by utilizing the triangulation tech-
nique to infer the geometrical and spatial information of traffic
entities based on a sequence of images gathered from different ori-
ented cameras (Muller et al. 2005).
However, this type of method entails complicated camera cal-
ibration over different cameras, and requires manually setting 3D
control points in the observed scene and accurately determining
their spatial location, which is oftentimes a laborious and time-
consuming process. Integrating depth information into monocular
images can effectively assist in recovering discriminative object
features and spatial information of road users in 2D images.
Common depth-sensing devices that can be used for this purpose
include radar (Niesen and Unnikrishnan 2020), depth-sensing cam-
eras (Xia et al. 2015), and LiDAR (Yang et al. 2018). Nevertheless,
the equipment cost of these sensors is typically expensive, thus im-
peding their wide application in real traffic settings.
Visual Surveillance Systems in Road Transportation
Applications
Visual surveillance systems have been widely deployed in modern
road transportation systems to facilitate the monitoring and man-
agement of traffic activities (Zhang et al. 2022). In comparison with
other traffic monitoring techniques such as inductive loop detectors,
radar sensors, and infrared sensors, a visual surveillance system can
be deployed for various applications by using different computer
vision algorithms. The potential applications include vehicle detec-
tion and classification (Lu and Dai 2023), human detection and rec-
ognition (Nikouei et al. 2018), incident detection (Sharma and
Sungheetha 2021), and illegal activity detection (Bhatti et al. 2021).
If the deployed surveillance cameras are properly calibrated, the
applicability of the surveillance system can be further widened, en-
abling capabilities such as speed measurement (Yang et al. 2019)
and behavior understanding (Giannakeris et al. 2018).
To eliminate the challenges with respect to network congestion,
latency, and data storage, the aforementioned image analysis tasks
are usually completed at the edge computing units in close prox-
imity to the surveillance cameras. The extracted semantic informa-
tion from each edge computing unit is then transmitted to the traffic
management center to assist the road operators in understanding
current traffic conditions and conducting system-wide management
of regional transportation (Zhang et al. 2022).
Problem Statement and Research Objective
Machine vision, especially the monocular fashion, has been recog-
nized as a cost-efficient and promising technology in assisting dig-
ital twin creation because it can capture both rich semantic and
geometric information. Nevertheless, the existing vision-based
methods are only applicable to extraction of one or a few specific
types of traffic data from real-world traffic scenes, and a unified
method allowing for collection of granular-level data of traffic
scenes is still missing in the existing literature. The granular-level
data of traffic scenes here refer to the road infrastructure features
(e.g., road markings and lane dividers) and road users type, loca-
tion, dimension, speed, and so on, all of which are vital to building
a trustworthy digital twin of road transportation and making opti-
mal traffic management and coordination decisions.
To fill this gap, this study proposed a new vision-based method
that digitalizes the traffic scenes by recovering spatial information
of both static road infrastructure and dynamic road users through
creation of a new reconstruction process that addresses the diffi-
culty in retrieving 3D spatial information of road users in 2D im-
ages. The main contribution of the study lies in developing a new
vision-based framework by leveraging a diversity of machine learn-
ing and computer vision techniques through customization and
synergy for granular traffic data collection in assisting digital twin
creation, which cannot be achieved by the existing methods.
Methodology
The overall framework of the proposed method for digitalization
of traffic scenes is illustrated in Fig. 1, which consists of three
main modules: (1) digitalization of static road infrastructure by
© ASCE 04023019-2 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
leveraging UAV and structure from motion, (2) digitalization of
moving road users through road surveillance cameras as well as
applying deep learning and view geometry, and (3) global registra-
tion and merging of digital models of static and dynamic compo-
nents. Because the road infrastructure usually remains constant
over a long period of time, the first module only needs to be per-
formed once unless the key infrastructure elements are retrofitted or
rebuilt, whereas the second and third modules have to be contin-
uously implemented for each recorded frame due to the dynamic
nature of road users. In the following subsections, each module is
explained in detail. The symbols in bold represent the homogenous
vectors; a multiplication sign stands for the cross-product operator.
Digitalization of Static Road Infrastructure Using
Structure from Motion
Structure from motion is employed in the proposed method for the
digitalization of static road infrastructure, which is essentially a
process to simultaneously estimate both the 3D location of scene
structure and camera poses from a series of photographed photos
(Szeliski 2010). In general, the framework of structure from motion
consists of five sequential steps: (1) detecting and matching the
salient feature points across different images, e.g., using a scale-
invariant feature transform (Lowe 1999) followed by approximate
nearest neighbors (Indyk and Motwani 1998), (2) calculating the
fundamental matrix and estimating the relative camera pose be-
tween each pair of images, (3) using triangulation to compute
the spatial locations of the point correspondences and populating
them into the 3D point cloud model, (4) exploiting bundle adjust-
ment to further refine all estimates, i.e., the camera poses and 3D
points, with guaranteeing that the global reprojection error is mini-
mized, and (5) removing the scale ambiguity.
It is worth pointing out that an overlap of 50%80% between the
adjacent aerial images is required when using structure from mo-
tion in order to achieve a reliable camera pose estimation (Khaloo
and Lattanzi 2017). Thus, the UAV fight configurations regarding
flight speed and altitude need to be properly designed prior to the
surveying mission. On the other hand, to eliminate the adverse
impact of moving vehicles on the reconstruction accuracy of road
infrastructure (Lee et al. 2022), the vehicles are manually annotated
and masked in the recorded aerial images before handing on to
structure from motion.
Fig. 2(a) displays an example of the 3D point cloud model and
camera poses that are recovered from the vehicle-free aerial images
with the aid of structure from motion. It can be seen that the re-
constructed 3D point cloud model is too sparse to provide desired
semantic information such as road markings and lane dividers, for
the development of the digital twin of road infrastructure. Thus, a
dense and photorealistic 3D point cloud model is ultimately gen-
erated through applying clustering views for multiview stereo
(CMVS) on the sparse counterpart, which is the patch-based algo-
rithm capable of clustering overlapped images and supplying the
additionally semantic cloud points into the 3D model (Furukawa
et al. 2010). The densified 3D point cloud model is displayed in
Fig. 2(b).
After 3D reconstruction, semantic segmentation is conducted to
identify the road features in the 3D point cloud model of road infra-
structure, which can be subsequently used to assist the analysis and
understanding of traffic data collected from the surveillance cam-
eras. In this study, the semantic segmentation is manually done by
leveraging the software CloudCompare version 2.7.0, which is an
open-source tool for working with 3D point clouds and meshes.
Particularly, the instances in the road infrastructure scenes are di-
vided into nine classes: road, sidewalk, crosswalk, road marking,
lane division, traffic sign, lamp pole, traffic light, and others. Fig. 3
represents an example of the semantic segmentation result of the
road infrastructure model.
Digitalization of Moving Road Users from Surveillance
Cameras
Road User Instance Segmentation Using Cascade Mask
R-CNN
As the first step of the digitalization of road users, Cascade Mask
region-based convolutional neural network (R-CNN) is utilized to
Process Output
Digitalization of Static Road Infrastructure
Vehicle Detection
and Annotation
UAV Surveying
Mission
Aerial
Images
Vehicle-free
Images
Structure from
Motion
Global Registration and Merging
of Digital Models
Digitalization of Moving Road Users
Road User Instance
Segmentation
3D Bounding Box
Construction
3D Bounding Box
Rectification
Localization and
Tracking
Road User
Masks
Raw 3D
Bounding Box
3D Information
in Pixel Unit
Fig. 1. Overall framework of proposed method for digitalization of traffic scenes.
Fig. 2. Road infrastructure reconstruction using structure from motion: (a) sparse 3D point cloud model; and (b) dense 3D point cloud model.
© ASCE 04023019-3 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
detect the road user instances in the recorded frames from the
surveillance cameras. The architecture of the Cascade Mask R-CNN
is depicted in Fig. 4.
Different from the other deep learning architectures, the
Cascade Mask R-CNN employs a set of cascaded detectors for
object detection (Cai and Vasconcelos 2019). In particular, the
cascaded detectors are trained sequentially, namely the output
(i.e., regressed bounding boxes) of the previous detectors with a
certain intersection over union (IOU) thresholds are used as
fine-tuned proposals for the next with a higher IoU threshold. In
the meanwhile, the ROIs in the following detectors are updated us-
ing the refined proposals and the shared feature maps from the
backbone. This is primarily motivated by the observation that
the output IoU of a regressor is almost invariably better than the
input IoU (Cai and Vasconcelos 2019). It has been proven that such
a resampling mechanism can effectively address the overfitting
problem during training and eliminate quality mismatches at infer-
ence, thus contributing to the outperformance of Cascade Mask
R-CNN over the almost existing deep learning architectures for ob-
ject detection and instance segmentation. In this study, the IoU
thresholds were specified as 0.5, 0.6, and 0.7 for different detectors,
respectively.
Construction of Road User 3D Bounding Boxes
Once the road usersinstances are obtained, their 3D spatial infor-
mation can be retrieved in the 2D images in the form of 3D bound-
ing boxes. The approach for 3D bounding box construction
has been developed in a previous study (Lu and Dai 2023), which
consists of two consecutive steps: vanishing point estimation and
3D bounding box construction. To make this paper self-contained,
the previously developed approach is briefly illustrated herein. For
details, interested readers may refer to Lu and Dai (2023).
The developed 3D bounding box construction approach entails
the identification of three orthogonal vanishing points in the traffic
senses. To this end, a random sample consensus (RANSAC)-based
method was proposed for vanishing point estimation, and the cor-
responding pseudocode is shown in Fig. 5. In comparison with
other vanishing point estimation methods, the advantage of the pro-
posed method lies in that it eliminates the need for a trial-and-error
parameter tuning process, thus allowing for easy implementation
across a variety of traffic scenes with little-to-no human input.
In this study, the three orthogonal vanishing points are defined
as follows. The first vanishing point v1is along the dominant traffic
direction, the second vanishing point v2is in the direction perpen-
dicular to v1and parallel to the road surface, and the third vanishing
point v3is in the direction perpendicular to the road surface, as de-
picted in Fig. 6.
After the three orthogonal vanishing points are successfully
identified, the 3D bounding box can be constructed for each sur-
veilled road user in the images. Fig. 6illustrates the pipeline of 3D
bounding box construction under the right camera view. It starts
with generating five lines L15that are tangent to the road users
silhouette (i.e., instance result obtained by the Cascade Mask
R-CNN) and pass through the three vanishing points [Fig. 6(a)],
and then progressively identifies the vertices B18by making use
of the outcome from the previous steps [Figs. 6(b and c)]. The
construction process can be formulized as follows:
Fig. 5. Algorithm for vanishing point estimation.
h
h
Detector 1
(IoU=0.5)
Detector 3
(IoU=0.7)
Detector 2
(IoU=0.6)
Legend
Fig. 4. Architecture of Cascade Mask R-CNN. (Images by Linjun Lu.)
Fig. 3. Example of road infrastructure instance segmentation.
© ASCE 04023019-4 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
B1¼L1×L3;B2¼L3×L5;B3¼L2×L5ð1Þ
L6¼B1×v2;L7¼B3×v3;L8¼B2×v1;L9¼B4×v2ð2Þ
B4¼L4×L9;B5¼L6×L7;B6¼L8×L9ð3Þ
L10 ¼B5×v1;L11 ¼B6×v3ð4Þ
B7¼L1×L11;B8¼L4×L10 ð5Þ
After the eight vertices are identified, the 3D bounding box is
finally constructed by joining the adjacent vertices pairwise, as
shown in Fig. 6(d). The corresponding pseudocode of 3D bounding
box construction is shown in Fig. 7. For the camera positioned on
the other side of the road, the 3D bounding box can be constructed
in a similar manner as illustrated here.
Rectification of Constructed 3D Bounding Boxes
As shown in Fig. 8, because of the irregular and somewhat bent
appearances of road users, the locations of the firstly drawn tangent
lines (i.e., L15) typically deviate radially toward the centroid of
road user from their actual location, resulting in the 3D bounding
boxes constructed from the previous step being slightly smaller in
width and length than those in reality. In order to increase the ac-
curacy of road user localization, it would be wise to make some
efforts to rectify or enlarge the constructed 3D bounding boxes.
The rectification can proceed from re-estimating the locations of
the four bottom vertices B2;3;4;6by scaling their coordinates along
the two symmetry axes of the bottom plane of the 3D bound-
ing box.
However, it may be intractable to directly carry out the coordi-
nate scaling in the image plane due to the shape distortion brought
on by projective transformation. For instance, in Fig. 6, the surfaces
of the 3D bounding box are not rectangular in the image, although
the originals are. To overcome this hurdle, a hierarchy of transfor-
mations was developed and applied to remove the projective/affine
distortion in the image plane and scale the vertex coordinates there-
after. The main concept of this hierarchy is illustrated in Fig. 9,
where l¼ðl1;l2;l3ÞTis the vanishing line of the ground plane
given by l¼v1×v2.
Geometrically, the projective distortion can be rectified by trans-
forming the vanishing line lback to its canonical position l¼
ð0;0;1ÞT, namely, mapping the image plane π1to π2as shown
in Fig. 9. The suitable projective matrix that achieves this transfor-
mation is
HP¼
2
6
6
4
100
010
l1l2l3
3
7
7
5
ð6Þ
After applying HPto the image plane π1, one can immediately
verify that l¼HT
Pl¼ð0;0;1ÞT,v1;¼HPv1¼ðv1;x;v1;y;0ÞT,
and v2;¼HPv2¼ðv2;x;v2;y;0ÞTas desired. In the meanwhile,
the locations of the four bottom vertices on π2are obtained by
BP
i¼HPBi. On top of this, the shape property of the bottom plane
can then be affinely recovered by mapping v1;ð0;1;0ÞTand
v2;ð1;0;0ÞT, namely, mapping π2to π3. Correspondingly, the
affine matrix is given by
HA¼2
6
4
v2;xv1;x0
v2;yv1;y0
001
3
7
5
1
ð7Þ
Fig. 7. Algorithm for 3D bounding box construction.
Fig. 8. Effect of bending shapes of road users on 3D bounding box
construction.
Recovery of projective distortion Recovery of affine distortion
Scaling
Image plane
Fig. 9. Pipeline of 3D bounding box rectification.
Fig. 6. Pipeline of constructing 3D bounding box: (a) Step 1; (b) Step 2; (c) Step 3; and (d) Step 4. (Images by Linjun Lu.)
© ASCE 04023019-5 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
Likewise, the locations of the four bottom vertices on π3are
determined as BA
i¼HABP
i. Once the shape property is recovered,
the coordinates of the four bottom vertices can be easily scaled by
performing a similarity transformation HSon π3, with
HS¼2
6
4
sx0Δtx
0syΔty
00 1
3
7
5ð8Þ
where ðΔtx;ΔtyÞT= inhomogeneous coordinates of the centroid of
the bottom plane on π3; and sxand sy= scaling factors along x- and
y-directions, respectively. The scaling factors for each type of road
user can be statistically calibrated by comparing the measured
values from original 3D bounding boxes with the ground-truth
counterparts. In summary, the hierarchy of transformations per-
formed on the bottom vertices can be expressed in the form
BS
i¼HSHAHPBi. After coordinate scaling, the re-estimated loca-
tions of vertices are reprojected to the original image plane π1by
~
Bi¼H1
PH1
ABS
i. Finally, the 3D bounding box is reconstructed
by following the same construction pipeline as elaborated in Fig. 7
with the use of the updated bottom vertices. The pseudocode of the
described rectification process is summarized in Fig. 10. The sug-
gested scaling factors, which were calibrated from a series of field
comparison tests, are listed in Table 1. Fig. 11 shows some exam-
ples of unrectified and rectified 3D bounding boxes of road users
under different traffic scenes and camera view angles.
Localization and Trajectory Tracking of Road Users
Once the road users 3D bounding box is constructed, its location
on the road plane can be represented by the four bottom vertices (or
better, the bottom plane) of its 3D bounding box along with con-
verting them to the physical coordinates in the real world. Assum-
ing that the road shape is approximately flat and the camera pose
remains fixed, the projective transformation between the image
plane and road plane can be described as follows (Hartley and
Zisserman 2003):
smi¼HMið9Þ
where mj¼ðxi;yi;1ÞTand Mi¼ðXi;Yi;1ÞTði¼1;2; :::;nÞ
are the homogeneous coordinates of the ith pair of point corre-
spondences on the image plane and road plane, respectively;
and His the 3×3homography matrix that depicts the projective
Fig. 10. Algorithm for 3D bounding box rectification.
Table 1. Scaling factors specified for different types of road users
Direction Car Bus Truck Pedestrian Bicyclist Motorcyclist
Width, sx1.05 1.05 1.05 1.20 1.20 1.20
Length, sy1.15 1.08 1.08 1.20 1.10 1.10
Fig. 11. Examples of 3D bounding boxes of road users before and after rectification. (Images by Linjun Lu.)
© ASCE 04023019-6 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
relationship between the image plane and road plane. The equality
in Eq. (9) is defined up to an arbitrary nonzero scale factor s. This
means that there are only eight independent degrees of freedom
among H. As a consequence, the homography matrix Hcan be
uniquely determined by specifying at least four point correspond-
ences in a general condition and solved by using the normalized
direct linear transform (Lu and Dai 2022).
Fig. 12 illustrates an example of exploiting homography trans-
formation for road user localization. The reference points selected
for homography estimation are shown in Fig. 12(a), whose pixel
coordinates and physical coordinates are manually measured from
the image plane and road plane, respectively. By making use of the
homography matrix, each road users physical location is retrieved
by mapping the four bottom vertices from the image coordinates to
the physical coordinates, as shown in Figs. 12(b and c).
Meanwhile, the length and width of road users are obtained by
measuring the distance between the remapped vertices. Together
with the height information, the road users 3D shape can be fully
recovered, which is a valuable clue for fine-grained vehicle classi-
fication and model recognition (Sochor et al. 2016). The physical
height of the 3D bounding box can be estimated by referring to one
object of known height, such as traffic signs and lampposts, for
which the top and base are imaged (Criminisi et al. 2000). The for-
mula for height estimation can be expressed
d¼
~
b1r2ðr1r2v3r2Þ
r1r2ð~
b1r2v3r2Þdrð10Þ
with
~
b1¼ðr2×b2×l×b1Þ×ðr1×r2Þð11Þ
where r1and r2= imaged top and base points of the reference ob-
ject, respectively; b1and b2= top and base points of the height of
the 3D bounding box, respectively; and dr= physical height of the
reference object, as annotated in Fig. 12(b).
In addition to localization in each frame, the Deep Sort tracker
(Wojke et al. 2017) is employed to lock onto identical road users
based on their spatial and appearance information, and continu-
ously track each of them over sequential frames until they leave
the field of view of the camera. Fig. 12(d) illustrates an example
of the localization and trajectory tracking of road users over a se-
quence of frames.
Global Registration and Merging of Digital Models
Upon the completion of Modules 2 and 3, the digital model of the
traffic scene is eventually constructed by merging the digital mod-
els of the road infrastructure and road users. To this end, the co-
ordinate systems of these two digital models have to be aligned.
Typically, the coordinate system alignment can be achieved by per-
forming a 3D rigid-body transformation on one digital model with
taking the other one as the reference. The transformation matrix can
be determined by specifying at least three ground control points
(that is, the points with known 3D geospatial locations in the
Euclidean world system) in both coordinate systems. Fig. 13 pro-
vides an example of the digitalized traffic scene by merging the 3D
point cloud model of road infrastructure in Fig. 2(b) and the road
user model in Fig. 12(d) together after performing coordinate
Fig. 12. Localization of road users by applying homography transformation: (a) reference points; (b) surveilled view; (c) surveilled view after
projective mapping; and (d) spatial locations and trajectories of road users. (Images by Linjun Lu.)
© ASCE 04023019-7 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
system alignment. For the cases where multiple cameras are de-
ployed in the same traffic scene, the model merging can be progres-
sively conducted for each camera in the same fashion with taking
the coordinate system of the road infrastructure as the reference.
Data Set Establishment and Model Training
A road user data set that consists of 35,401 images at various res-
olutions was established by the authors for model training. In order
to ensure the trained models ability to generalize, the road user
images were collected from a variety of traffic settings, including
urban, rural, and highway areas, and under varying lighting con-
ditions. Some images in the developed data set were taken from
several public data sets such as COCO (Lin et al. 2014), Cityscapes
(Cordts et al. 2016), and nuScenes (Caesar et al. 2020). Based on
the application standpoint and to facilitate the annotation, the
road user instances are grouped into six categories: car, bus, truck,
pedestrian, bicyclist, and motorcyclist. The image annotation task
was completed by VGG Image Annotator version 2.0.12 (Dutta and
Zisserman 2019), which is an open-source image annotation soft-
ware developed by the Visual Geometry Group. To ensure the best
level of instance segmentation accuracy, the road user instances
were precisely labelled by fine-grained polygons in the gathered
images. Some annotated images samples of road users are shown
in Fig. 14.
The annotated images were randomly split into training and val-
idation sets at a ratio of 82for model training. Four widely used
backbone networks, that is, ResNet-50, ResNet-101, Swin-T, and
Swin-S (He et al. 2016;Liu et al. 2021b), were chosen as the can-
didate backbones of the Cascade Mask R-CNN for feature extrac-
tion, resulting in four different Cascade Mask R-CNN models.
These four Cascade Mask R-CNN models were separately trained
on the customized data set, and COCO evaluation metrics (Lin et al.
2014) were adopted to evaluate the performance of each trained
model. The training task was conducted on the Amazon Web
Services (AWS) platform with a single GPU (NVIDIATesla V100).
The hyperparameters, that is, momentum, weight decay, and batch
were set to 0.9, 0.0001, and 2 for all models; the learning rate was
set to 0.001 and 0.0001 for ResNet- and Swin-based models, re-
spectively. The training data set was augmented during the training
process by implementing horizontal image flipping.
Table 2presents a comparison of the average precision (AP) and
mean average precision (mAP) performance of the trained models
with different backbones on the validation data set (the bold value
represents the best one in each column). Specifically, mAP50 and
mAP75 stand for the mAP at IoU of 0.5 and 0.75, respectively; and
APS,AP
M, and APLrepresent the AP on small, medium, and large
objects, respectively (Lin et al. 2014). As can be seen, Swin-S out-
performs the other three backbones regarding all the mAPs and
APs. Therefore, Swin-S was chosen as the backbone in this study.
Performance Validation
Experimental Setup and Implementation Details
The field experiment was carried out at a road intersection near
the Evansdale campus of West Virginia University to evaluate the
performance of the proposed method for constructing the digital
models of the traffic scenes. Two commercial cameras (Panasonic
HC-W580K, Kadoma, Japan) were employed to surveil two differ-
ent regions of the intersection both in a full-high-definition (HD)
resolution (1,920 ×1,080 pixels) and a frame rate of 30 frames per
second (fps). Each camera was rigidly mounted on one lamppost
with an elevated height of approximately 6 m above the road sur-
face. The camera locations as well as the surveilled traffic scene are
illustrated in Fig. 15.
A commercial-level UAV (DJI Phantom 4 Pro+ V2.0, Shenzhen,
China) equipped with a high-resolution camera was utilized to scan
the road intersection for 3D model reconstruction. The resolution of
the onboard camera is 3,840 ×2,160 pixels, and the focus lens is
35 mm. The flight speed and altitude were set as around 1m=s
and 1015 m, respectively. Because of the measurement uncer-
tainty in practical applications, using more point correspondences
can help achieve higher accuracy for homography estimation, but at
the expense of taking more effort for ground control point setting.
After taking a trade-off between accuracy and complexity, a total of
four and five ground control points were finally arranged on the
road surface for homography matrix estimation for Cameras 1
and 2, respectively. Moreover, an additional two ground control
points were also set on the road surface and adopted in conjunction
with the abovementioned nines for coordinate system alignment.
The locations of these 11 ground control points are annotated in
Fig. 15, whose geospatial coordinates were precisely measured
from the total station.
The localization accuracy of road users would undoubtedly be
an essential metric for the performance evaluation of the proposed
method. Therefore, the ground-truth locations of road users have to
be known a priori. To this end, the DJI UAV was employed to
Fig. 13. Example of digitalized traffic scene by merging digital models of static road infrastructure and moving road users.
© ASCE 04023019-8 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
continuously monitor the same traffic scene in a birds eye view
during the data acquisition of the surveillance cameras. By aligning
the optical axis of the onboard camera perpendicular to the road
surface, the appeared road users can be easily and accurately lo-
cated in the aerial images via 2D bounding boxes, as illustrated in
Fig. 16.
The UAV was set in a hover mode with a flight altitude of some
25 m. Under such a flight configuration, the field of view of the
UAV was about 71 ×40 m. Accordingly, the accuracy level of
the UAV for road user localization can be roughly computed, which
is equal to 7,100=3,840 ¼1.85 cm=pixel. As such, the road users
locations estimated from the UAV can be safely served as ground
truths for comparison purposes. By leveraging the same ground
control points as shown in Fig. 15, the projective relationship be-
tween the ground plane and the image plane of the onboard camera
can be determined through Eq. (9), allowing for the conversion of
the road users location from pixel unit to physical unit thereof.
Although the UAV used flight stabilization control during record-
ing, the self-motion resulting from wind could not be avoided.
Thus, it was necessary to conduct the homography estimation
for each recorded aerial frame. The clocks of the DJI UAV and
Panasonic cameras were synchronized with the Internet time
server.
Performance Metrics and Experimental Results
A total of 331 aerial images and two 30-min surveillance videos
were acquired from the field experiment and used for the digitali-
zation of the traffic scene. The data processing was conducted on
the AWS (GPU: NVIDIA Tesla V100) , which can reach an average
processing speed of 9.2 fps for the digitalization of traffic scenes.
Fig. 17 illustrates the digitalized traffic scene at a certain time point.
As can be seen, the road infrastructure and the road users within
two surveilled traffic regions were successfully digitalized by the
proposed method.
To quantitatively evaluate the performance of the digitalized
traffic scene model, three performance metrics were adopted, that
Fig. 14. Samples of annotated images in customized road user data set. (Images by Linjun Lu.)
Table 2. Performance of Cascade Mask R-CNN with different backbones
Backbone mAP mAP50 mAP75 APSAPMAPL
ResNet-50 15.0 30.0 13.1 4.6 12.8 25.4
ResNet-101 33.3 58.9 32.5 14.0 32.8 49.4
Swin-T 35.6 60.7 35.8 17.7 33.6 51.7
Swin-S 35.7 61.0 35.9 17.9 33.7 51.8
Region #1
Camera 2
Region #2
Camera 1
Region #1
(1725,6702)
(667,6708)
(-
Region #2
(0,3250)
(1725,3250)
(0,0)
(1725,0)
(-1337,4592)
(-2202,4573) 2220,4750)
(-1402,5169)
(-2234,5109)
Fig. 15. Overview of field experimental setup. (Images by Linjun Lu.)
© ASCE 04023019-9 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
is, the road user instance segmentation performance, geometric ac-
curacy of digitalized road infrastructure model, and localization
and dimension estimation accuracy of digitalized road users.
The traffic data, speed, and acceleration were not evaluated in this
study because they are intimately associated with location estima-
tion results. On the contrary, the quantitative evaluation of the dig-
ital model of the traffic scene should be based on specific traffic
applications because different applications may have varying accu-
racy requirements for a particular type of traffic data, which was
thus not carried out in the current stage of this study. A demo
of the proposed method for digitalizing the dynamic traffic scene
can be found online (ICIL 2022).
Road User Instance Segmentation Performance
In order to examine the effectiveness of the trained Cascade Mask
R-CNN model for road user detection and instance segmentation,
which is a critical indicator to assess the reliability of the digitalized
traffic scene, a total of 360 frames were extracted from two re-
corded surveillance videos with a resampling rate of 5 s and used
for model testing. Fig. 18 presents the detection results of different
types of road users in the form of a confusion matrix. Specifically,
each row of the confusion matrix represents the instances in true
classes, whereas each column represents the instances in predicted
classes. It turns out that the precision and recall for the detection of
each type of road user were above 90.0% and 85.7%, respectively.
It was also found that all misclassification of road users
appeared to happen among the cars, buses, and trucks. This is
primarily attributed to their similar appearances, which make it
different for the trained deep learning model to distinguish them
correctly. On the other hand, after checking the raw images, the
misdetection of road users typically occurred when they were
heavily obscured by others. A viable solution to tackle the misclas-
sification and misdetection issues would be to introduce more
images for deep learning model training and install the cameras
at a higher height so that the level of occlusion can be virtually
minimized.
Geometric Accuracy of Digitalized Road Infrastructure
Model
The geometric accuracy of the digitalized road infrastructure
model was assessed by comparing the dimensions of the digitalized
objects and their actual values in reality. To this end, eight distinc-
tive natural markers distributed over the traffic scene were selected
as the checking points. The location of these eight checking points
is annotated in Fig. 19. Accordingly, the Euclidean distance be-
tween each pair of checking points was measured in both the
digitalized road infrastructure model and the real world. One
laser distance measure [Bosch (Gerlingen, Germany) Blaze Pro
GLM400 CL] was employed for actual distance acquisition, which
has a measurement accuracy of 2mm.
Fig. 16. Example of road user localization using UAV. (Images by Linjun Lu.)
Fig. 17. Example of digitalized traffic scene in field experiment. (Images by Linjun Lu.)
© ASCE 04023019-10 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
Table 3summarizes the average relative dimension error of each
checking point to the others in comparison with the real counter-
part. It can be seen that the average relative dimension error asso-
ciated with each checking point was no more than 0.29% and the
overall relative dimension error was 0.25%. According to Smith
and Vericat (2015), the relative dimension error of a qualifying
digitalized model is expected to be less than 0.3%, indicating that
the digitalized road infrastructure model by the proposed method
has a satisfactory level of geometric accuracy thereof.
Localization and Dimension Estimation Accuracy of
Digitalized Road Users
Table 4lists the statistics of localization errors of different types of
road users by the proposed method in comparison with the UAV-
obtained ground truths. The localization error of each object was
computed by averaging the differences between the measured lo-
cations of four bottom vertexes and their actual values. The appa-
rently false 3D bounding boxes of road users brought by occlusion
were excluded for performance validation. It is safe to do this be-
cause the occlusion issue can be easily tackled by setting additional
surveillance cameras to surveil the same traffic area from different
perspectives in parallel.
The results showed that the mean and maximum localization
errors were 34.5 and 50.3, 38.0 and 55.5, 48.7, and 77.4, 15.6
and 35.2, 18.2 and 34.3, 19.8, and 32.4 cm for cars, buses, trucks,
pedestrians, bicyclists, motorcyclists, respectively. Additionally,
the standard deviation of localization error for these six types of
road users was 10.3, 8.7, 13.8, 6.6, 5.2, and 6.9 cm, respectively.
It is evident that the highest localization accuracy can be achieved
for pedestrians, which is mostly due to the fact that the pedestrians
occupy less road space, allowing them to be more easily and pre-
cisely located on the road surface.
The statistics of road user dimension estimation errors are sum-
marized in Table 5. To facilitate the comparison, the relative esti-
mation errors are provided. The ground-truth heights of cars, buses,
and trucks were obtained from manual measurement or by referring
to the online dimension information provided by the vehicle man-
ufacturers, whereas the ground-truth heights of pedestrians were
acquired from the onsite inquiry. However, because there was un-
fortunately no measure in the field experiment that could be used to
obtain the ground-truth heights of bicyclists and motorcyclists, the
relative height estimation errors with respect to these two types of
road users were not listed herein.
It can be seen that the dimension estimation accuracy of cars,
buses, and trucks was much better than those of pedestrians, bicy-
clists, and motorcyclists. It is probably because the former three
types of road users have comparatively more regular shapes than
those of the latter three. Particularly, the pedestrians had the largest
relative dimension estimation error, which was primarily attributed
to their much more irregular appearances than those of other types
BG
Car
Bus
Truck
Pedestrian
Bicyclist
Motorcyclist
Predicted Class
BG
Car
Bus
Truck
Pedestrian
Bicyclist
Motorcyclist
True Class
Road User Detection Using Cascade Mask R-CNN
360
42
2
1
1
18
4
1
30
56
6
9
1094 4.1%
5.3%
6.3%
10.0%
100.0%
95.9%
94.7%
93.8%
100.0%
100.0%
90.0%
11.1% 5.3% 14.3%
88.9% 100.0% 94.7% 85.7% 100.0% 100.0% 100.0%
Fig. 18. Confusion matrix of road user detection results.
Table 3. Statistics of relative dimension error (%)
Point no. Relative error
1 0.27
2 0.19
3 0.24
4 0.28
5 0.29
6 0.26
7 0.23
8 0.19
All 0.25
Table 4. Statistics of road user localization errors (cm)
Road user Mean Maximum Standard deviation
Car 34.5 50.3 10.3
Bus 38.0 55.5 8.7
Truck 48.7 77.4 13.8
Pedestrian 15.6 35.2 6.6
Bicyclist 18.2 34.3 5.2
Motorcyclist 19.8 32.4 6.9
Table 5. Statistics of road user relative dimension estimation errors (%)
Road user Length Width Height
Car 6.5 9.2 3.9
Bus 4.1 7.7 3.6
Truck 5.3 8.4 6.7
Pedestrian 30.6 28.3 17.1
Bicyclist 15.6 19.6
Motorcyclist 13.6 12.9
Fig. 19. Location of checking points used for geometrical accuracy
evaluation.
© ASCE 04023019-11 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
of road users, making it challenging for the proposed method to
construct the 3D bounding boxes for them accurately. This would
be a critical concern when applying the proposed method in safety-
related applications because pedestrians are the most vulnerable en-
tities in traffic scenes. To address this issue, the source of the error
of the proposed method in dimension estimation should be well
understood, quantified, and minimized in future work, upon which
error compensation efforts can be made to enhance the performance
of the proposed method for road user localization thereof.
Discussion and Future Work
The digital models of the traffic scenes have great potential to
be employed in numerous intelligent transportation applications.
The most significant application is the infrastructure-vehicle co-
operative autonomous driving. Automated vehicles may suffer un-
reliable and insufficient perception due to occlusions and complex
traffic conditions (particularly in crowded urban intersections),
which poses safety risks to the automated vehicles and passengers.
The digital models of the traffic scenes provide a promising solu-
tion to overcome this issue by providing the automated vehicles
with a global view of the current traffic situation as well as the com-
plementary position information of the surrounding road users.
According to a qualitative analysis revealed by Williams and Barth
(2020), in order to ensure that the safe-critical autonomous driving
modules (e.g., collision warning and prediction) function correctly,
lane-level positioning accuracy (90 cm) is required for road user
positioning. It is worth pointing out that the maximum localization
error by the proposed method was 77.4 cm, which happened in the
case of truck localization. Therefore, it indicates that the proposed
method has great potential to be exploited to enhance the fields of
view of automated vehicles and assist them in making safer and
more efficient driving decisions. Similarly, legacy vehicles can also
benefit from this application if communication between the drivers
and visual surveillance systems is established.
Meanwhile, the digital model of the traffic scene can effectively
simulate the real traffic situation and provides an ideal test ground
for the development and optimization of automated driving algo-
rithms for automated vehicles, which are a crucial part of intelligent
transportation systems. Moreover, the digital models of the traffic
scenes also offer a platform for studying the behavior of different
road users. Following that, this information can be used to inves-
tigate traffic safety and create or modify policies to increase traffic
efficiency.
Additionally, the digital models of the traffic scenes could be
served as a tool for overheight collision prevention, because the
height of passing vehicles and the clearance of infrastructures
can be easily obtained in the digital model. Furthermore, the digital
model of road infrastructure provides 3D road surface information
that can be used for pavement distress detection and condition
maintenance, which is unable to be acquired from a 2D road infra-
structure map.
On top of the current study, there are still some tasks that should
be carried out in the future. First, in order to fully cover the traffic
scene, multiple cameras are needed to be deployed at different lo-
cations and work with each other to monitor the relevant traffic
scene in parallel. Thus, it is necessary to take some efforts to de-
velop a feasible strategy to merge or remove the duplicate entities
within the overlapped monitoring zones.
Second, it is also desirable to conduct investigations for trajec-
tory prediction of road users based on the digital models of the
traffic scenes, which is of vital significance for collision warnings
and automated driving. To fit this purpose, some deep learning
techniques, such as the long short-term memory network (Kim
et al. 2017) and Social generative adversarial network (GAN)
(Gupta et al. 2018), can be exploited to function in conjunction with
the trajectory tracking module (Deep Sort) for trajectory prediction.
Third, to promote the adoption of digital twins in support of the
development of intelligent transportation systems, a more fine-
grained digital model of the traffic scene is sometimes required.
To this end, additional cutting-edge sensors and image processing
technologies will be incorporated into the proposed method to help
digitalize and introduce more real-world information into the dig-
ital models, such as license plate numbers, physical conditions of
pedestrians, weather conditions, and so on.
Fourth, the class imbalance problem among established road
user data set should also be appropriately addressed, which may
cause the trained deep learning model to exhibit a bias towards
the majority classes (e.g., car and pedestrian) due to their increased
prior probability. To address this hurdle, one commonly used strat-
egy is to randomly discard the majority class samples or replicate
the minority class samples to balance the class distribution on the
training data set (Van Hulse et al. 2007). Another alternative strat-
egy to handle the class imbalance is to take the class penalty and
weight into the consideration when designing the loss function for
model training (Johnson and Khoshgoftaar 2019). In the future
work, comprehensive experiments will be conducted to compare
the performance of different strategies to deal with the class imbal-
ance problem in the developed road user data set.
Conclusions
This study proposed a vision-based method for the digitalization of
traffic scenes by leveraging UAV and roadside surveillance cam-
eras. The performance of the proposed method was evaluated
through a field experiment conducted at a road intersection. The
experiment results showed that the traffic scene can be successfully
digitalized by the proposed method with promising accuracy. The
contribution of this study lies in providing a cost-efficient tool for
the development of the digital twins of road transportation in sup-
port of intelligent transportation applications. Also, the proposed
method is scalable, namely, the coverage of the digital model
can be easily extended by adding new cameras to the surveillance
network. The potential applications of the proposed method include
infrastructure-vehicle cooperative autonomous driving, collision
warning and prediction, road user behavior analysis and under-
standing, and so on. However, the proposed method is still in its
infancy; thus, further investigations and performance testing need
to be conducted in the future to promote its applications in real-
world traffic scenes.
Data Availability Statement
The annotations for the image data sets and the created models that
support the findings of this study are available from the correspond-
ing author upon reasonable request.
Acknowledgments
This work was sponsored by a grant from the Center for Integrated
Asset Management for Multimodal Transportation Infrastructure
Systems (CIAMTIS), a US Department of Transportation University
Transportation Center, under federal Grant No. 69A3551847103.
The authors are grateful for the support. Any opinions, findings,
conclusions, and recommendations expressed in this paper are
© ASCE 04023019-12 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
those of the authors and do not necessarily reflect the views of
the CIAMTIS.
References
Aryan, A., F. Bosché, and P. Tang. 2021. Planning for terrestrial laser
scanning in construction: A review.Autom. Constr. 125 (May):
103551. https://doi.org/10.1016/j.autcon.2021.103551.
Bao, L., Q. Wang, and Y. Jiang. 2021. Review of digital twin for intelligent
transportation system.In Proc., Int. Conf. on Information Control, Elec-
trical Engineering and Rail Transit (ICEERT),309315. New York:
IEEE.
Bhatti, M. T., M. G. Khan, M. Aslam, and M. J. Fiaz. 2021. Weapon
detection in real-time CCTV videos using deep learning.IEEE Access
9 (Feb): 3436634382. https://doi.org/10.1109/ACCESS.2021.305
9170.
Caesar, H., V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A.
Krishnan, Y. Pan, G. Baldan, and O. Beijbom. 2020. nuScenes: A mul-
timodal dataset for autonomous driving.In Proc., IEEE/CVF Conf. on
Computer Vision and Pattern Recognition, 1162111631. New York:
IEEE.
Cai, Z., and N. Vasconcelos. 2019. Cascade R-CNN: High quality
object detection and instance segmentation.IEEE Trans. Pattern Anal.
Mach. Intell. 43 (5): 14831498. https://doi.org/10.1109/TPAMI.2019
.2956516.
Chen, J., Z. Kira, and Y. K. Cho. 2019. Deep learning approach to point
cloud scene understanding for automated scan to 3D reconstruction.
J. Comput. Civ. Eng. 33 (4): 04019027. https://doi.org/10.1061
/(ASCE)CP.1943-5487.0000842.
Cordts, M., M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele. 2016. The cityscapes dataset for
semantic urban scene understanding.In Proc., IEEE Conf. on Com-
puter Vision and Pattern Recognition, 32133223. New York: IEEE.
Criminisi, A., I. Reid, and A. Zisserman. 2000. Single view metrology.
Int. J. Comput. Vision 40 (2): 123148. https://doi.org/10.1023
/A:1026598000963.
Dutta, A., and A. Zisserman. 2019. The VIA annotation software for im-
ages, audio and video.In Proc., 27th ACM Int. Conf. on Multimedia,
22762279. New York: Association for Computing Machinery.
El Marai, O., T. Taleb, and J. Song. 2020. Roads infrastructure digital
twin: A step toward smarter cities realization.IEEE Network 35 (2):
136143. https://doi.org/10.1109/MNET.011.2000398.
Furukawa, Y., B. Curless, S. M. Seitz, and R. Szeliski. 2010. Towards
internet-scale multi-view stereo.In Proc., 2010 IEEE Computer Soci-
ety Conf. on Computer Vision and Pattern Recognition, 14341441.
New York: IEEE.
Gao, Y., S. Qian, Z. Li, P. Wang, F. Wang, and Q. He. 2021. Digital twin
and its application in transportation infrastructure.In Proc., 2021 IEEE
1st Int. Conf. on Digital Twins and Parallel Intelligence (DTPI),
298301. New York: IEEE.
Giannakeris, P., V. Kaltsa, K. Avgerinakis, A. Briassouli, S. Vrochidis, and
I. Kompatsiaris. 2018. Speed estimation and abnormality detection
from surveillance cameras.In Proc., IEEE Conf. on Computer Vision
and Pattern Recognition Workshops,9399. New York: IEEE.
Gupta, A., J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. 2018. Social
GAN: Socially acceptable trajectories with generative adversarial
networks.In Proc., IEEE Conf. on Computer Vision and Pattern
Recognition, 22552264. New York: IEEE.
Hartley, R., and A. Zisserman. 2003. Multiple view geometry in computer
vision. Cambridge, UK: Cambridge University Press.
He, K., X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for
image recognition.In Proc., IEEE Conf. on Computer Vision and
Pattern Recognition, 770778. New York: IEEE.
Hu, C., W. Fan, E. Zeng, Z. Hang, F. Wang, L. Qi, and M. Z. A. Bhuiyan.
2021. Digital twin-assisted real-time traffic data prediction method
for 5G-enabled internet of vehicles.IEEE Trans. Ind. Inf. 18 (4):
28112819. https://doi.org/10.1109/TII.2021.3083596.
Hui, Y., Q. Wang, N. Cheng, R. Chen, X. Xiao, and T. H. Luan. 2021.
Time or reward: Digital-twin enabled personalized vehicle path planning.
In Proc., 2021 IEEE Global Communications Conf. (GLOBECOM),16.
New York: IEEE.
ICIL (Integrated Construction Informatics Laboratory). 2022. Digital
twinning of traffic scenes.Accessed September 29, 2022. https://www
.youtube.com/watch?v=tDywTX8pEoY.
Indyk, P., and R. Motwani. 1998. Approximate nearest neighbors:
Towards removing the curse of dimensionality.In Proc., 13th Annual
ACM Symp. on Theory of Computing, 604613. New York: Association
for Computing Machinery.
Johnson, J. M., and T. M. Khoshgoftaar. 2019. Survey on deep learning
with class imbalance.J. Big Data 6 (27): 154.
Kar, A., S. Tulsiani, J. Carreira, and J. Malik. 2015. Category-specific
object reconstruction from a single image.In Proc., IEEE Conf. on
Computer Vision and Pattern Recognition, 19661974. New York:
IEEE.
Ke, L., S. Li, Y. Sun, Y.-W. Tai, and C.-K. Tang. 2020. GSNet: Joint ve-
hicle pose and shape reconstruction with geometrical and scene-aware
supervision.In Proc., European Conf. on Computer Vision, 515532.
New York: Springer.
Khaloo, A., and D. Lattanzi. 2017. Hierarchical dense structure-from-
motion reconstructions for infrastructure condition assessment.J. Com-
put. Civ. Eng. 31 (1): 04016047. https://doi.org/10.1061/(ASCE)CP.1943
-5487.0000616.
Kim, B., C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi. 2017.
Probabilistic vehicle trajectory prediction over occupancy grid map via
recurrent neural network.In Proc., 2017 IEEE 20th Int. Conf. on In-
telligent Transportation Systems (ITSC), 399404. New York: IEEE.
Kumar, S. A., R. Madhumathi, P. R. Chelliah, L. Tao, and S. Wang. 2018.
A novel digital twin-centric approach for driver intention prediction
and traffic congestion avoidance.J. Reliab. Intell. Environ. 4 (4):
199209. https://doi.org/10.1007/s40860-018-0069-y.
Lee, S., S. Kim, and S. Moon. 2022. Development of a car-free street map-
ping model using an integrated system with unmanned aerial vehicles,
aerial mapping cameras, and a deep learning algorithm.J. Comput.
Civ. Eng. 36 (3): 04022003. https://doi.org/10.1061/(ASCE)CP.1943
-5487.0001013.
Lin, T. Y., M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick. 2014. Microsoft COCO: Common objects
in context.In Proc., European Conf. on Computer Vision, 740755.
New York: Springer.
Liu, S., B. Yu, J. Tang, and Q. Zhu. 2021a. Towards fully intelligent trans-
portation through infrastructure-vehicle cooperative autonomous driv-
ing: Challenges and opportunities.In Proc., 2021 58th ACM/IEEE
Design Automation Conf. (DAC), 13231326. New York: IEEE.
Liu, Z., Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. 2021b.
Swin transformer: Hierarchical vision transformer using shifted
windows.In Proc., IEEE/CVF Int. Conf. on Computer Vision,
1001210022. New York: IEEE.
Lowe, D. G. 1999. Object recognition from local scale-invariant features.
In Proc., 7th IEEE Int. Conf. on Computer Vision,11501157. New York:
IEEE.
Lu, L., and F. Dai. 2022. A unified normalization method for homography
estimation using combined point and line correspondences.Comput.-
Aided Civ. Infrastruct. Eng. 37 (8): 10101026. https://doi.org/10.1111
/mice.12788.
Lu, L., and F. Dai. 2023. Automated visual surveying of vehicle heights to
help measure the risk of overheight collisions using deep learning and
view geometry.Comput.-Aided Civ. Infrastruct. Eng. 38 (2): 194210.
https://doi.org/10.1111/mice.12842.
Lv, Z., Y. Li, H. Feng, and H. Lv. 2021. Deep learning for security in
digital twins of cooperative intelligent transportation systems.IEEE
Trans. Intell. Transp. Syst. 23 (9): 1666616675. https://doi.org/10
.1109/TITS.2021.3113779.
Mottaghi, R., Y. Xiang, and S. Savarese. 2015. A coarse-to-fine model for
3D pose estimation and sub-category recognition.In Proc., IEEE Conf.
on Computer Vision and Pattern Recognition, 418426. New York:
IEEE.
Muller, K., A. Smolic, M. Drose, P. Voigt, and T. Wiegand. 2005. 3-D
reconstruction of a dynamic environment with a fully calibrated
© ASCE 04023019-13 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
background for traffic scenes.IEEE Trans. Circuits Syst. Video Tech-
nol. 15 (4): 538549. https://doi.org/10.1109/TCSVT.2005.844452.
Niesen, U., and J. Unnikrishnan. 2020. Camera-radar fusion for 3-D depth
reconstruction.In Proc., 2020 IEEE Intelligent Vehicles Symp. (IV),
265271. New York: IEEE.
Nikouei, S. Y., Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan.
2018. Real-time human detection as an edge service enabled by a light-
weight CNN.In Proc., 2018 IEEE Int. Conf. on Edge Computing
(EDGE), 125129. New York: IEEE.
Pan, Y., N. Wu, T. Qu, P. Li, K. Zhang, and H. Guo. 2021. Digital-
twin-driven production logistics synchronization system for vehicle
routing problems with pick-up and delivery in industrial park.Int.
J. Comput. Integr. Manuf. 34 (78): 814828. https://doi.org/10
.1080/0951192X.2020.1829059.
Reddy, N. D., M. Vo, and S. G. Narasimhan. 2018. CarFusion: Combining
point tracking and part detection for dynamic 3D reconstruction of
vehicles.In Proc., IEEE Conf. on Computer Vision and Pattern
Recognition, 19061915. New York: IEEE.
Sharma, R., and A. Sungheetha. 2021. An efficient dimension reduction
based fusion of CNN and SVM model for detection of abnormal inci-
dent in video surveillance.J. Soft Comput. Paradigm 3 (2): 5569.
https://doi.org/10.36548/jscp.2021.2.001.
Smith, M. W., and D. Vericat. 2015. From experimental plots to experimental
landscapes: Topography, erosion and deposition in sub-humid badlands
from structure-from-motion photogrammetry.Earth Surf. Processes
Landforms 40 (12): 16561671. https://doi.org/10.1002/esp.3747.
Sochor, J., A. Herout, and J. Havel. 2016. Boxcars: 3D boxes as CNN
input for improved fine-grained vehicle recognition.In Proc., IEEE
Conf. on Computer Vision and Pattern Recognition, 30063015.
New York: IEEE.
Strigel, E., D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer. 2014.
The ko-per intersection laserscanner and video dataset.In Proc.,
17th Int. IEEE Conf. on Intelligent Transportation Systems (ITSC),
19001901. New York: IEEE.
Su, H., C. R. Qi, Y. Li, and L. J. Guibas. 2015. Render for CNN:
Viewpoint estimation in images using CNNs trained with rendered
3D model views.In Proc., IEEE Int. Conf. on Computer Vision,
26862694. New York: IEEE.
Szeliski, R. 2010. Computer vision: Algorithms and applications. Berlin:
Springer Science & Business Media.
Van Hulse, J., T. M. Khoshgoftaar, and A. Napolitano. 2007. Experimental
perspectives on learning from imbalanced data.In Proc., 24th Int.
Conf. on Machine Learning, 935942. New York: Association for
Computing Machinery.
Williams, N., and M. Barth. 2020. A qualitative analysis of vehicle posi-
tioning requirements for connected vehicle applications.IEEE Intell.
Transp. Syst. Mag. 13 (1): 225242. https://doi.org/10.1109/MITS
.2019.2953521.
Wojke, N., A. Bewley, and D. Paulus. 2017. Simple online and realtime
tracking with a deep association metric.In Proc., 2017 IEEE Int. Conf.
on Image Processing (ICIP), 36453649. New York: IEEE.
Xia, Y., W. Xu, L. Zhang, X. Shi, and K. Mao. 2015. Integrating 3D
structure into traffic scene understanding with RGB-D data.Neuro-
computing 151 (Mar): 700709. https://doi.org/10.1016/j.neucom.2014
.05.091.
Yang, B., W. Luo, and R. Urtasun. 2018. PIXOR: Real-time 3D object
detection from point clouds.In Proc., IEEE Conf. on Computer Vision
and Pattern Recognition, 76527660. New York: IEEE.
Yang, L., M. Li, X. Song, Z. Xiong, C. Hou, and B. Qu. 2019. Vehicle
speed measurement based on binocular stereovision system.IEEE
Access 7 (Jul): 106628106641. https://doi.org/10.1109/ACCESS.2019
.2932120.
Zhang, X., Y. Feng, P. Angeloudis, and Y. Demiris. 2022. Monocular
visual traffic surveillance: A review.IEEE Trans. Intell. Transp. Syst.
23 (9): 1414814165. https://doi.org/10.1109/TITS.2022.3147770.
© ASCE 04023019-14 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2023, 37(5): 04023019
Downloaded from ascelibrary.org by Fei Dai on 05/19/23. Copyright ASCE. For personal use only; all rights reserved.
... In fact, an autonomous model must be capable of addressing several objectives, such as user safety, passenger comfort and time efficiency. Research shows that it is possible to access environmental information using fixed or vehicle-mounted sensors [1,2]. These tools provide a robust visualization of the environment in digital format. ...
Article
Full-text available
The autonomous vehicle is an innovative field for the application of machine learning algorithms. Controlling an agent designed to drive safely in traffic is very complex as human behavior is difficult to predict. An individual’s actions depend on a large number of factors that cannot be acquired directly by visualization. The size of the vehicle, its vulnerability, its perception of the environment and weather conditions, among others, are all parameters that profoundly modify the actions that the optimized model should take. The agent must therefore have a great capacity for adaptation and anticipation in order to drive while ensuring the safety of users, especially pedestrians, who remain the most vulnerable users on the road. Deep reinforcement learning (DRL), a sub-field that is supported by the community for its real-time learning capability and the long-term temporal aspect of its objectives looks promising for AV control. In a previous article, we were able to show the strong capabilities of a DRL model with a continuous action space to manage the speed of a vehicle when approaching a pedestrian crossing. One of the points that remains to be addressed is the notion of discrete decision-making intrinsically linked to speed control. In this paper, we will present the problems of AV control during a pedestrian crossing, starting with a modelization and a DRL model with hybrid action space adapted to the scalability of a vehicle-to-pedestrian (V2P) encounter. We will also present the difficulties raised by the scalability and the curriculum-based method.
... Based on real traffic, digital twin traffic integrates collected historical and realtime traffic data into the established traffic simulation system using digital twin technology. It conducts fast data fusion and simulation inference, creating a virtual digital map of the entire traffic system, restoring the travel laws of the entire traffic, and achieving real-time interaction between real and digital traffic [29,30]. Real traffic data has been added to a microscopic expressway traffic simulation system through the introduction of the Geneva expressway Digital Twin [31]. ...
Article
Full-text available
Efficient detection of traffic crashes has been a significant matter of concern with regards to expressway safety management. The current challenge is that, despite collecting vast amounts of data, expressway detection equipment is plagued by low data utilization rates, unreliable crash detection models, and inadequate real‐time updating capabilities. This study is to develop an effective digital twin framework for the detection of traffic crashes on expressways. Firstly, the digital twin technology is used to create a virtual entity of the real expressway. A fusion method for macro and micro traffic data is proposed based on the location of multi‐source detectors on a digital twin platform. Then, a traffic crash detection model is developed using the ThunderGBM algorithm and interpreted by the SHAP method. Furthermore, a distributed strategy for detecting traffic crashes is suggested, where various models are employed concurrently on the digital twin platform to enhance the general detection ability and reliability of the models. Finally, the efficacy of the digital twin framework is confirmed through a case study of certain sections of the Nanjing Ring expressway. This study is expected to lay the groundwork for expressway digital twin studies and offer technical assistance for expressway traffic management.
... Note that the developed geometric model is built upon the assumption that the optical axis of the UAV camera is perpendicular to the ground plane. Such a camera setup is a common practice in UAV-based traffic monitoring tasks, which can greatly simplify the camera calibration process and help achieve consistent ground coverage [12,37]. In addition, the proposed method assumes that other measurement uncertainties resulting from factors such as motion blur, UAV self-motion, and adverse environmental conditions have already been adequately addressed based on previous research. ...
Article
Full-text available
This paper constructs a two-layer road data asset revenue allocation model based on a modified Shapley value approach. The first layer allocates revenue to three roles in the data value realization process: the original data collectors, the data processors, and the data product producers. It fully considers and appropriately adjusts the revenue allocation to each role based on data risk factors. The second layer determines the correction factors for different roles to distribute revenue among the participants within those roles. Finally, the revenue values of the participants within each role are synthesized to obtain a consolidated revenue distribution for each participant. Compared to the traditional Shapley value method, this model establishes a revenue allocation evaluation index system, uses entropy weighting and rough set theory to determine the weights, and adopts a fuzzy comprehensive evaluation and numerical analysis to assess the degree of contribution of participants. It fully accounts for differences in both the qualitative and quantitative contributions of participants, enabling a fairer and more reasonable distribution of revenues. This study provides new perspectives and methodologies for the benefit distribution mechanism in road data assets, which aid in promoting the market-based use of road data assets, and it serves as an important reference for the application of data assetization in the road transportation industry.
Article
Overheight vehicle collisions continuously pose a serious threat to transportation infrastructure and public safety. This study proposed a vision‐based method for automatic vehicle height measurement using deep learning and view geometry. In this method, vehicle instances are first segmented from traffic surveillance video frames by exploiting mask region‐based convolutional neural network (Mask R‐CNN). Then, 3D bounding box on each vehicle instance is constructed using the obtained vehicle silhouette and three orthogonal vanishing points in the surveilled traffic scene. By doing so, the vertical edges of the constructed 3D bounding box are directly associated with the vehicle image height. Last, the vehicle's physical height is computed by referencing an object with a known height in the traffic scene using single view metrology. A field experiment was performed to evaluate the performance of the proposed method, leading to the mean and maximum errors of 3.6 and 6.6, 5.8 and 12.9, 4.4 and 8.1, and 9.2 and 18.5 cm for cars, buses, vans, and trucks, respectively. The experiment also demonstrated the ability of the method to overcome vehicle occlusion, shadow, and irregular appearance interferences in height estimation suffered by existing image‐based methods. The results signified the potential of the proposed method for overheight vehicle detection and collision warning in real traffic settings.
Article
Road condition and quality are critical road maintenance and risk reduction factors. Most existing road monitoring systems include regular on-site surveys and maintenance. However, major roads in urban areas are generally complicated and have heavy traffic during the daytime, so such field investigation can be significantly limited. Moreover, any road work at nighttime can be risky and dangerous and incur excessive expenses. Based on a review of existing systems for monitoring road conditions, this study focuses on overcoming two unsolved challenges: the capacity of the monitoring range and the avoidance techniques to ensure traffic is not hindered. To solve these challenges, these paper proposes an integrated road monitoring system called Car-free Street Mapping (CfSM) using unmanned aerial vehicles (UAV), aerial mapping cameras, and deep learning (DL) algorithms. The use of the aerial mapping camera mounted on the UAV is to widen the monitoring viewing range, and general-purpose drones are used in this study rather than expensive special equipment. Since the drone-taken images include many passing vehicles that conceal the road surface from the camera vision, the DL model was applied to detect the vehicles and their shadows and then remove them from the images. To train the DL model, two image datasets were used: publicly available cars overhead with context (COWC) images and orthoimages additionally taken for the project to further improve the accuracy. The two datasets consist of 298,623 labeled objects on 9,331 images in total. The tests resulted in a mean average precision (mAP) of 89.57% for trucks, 95.77% for passenger vehicles, and 76.51% for buses. Finally, the object-removed images were composited into one whole car-free image. The CfSM was applied to two areas in Yeouido and Sangam-dong, Seoul, Korea. The car-free images in both regions show a spatial resolution of 10 mm and can be used for various purposes such as road maintenance and management and autonomous vehicle roadmaps.
Article
To facilitate the monitoring and management of modern transportation systems, monocular visual traffic surveillance systems have been widely adopted for speed measurement, accident detection, and accident prediction. Thanks to the recent innovations in computer vision and deep learning research, the performance of visual traffic surveillance systems has been significantly improved. However, despite this success, there is a lack of survey papers that systematically review these new methods. Therefore, we conduct a systematic review of relevant studies to fill this gap and provide guidance to future studies. This paper is structured along the visual information processing pipeline that includes object detection, object tracking, and camera calibration. Moreover, we also include important applications of visual traffic surveillance systems, such as speed measurement, behavior learning, accident detection and prediction. Finally, future research directions of visual traffic surveillance systems are outlined.
Article
Data normalization is an essential and imperative step when using the direct linear transform method for homography estimation. However, the existing data normalization methods only rely on either point coordinates or line coefficients in computing homography, which is inapplicable to many civil infrastructure scenes, in which both points and lines exist but are scarce. It is plausible to maximize the accuracy level achievable by fully utilizing all points and lines available in these scenes. To fit this purpose, this study proposed a unified data normalization method for homography estimation by processing combined point and line correspondences. In this method, an analytical procedure was created to transform line coefficients for normalization under similarity transformation. By this procedure, formulas for defining parameters of the similarity transformation were developed and equations for pre- and post-processing of data were established, leading to a pipeline that allows for solving different combinations of point and line correspondences simultaneously. Both simulation experiments and field tests were conducted to evaluate the performance of the method. The results showed that the method can effectively exploit all available point and line correspondences and significantly improve the accuracy for homography estimation under different measurement conditions.
Article
The purpose is to solve the security problems of the Cooperative Intelligent Transportation System (CITS) Digital Twins (DTs) in the Deep Learning (DL) environment. The DL algorithm is improved; the Convolutional Neural Network (CNN) is combined with Support Vector Regression (SVR); the DTs technology is introduced. Eventually, a CITS DTs model is constructed based on CNN-SVR, whose security performance and effect are analyzed through simulation experiments. Compared with other algorithms, the security prediction accuracy of the proposed algorithm reaches 90.43%. Besides, the proposed algorithm outperforms other algorithms regarding Precision, Recall, and F1. The data transmission performances of the proposed algorithm and other algorithms are compared. The proposed algorithm can ensure that emergency messages can be responded to in time, with a delay of less than 1.8s. Meanwhile, it can better adapt to the road environment, maintain high data transmission speed, and provide reasonable path planning for vehicles so that vehicles can reach their destinations faster. The impacts of different factors on the transportation network are analyzed further. Results suggest that under path guidance, as the Market Penetration Rate (MPR), Following Rate (FR), and Congestion Level (CL) increase, the guidance strategy's effects become more apparent. When MPR ranges between 40% ~ 80% and the congestion is level III, the ATT decreases the fastest, and the improvement effect of the guidance strategy is more apparent. The proposed DL algorithm model can lower the data transmission delay of the system, increase the prediction accuracy, and reasonably changes the paths to suppress the sprawl of traffic congestions, providing an experimental reference for developing and improving urban transportation.