ArticlePDF Available

Digitalization of Traffic Scenes in Support of Intelligent Transportation Applications

September 2023
Journal of Computing in Civil Engineering 37(5)

September 2023
37(5)

DOI:10.1061/JCCEE5.CPENG-5204

Authors:

Linjun Lu

University of Cambridge

Fei Dai

West Virginia University

Content uploaded by Linjun Lu

Content may be subject to copyright.

Digitalization of Traffic Scenes in Support of

Intelligent Transportation Applications

Linjun Lu, S.M.ASCE1; and Fei Dai, M.ASCE2

Abstract: Digitalization of real-world traffic scenes is a fundamental task in development of digital twins of road transportation. However,

the existing digitalization approaches are either expensive in equipment costs or inapplicable to collect granular level data of traffic scenes.

This study proposed a vision-based method for real-time digitalization of traffic scenes through modeling and merging the road infrastructure

(static components) and road users (dynamic components) progressively. Specifically, the former is reconstructed by leveraging unmanned

aerial vehicles (UAVs) and structure from motion; and the latter is digitized via using roadside surveillance videos and a new reconstruction

process through applying deep learning and view geometry. Last, the digital model of the traffic scene is built by merging the digital models of

static and dynamic components. A field experiment was performed to evaluate the performance of the proposed method. The results showed

that the traffic scene can be successfully digitalized by the proposed method with promising accuracy, thus signifying the method’s potential

for the development of the digital twins of road transportation in support of intelligent transportation applications. DOI: 10.1061/JCCEE5.

Author keywords: Digital twins; Image-based methods; Intelligent transportation; Computer vision; Automated vehicles.

Introduction

Digital twins have been receiving a surge of attention in the traffic

community as a promising and powerful tool for the improvement

of road transportation management and operations (Gao et al. 2021;

Liu et al. 2021a;Lv et al. 2021). A digital twin of road transpor-

tation is a digital replica of an actual traffic system created in a

virtual space, which enables real-time information interaction,

closed-loop data transmission, and adaptive iterative optimization

between the digital and physical counterparts (Bao et al. 2021).

Thanks to recent advances in smart sensing and artificial intelli-

gence, the performance of digital twins has been significantly im-

proved, allowing for analysis, quantification, and modeling of the

intricate hierarchical relationships and interactions among different

traffic entities. By continuously infusing the observed traffic data,

the digital twins of road transportation will progress in intelligence

and fidelity and can make ever-improving predictions about traffic

behaviors and potential problems. This will assist the traffic man-

agement systems in making optimal and timely traffic coordination

decisions on a global scope. As a result, the digital twins of road

transportation have potential to actively contribute to the realization

of intelligent transportation systems (Liu et al. 2021a;Pan et al.

2021;Strigel et al. 2014).

The digitalization of real-world traffic scenes is a fundamental

task for the development of the digital twins of road transportation,

which is a process of converting the valuable information in traffic

scenes into an electronically stored digital format that is readily for

use by transportation managers (Hu et al. 2021). Laser scanning is

the widely used technology in this regard for constructing three-

dimensional (3D) models of structures and scenes. It produces the

3D point clouds by measuring the distance from the laser sensor to

the scanning targets based on the time-of-flight or phase-based

principle (Aryan et al. 2021). However, the adoption of laser scan-

ners is currently limited in traffic monitoring applications because

of high equipment cost. Additionally, the semantic understanding

of traffic scenes remains a challenge in the use of monochromatic

3D point clouds generated by laser scanners, despite some research

efforts have been devoted to coping with this issue (Chen et al.

2019).

With advances in electro-optical sensors and computational

capacities, machine vision is regarded as a cost-efficient alternative

for 3D model reconstruction (Szeliski 2010), upon which the se-

mantic information can be feasibly extracted from the recorded

high-resolution images. This study proposed a machine vision–

based method for real-time digitalization of traffic scenes by lever-

aging unmanned aerial vehicles (UAVs) and roadside surveillance

cameras. The method’s performance was evaluated through a field

experiment at a road intersection. The results showed that the traffic

scene can be successfully digitalized by the proposed method with

promising accuracy, thus signifying the method’s potential for the

development of digital twins of road transportation in support of

intelligent transportation applications.

Background

State of Research in Digital Twins for Intelligent

Transportation Applications

A few studies have been conducted to explore the applicability of

digital twins for intelligent transportation management and opera-

tions. For instance, by leveraging digital twins and machine learn-

ing, Kumar et al. (2018) developed a virtual vehicle model for

driving intention prediction that can be utilized to assist automated

1Graduate Research Assistant, Wadsworth Dept. of Civil and Environ-

mental Engineering, West Virginia Univ., Morgantown, WV 26506. Email:

ll0074@mix.wvu.edu

2Associate Professor, Wadsworth Dept. of Civil and Environmental

Engineering, West Virginia Univ., Morgantown, WV 26506 (corresponding

author). ORCID: https://orcid.org/0000-0002-8868-2821. Email: fei.dai@

mail.wvu.edu

Note. This manuscript was submitted on September 29, 2022; approved

on February 1, 2023; published online on May 19, 2023. Discussion period

open until October 19, 2023; separate discussions must be submitted for

individual papers. This paper is part of the Journal of Computing in Civil

Engineering, © ASCE, ISSN 0887-3801.

J. Comput. Civ. Eng., 2023, 37(5): 04023019

and legacy vehicles to make optimal path planning. Hui et al.

(2021) proposed a digital twin–enabled path planning scheme to

help increase traffic efficiency. Particularly, the proposed scheme

embedded a personalized utility module able to meet the specific

needs from each vehicle, thereby contributing to higher overall ve-

hicle utility than the traditional path planning schemes. Liu et al.

(2021a) introduced an infrastructure-vehicle cooperation system

for the advancement of automated driving. Their approach involved

using roadside sensors to digitize traffic scenes and transmitting the

perception results to the neighboring automated vehicle, which en-

ables the automated vehicles to get a full picture of the current traf-

fic situation.

However, the creation of digital twins in the aforementioned

methods relies on the information gathered from different types

of roadside sensors, such as cameras, LiDARs, and radars, making

these methods less applicable in field applications due to the high

equipment costs. To tackle this hurdle, El Marai et al. (2020)ex-

ploited omnidirectional cameras to create the digital twins of traffic

scenes in urban cities and simultaneously implemented the deep

learning method for detection and recognition of the road users.

However, the lack of spatial information about road users, which

would be otherwise of vital importance for traffic condition analy-

sis and prediction, rendered the generated digital twins by this

method less practical.

State of Research in Traffic Scene Reconstruction

Traffic scene reconstruction and understanding is an active topic

in intelligent transportation systems, in which the primary task

is to perceive 3D attributes of road users in dynamic environments.

Using monocular video for traffic scene reconstruction has been

receiving extensive interest recently because of the advantage of

simple implementation (Kar et al. 2015). The monocular-based

methods normally start with detecting the objects in the images

with two-dimensional (2D) bounding boxes and key points and

then fitting 3D templates (3D pose and shape) to best match their

2D observations (Su et al. 2015). Subsequently, postprocessing

is carried out to improve the initial estimation via nonlinear opti-

mization (Mottaghi et al. 2015). However, these methods are

only applicable to limited types of road users. In addition, they

are substantially time-consuming and fall short of the real-time

processing speeds needed by a plethora of intelligent transporta-

tion applications.

Recently, convolutional neural networks (CNNs) are gaining in-

creasing popularity and have been exploited to directly predict the

spatial location and pose of road users in traffic images in an end-

to-end manner (Ke et al. 2020;Reddy et al. 2018). Nevertheless, the

reconstruction results produced by the CNN-based methods are

usually too coarse to be used in real traffic applications due to

the difficulty of retrieving the depth information missing in 2D im-

ages. To address these hurdles, some research efforts have resorted

to multiple-view reconstruction or integrating depth information

into monocular images. Particularly, the multiple-view-based meth-

ods reconstruct traffic scenes by utilizing the triangulation tech-

nique to infer the geometrical and spatial information of traffic

entities based on a sequence of images gathered from different ori-

ented cameras (Muller et al. 2005).

However, this type of method entails complicated camera cal-

ibration over different cameras, and requires manually setting 3D

control points in the observed scene and accurately determining

their spatial location, which is oftentimes a laborious and time-

consuming process. Integrating depth information into monocular

images can effectively assist in recovering discriminative object

features and spatial information of road users in 2D images.

Common depth-sensing devices that can be used for this purpose

include radar (Niesen and Unnikrishnan 2020), depth-sensing cam-

eras (Xia et al. 2015), and LiDAR (Yang et al. 2018). Nevertheless,

the equipment cost of these sensors is typically expensive, thus im-

peding their wide application in real traffic settings.

Visual Surveillance Systems in Road Transportation

Applications

Visual surveillance systems have been widely deployed in modern

road transportation systems to facilitate the monitoring and man-

agement of traffic activities (Zhang et al. 2022). In comparison with

other traffic monitoring techniques such as inductive loop detectors,

radar sensors, and infrared sensors, a visual surveillance system can

be deployed for various applications by using different computer

vision algorithms. The potential applications include vehicle detec-

tion and classification (Lu and Dai 2023), human detection and rec-

ognition (Nikouei et al. 2018), incident detection (Sharma and

Sungheetha 2021), and illegal activity detection (Bhatti et al. 2021).

If the deployed surveillance cameras are properly calibrated, the

applicability of the surveillance system can be further widened, en-

abling capabilities such as speed measurement (Yang et al. 2019)

and behavior understanding (Giannakeris et al. 2018).

To eliminate the challenges with respect to network congestion,

latency, and data storage, the aforementioned image analysis tasks

are usually completed at the edge computing units in close prox-

imity to the surveillance cameras. The extracted semantic informa-

tion from each edge computing unit is then transmitted to the traffic

management center to assist the road operators in understanding

current traffic conditions and conducting system-wide management

of regional transportation (Zhang et al. 2022).

Problem Statement and Research Objective

Machine vision, especially the monocular fashion, has been recog-

nized as a cost-efficient and promising technology in assisting dig-

ital twin creation because it can capture both rich semantic and

geometric information. Nevertheless, the existing vision-based

methods are only applicable to extraction of one or a few specific

types of traffic data from real-world traffic scenes, and a unified

method allowing for collection of granular-level data of traffic

scenes is still missing in the existing literature. The granular-level

data of traffic scenes here refer to the road infrastructure features

(e.g., road markings and lane dividers) and road user’s type, loca-

tion, dimension, speed, and so on, all of which are vital to building

a trustworthy digital twin of road transportation and making opti-

mal traffic management and coordination decisions.

To fill this gap, this study proposed a new vision-based method

that digitalizes the traffic scenes by recovering spatial information

of both static road infrastructure and dynamic road users through

creation of a new reconstruction process that addresses the diffi-

culty in retrieving 3D spatial information of road users in 2D im-

ages. The main contribution of the study lies in developing a new

vision-based framework by leveraging a diversity of machine learn-

ing and computer vision techniques through customization and

synergy for granular traffic data collection in assisting digital twin

creation, which cannot be achieved by the existing methods.

Methodology

The overall framework of the proposed method for digitalization

of traffic scenes is illustrated in Fig. 1, which consists of three

main modules: (1) digitalization of static road infrastructure by

J. Comput. Civ. Eng., 2023, 37(5): 04023019

leveraging UAV and structure from motion, (2) digitalization of

moving road users through road surveillance cameras as well as

applying deep learning and view geometry, and (3) global registra-

tion and merging of digital models of static and dynamic compo-

nents. Because the road infrastructure usually remains constant

over a long period of time, the first module only needs to be per-

formed once unless the key infrastructure elements are retrofitted or

rebuilt, whereas the second and third modules have to be contin-

uously implemented for each recorded frame due to the dynamic

nature of road users. In the following subsections, each module is

explained in detail. The symbols in bold represent the homogenous

vectors; a multiplication sign stands for the cross-product operator.

Digitalization of Static Road Infrastructure Using

Structure from Motion

Structure from motion is employed in the proposed method for the

digitalization of static road infrastructure, which is essentially a

process to simultaneously estimate both the 3D location of scene

structure and camera poses from a series of photographed photos

(Szeliski 2010). In general, the framework of structure from motion

consists of five sequential steps: (1) detecting and matching the

salient feature points across different images, e.g., using a scale-

invariant feature transform (Lowe 1999) followed by approximate

nearest neighbors (Indyk and Motwani 1998), (2) calculating the

fundamental matrix and estimating the relative camera pose be-

tween each pair of images, (3) using triangulation to compute

the spatial locations of the point correspondences and populating

them into the 3D point cloud model, (4) exploiting bundle adjust-

ment to further refine all estimates, i.e., the camera poses and 3D

points, with guaranteeing that the global reprojection error is mini-

mized, and (5) removing the scale ambiguity.

It is worth pointing out that an overlap of 50%–80% between the

adjacent aerial images is required when using structure from mo-

tion in order to achieve a reliable camera pose estimation (Khaloo

and Lattanzi 2017). Thus, the UAV fight configurations regarding

flight speed and altitude need to be properly designed prior to the

surveying mission. On the other hand, to eliminate the adverse

impact of moving vehicles on the reconstruction accuracy of road

infrastructure (Lee et al. 2022), the vehicles are manually annotated

and masked in the recorded aerial images before handing on to

structure from motion.

Fig. 2(a) displays an example of the 3D point cloud model and

camera poses that are recovered from the vehicle-free aerial images

with the aid of structure from motion. It can be seen that the re-

constructed 3D point cloud model is too sparse to provide desired

semantic information such as road markings and lane dividers, for

the development of the digital twin of road infrastructure. Thus, a

dense and photorealistic 3D point cloud model is ultimately gen-

erated through applying clustering views for multiview stereo

(CMVS) on the sparse counterpart, which is the patch-based algo-

rithm capable of clustering overlapped images and supplying the

additionally semantic cloud points into the 3D model (Furukawa

et al. 2010). The densified 3D point cloud model is displayed in

Fig. 2(b).

After 3D reconstruction, semantic segmentation is conducted to

identify the road features in the 3D point cloud model of road infra-

structure, which can be subsequently used to assist the analysis and

understanding of traffic data collected from the surveillance cam-

eras. In this study, the semantic segmentation is manually done by

leveraging the software CloudCompare version 2.7.0, which is an

open-source tool for working with 3D point clouds and meshes.

Particularly, the instances in the road infrastructure scenes are di-

vided into nine classes: road, sidewalk, crosswalk, road marking,

lane division, traffic sign, lamp pole, traffic light, and others. Fig. 3

represents an example of the semantic segmentation result of the

road infrastructure model.

Digitalization of Moving Road Users from Surveillance

Cameras

Road User Instance Segmentation Using Cascade Mask

R-CNN

As the first step of the digitalization of road users, Cascade Mask

region-based convolutional neural network (R-CNN) is utilized to

Process Output

Digitalization of Static Road Infrastructure

Vehicle Detection

and Annotation

UAV Surveying

Mission

Aerial

Images

Vehicle-free

Images

Structure from

Motion

Global Registration and Merging

of Digital Models

Digitalization of Moving Road Users

Road User Instance

Segmentation

3D Bounding Box

Construction

3D Bounding Box

Rectification

Localization and

Tracking

Road User

Masks

Raw 3D

Bounding Box

3D Information

in Pixel Unit

Fig. 1. Overall framework of proposed method for digitalization of traffic scenes.

Fig. 2. Road infrastructure reconstruction using structure from motion: (a) sparse 3D point cloud model; and (b) dense 3D point cloud model.

J. Comput. Civ. Eng., 2023, 37(5): 04023019

detect the road user instances in the recorded frames from the

surveillance cameras. The architecture of the Cascade Mask R-CNN

is depicted in Fig. 4.

Different from the other deep learning architectures, the

Cascade Mask R-CNN employs a set of cascaded detectors for

object detection (Cai and Vasconcelos 2019). In particular, the

cascaded detectors are trained sequentially, namely the output

(i.e., regressed bounding boxes) of the previous detectors with a

certain intersection over union (IOU) thresholds are used as

fine-tuned proposals for the next with a higher IoU threshold. In

the meanwhile, the ROIs in the following detectors are updated us-

ing the refined proposals and the shared feature maps from the

backbone. This is primarily motivated by the observation that

the output IoU of a regressor is almost invariably better than the

input IoU (Cai and Vasconcelos 2019). It has been proven that such

a resampling mechanism can effectively address the overfitting

problem during training and eliminate quality mismatches at infer-

ence, thus contributing to the outperformance of Cascade Mask

R-CNN over the almost existing deep learning architectures for ob-

ject detection and instance segmentation. In this study, the IoU

thresholds were specified as 0.5, 0.6, and 0.7 for different detectors,

respectively.

Construction of Road User 3D Bounding Boxes

Once the road users’instances are obtained, their 3D spatial infor-

mation can be retrieved in the 2D images in the form of 3D bound-

ing boxes. The approach for 3D bounding box construction

has been developed in a previous study (Lu and Dai 2023), which

consists of two consecutive steps: vanishing point estimation and

3D bounding box construction. To make this paper self-contained,

the previously developed approach is briefly illustrated herein. For

details, interested readers may refer to Lu and Dai (2023).

The developed 3D bounding box construction approach entails

the identification of three orthogonal vanishing points in the traffic

senses. To this end, a random sample consensus (RANSAC)-based

method was proposed for vanishing point estimation, and the cor-

responding pseudocode is shown in Fig. 5. In comparison with

other vanishing point estimation methods, the advantage of the pro-

posed method lies in that it eliminates the need for a trial-and-error

parameter tuning process, thus allowing for easy implementation

across a variety of traffic scenes with little-to-no human input.

In this study, the three orthogonal vanishing points are defined

as follows. The first vanishing point v1is along the dominant traffic

direction, the second vanishing point v2is in the direction perpen-

dicular to v1and parallel to the road surface, and the third vanishing

point v3is in the direction perpendicular to the road surface, as de-

picted in Fig. 6.

After the three orthogonal vanishing points are successfully

identified, the 3D bounding box can be constructed for each sur-

veilled road user in the images. Fig. 6illustrates the pipeline of 3D

bounding box construction under the right camera view. It starts

with generating five lines L1–5that are tangent to the road user’s

silhouette (i.e., instance result obtained by the Cascade Mask

R-CNN) and pass through the three vanishing points [Fig. 6(a)],

and then progressively identifies the vertices B1–8by making use

of the outcome from the previous steps [Figs. 6(b and c)]. The

construction process can be formulized as follows:

Fig. 5. Algorithm for vanishing point estimation.

Detector 1

(IoU=0.5)

Detector 3

(IoU=0.7)

Detector 2

(IoU=0.6)

Legend

Fig. 4. Architecture of Cascade Mask R-CNN. (Images by Linjun Lu.)

Fig. 3. Example of road infrastructure instance segmentation.

J. Comput. Civ. Eng., 2023, 37(5): 04023019

B1¼L1×L3;B2¼L3×L5;B3¼L2×L5ð1Þ

L6¼B1×v2;L7¼B3×v3;L8¼B2×v1;L9¼B4×v2ð2Þ

B4¼L4×L9;B5¼L6×L7;B6¼L8×L9ð3Þ

L10 ¼B5×v1;L11 ¼B6×v3ð4Þ

B7¼L1×L11;B8¼L4×L10 ð5Þ

After the eight vertices are identified, the 3D bounding box is

finally constructed by joining the adjacent vertices pairwise, as

shown in Fig. 6(d). The corresponding pseudocode of 3D bounding

box construction is shown in Fig. 7. For the camera positioned on

the other side of the road, the 3D bounding box can be constructed

in a similar manner as illustrated here.

Rectification of Constructed 3D Bounding Boxes

As shown in Fig. 8, because of the irregular and somewhat bent

appearances of road users, the locations of the firstly drawn tangent

lines (i.e., L1–5) typically deviate radially toward the centroid of

road user from their actual location, resulting in the 3D bounding

boxes constructed from the previous step being slightly smaller in

width and length than those in reality. In order to increase the ac-

curacy of road user localization, it would be wise to make some

efforts to rectify or enlarge the constructed 3D bounding boxes.

The rectification can proceed from re-estimating the locations of

the four bottom vertices B2;3;4;6by scaling their coordinates along

the two symmetry axes of the bottom plane of the 3D bound-

ing box.

However, it may be intractable to directly carry out the coordi-

nate scaling in the image plane due to the shape distortion brought

on by projective transformation. For instance, in Fig. 6, the surfaces

of the 3D bounding box are not rectangular in the image, although

the originals are. To overcome this hurdle, a hierarchy of transfor-

mations was developed and applied to remove the projective/affine

distortion in the image plane and scale the vertex coordinates there-

after. The main concept of this hierarchy is illustrated in Fig. 9,

where l¼ðl1;l2;l3ÞTis the vanishing line of the ground plane

given by l¼v1×v2.

Geometrically, the projective distortion can be rectified by trans-

forming the vanishing line lback to its canonical position l∞¼

ð0;0;1ÞT, namely, mapping the image plane π1to π2as shown

in Fig. 9. The suitable projective matrix that achieves this transfor-

mation is

HP¼

100

010

l1l2l3

ð6Þ

After applying HPto the image plane π1, one can immediately

verify that l∞¼H−T

Pl¼ð0;0;1ÞT,v1;∞¼HPv1¼ðv1;x;v1;y;0ÞT,

and v2;∞¼HPv2¼ðv2;x;v2;y;0ÞTas desired. In the meanwhile,

the locations of the four bottom vertices on π2are obtained by

i¼HPBi. On top of this, the shape property of the bottom plane

can then be affinely recovered by mapping v1;∞→ð0;1;0ÞTand

v2;∞→ð1;0;0ÞT, namely, mapping π2to π3. Correspondingly, the

affine matrix is given by

HA¼2

v2;xv1;x0

v2;yv1;y0

001

−1

ð7Þ

Fig. 7. Algorithm for 3D bounding box construction.

Fig. 8. Effect of bending shapes of road users on 3D bounding box

construction.

Recovery of projective distortion Recovery of affine distortion

Scaling

Image plane

Fig. 9. Pipeline of 3D bounding box rectification.

Fig. 6. Pipeline of constructing 3D bounding box: (a) Step 1; (b) Step 2; (c) Step 3; and (d) Step 4. (Images by Linjun Lu.)

J. Comput. Civ. Eng., 2023, 37(5): 04023019

Likewise, the locations of the four bottom vertices on π3are

determined as BA

i¼HABP

i. Once the shape property is recovered,

the coordinates of the four bottom vertices can be easily scaled by

performing a similarity transformation HSon π3, with

HS¼2

sx0−Δtx

0sy−Δty

00 1

5ð8Þ

where ðΔtx;ΔtyÞT= inhomogeneous coordinates of the centroid of

the bottom plane on π3; and sxand sy= scaling factors along x- and

y-directions, respectively. The scaling factors for each type of road

user can be statistically calibrated by comparing the measured

values from original 3D bounding boxes with the ground-truth

counterparts. In summary, the hierarchy of transformations per-

formed on the bottom vertices can be expressed in the form

i¼HSHAHPBi. After coordinate scaling, the re-estimated loca-

tions of vertices are reprojected to the original image plane π1by

Bi¼H−1

PH−1

ABS

i. Finally, the 3D bounding box is reconstructed

by following the same construction pipeline as elaborated in Fig. 7

with the use of the updated bottom vertices. The pseudocode of the

described rectification process is summarized in Fig. 10. The sug-

gested scaling factors, which were calibrated from a series of field

comparison tests, are listed in Table 1. Fig. 11 shows some exam-

ples of unrectified and rectified 3D bounding boxes of road users

under different traffic scenes and camera view angles.

Localization and Trajectory Tracking of Road Users

Once the road user’s 3D bounding box is constructed, its location

on the road plane can be represented by the four bottom vertices (or

better, the bottom plane) of its 3D bounding box along with con-

verting them to the physical coordinates in the real world. Assum-

ing that the road shape is approximately flat and the camera pose

remains fixed, the projective transformation between the image

plane and road plane can be described as follows (Hartley and

Zisserman 2003):

smi¼HMið9Þ

where mj¼ðxi;yi;1ÞTand Mi¼ðXi;Yi;1ÞTði¼1;2; :::;nÞ

are the homogeneous coordinates of the ith pair of point corre-

spondences on the image plane and road plane, respectively;

and His the 3×3homography matrix that depicts the projective

Fig. 10. Algorithm for 3D bounding box rectification.

Table 1. Scaling factors specified for different types of road users

Direction Car Bus Truck Pedestrian Bicyclist Motorcyclist

Width, sx1.05 1.05 1.05 1.20 1.20 1.20

Length, sy1.15 1.08 1.08 1.20 1.10 1.10

Fig. 11. Examples of 3D bounding boxes of road users before and after rectification. (Images by Linjun Lu.)

J. Comput. Civ. Eng., 2023, 37(5): 04023019

relationship between the image plane and road plane. The equality

in Eq. (9) is defined up to an arbitrary nonzero scale factor s. This

means that there are only eight independent degrees of freedom

among H. As a consequence, the homography matrix Hcan be

uniquely determined by specifying at least four point correspond-

ences in a general condition and solved by using the normalized

direct linear transform (Lu and Dai 2022).

Fig. 12 illustrates an example of exploiting homography trans-

formation for road user localization. The reference points selected

for homography estimation are shown in Fig. 12(a), whose pixel

coordinates and physical coordinates are manually measured from

the image plane and road plane, respectively. By making use of the

homography matrix, each road user’s physical location is retrieved

by mapping the four bottom vertices from the image coordinates to

the physical coordinates, as shown in Figs. 12(b and c).

Meanwhile, the length and width of road users are obtained by

measuring the distance between the remapped vertices. Together

with the height information, the road user’s 3D shape can be fully

recovered, which is a valuable clue for fine-grained vehicle classi-

fication and model recognition (Sochor et al. 2016). The physical

height of the 3D bounding box can be estimated by referring to one

object of known height, such as traffic signs and lampposts, for

which the top and base are imaged (Criminisi et al. 2000). The for-

mula for height estimation can be expressed

d¼

b1−r2ðr1−r2−v3−r2Þ

r1−r2ð~

b1−r2−v3−r2Þdrð10Þ

with

b1¼ðr2×b2×l×b1Þ×ðr1×r2Þð11Þ

where r1and r2= imaged top and base points of the reference ob-

ject, respectively; b1and b2= top and base points of the height of

the 3D bounding box, respectively; and dr= physical height of the

reference object, as annotated in Fig. 12(b).

In addition to localization in each frame, the Deep Sort tracker

(Wojke et al. 2017) is employed to lock onto identical road users

based on their spatial and appearance information, and continu-

ously track each of them over sequential frames until they leave

the field of view of the camera. Fig. 12(d) illustrates an example

of the localization and trajectory tracking of road users over a se-

quence of frames.

Global Registration and Merging of Digital Models

Upon the completion of Modules 2 and 3, the digital model of the

traffic scene is eventually constructed by merging the digital mod-

els of the road infrastructure and road users. To this end, the co-

ordinate systems of these two digital models have to be aligned.

Typically, the coordinate system alignment can be achieved by per-

forming a 3D rigid-body transformation on one digital model with

taking the other one as the reference. The transformation matrix can

be determined by specifying at least three ground control points

(that is, the points with known 3D geospatial locations in the

Euclidean world system) in both coordinate systems. Fig. 13 pro-

vides an example of the digitalized traffic scene by merging the 3D

point cloud model of road infrastructure in Fig. 2(b) and the road

user model in Fig. 12(d) together after performing coordinate

Fig. 12. Localization of road users by applying homography transformation: (a) reference points; (b) surveilled view; (c) surveilled view after

projective mapping; and (d) spatial locations and trajectories of road users. (Images by Linjun Lu.)

J. Comput. Civ. Eng., 2023, 37(5): 04023019

system alignment. For the cases where multiple cameras are de-

ployed in the same traffic scene, the model merging can be progres-

sively conducted for each camera in the same fashion with taking

the coordinate system of the road infrastructure as the reference.

Data Set Establishment and Model Training

A road user data set that consists of 35,401 images at various res-

olutions was established by the authors for model training. In order

to ensure the trained model’s ability to generalize, the road user

images were collected from a variety of traffic settings, including

urban, rural, and highway areas, and under varying lighting con-

ditions. Some images in the developed data set were taken from

several public data sets such as COCO (Lin et al. 2014), Cityscapes

(Cordts et al. 2016), and nuScenes (Caesar et al. 2020). Based on

the application standpoint and to facilitate the annotation, the

road user instances are grouped into six categories: car, bus, truck,

pedestrian, bicyclist, and motorcyclist. The image annotation task

was completed by VGG Image Annotator version 2.0.12 (Dutta and

Zisserman 2019), which is an open-source image annotation soft-

ware developed by the Visual Geometry Group. To ensure the best

level of instance segmentation accuracy, the road user instances

were precisely labelled by fine-grained polygons in the gathered

images. Some annotated images samples of road users are shown

in Fig. 14.

The annotated images were randomly split into training and val-

idation sets at a ratio of 8∶2for model training. Four widely used

backbone networks, that is, ResNet-50, ResNet-101, Swin-T, and

Swin-S (He et al. 2016;Liu et al. 2021b), were chosen as the can-

didate backbones of the Cascade Mask R-CNN for feature extrac-

tion, resulting in four different Cascade Mask R-CNN models.

These four Cascade Mask R-CNN models were separately trained

on the customized data set, and COCO evaluation metrics (Lin et al.

2014) were adopted to evaluate the performance of each trained

model. The training task was conducted on the Amazon Web

Services (AWS) platform with a single GPU (NVIDIATesla V100).

The hyperparameters, that is, momentum, weight decay, and batch

were set to 0.9, 0.0001, and 2 for all models; the learning rate was

set to 0.001 and 0.0001 for ResNet- and Swin-based models, re-

spectively. The training data set was augmented during the training

process by implementing horizontal image flipping.

Table 2presents a comparison of the average precision (AP) and

mean average precision (mAP) performance of the trained models

with different backbones on the validation data set (the bold value

represents the best one in each column). Specifically, mAP50 and

mAP75 stand for the mAP at IoU of 0.5 and 0.75, respectively; and

APS,AP

M, and APLrepresent the AP on small, medium, and large

objects, respectively (Lin et al. 2014). As can be seen, Swin-S out-

performs the other three backbones regarding all the mAPs and

APs. Therefore, Swin-S was chosen as the backbone in this study.

Performance Validation

Experimental Setup and Implementation Details

The field experiment was carried out at a road intersection near

the Evansdale campus of West Virginia University to evaluate the

performance of the proposed method for constructing the digital

models of the traffic scenes. Two commercial cameras (Panasonic

HC-W580K, Kadoma, Japan) were employed to surveil two differ-

ent regions of the intersection both in a full-high-definition (HD)

resolution (1,920 ×1,080 pixels) and a frame rate of 30 frames per

second (fps). Each camera was rigidly mounted on one lamppost

with an elevated height of approximately 6 m above the road sur-

face. The camera locations as well as the surveilled traffic scene are

illustrated in Fig. 15.

A commercial-level UAV (DJI Phantom 4 Pro+ V2.0, Shenzhen,

China) equipped with a high-resolution camera was utilized to scan

the road intersection for 3D model reconstruction. The resolution of

the onboard camera is 3,840 ×2,160 pixels, and the focus lens is

35 mm. The flight speed and altitude were set as around 1m=s

and 10–15 m, respectively. Because of the measurement uncer-

tainty in practical applications, using more point correspondences

can help achieve higher accuracy for homography estimation, but at

the expense of taking more effort for ground control point setting.

After taking a trade-off between accuracy and complexity, a total of

four and five ground control points were finally arranged on the

road surface for homography matrix estimation for Cameras 1

and 2, respectively. Moreover, an additional two ground control

points were also set on the road surface and adopted in conjunction

with the abovementioned nines for coordinate system alignment.

The locations of these 11 ground control points are annotated in

Fig. 15, whose geospatial coordinates were precisely measured

from the total station.

The localization accuracy of road users would undoubtedly be

an essential metric for the performance evaluation of the proposed

method. Therefore, the ground-truth locations of road users have to

be known a priori. To this end, the DJI UAV was employed to

Fig. 13. Example of digitalized traffic scene by merging digital models of static road infrastructure and moving road users.

J. Comput. Civ. Eng., 2023, 37(5): 04023019

continuously monitor the same traffic scene in a bird’s eye view

during the data acquisition of the surveillance cameras. By aligning

the optical axis of the onboard camera perpendicular to the road

surface, the appeared road users can be easily and accurately lo-

cated in the aerial images via 2D bounding boxes, as illustrated in

Fig. 16.

The UAV was set in a hover mode with a flight altitude of some

25 m. Under such a flight configuration, the field of view of the

UAV was about 71 ×40 m. Accordingly, the accuracy level of

the UAV for road user localization can be roughly computed, which

is equal to 7,100=3,840 ¼1.85 cm=pixel. As such, the road users’

locations estimated from the UAV can be safely served as ground

truths for comparison purposes. By leveraging the same ground

control points as shown in Fig. 15, the projective relationship be-

tween the ground plane and the image plane of the onboard camera

can be determined through Eq. (9), allowing for the conversion of

the road user’s location from pixel unit to physical unit thereof.

Although the UAV used flight stabilization control during record-

ing, the self-motion resulting from wind could not be avoided.

Thus, it was necessary to conduct the homography estimation

for each recorded aerial frame. The clocks of the DJI UAV and

Panasonic cameras were synchronized with the Internet time

server.

Performance Metrics and Experimental Results

A total of 331 aerial images and two 30-min surveillance videos

were acquired from the field experiment and used for the digitali-

zation of the traffic scene. The data processing was conducted on

the AWS (GPU: NVIDIA Tesla V100) , which can reach an average

processing speed of 9.2 fps for the digitalization of traffic scenes.

Fig. 17 illustrates the digitalized traffic scene at a certain time point.

As can be seen, the road infrastructure and the road users within

two surveilled traffic regions were successfully digitalized by the

proposed method.

To quantitatively evaluate the performance of the digitalized

traffic scene model, three performance metrics were adopted, that

Fig. 14. Samples of annotated images in customized road user data set. (Images by Linjun Lu.)

Table 2. Performance of Cascade Mask R-CNN with different backbones

Backbone mAP mAP50 mAP75 APSAPMAPL

ResNet-50 15.0 30.0 13.1 4.6 12.8 25.4

ResNet-101 33.3 58.9 32.5 14.0 32.8 49.4

Swin-T 35.6 60.7 35.8 17.7 33.6 51.7

Swin-S 35.7 61.0 35.9 17.9 33.7 51.8

Region #1

Camera 2

Region #2

Camera 1

Region #1

(1725,6702)

(667,6708)

Region #2

(0,3250)

(1725,3250)

(0,0)

(1725,0)

(-1337,4592)

(-2202,4573) 2220,4750)

(-1402,5169)

(-2234,5109)

Fig. 15. Overview of field experimental setup. (Images by Linjun Lu.)

J. Comput. Civ. Eng., 2023, 37(5): 04023019

is, the road user instance segmentation performance, geometric ac-

curacy of digitalized road infrastructure model, and localization

and dimension estimation accuracy of digitalized road users.

The traffic data, speed, and acceleration were not evaluated in this

study because they are intimately associated with location estima-

tion results. On the contrary, the quantitative evaluation of the dig-

ital model of the traffic scene should be based on specific traffic

applications because different applications may have varying accu-

racy requirements for a particular type of traffic data, which was

thus not carried out in the current stage of this study. A demo

of the proposed method for digitalizing the dynamic traffic scene

can be found online (ICIL 2022).

Road User Instance Segmentation Performance

In order to examine the effectiveness of the trained Cascade Mask

R-CNN model for road user detection and instance segmentation,

which is a critical indicator to assess the reliability of the digitalized

traffic scene, a total of 360 frames were extracted from two re-

corded surveillance videos with a resampling rate of 5 s and used

for model testing. Fig. 18 presents the detection results of different

types of road users in the form of a confusion matrix. Specifically,

each row of the confusion matrix represents the instances in true

classes, whereas each column represents the instances in predicted

classes. It turns out that the precision and recall for the detection of

each type of road user were above 90.0% and 85.7%, respectively.

It was also found that all misclassification of road users

appeared to happen among the cars, buses, and trucks. This is

primarily attributed to their similar appearances, which make it

different for the trained deep learning model to distinguish them

correctly. On the other hand, after checking the raw images, the

misdetection of road users typically occurred when they were

heavily obscured by others. A viable solution to tackle the misclas-

sification and misdetection issues would be to introduce more

images for deep learning model training and install the cameras

at a higher height so that the level of occlusion can be virtually

minimized.

Geometric Accuracy of Digitalized Road Infrastructure

Model

The geometric accuracy of the digitalized road infrastructure

model was assessed by comparing the dimensions of the digitalized

objects and their actual values in reality. To this end, eight distinc-

tive natural markers distributed over the traffic scene were selected

as the checking points. The location of these eight checking points

is annotated in Fig. 19. Accordingly, the Euclidean distance be-

tween each pair of checking points was measured in both the

digitalized road infrastructure model and the real world. One

laser distance measure [Bosch (Gerlingen, Germany) Blaze Pro

GLM400 CL] was employed for actual distance acquisition, which

has a measurement accuracy of 2mm.

Fig. 16. Example of road user localization using UAV. (Images by Linjun Lu.)

Fig. 17. Example of digitalized traffic scene in field experiment. (Images by Linjun Lu.)

J. Comput. Civ. Eng., 2023, 37(5): 04023019

Table 3summarizes the average relative dimension error of each

checking point to the others in comparison with the real counter-

part. It can be seen that the average relative dimension error asso-

ciated with each checking point was no more than 0.29% and the

overall relative dimension error was 0.25%. According to Smith

and Vericat (2015), the relative dimension error of a qualifying

digitalized model is expected to be less than 0.3%, indicating that

the digitalized road infrastructure model by the proposed method

has a satisfactory level of geometric accuracy thereof.

Localization and Dimension Estimation Accuracy of

Digitalized Road Users

Table 4lists the statistics of localization errors of different types of

road users by the proposed method in comparison with the UAV-

obtained ground truths. The localization error of each object was

computed by averaging the differences between the measured lo-

cations of four bottom vertexes and their actual values. The appa-

rently false 3D bounding boxes of road users brought by occlusion

were excluded for performance validation. It is safe to do this be-

cause the occlusion issue can be easily tackled by setting additional

surveillance cameras to surveil the same traffic area from different

perspectives in parallel.

The results showed that the mean and maximum localization

errors were 34.5 and 50.3, 38.0 and 55.5, 48.7, and 77.4, 15.6

and 35.2, 18.2 and 34.3, 19.8, and 32.4 cm for cars, buses, trucks,

pedestrians, bicyclists, motorcyclists, respectively. Additionally,

the standard deviation of localization error for these six types of

road users was 10.3, 8.7, 13.8, 6.6, 5.2, and 6.9 cm, respectively.

It is evident that the highest localization accuracy can be achieved

for pedestrians, which is mostly due to the fact that the pedestrians

occupy less road space, allowing them to be more easily and pre-

cisely located on the road surface.

The statistics of road user dimension estimation errors are sum-

marized in Table 5. To facilitate the comparison, the relative esti-

mation errors are provided. The ground-truth heights of cars, buses,

and trucks were obtained from manual measurement or by referring

to the online dimension information provided by the vehicle man-

ufacturers, whereas the ground-truth heights of pedestrians were

acquired from the onsite inquiry. However, because there was un-

fortunately no measure in the field experiment that could be used to

obtain the ground-truth heights of bicyclists and motorcyclists, the

relative height estimation errors with respect to these two types of

road users were not listed herein.

It can be seen that the dimension estimation accuracy of cars,

buses, and trucks was much better than those of pedestrians, bicy-

clists, and motorcyclists. It is probably because the former three

types of road users have comparatively more regular shapes than

those of the latter three. Particularly, the pedestrians had the largest

relative dimension estimation error, which was primarily attributed

to their much more irregular appearances than those of other types

Car

Bus

Truck

Pedestrian

Bicyclist

Motorcyclist

Predicted Class

Car

Bus

Truck

Pedestrian

Bicyclist

Motorcyclist

True Class

Road User Detection Using Cascade Mask R-CNN

360

1094 4.1%

5.3%

6.3%

10.0%

100.0%

95.9%

94.7%

93.8%

100.0%

90.0%

11.1% 5.3% 14.3%

88.9% 100.0% 94.7% 85.7% 100.0% 100.0% 100.0%

Fig. 18. Confusion matrix of road user detection results.

Table 3. Statistics of relative dimension error (%)

Point no. Relative error

1 0.27

2 0.19

3 0.24

4 0.28

5 0.29

6 0.26

7 0.23

8 0.19

All 0.25

Table 4. Statistics of road user localization errors (cm)

Road user Mean Maximum Standard deviation

Car 34.5 50.3 10.3

Bus 38.0 55.5 8.7

Truck 48.7 77.4 13.8

Pedestrian 15.6 35.2 6.6

Bicyclist 18.2 34.3 5.2

Motorcyclist 19.8 32.4 6.9

Table 5. Statistics of road user relative dimension estimation errors (%)

Road user Length Width Height

Car 6.5 9.2 3.9

Bus 4.1 7.7 3.6

Truck 5.3 8.4 6.7

Pedestrian 30.6 28.3 17.1

Bicyclist 15.6 19.6 —

Motorcyclist 13.6 12.9 —

Fig. 19. Location of checking points used for geometrical accuracy

evaluation.

J. Comput. Civ. Eng., 2023, 37(5): 04023019

of road users, making it challenging for the proposed method to

construct the 3D bounding boxes for them accurately. This would

be a critical concern when applying the proposed method in safety-

related applications because pedestrians are the most vulnerable en-

tities in traffic scenes. To address this issue, the source of the error

of the proposed method in dimension estimation should be well

understood, quantified, and minimized in future work, upon which

error compensation efforts can be made to enhance the performance

of the proposed method for road user localization thereof.

Discussion and Future Work

The digital models of the traffic scenes have great potential to

be employed in numerous intelligent transportation applications.

The most significant application is the infrastructure-vehicle co-

operative autonomous driving. Automated vehicles may suffer un-

reliable and insufficient perception due to occlusions and complex

traffic conditions (particularly in crowded urban intersections),

which poses safety risks to the automated vehicles and passengers.

The digital models of the traffic scenes provide a promising solu-

tion to overcome this issue by providing the automated vehicles

with a global view of the current traffic situation as well as the com-

plementary position information of the surrounding road users.

According to a qualitative analysis revealed by Williams and Barth

(2020), in order to ensure that the safe-critical autonomous driving

modules (e.g., collision warning and prediction) function correctly,

lane-level positioning accuracy (90 cm) is required for road user

positioning. It is worth pointing out that the maximum localization

error by the proposed method was 77.4 cm, which happened in the

case of truck localization. Therefore, it indicates that the proposed

method has great potential to be exploited to enhance the fields of

view of automated vehicles and assist them in making safer and

more efficient driving decisions. Similarly, legacy vehicles can also

benefit from this application if communication between the drivers

and visual surveillance systems is established.

Meanwhile, the digital model of the traffic scene can effectively

simulate the real traffic situation and provides an ideal test ground

for the development and optimization of automated driving algo-

rithms for automated vehicles, which are a crucial part of intelligent

transportation systems. Moreover, the digital models of the traffic

scenes also offer a platform for studying the behavior of different

road users. Following that, this information can be used to inves-

tigate traffic safety and create or modify policies to increase traffic

efficiency.

Additionally, the digital models of the traffic scenes could be

served as a tool for overheight collision prevention, because the

height of passing vehicles and the clearance of infrastructures

can be easily obtained in the digital model. Furthermore, the digital

model of road infrastructure provides 3D road surface information

that can be used for pavement distress detection and condition

maintenance, which is unable to be acquired from a 2D road infra-

structure map.

On top of the current study, there are still some tasks that should

be carried out in the future. First, in order to fully cover the traffic

scene, multiple cameras are needed to be deployed at different lo-

cations and work with each other to monitor the relevant traffic

scene in parallel. Thus, it is necessary to take some efforts to de-

velop a feasible strategy to merge or remove the duplicate entities

within the overlapped monitoring zones.

Second, it is also desirable to conduct investigations for trajec-

tory prediction of road users based on the digital models of the

traffic scenes, which is of vital significance for collision warnings

and automated driving. To fit this purpose, some deep learning

techniques, such as the long short-term memory network (Kim

et al. 2017) and Social generative adversarial network (GAN)

(Gupta et al. 2018), can be exploited to function in conjunction with

the trajectory tracking module (Deep Sort) for trajectory prediction.

Third, to promote the adoption of digital twins in support of the

development of intelligent transportation systems, a more fine-

grained digital model of the traffic scene is sometimes required.

To this end, additional cutting-edge sensors and image processing

technologies will be incorporated into the proposed method to help

digitalize and introduce more real-world information into the dig-

ital models, such as license plate numbers, physical conditions of

pedestrians, weather conditions, and so on.

Fourth, the class imbalance problem among established road

user data set should also be appropriately addressed, which may

cause the trained deep learning model to exhibit a bias towards

the majority classes (e.g., car and pedestrian) due to their increased

prior probability. To address this hurdle, one commonly used strat-

egy is to randomly discard the majority class samples or replicate

the minority class samples to balance the class distribution on the

training data set (Van Hulse et al. 2007). Another alternative strat-

egy to handle the class imbalance is to take the class penalty and

weight into the consideration when designing the loss function for

model training (Johnson and Khoshgoftaar 2019). In the future

work, comprehensive experiments will be conducted to compare

the performance of different strategies to deal with the class imbal-

ance problem in the developed road user data set.

Conclusions

This study proposed a vision-based method for the digitalization of

traffic scenes by leveraging UAV and roadside surveillance cam-

eras. The performance of the proposed method was evaluated

through a field experiment conducted at a road intersection. The

experiment results showed that the traffic scene can be successfully

digitalized by the proposed method with promising accuracy. The

contribution of this study lies in providing a cost-efficient tool for

the development of the digital twins of road transportation in sup-

port of intelligent transportation applications. Also, the proposed

method is scalable, namely, the coverage of the digital model

can be easily extended by adding new cameras to the surveillance

network. The potential applications of the proposed method include

infrastructure-vehicle cooperative autonomous driving, collision

warning and prediction, road user behavior analysis and under-

standing, and so on. However, the proposed method is still in its

infancy; thus, further investigations and performance testing need

to be conducted in the future to promote its applications in real-

world traffic scenes.

Data Availability Statement

The annotations for the image data sets and the created models that

support the findings of this study are available from the correspond-

ing author upon reasonable request.

Acknowledgments

This work was sponsored by a grant from the Center for Integrated

Asset Management for Multimodal Transportation Infrastructure

Systems (CIAMTIS), a US Department of Transportation University

Transportation Center, under federal Grant No. 69A3551847103.

The authors are grateful for the support. Any opinions, findings,

conclusions, and recommendations expressed in this paper are

J. Comput. Civ. Eng., 2023, 37(5): 04023019

those of the authors and do not necessarily reflect the views of

the CIAMTIS.

References

Aryan, A., F. Bosché, and P. Tang. 2021. “Planning for terrestrial laser

scanning in construction: A review.”Autom. Constr. 125 (May):

103551. https://doi.org/10.1016/j.autcon.2021.103551.

Bao, L., Q. Wang, and Y. Jiang. 2021. “Review of digital twin for intelligent

transportation system.”In Proc., Int. Conf. on Information Control, Elec-

trical Engineering and Rail Transit (ICEERT),309–315. New York:

IEEE.

Bhatti, M. T., M. G. Khan, M. Aslam, and M. J. Fiaz. 2021. “Weapon

detection in real-time CCTV videos using deep learning.”IEEE Access

9 (Feb): 34366–34382. https://doi.org/10.1109/ACCESS.2021.305

9170.

Caesar, H., V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A.

Krishnan, Y. Pan, G. Baldan, and O. Beijbom. 2020. “nuScenes: A mul-

timodal dataset for autonomous driving.”In Proc., IEEE/CVF Conf. on

Computer Vision and Pattern Recognition, 11621–11631. New York:

IEEE.

Cai, Z., and N. Vasconcelos. 2019. “Cascade R-CNN: High quality

object detection and instance segmentation.”IEEE Trans. Pattern Anal.

Mach. Intell. 43 (5): 1483–1498. https://doi.org/10.1109/TPAMI.2019

.2956516.

Chen, J., Z. Kira, and Y. K. Cho. 2019. “Deep learning approach to point

cloud scene understanding for automated scan to 3D reconstruction.”

J. Comput. Civ. Eng. 33 (4): 04019027. https://doi.org/10.1061

/(ASCE)CP.1943-5487.0000842.

Cordts, M., M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,

U. Franke, S. Roth, and B. Schiele. 2016. “The cityscapes dataset for

semantic urban scene understanding.”In Proc., IEEE Conf. on Com-

puter Vision and Pattern Recognition, 3213–3223. New York: IEEE.

Criminisi, A., I. Reid, and A. Zisserman. 2000. “Single view metrology.”

Int. J. Comput. Vision 40 (2): 123–148. https://doi.org/10.1023

/A:1026598000963.

Dutta, A., and A. Zisserman. 2019. “The VIA annotation software for im-

ages, audio and video.”In Proc., 27th ACM Int. Conf. on Multimedia,

2276–2279. New York: Association for Computing Machinery.

El Marai, O., T. Taleb, and J. Song. 2020. “Roads infrastructure digital

twin: A step toward smarter cities realization.”IEEE Network 35 (2):

136–143. https://doi.org/10.1109/MNET.011.2000398.

Furukawa, Y., B. Curless, S. M. Seitz, and R. Szeliski. 2010. “Towards

internet-scale multi-view stereo.”In Proc., 2010 IEEE Computer Soci-

ety Conf. on Computer Vision and Pattern Recognition, 1434–1441.

New York: IEEE.

Gao, Y., S. Qian, Z. Li, P. Wang, F. Wang, and Q. He. 2021. “Digital twin

and its application in transportation infrastructure.”In Proc., 2021 IEEE

1st Int. Conf. on Digital Twins and Parallel Intelligence (DTPI),

298–301. New York: IEEE.

Giannakeris, P., V. Kaltsa, K. Avgerinakis, A. Briassouli, S. Vrochidis, and

I. Kompatsiaris. 2018. “Speed estimation and abnormality detection

from surveillance cameras.”In Proc., IEEE Conf. on Computer Vision

and Pattern Recognition Workshops,93–99. New York: IEEE.

Gupta, A., J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. 2018. “Social

GAN: Socially acceptable trajectories with generative adversarial

networks.”In Proc., IEEE Conf. on Computer Vision and Pattern

Recognition, 2255–2264. New York: IEEE.

Hartley, R., and A. Zisserman. 2003. Multiple view geometry in computer

vision. Cambridge, UK: Cambridge University Press.

He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep residual learning for

image recognition.”In Proc., IEEE Conf. on Computer Vision and

Pattern Recognition, 770–778. New York: IEEE.

Hu, C., W. Fan, E. Zeng, Z. Hang, F. Wang, L. Qi, and M. Z. A. Bhuiyan.

2021. “Digital twin-assisted real-time traffic data prediction method

for 5G-enabled internet of vehicles.”IEEE Trans. Ind. Inf. 18 (4):

2811–2819. https://doi.org/10.1109/TII.2021.3083596.

Hui, Y., Q. Wang, N. Cheng, R. Chen, X. Xiao, and T. H. Luan. 2021.

“Time or reward: Digital-twin enabled personalized vehicle path planning.”

In Proc., 2021 IEEE Global Communications Conf. (GLOBECOM),1–6.

New York: IEEE.

ICIL (Integrated Construction Informatics Laboratory). 2022. “Digital

twinning of traffic scenes.”Accessed September 29, 2022. https://www

.youtube.com/watch?v=tDywTX8pEoY.

Indyk, P., and R. Motwani. 1998. “Approximate nearest neighbors:

Towards removing the curse of dimensionality.”In Proc., 13th Annual

ACM Symp. on Theory of Computing, 604–613. New York: Association

for Computing Machinery.

Johnson, J. M., and T. M. Khoshgoftaar. 2019. “Survey on deep learning

with class imbalance.”J. Big Data 6 (27): 1–54.

Kar, A., S. Tulsiani, J. Carreira, and J. Malik. 2015. “Category-specific

object reconstruction from a single image.”In Proc., IEEE Conf. on

Computer Vision and Pattern Recognition, 1966–1974. New York:

IEEE.

Ke, L., S. Li, Y. Sun, Y.-W. Tai, and C.-K. Tang. 2020. “GSNet: Joint ve-

hicle pose and shape reconstruction with geometrical and scene-aware

supervision.”In Proc., European Conf. on Computer Vision, 515–532.

New York: Springer.

Khaloo, A., and D. Lattanzi. 2017. “Hierarchical dense structure-from-

motion reconstructions for infrastructure condition assessment.”J. Com-

put. Civ. Eng. 31 (1): 04016047. https://doi.org/10.1061/(ASCE)CP.1943

-5487.0000616.

Kim, B., C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi. 2017.

“Probabilistic vehicle trajectory prediction over occupancy grid map via

recurrent neural network.”In Proc., 2017 IEEE 20th Int. Conf. on In-

telligent Transportation Systems (ITSC), 399–404. New York: IEEE.

Kumar, S. A., R. Madhumathi, P. R. Chelliah, L. Tao, and S. Wang. 2018.

“A novel digital twin-centric approach for driver intention prediction

and traffic congestion avoidance.”J. Reliab. Intell. Environ. 4 (4):

199–209. https://doi.org/10.1007/s40860-018-0069-y.

Lee, S., S. Kim, and S. Moon. 2022. “Development of a car-free street map-

ping model using an integrated system with unmanned aerial vehicles,

aerial mapping cameras, and a deep learning algorithm.”J. Comput.

Civ. Eng. 36 (3): 04022003. https://doi.org/10.1061/(ASCE)CP.1943

-5487.0001013.

Lin, T. Y., M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Dollár, and C. L. Zitnick. 2014. “Microsoft COCO: Common objects

in context.”In Proc., European Conf. on Computer Vision, 740–755.

New York: Springer.

Liu, S., B. Yu, J. Tang, and Q. Zhu. 2021a. “Towards fully intelligent trans-

portation through infrastructure-vehicle cooperative autonomous driv-

ing: Challenges and opportunities.”In Proc., 2021 58th ACM/IEEE

Design Automation Conf. (DAC), 1323–1326. New York: IEEE.

Liu, Z., Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. 2021b.

“Swin transformer: Hierarchical vision transformer using shifted

windows.”In Proc., IEEE/CVF Int. Conf. on Computer Vision,

10012–10022. New York: IEEE.

Lowe, D. G. 1999. “Object recognition from local scale-invariant features.”

In Proc., 7th IEEE Int. Conf. on Computer Vision,1150–1157. New York:

IEEE.

Lu, L., and F. Dai. 2022. “A unified normalization method for homography

estimation using combined point and line correspondences.”Comput.-

Aided Civ. Infrastruct. Eng. 37 (8): 1010–1026. https://doi.org/10.1111

/mice.12788.

Lu, L., and F. Dai. 2023. “Automated visual surveying of vehicle heights to

help measure the risk of overheight collisions using deep learning and

view geometry.”Comput.-Aided Civ. Infrastruct. Eng. 38 (2): 194–210.

https://doi.org/10.1111/mice.12842.

Lv, Z., Y. Li, H. Feng, and H. Lv. 2021. “Deep learning for security in

digital twins of cooperative intelligent transportation systems.”IEEE

Trans. Intell. Transp. Syst. 23 (9): 16666–16675. https://doi.org/10

.1109/TITS.2021.3113779.

Mottaghi, R., Y. Xiang, and S. Savarese. 2015. “A coarse-to-fine model for

3D pose estimation and sub-category recognition.”In Proc., IEEE Conf.

on Computer Vision and Pattern Recognition, 418–426. New York:

IEEE.

Muller, K., A. Smolic, M. Drose, P. Voigt, and T. Wiegand. 2005. “3-D

reconstruction of a dynamic environment with a fully calibrated

J. Comput. Civ. Eng., 2023, 37(5): 04023019

background for traffic scenes.”IEEE Trans. Circuits Syst. Video Tech-

nol. 15 (4): 538–549. https://doi.org/10.1109/TCSVT.2005.844452.

Niesen, U., and J. Unnikrishnan. 2020. “Camera-radar fusion for 3-D depth

reconstruction.”In Proc., 2020 IEEE Intelligent Vehicles Symp. (IV),

265–271. New York: IEEE.

Nikouei, S. Y., Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan.

2018. “Real-time human detection as an edge service enabled by a light-

weight CNN.”In Proc., 2018 IEEE Int. Conf. on Edge Computing

(EDGE), 125–129. New York: IEEE.

Pan, Y., N. Wu, T. Qu, P. Li, K. Zhang, and H. Guo. 2021. “Digital-

twin-driven production logistics synchronization system for vehicle

routing problems with pick-up and delivery in industrial park.”Int.

J. Comput. Integr. Manuf. 34 (7–8): 814–828. https://doi.org/10

.1080/0951192X.2020.1829059.

Reddy, N. D., M. Vo, and S. G. Narasimhan. 2018. “CarFusion: Combining

point tracking and part detection for dynamic 3D reconstruction of

vehicles.”In Proc., IEEE Conf. on Computer Vision and Pattern

Recognition, 1906–1915. New York: IEEE.

Sharma, R., and A. Sungheetha. 2021. “An efficient dimension reduction

based fusion of CNN and SVM model for detection of abnormal inci-

dent in video surveillance.”J. Soft Comput. Paradigm 3 (2): 55–69.

https://doi.org/10.36548/jscp.2021.2.001.

Smith, M. W., and D. Vericat. 2015. “From experimental plots to experimental

landscapes: Topography, erosion and deposition in sub-humid badlands

from structure-from-motion photogrammetry.”Earth Surf. Processes

Landforms 40 (12): 1656–1671. https://doi.org/10.1002/esp.3747.

Sochor, J., A. Herout, and J. Havel. 2016. “Boxcars: 3D boxes as CNN

input for improved fine-grained vehicle recognition.”In Proc., IEEE

Conf. on Computer Vision and Pattern Recognition, 3006–3015.

New York: IEEE.

Strigel, E., D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer. 2014.

“The ko-per intersection laserscanner and video dataset.”In Proc.,

17th Int. IEEE Conf. on Intelligent Transportation Systems (ITSC),

1900–1901. New York: IEEE.

Su, H., C. R. Qi, Y. Li, and L. J. Guibas. 2015. “Render for CNN:

Viewpoint estimation in images using CNNs trained with rendered

3D model views.”In Proc., IEEE Int. Conf. on Computer Vision,

2686–2694. New York: IEEE.

Szeliski, R. 2010. Computer vision: Algorithms and applications. Berlin:

Springer Science & Business Media.

Van Hulse, J., T. M. Khoshgoftaar, and A. Napolitano. 2007. “Experimental

perspectives on learning from imbalanced data.”In Proc., 24th Int.

Conf. on Machine Learning, 935–942. New York: Association for

Computing Machinery.

Williams, N., and M. Barth. 2020. “A qualitative analysis of vehicle posi-

tioning requirements for connected vehicle applications.”IEEE Intell.

Transp. Syst. Mag. 13 (1): 225–242. https://doi.org/10.1109/MITS

.2019.2953521.

Wojke, N., A. Bewley, and D. Paulus. 2017. “Simple online and realtime

tracking with a deep association metric.”In Proc., 2017 IEEE Int. Conf.

on Image Processing (ICIP), 3645–3649. New York: IEEE.

Xia, Y., W. Xu, L. Zhang, X. Shi, and K. Mao. 2015. “Integrating 3D

structure into traffic scene understanding with RGB-D data.”Neuro-

computing 151 (Mar): 700–709. https://doi.org/10.1016/j.neucom.2014

.05.091.

Yang, B., W. Luo, and R. Urtasun. 2018. “PIXOR: Real-time 3D object

detection from point clouds.”In Proc., IEEE Conf. on Computer Vision

and Pattern Recognition, 7652–7660. New York: IEEE.

Yang, L., M. Li, X. Song, Z. Xiong, C. Hou, and B. Qu. 2019. “Vehicle

speed measurement based on binocular stereovision system.”IEEE

Access 7 (Jul): 106628–106641. https://doi.org/10.1109/ACCESS.2019

.2932120.

Zhang, X., Y. Feng, P. Angeloudis, and Y. Demiris. 2022. “Monocular

visual traffic surveillance: A review.”IEEE Trans. Intell. Transp. Syst.

23 (9): 14148–14165. https://doi.org/10.1109/TITS.2022.3147770.

J. Comput. Civ. Eng., 2023, 37(5): 04023019

Application of Hybrid Deep Reinforcement Learning for Managing Connected Cars at Pedestrian Crossings: Challenges and Research Directions

Article

Full-text available

May 2024

The autonomous vehicle is an innovative field for the application of machine learning algorithms. Controlling an agent designed to drive safely in traffic is very complex as human behavior is difficult to predict. An individual’s actions depend on a large number of factors that cannot be acquired directly by visualization. The size of the vehicle, its vulnerability, its perception of the environment and weather conditions, among others, are all parameters that profoundly modify the actions that the optimized model should take. The agent must therefore have a great capacity for adaptation and anticipation in order to drive while ensuring the safety of users, especially pedestrians, who remain the most vulnerable users on the road. Deep reinforcement learning (DRL), a sub-field that is supported by the community for its real-time learning capability and the long-term temporal aspect of its objectives looks promising for AV control. In a previous article, we were able to show the strong capabilities of a DRL model with a continuous action space to manage the speed of a vehicle when approaching a pedestrian crossing. One of the points that remains to be addressed is the notion of discrete decision-making intrinsically linked to speed control. In this paper, we will present the problems of AV control during a pedestrian crossing, starting with a modelization and a DRL model with hybrid action space adapted to the scalability of a vehicle-to-pedestrian (V2P) encounter. We will also present the difficulties raised by the scalability and the curriculum-based method.

Towards efficient traffic crash detection based on macro and micro data fusion on expressways: A digital twin framework

Article

Full-text available

Feb 2024
IET INTELL TRANSP SY

Efficient detection of traffic crashes has been a significant matter of concern with regards to expressway safety management. The current challenge is that, despite collecting vast amounts of data, expressway detection equipment is plagued by low data utilization rates, unreliable crash detection models, and inadequate real‐time updating capabilities. This study is to develop an effective digital twin framework for the detection of traffic crashes on expressways. Firstly, the digital twin technology is used to create a virtual entity of the real expressway. A fusion method for macro and micro traffic data is proposed based on the location of multi‐source detectors on a digital twin platform. Then, a traffic crash detection model is developed using the ThunderGBM algorithm and interpreted by the SHAP method. Furthermore, a distributed strategy for detecting traffic crashes is suggested, where various models are employed concurrently on the digital twin platform to enhance the general detection ability and reliability of the models. Finally, the efficacy of the digital twin framework is confirmed through a case study of certain sections of the Nanjing Ring expressway. This study is expected to lay the groundwork for expressway digital twin studies and offer technical assistance for expressway traffic management.

Accurate road user localization in aerial images captured by unmanned aerial vehicles

Article

Feb 2024
AUTOMAT CONSTR

A road data assets revenue allocation model based on a modified Shapley value approach considering contribution evaluation

Article

Full-text available

Mar 2024

This paper constructs a two-layer road data asset revenue allocation model based on a modified Shapley value approach. The first layer allocates revenue to three roles in the data value realization process: the original data collectors, the data processors, and the data product producers. It fully considers and appropriately adjusts the revenue allocation to each role based on data risk factors. The second layer determines the correction factors for different roles to distribute revenue among the participants within those roles. Finally, the revenue values of the participants within each role are synthesized to obtain a consolidated revenue distribution for each participant. Compared to the traditional Shapley value method, this model establishes a revenue allocation evaluation index system, uses entropy weighting and rough set theory to determine the weights, and adopts a fuzzy comprehensive evaluation and numerical analysis to assess the degree of contribution of participants. It fully accounts for differences in both the qualitative and quantitative contributions of participants, enabling a fairer and more reasonable distribution of revenues. This study provides new perspectives and methodologies for the benefit distribution mechanism in road data assets, which aid in promoting the market-based use of road data assets, and it serves as an important reference for the application of data assetization in the road transportation industry.

Time or Reward: Digital-twin Enabled Personalized Vehicle Path Planning

Conference Paper

Dec 2021

Automated visual surveying of vehicle heights to help measure the risk of overheight collisions using deep learning and view geometry

Article

Apr 2022

Overheight vehicle collisions continuously pose a serious threat to transportation infrastructure and public safety. This study proposed a vision‐based method for automatic vehicle height measurement using deep learning and view geometry. In this method, vehicle instances are first segmented from traffic surveillance video frames by exploiting mask region‐based convolutional neural network (Mask R‐CNN). Then, 3D bounding box on each vehicle instance is constructed using the obtained vehicle silhouette and three orthogonal vanishing points in the surveilled traffic scene. By doing so, the vertical edges of the constructed 3D bounding box are directly associated with the vehicle image height. Last, the vehicle's physical height is computed by referencing an object with a known height in the traffic scene using single view metrology. A field experiment was performed to evaluate the performance of the proposed method, leading to the mean and maximum errors of 3.6 and 6.6, 5.8 and 12.9, 4.4 and 8.1, and 9.2 and 18.5 cm for cars, buses, vans, and trucks, respectively. The experiment also demonstrated the ability of the method to overcome vehicle occlusion, shadow, and irregular appearance interferences in height estimation suffered by existing image‐based methods. The results signified the potential of the proposed method for overheight vehicle detection and collision warning in real traffic settings.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Conference Paper

Oct 2021

Development of a Car-Free Street Mapping Model Using an Integrated System with Unmanned Aerial Vehicles, Aerial Mapping Cameras, and a Deep Learning Algorithm

Article

Feb 2022

Road condition and quality are critical road maintenance and risk reduction factors. Most existing road monitoring systems include regular on-site surveys and maintenance. However, major roads in urban areas are generally complicated and have heavy traffic during the daytime, so such field investigation can be significantly limited. Moreover, any road work at nighttime can be risky and dangerous and incur excessive expenses. Based on a review of existing systems for monitoring road conditions, this study focuses on overcoming two unsolved challenges: the capacity of the monitoring range and the avoidance techniques to ensure traffic is not hindered. To solve these challenges, these paper proposes an integrated road monitoring system called Car-free Street Mapping (CfSM) using unmanned aerial vehicles (UAV), aerial mapping cameras, and deep learning (DL) algorithms. The use of the aerial mapping camera mounted on the UAV is to widen the monitoring viewing range, and general-purpose drones are used in this study rather than expensive special equipment. Since the drone-taken images include many passing vehicles that conceal the road surface from the camera vision, the DL model was applied to detect the vehicles and their shadows and then remove them from the images. To train the DL model, two image datasets were used: publicly available cars overhead with context (COWC) images and orthoimages additionally taken for the project to further improve the accuracy. The two datasets consist of 298,623 labeled objects on 9,331 images in total. The tests resulted in a mean average precision (mAP) of 89.57% for trucks, 95.77% for passenger vehicles, and 76.51% for buses. Finally, the object-removed images were composited into one whole car-free image. The CfSM was applied to two areas in Yeouido and Sangam-dong, Seoul, Korea. The car-free images in both regions show a spatial resolution of 10 mm and can be used for various purposes such as road maintenance and management and autonomous vehicle roadmaps.

Monocular Visual Traffic Surveillance: A Review

Article

Jan 2022

To facilitate the monitoring and management of modern transportation systems, monocular visual traffic surveillance systems have been widely adopted for speed measurement, accident detection, and accident prediction. Thanks to the recent innovations in computer vision and deep learning research, the performance of visual traffic surveillance systems has been significantly improved. However, despite this success, there is a lack of survey papers that systematically review these new methods. Therefore, we conduct a systematic review of relevant studies to fill this gap and provide guidance to future studies. This paper is structured along the visual information processing pipeline that includes object detection, object tracking, and camera calibration. Moreover, we also include important applications of visual traffic surveillance systems, such as speed measurement, behavior learning, accident detection and prediction. Finally, future research directions of visual traffic surveillance systems are outlined.

Review of Digital twin for intelligent transportation system

Conference Paper

Oct 2021

Invited: Towards Fully Intelligent Transportation through Infrastructure-Vehicle Cooperative Autonomous Driving: Challenges and Opportunities

Conference Paper

Dec 2021

A unified normalization method for homography estimation using combined point and line correspondences

Article

Nov 2021

Data normalization is an essential and imperative step when using the direct linear transform method for homography estimation. However, the existing data normalization methods only rely on either point coordinates or line coefficients in computing homography, which is inapplicable to many civil infrastructure scenes, in which both points and lines exist but are scarce. It is plausible to maximize the accuracy level achievable by fully utilizing all points and lines available in these scenes. To fit this purpose, this study proposed a unified data normalization method for homography estimation by processing combined point and line correspondences. In this method, an analytical procedure was created to transform line coefficients for normalization under similarity transformation. By this procedure, formulas for defining parameters of the similarity transformation were developed and equations for pre- and post-processing of data were established, leading to a pipeline that allows for solving different combinations of point and line correspondences simultaneously. Both simulation experiments and field tests were conducted to evaluate the performance of the method. The results showed that the method can effectively exploit all available point and line correspondences and significantly improve the accuracy for homography estimation under different measurement conditions.

Deep Learning for Security in Digital Twins of Cooperative Intelligent Transportation Systems

Article

Oct 2021

The purpose is to solve the security problems of the Cooperative Intelligent Transportation System (CITS) Digital Twins (DTs) in the Deep Learning (DL) environment. The DL algorithm is improved; the Convolutional Neural Network (CNN) is combined with Support Vector Regression (SVR); the DTs technology is introduced. Eventually, a CITS DTs model is constructed based on CNN-SVR, whose security performance and effect are analyzed through simulation experiments. Compared with other algorithms, the security prediction accuracy of the proposed algorithm reaches 90.43%. Besides, the proposed algorithm outperforms other algorithms regarding Precision, Recall, and F1. The data transmission performances of the proposed algorithm and other algorithms are compared. The proposed algorithm can ensure that emergency messages can be responded to in time, with a delay of less than 1.8s. Meanwhile, it can better adapt to the road environment, maintain high data transmission speed, and provide reasonable path planning for vehicles so that vehicles can reach their destinations faster. The impacts of different factors on the transportation network are analyzed further. Results suggest that under path guidance, as the Market Penetration Rate (MPR), Following Rate (FR), and Congestion Level (CL) increase, the guidance strategy's effects become more apparent. When MPR ranges between 40% ~ 80% and the congestion is level III, the ATT decreases the fastest, and the improvement effect of the guidance strategy is more apparent. The proposed DL algorithm model can lower the data transmission delay of the system, increase the prediction accuracy, and reasonably changes the paths to suppress the sprawl of traffic congestions, providing an experimental reference for developing and improving urban transportation.

Digital Twin and Its Application in Transportation Infrastructure

Conference Paper

Jul 2021

Digitalization of Traffic Scenes in Support of Intelligent Transportation Applications

Recommended publications

Intelligent Traffic Management System using Bollards

An algebraic and graphic combination approach to solve network sensor location problems

Measuring congestion at road-diet reductions using probe vehicle data

Impact of ETC system on service level of toll plazas