ArticlePDF Available

3D Tracking of Construction Resources Using an On-site Camera System

Authors:

Abstract and Figures

Vision trackers have been proposed as a promising alternative for tracking at large-scale, congested construction sites. They provide the location of a large number of entities in a camera view across frames. However, vision trackers provide only two-dimensional (2D) pixel coordinates, which are not adequate for construction applications. This paper proposes and validates a method that overcomes this limitation by employing stereo cameras and converting 2D pixel coordinates to three-dimensional (3D) metric coordinates. The proposed method consists of four steps: camera calibration, camera pose estimation, 2D tracking, and triangulation. Given that the method employs fixed, calibrated stereo cameras with a long baseline, appropriate algorithms are selected for each step. Once the first two steps reveal camera system parameters, the third step determines 2D pixel coordinates of entities in subsequent frames. The 2D coordinates are triangulated on the basis of the camera system parameters to obtain 3D coordinates. The methodology presented in this paper has been implemented and tested with data collected from a construction site. The results demonstrate the suitability of this method for on-site tracking purposes. DOI: 10.1061/(ASCE)CP.1943-5487.0000168. (C) 2012 American Society of Civil Engineers.
Content may be subject to copyright.
Three-Dimensional Tracking of Construction Resources
Using an On-Site Camera System
Man-Woo Park, A.M.ASCE1; Christian Koch2; and Ioannis Brilakis, M.ASCE3
Abstract: Vision trackers have been proposed as a promising alternative for tracking at large-scale, congested construction sites. They
provide the location of a large number of entities in a camera view across frames. However, vision trackers provide only two-dimensional
(2D) pixel coordinates, which are not adequate for construction applications. This paper proposes and validates a method that overcomes this
limitation by employing stereo cameras and converting 2D pixel coordinates to three-dimensional (3D) metric coordinates. The proposed
method consists of four steps: camera calibration, camera pose estimation, 2D tracking, and triangulation. Given that the method employs
fixed, calibrated stereo cameras with a long baseline, appropriate algorithms are selected for each step. Once the first two steps reveal camera
system parameters, the third step determines 2D pixel coordinates of entities in subsequent frames. The 2D coordinates are triangulated on
the basis of the camera system parameters to obtain 3D coordinates. The methodology presented in this paper has been implemented
and tested with data collected from a construction site. The results demonstrate the suitability of this method for on-site tracking purposes.
DOI: 10.1061/(ASCE)CP.1943-5487.0000168.© 2012 American Society of Civil Engineers.
CE Database subject headings: Automation; Imaging techniques; Models; Information technology (IT); Cameras; Remote sensing;
Construction management.
Author keywords: Automation; Imaging techniques; Computer-aided vision system; Models; Information technology; Remote sensing.
Introduction
Three-dimensional (3D) object tracking on construction sites has a
wide variety of applications. It allows identification and tracking of
personnel, equipment, and materials to support effective progress
monitoring, activity sequence analysis, productivity measurements,
and asset management and to enhance site safety. In addition,
tracking instantly enables the identification of critical activities
and problems, which allows for on-site project control and
decision-making capabilities. Available tracking solutions are pri-
marily on the basis of radio frequency technologies, including
global positioning system (GPS), radio frequency identification
(RFID), and ultra-wideband (UWB) technologies. They all work
under the same principle of having a sensor attached on each entity
to be tracked. These technologies have been applied and proven to
work excellently for most scenarios involved in construction man-
agement, such as proactive work zone safety and material registra-
tion and installation (Teizer et al. 2007b;Ergen et al. 2007;Song
et al. 2006). However, when it comes to large-scale and congested
construction sites, the installation of the sensor system can be costly
and time-consuming because of the large amount of items involved.
Also, privacy issues can arise out of tagging workers. For these
specific scenarios, vision-based tracking may have the potential
for use as an efficient alternative.
Vision-based methods have been introduced for tracking entities
on construction sites. Vision-based tracking works by receiving
video streams and estimating an entitys motion in subsequent
video frames on the basis of the history of their appearance and
location. Its capability of tracking multiple entities without the
installation of sensors on the entities has a great potential in con-
struction applications. It provides two-dimensional (2D) pixel
coordinates, xand y, of the entities across time. The 2D results
may be useful when predefined measurements on an entitys tra-
jectory are available, as in Gong and Caldasresearch work
(2010). However, the 2D results are generally not enough to extract
substantial information for most construction management tasks
because it is unknown how far entities are located from the camera.
Because of the lack of depth information (z), even approximate dis-
tance measurements between two entities, e.g., workers and mobile
equipment, are not reliable but necessary for safety management.
Also, any movement along the zaxis is not measurable. Brilakis
et al. (2011) proposed a framework for 3D vision-based tracking
that can provide 3D coordinates of entities by deploying stereo
cameras. The framework consists of several processes, including
construction entity detection, 2D tracking, and correlation of 2D
tracking results and calculating 3D location. Thus far only 2D
tracking of construction resources has been validated successfully
(Park et al. 2011) from this framework. Because of the large
amount of contents involved in the framework, each single process
has not been fully detailed and validated.
This paper presents and validates the frameworks method for
correlating 2D tracking results paired with multiple views and cal-
culating 3D location of construction entities. This method employs
stereo vision to provide 3D trajectories of moving entities. In
the current state of research, stereo vision has been applied to
1Ph.D. Candidate, School of Civil and Environmental Engineering,
Georgia Institute of Technology, 130 Hinman Research Building, 723
Cherry St., Atlanta, GA 30332 (corresponding author). E-mail: mw.park@
gatech.edu
2Postdoctoral Associate, Computing in Engineering, Faculty of
Civil and Environmental Engineering, Ruhr-Universität Bochum, Uni-
versitätsstraße 150, 44780 Bochum, Germany. E-mail: koch@inf.bi.rub.de
3Assistant Professor, School of Civil and Environmental Engineering,
Georgia Institute of Technology, 328 Sustainable Education Building, 788
Atlantic Dr. NW, Atlanta, GA 30332. E-mail: brilakis@gatech.edu
Note. This manuscript was submitted on June 10, 2011; approved on
September 16, 2011; published online on September 19, 2011. Discussion
period open until December 1, 2012; separate discussions must be sub-
mitted for individual papers. This paper is part of the Journal of Comput-
ing in Civil Engineering, Vol. 26, No. 4, July 1, 2012. ©ASCE, ISSN
0887-3801/2012/4-541549/$25.00.
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 541
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
3D modeling for construction progress monitoring (Chae and Kano
2007;Son and Kim 2010;Golparvar-Fard et al. 2010), focusing on
the retrieval and visualization of static structure components,
whereas the focus of this paper lies in 3D localization of moving
entities and the accuracy of localization measurements. The camera
system consists of two fixed cameras having several meters of base-
line, which is significantly longer than Bumblebees 24-cm base-
line (Point Gray 2011). The long baseline allows competitive
accuracy in localizing far-located entities. Under the proposed
method, in the first step, cameras are calibrated to find their intrin-
sic parameters, i.e., the focal length, the principal point, radial dis-
tortions, and tangential distortions of a camera. The second step is
to estimate a relative pose (rotation and translation) of the calibrated
cameras, which is called extrinsic parameters. Once the intrinsic
and extrinsic parameters are known, a 2D tracker is applied to every
video frame of each camera in the third step. Using a kernel-based
2D tracking algorithm, the 2D pixel coordinates of an entitys cent-
roid are determined. To obtain 3D locations, in the fourth step, 2D
tracking results are triangulated on the basis of the intrinsic and
extrinsic parameters. In every frame, a projection of the determined
centroid is obtained for each camera. Finally, an intersection of the
two projections from two cameras determines the 3D location of a
tracked entity.
The proposed method is tested on the videos recorded of a con-
struction site. The tests involve three types of entities: a steel plate,
a worker, and a van. Various point matching methods and different
baseline lengths are applied to identify their effects on accuracy.
The results are at maximum 0.658-m error with 95% confidence,
which validates the effectiveness, accuracy, and applicability of the
proposed vision-based 3D tracking approach.
Background
State of Practice in Tracking Technology
Common tracking methods are either on the basis of radio
frequency, which includes several types of technologies like
GPS RFID, Bluetooth, and wireless fidelity, e.g., Wi-Fi and
ultra-wideband, or they make use of optical remote sensors, such
as 2D image/video cameras and 3D range cameras, e.g., Flash
LADAR.
Global positioning system is an outdoor satellite-based world-
wide navigation system formed by a constellation of satellites and
ground control stations. The 3D position is determined by a GPS
receiver using triangulation on the basis of these satellites. Global
positioning system is an established location technology that offers
a wide range of off-the-shelf solutions in both hardware and soft-
ware. According to Caldas et al. (2004), GPS applications have
been applied to construction practices, such as positioning of equip-
ment and surveying. However, when using only GPS, there is lim-
ited potential in other applications, such as improving the
management of materials on construction job sites. Moreover, it
can only operate outdoors and the accuracy is only approximately
10 m.
Radio frequency identification is used for identifying and
tracking various objects (Ergen et al. 2007). Radio frequency iden-
tification systems are primarily composed of a tag and a reader.
Radio frequency identification technology does not require line
of sight and it is also durable in harsh environments and can be
embedded in concrete. Radio frequency identification enables
efficient automatic data collection because readers can be mounted
on any structure in the reading range and each reader can scan
multiple tags at a given time. However, this technology, unless
combined with other tools (Ergen et al. 2007), can only report
the radius inside which the tracked entity exists, and most impor-
tantly, the near-sighted effect prohibits its use in tracking applica-
tions. Combinations of GPS and RFID technologies have been
recently explored (Song et al. 2004,2006). The advantage of this
combination is that GPS sensors need to only accompany the tag
readers and not the materials. Every time a tag is located, the 3D
coordinates as reported by the GPS can be recorded as the location
of each piece of material at that given time.
Another type of radio technology that can be applied to short-
range communications is UWB. Ultra-wideband is able to detect
time of flight of the radio transmissions at various frequencies,
which enables it to perform effectively in providing precision
localization even in the presence of severe multipath effects
(Fontana et al. 2003). Another advantage is the low average power
requirement that results from the low pulse rate (Fontana 2004).
Teizer et al. (2007b) applied the UWB technology to construction.
It was used for a material location tracking system with primary
applications to active work zone safety. Its ability to provide accu-
rate 3D locations in real time is a definite benefit to tracking in
construction sites.
Vision technologies and laser technologies are attracting
increasing interests for tracking in large-scale, congested sites
because they are free of tags. A 3D range imaging/video camera,
e.g., a Flash LADAR, provides not only the intensity but also the
estimated range of the corresponding image area. When compared
with 3D laser scanners, which have been used in construction, the
device is portable and inexpensive. Testing various kinds of data
filtering, transformation, and clustering algorithms, Gong and
Caldas (2008) used 3D range cameras for spatial modeling. Teizer
et al. (2007a) demonstrated tracking with 3D range cameras and the
potential of its use for site safety enhancement. However, the low
resolution and short range make it difficult to be applied to large-
scale construction sites. Few tests have been executed in outdoor
construction sites in which the environments are more cluttered and
less controlled. Also, it is reported that the reflectance of a surface
varies extremely even in indoor environments (Gächter et al. 2006).
Moreover, when multiple cameras are used, they can interfere with
one other (Fuchs 2010).
Traditional 2D vision trackers are simply on the basis of a
sequence of images and can be a proper alternative to RFID meth-
ods because they remove the need for installing sensors and identity
(ID) tags of any kind on the tracked entity. For this reason, this
technology is (1) highly applicable in dynamic, busy construction
sites in which large numbers of equipment, personnel, and materi-
als are involved; and (2) more desirable from personnel who wish to
avoid being taggedwith sensors. In Gruens research (1997), it is
highly regarded for its capability to measure a large number of par-
ticles with a high level of accuracy. Yang et al. (2010) proposed a
vision tracker that can track multiple construction workers. Gong
and Caldas (2010) showed the applicability of vision tracking to
automated productivity analysis.
Two-dimensional vision trackers can be categorized in kernel-
based, contour-based, and point-based methods, depending on the
way of representing objects. In kernel-based methods, an object is
represented by the color or texture in the region of interest, and its
position in the next frame is estimated on the basis of the regions
color or texture information. In contour-based methods, an object is
represented by silhouettes or contours that determine the boundary
of the object. In point-based methods, an object is represented by a
set of feature points extracted from the region that contains the
object. Out of the three categories, kernel-based methods are the
most suitable for construction-related applications with respect
to the construction sitescharacteristics, such as illumination
542 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
condition, occlusion, and object types. Park et al. (2011) reported
that kernel-based methods could effectively track construction en-
tities in various illumination conditions and that they performed
well even on objects occluded by 50% or more. The entities that
failed to be tracked because of severe occlusions can still be recov-
ered by reinitialization within an object detection process. The 2D
tracker used in this paper is on the basis of the method by Ross et al.
(2008). It tracks an object on the basis of the model template com-
posed of eigenimages, and represents the tracked object as an
affine-transformed rectangle that encloses it. Six affine parameters,
xand ycoordinates of the centroid, scale, aspect ratio, rotation, and
skew, are estimated through particle filtering.
Stereo View Geometry
Two-dimensional vision-based tracking is not comparable with
other 3D technologies previously described unless it can provide
3D information. To reconstruct the 3D position of an entity, several
steps must be taken to determine the stereo view geometry (Hartley
and Zisserman 2004). Heikkilä and Silvén (1997), Zhang (1999),
and Bouguet (2004) presented and provided standard calibration
tools. The calibration tools reveal intrinsic camera parameters,
including the focal length, the principal point, radial distortions,
and tangential distortions. They use calibration objects that have
specific patterns, such as a checkerboard. In Zhangs calibration
method, tangential distortion is not modeled. Heikkilä and Silvéns
toolbox and Bouguets toolbox use the same distortion model that
takes into account both radial and tangential distortions. Therefore,
both toolboxes generally result in almost equivalent calibration.
Bouguet provides additional functions, such as error analysis,
which is useful to recalibrate with revised inputs.
After having calibrated each camera separately, the external
camera system has to be determined (see Fig. 1). For this purpose,
feature points are identified and matched within the two camera
views. The most well-known and robust algorithms commonly
used for this task are the scale-invariant feature transform (SIFT)
(Lowe 2004) and speeded up robust features (SURF) (Bay et al.
2008). Whereas SIFT uses Laplacian of Gaussian (LOG), differ-
ence of Gaussian (DOG), and histograms of local oriented gra-
dients, SURF relies on a Hessian matrix and the distribution of
Haar-wavelet responses for feature point detection and matching,
respectively. Although SIFT turned out to be slightly better in terms
of accuracy, SURF is computationally much more efficient (Bauer
et al. 2007). The algorithms SIFT and SURF provide point matches,
including extreme outliers (mismatches) that have to be removed.
To achieve that, robust algorithms for managing the outliers were
introduced. Random sample consensus (RANSAC) (Hartley and
Zisserman 2004) and maximum a posteriori sample consensus
(MAPSAC) (Torr 2002) are the representative robust methods.
The RANSAC method minimizes the number of outliers by ran-
domly selecting a small subset of the point matches and repeating
the maximization process for different subsets until it reaches a de-
sired confidence in the exclusion of outliers. One of its problems is
the poor estimates associated with a high threshold (Torr 2002).
Working in a similar way to RANSAC, MAPSAC resolved this
problem by minimizing not only the number of outliers but also
the error associated with the inliers.
The next step is the estimation of the essential matrix, E, on the
basis of the identified point matches. In general, the normalized
eight point (Hartley 1997), seven point (Hartley and Zisserman
2004), six point (Pizarro et al. 2003), and five point (Nistér
2004) algorithms are used for this purpose. Eight, seven, six,
and five is the minimal number of points required to perform
the estimation. Rashidi et al. (2011) compared the resulting accu-
racy of these algorithms in practical civil infrastructure environ-
ments, finding the five-point algorithm to be the best. However,
because of its simplicity and reasonable accuracy the normalized
eight-point algorithm is still the most common one and the second
best according to Brückner et al. (2008). On the basis of the essen-
tial matrix, E, the relative pose of two cameras (Rand Tin Fig. 1),
can be derived directly (Hartley and Zissermann 2004).
In the last step, triangulation is performed. On the basis of two
corresponding pixels in the respective view, two lines of sight have
to be intersected to find the 3D position (Fig. 1). However, because
of image noise and slightly incorrect point correspondences, the
two rays may not intersect in space. To address this problem,
Hartley-Sturm optimal triangulation (Hartley and Sturm 1997)
and optimal correction (Kanatani et al. 2008) algorithms are cur-
rently used as standard methods for finding corrected correspond-
ences. They both try to find the minimum displacement through the
geometric error minimization, correct the pixel coordinates accord-
ingly, and intersect the corrected rays to determine 3D coordinates.
Although the latter has a faster process, the formers results are
more accurate (Fathi and Brilakis 2011).
Several researchers have introduced and applied stereo vision
technologies to construction. Most applications presented so far
are related to 3D modeling of structures for progress monitoring.
Chae and Kano (2007) estimated spatial data for development of a
project control system from stereo images. In another work, Son
and Kim (2010) used a stereo vision system to acquire 3D data
and to recognize 3D structural components. Golparvar-Fard et al.
(2010) presented a sparse 3D representation of a site scene using
daily progress photographs for use as an as-built model. On the
contrary to creating 3D geometry models on the basis of static
feature points, the application of stereo vision in this paper locates
moving entities in 3D across time. Furthermore, this paper
measures the accuracy of 3D positioning by comparing with total
station data.
Problem Statement and Objectives
As described in the previous section, the results of general vision-
based tracking are restricted to 2D. The applications of these
results are limited at large-scale, congested construction sites.
Brilakis et al. (2011) introduced a framework for 3D vision
tracking, which employs multiple fixed cameras to calculate the
3D location of an entity. From this framework, this paper aims
to present and validate the method of combining 2D tracking results
Fig. 1. Epipolar geometry and centroid relocation
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 543
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
with stereo vision geometry for the sake of accurate 3D trajectories
of far-located construction entities. This research is aiming strictly
for accurate localization of construction entities and not for real-
time processing. Each single step involved in this method should
be optimized to characteristics of the fixed camera system and con-
struction sites, such as various types of construction entities, the
long baseline, and the long distance from cameras to an entity that
is inevitable at large-scale construction sites.
Methodology
The proposed method is shown in Fig. 2and is composed of four
steps of camera calibration, camera pose estimation, 2D tracking,
and triangulation. To calculate 3D positions of an object, the regis-
tration of the camera system is required. The camera system in this
method is composed of two cameras located several meters apart
from one other. This system is described by epipolar geometry, as
shown in Fig. 1. This geometry consists of two types of parameters:
intrinsic and extrinsic parameters. Intrinsic parameters determine
the linear system of projecting 3D points on the image plane
(P1and P2in Fig. 1). Bouguets calibration toolbox (2004) is used
to reveal the intrinsic parameters because of its accuracy, robust
convergence, and convenience.
The focal point of the left camera becomes an origin of the co-
ordinate in this system. Extrinsic parameters represent the relative
pose of the right camera to the left one (the rotation matrix Rand
the translation vector Tin Fig. 1). The estimation of Rand Tin-
volves point matching between two views. Two combinations of
algorithms are considered in this paper. One is using SURF
(Bay 2008) and RANSAC (Hartley and Zisserman 2004) for the
feature descriptor and outlier removal, respectively. This combina-
tion proved to be fast and accurate enough for point cloud gener-
ation of infrastructure (Fathi and Brilakis 2011). The other is using
SIFT (Lowe 2004) and MAPSAC (Torr 2002), which is slower but
capable of acquiring more matches than the former combination.
Even though the use of SIFT is slower than SURF, it is worth
to consider this combination in the application because of the fol-
lowing reasons. First, cameras are fixed in the application, which
requires the camera pose estimation only once at the initial stage of
the framework. Therefore, the longer processing time of using SIFT
can be ignored. Second, as a longer baseline line, i.e., distance
between two cameras, is used, fewer point matches are obtained
because of higher disparity between two camera views. In this case,
SIFT and MAPSAC can be helpful to feed more inlier matches and
less outlier matches to the next step.
The normalized eight point algorithm (Hartley 1997) is selected
to estimate the essential matrix on the basis of intrinsic parameters
and point matches. The selected method is the most widely used
because of its simple implementation and reasonably accurate
results. Although this method is less computationally efficient
and more sensitive to degeneracy problems compared with other
methods (Nistér 2004;Li and Hartley 2006), it is still efficient
and accurate enough to satisfy needs with regard to fixed camera
positions, a long baseline, and the complexity of the construction
sites. Finally, extrinsic parameters, Rand T, are recovered directly
from the essential matrix (Hartley and Zisserman 2004). These
parameters together with the intrinsic parameters are used for
triangulating 2D tracking results.
For each calibrated camera view, an identified construction
entity is tracked across subsequent frames. According to the com-
parative study of Park et al. (2011), a kernel-based 2D tracker,
which is based on the method by Ross et al. (2008), is used. In
this paper, the eigenimage is constructed selectively with gray scale
values or saturation values depending on the tracked entitys color
characteristics to enhance the accuracy. Also, in the particle filter-
ing process, the position translation, delta-xand delta-ybetween
consecutive frames, is considered instead of the entity location,
xand ycoordinates. This estimation strategy is beneficial to cor-
rectly locate the entity with fewer samples in particle filtering.
The centroid coordinates are updated every frame by accumulating
the estimated translation vector.
The results obtained in two previous sections, epipolar geometry
and two centroids, are fed into the triangulation step. Generally, the
projections of two centroid coordinates determined from two views
do not intersect one other because of camera lens distortions and
errors caused by 2D tracking. Even if the 2D tracker correctly
locates the entity on each frame, the disparity between two camera
views causes mismatch of the centroids. To enhance the accuracy of
the triangulation process, the two centroids had to be relocated so
that their projections intersect (see Fig. 1). For this purpose, Hartley
and Sturms algorithm (Hartley and Sturm 1997) is selected be-
cause the accuracy is more critical than the processing time in
the application. Intersecting projections of the modified pair of
centroids for each frame leads to the 3D coordinate of the tracked
entity.
Experiments and Results
The data for validation are collected from a construction site at the
Georgia Institute of Technology. This site is the construction of an
indoor football practice facility managed by Barton Malow Com-
pany. The roof and columns of the steel-framed facility were
already completed when the data were collected. The videos were
taken with two high-definition (HD) camcorders (Canon VISXIA
HF S100, 30 frames per second, 1;920 × 1;080 pixels) located ap-
proximately 4.5 m above the ground on one side of the facility
structure where the ground area of the facility structure could be
overlooked. One total station (Sokkia SET 230RK3) was used
to acquire ground truths of the entitiestrajectories, which are com-
pared with obtained results.
Figs. 3and 4show the positions of the cameras and entities
trajectories from a birds eye view on the basis of the total station
coordinate system and camerasviews. In Figs. 3and 4, trajectories
1 and 2 are composed of 10 and eight segments of straight lines,
Camera 1
Camera
Calibration
Essential
Matrix
2D Tracking
Centroids
of entities
Camera Pose
Estimation
Intrinsic
Parameters
Camera 2
Camera
Calibration
2D Tracking
Centroids
of entities
Intrinsic
Parameters
Triangulation
3D Coordinates
Fig. 2. Methodology overview
544 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
approximately located at a 39 and 43 m distant from the left camera,
respectively. Trajectory 3 is one straight line located 36 m distant
from the left camera. The total station data include the end points of
all segments, i.e., nine, 11, and two points for trajectory 1, 2, and 3,
respectively. The ground-truth trajectories are made by connecting
those points with straight lines. The proposed methodology is
tested on three types of entities: a worker, a steel plate carried
by a worker, and a van. Trajectories 1 and 2 are those of a worker
and a steel plate, and trajectory 3 is of a van. The accuracy of
tracking is quantified by an absolute error that is defined as the
distance between the tracked point and the ground-truth trajectory.
For each frame j, the distance Djis calculated by the following
equation:
Dj¼jðQiþ1QiÞ×ðPjQiÞjQiþ1QiÞj
where Qiand Qiþ1= endpoints of the ith line segment Li¼Qiþ
tðQiþ1QiÞof ground-truth trajectories on which the object in
frame jlies; and Pj=jth frames tracking results, i.e., 3D points.
The main causes of error considered in this paper can be classified
into the 2D tracker error and the error of camera pose estimation.
Also, the assumption that an entity moves exactly along the straight
line is another miscellaneous cause of error.
Camera Calibration and Camera Pose Estimation
For the purpose of camera calibration, a video of a moving checker-
board (7 by 9 blocks of 65 × 65 mm squares) is recorded by each
camera. A total of 26 frames are selected appropriately to have vari-
ous angles of view and are fed into Bouguets calibration toolbox
(Bouguet 2004). Once the checkerboard videos are taken and the
cameras are calibrated, all camera system settings remained the
same through the experiments. All functions that may automati-
cally cause a change in the camera intrinsic parameters, such as
autofocus and automated image stabilization, are disabled. Out
of all the video frames, a pair of corresponding frames of left
and right cameras is used to obtain a large number of point matches.
The point matches and calculated intrinsic parameters are used to
estimate camera poses. Because the positions of the cameras are
fixed in the proposed method, all these procedures are required only
once as a preprocess.
Tracking of Steel Plate
A 0.6-m by 0.3-m steel plate is chosen as the first entity to track.
The plate is carried by a worker walking along trajectory 1 and 2.
The video contains 1,430 frames in total, with 790 and 640 frames
for trajectory 1 and 2, respectively, which indicates the results have
1,430 tracked 3D coordinates. In this experiment, right camera 1
(Fig. 3) is set to have a 3.8-m baseline. The template model for
the 2D tracker is composed of gray pixel values. The tracker ac-
curately fits the steel plate with an affine-transformed rectangle
in most frames. Therefore, it can be inferred that the errors in this
experiment mostly come from triangulation, including camera pose
estimation. Fig. 5shows 3D tracking results of using different
-15 -10 -5 0 5 10 15
0
5
10
15
20
25
30
35
40
45
50
X (m)
Z (m)
Trajectory 1
Trajectory 2
Trajectory 3
Total station
Left camera
Right camera 1
Right camera 2
8.3m
3.8m
Fig. 3. Layout of tests from birds eye view
Fig. 4. Entitiestrajectories: (a) trajectories 1 and 2 from view of right
camera 1; (b) trajectory 3 from view of right camera 2
-14 -10 -6 -2 26
32
36
40
44
48
2
4
X (m)
Z (m)
Y (m)
Ground-Truth
SIFT+MAPSAC (DR=0.6)
SURF+RANSAC (DR=0.8)
SURF+RANSAC (DR=0.6)
Fig. 5. Tracking results of steel plate
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 545
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
camera pose estimation methods, and Table 1summarizes the error
results.
The SURF algorithm is tested with two threshold values of dis-
tance ratio (DR): 0.8 and 0.6. Distance ratio is the distance of the
closest neighbor to that of the second closest neighbor (Lowe
2004). Discarding feature points that have distance ratios higher
than the threshold is an effective way of reducing false-positive
matches. In the case of DR ¼0:8, more point matches are obtained
than DR ¼0:6, but they contain apparent outliers (Fig. 6) that have
adverse effects on essential matrix estimation. The effect of outliers
is reflected on the large error of tracking. Even though SURF with a
DR of 0.6 generates fewer point matches than others, it reduces
outliers significantly and performs even better than SIFT
(DR ¼0:6) and MAPSAC, which provide approximately twice
as many point matches. Assuming the error follows a normal
Table 1. Errors of Tracking Steel Plate
Method DR
Number of
point matches
Error (m)
Total Trajectory 1 Trajectory 2
Max Mean STD Max Mean STD Max Mean STD
SIFT plus MAPSAC 0.6 568 0.836 0.252 0.179 0.836 0.314 0.192 0.569 0.177 0.125
SURF plus RANSAC 0.8 423 3.965 1.220 0.911 3.965 1.537 0.983 2.532 0.828 0.620
0.6 271 0.631 0.180 0.127 0.631 0.222 0.136 0.429 0.127 0.091
Note: DR ¼distance ratio; STD ¼standard deviation.
Fig. 6. Point matches obtained by SURF plus RANSAC; DR ¼0:8
Table 2. Errors of Tracking Van
Method DR
Number of
point matches
Error: Trajectory 3 (m)
Max Mean STD
SIFT plus
MAPSAC
0.6 230 0.865 0.278 0.194
SURF plus
RANSAC
0.8 235 1.239 0.426 0.327
0.6 183 0.931 0.289 0.235
Note: STD ¼standard deviation.
Fig. 8. 2D tracking results in right camera view
-16 -12 -8 -4 048
30
34
38
42
2
4
X (m)
Z (m)
Y (m)
Ground-Truth
SIFT+MAPSAC (DR=0.6)
SURF+RANSAC (DR=0.8)
SURF+RANSAC (DR=0.6)
Fig. 7. Tracking results of van
-14 -10 -6 -2 26
32
36
40
44
48
2
4
X (m)
Z (m)
Y (m)
Ground-Truth
SIFT+MAPSAC (DR=0.6)
SURF+RANSAC (DR=0.6)
Fig. 9. Tracking results of worker with short baseline
546 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
distribution, it is concluded that the tracking error is less than
0.429 m with 95% confidence.
Tracking of Van
The second experiment deals with the tracking of a van that is 2-m
wide, 1.95-m high, and 5.13-m long and moving forward and back-
ward along trajectory 3. The video contains a total of 1,034 frames.
A long baseline (8.3 m) is tested in this experiment placing a cam-
era at right camera 2in Fig. 3. Gray pixel values are used for
templates of the 2D tracker. Fig. 7displays obtained trajectories
with ground truth. Similar to the first experiment, it is observed
that outliers finally result in inaccurate depth estimation (SURF
plus RANSAC with DR ¼0:8). There is a difference between
the results for forward and backward moving even though they
were on the same trajectory. This disparity is caused exclusively
by the 2D tracking results. Fig. 8shows 2D tracking results in
the right camera view in which the slight difference between for-
ward and backward trajectories is observable.
The error results are presented in Table 2. The long baseline
allows a smaller number of point matches than with the short base-
line because of the greater difference between the left and right
camera views. The number decreases to less than a half compared
with the first experiment, tracking of a steel plate. The algorithms
SIFT plus MAPSAC, which generated 26% more matches than The
algorithms SURF plus RANSAC, performed better in this case. As-
suming the error follows a normal distribution, it is concluded that
the tracking error is less than 0.658 m with 95% confidence.
Tracking of Worker
The third experiment is performed on a worker moving along
trajectories 1 and 2. Two lengths of baseline, 3.8 and 8.3 m, are
tested. The videos with a short and a long baseline contain
1435 and 1368 frames, respectively. The region of a workers upper
body, which can be well characterized by fluorescent colors of a
hard hat and a safety vest, is tracked. Instead of gray pixel values,
saturation values are used for composing the template model.
Figs. 9and 10 present the trajectory results in which it is noticeable
that the longer baseline allows more stable and accurate trajec-
tories. The longer baseline forms a larger angle between two pro-
jections, P1and P2, in Fig. 1, which results in lower error rate. In
Table 3, errors of a long baseline are approximately half of a short
Table 3. Errors of Tracking Worker
Method
Baseline
length (m)
Number of
point matches
Error (m)
Total Trajectory 1 Trajectory 2
Max Mean STD Max Mean STD Max Mean STD
SIFT plus MAPSAC 3.8 584 1.959 0.523 0.357 1.959 0.605 0.374 1.490 0.426 0.309
8.3 215 1.053 0.258 0.193 1.053 0.317 0.211 0.555 0.187 0.140
SURF plus RANSAC 3.8 503 2.549 0.714 0.481 2.549 0.841 0.503 1.791 0.562 0.404
8.3 166 1.510 0.381 0.321 1.510 0.455 0.374 0.731 0.292 0.212
Note: STD ¼standard deviation; DR ¼distance ratio ¼0:6.
Fig. 12. 2D tracking results of 693rd frame: (a) left camera; (b) right
camera
Fig. 11. Appearance variations: (a) steel plate; (b) worker
-14 -10 -6 -2 26
32
36
40
44
48
2
4
X (m)
Z (m)
Y (m)
Ground-Truth
SIFT+MAPSAC (DR=0.6)
SURF+RANSAC (DR=0.6)
Fig. 10. Tracking results of worker with long baseline
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 547
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
baseline, and SIFT plus MAPSAC produces lower errors than
SURF plus MAPSAC.
Whenever a worker changes his direction, the 2D tracker suffers
severe variations of a workers appearance. When compared with
Fig. 11(a), Fig. 11(b) shows more substantial changes in the dis-
tribution of pixel values inside a rectangle. This is why the errors
with a short baseline are higher than the errors of tracking a steel
plate. The error caused by the 2D tracker can be divided into two
cases. The first case is when the determined centroid in each view
does not exactly match the real centroid, i.e., the total station target
point. The second case is when the two centroids from the left and
right cameras do not correspond to one other (Fig. 12). These kinds
of errors are partly compensated by a decrease in triangulation er-
ror, which is achieved by using a long baseline. Assuming the
error follows a normal distribution, it is concluded that the tracking
error is less than 0.636 m with 95% confidence.
Conclusion
In this paper, details of correlating multiple 2D tracking results
were presented. Under this method, camera calibration revealed
intrinsic parameters of cameras by processing video frames of a
checkerboard. The extrinsic parameters of two cameras were esti-
mated using point matches between two corresponding views. A
2D tracker provided 2D pixel coordinates of an entitys centroid
in each calibrated camera view. Epipolar geometry constructed with
the intrinsic and extrinsic parameters triangulated the centroids
from multiple views and retrieved 3D location information.
The proposed method was tested on the videos recorded on a
real construction site. The tests involved three types of entities:
a steel plate, a worker, and a van. A kernel-based 2D tracker
was employed, and different methods of point match extraction
were experimented to reveal the effect of errors caused by correlat-
ing multiple views. The algorithms SIFT plus MAPSAC provided a
larger number of point matches, which generally resulted in a good
estimation of extrinsic parameters, especially for long baselines.
For tracking of a steel plate and a van, the maximum errors deter-
mined with 95% confidence were smaller than the entitys width.
Various appearances of a worker from the front, side, and rear
views brought about larger errors of 2D tracking than tracking
of a steel plate. However, it results in at most a 0.658-m error with
95% confidence using a long baseline. The results validated that the
vision-based 3D tracking approach can effectively provide accurate
localization of construction site entities, with a distance ranging
approximately 40-50 m.
The sole objective of this research is to achieve a competitive
accuracy in 3D positioning, whereas real-time processing is not an
immediate target. Working with high definition videos is not a real-
time process at the prototype level, which is expected by the current
research work. Incidentally, there are several types of applications
that do not require real-time processing and can be postprocessed,
e.g., productivity measurement, progress monitoring, and activity
sequence analysis. Also, it is expected that real-time commercial
development is attainable through code optimization and parallel
computing. For example, the access to pixel data of a high defini-
tion image, which takes a significant amount of processing time,
can be reduced by discarding static pixel areas. The next step as
a future work is to investigate how visual pattern recognition meth-
ods can be used to automatically recognize and match entities,
which would remove the need for manual entity selection and help
to recover a failure of tracking. Furthermore, it is worth to do re-
search on the camera network composed of multiple stereo camera
systems. Various angles of views and networks among them can
reduce failures caused by occlusions.
Acknowledgments
This material is on the basis of work supported by the National
Science Foundation under Grants No. 0933931 and 0904109.
Any opinions, findings, conclusions, or recommendations ex-
pressed in this material are those of the authors and do not neces-
sarily reflect the views of the National Science Foundation. The
authors would also like to thank Keitaro Kamiya, Masoud Gheisari,
and the Barton Malow Company for their help in collecting data for
the experiments.
References
Bauer, J., Sünderhauf, N., and Protzel, P. (2007). Comparing several im-
plementations of two recently published feature detectors.Proc., Int.
Conf. on Intelligent and Autonomous Systems, Institute of Electrical and
Electronics Engineers (IEEE), New York.
Bay, H., Tuytelaars, T., and Gool, L. V. (2008). SURF: Speeded up
robust features.Comput. Vis. Image Understanding, 110(3),
346359.
Bouguet, J. Y. (2004). Camera calibration toolbox for Matlab.http://
www.vision.caltech.edu/bouguetj/calib_doc(Apr. 18, 2011).
Brilakis, I., Park, M.-W., and Jog, G. (2011). Automated vision tracking of
project related entities.Adv. Eng. Inf., 25(4), 713724.
Brückner, M., Bajramovic, F., and Denzler, J. (2008). Experimental evalu-
ation of relative pose estimation algorithms.Proc., 3rd Int. Conf. on
Computer Vision Theory and Applications, Vol. 2, Institute for Systems
and Technologies of Information, Control and Communication (IN-
STICC), Setubal, Portugal, 431438.
Caldas, C. H., Torrent, D. G., and Haas, C. T. (2004). Integration of au-
tomated data collection technologies for real-time field materials man-
agement.Proc., 21st Int. Symp. on Automation and Robotics in
Construction, International Association for Automation and Robotics
in Construction.
Chae, S., and Kano, N. (2007). Application of location information by
stereo camera images to project progress monitoring.Proc., 24th
Int. Symp. on Automation and Robotics in Construction, International
Association for Automation and Robotics in Construction, Eindhoven,
Netherlands, 8992.
Ergen, E., Akinci, B., and Sacks, R. (2007). Tracking and locating com-
ponents in a precast storage yard utilizing radio frequency identification
technology and GPS.Autom. Constr., 16(3), 354367
Fathi, H., and Brilakis, I. (2011). Automated sparse 3D point cloud gen-
eration of infrastructure using its distinctive visual features.Adv. Eng.
Inf., 25(4), 760770.
Fontana, R. J. (2004). Recent system applications of short-pulse ultra-
wideband (UWB) technology.IEEE Trans. Microwave Theory Tech.,
52(9), 20872104.
Fontana, R. J., Richley, E., and Barney, J. (2003). Commercialization of an
ultra wideband precision asset location system.Proc., IEEE Conf. on
Ultra Wideband Systems and Technologies, Institute of Electrical and
Electronics Engineers (IEEE), New York, 369373.
Fuchs, S. (2010). Multipaths interference compensation in time-of-flight
camera image.Proc., 20th Int. Conf. on Pattern Recognition, IEEE
Computer Society, Washington, DC, 35833586.
Gächter, S., Nguyen, V., and Siegwart, R. (2006). Results on range image
segmentation for service robots.Proc., IEEE Int. Conf. on Computer
Vision Systems, Institute of Electrical and Electronics Engineers (IEEE),
New York.
Golparvar-Fard, M., Peña-Mora, F., and Savarese, S. (2010). Application
of D4ARA 4-dimensional augmented reality model for automating
construction progress monitoring data collection, processing and com-
munication.J. Inf. Technol. Constr., 14, 129153.
Gong, J., and Caldas, C. H. (2008). Data processing for real-time construc-
tion site spatial modeling.Autom. Constr., 17(5), 526535.
548 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
Gong, J., and Caldas, C. H. (2010). Computer vision-based video
interpretation model for automated productivity analysis of construction
operations.J. Comput. Civ. Eng., 24(3), 252263.
Gruen, A. (1997). Fundamentals of videogrammetrya review.Hum.
Movement Sci. J., 16(23), 155187.
Hartley, R. (1997). In defense of the eight-point algorithm.IEEE Trans.
Pattern Anal. Mach. Intell., 19(6), 580593.
Hartley, R., and Sturm, P. (1997). Triangulation.Comput. Vis. Image
Understanding, 68(2), 146157.
Hartley, R., and Zisserman, A. (2004). Multiple view geometry in computer
vision, Cambridge University Press, Cambridge, UK.
Heikkilä, J., and Silvén, O. (1997). A four-step camera calibration
procedure with implicit image correction.Proc., IEEE Computer
Society Conf. on Computer Vision and Pattern Recognition, Institute
of Electrical and Electronics Engineers (IEEE), New York, 11061112.
Kanatani, K., Sugaya, Y., and Niitsuma, H. (2008). Triangulation from
two views revisited: Hartley-Sturm vs. optimal correction.Proc.,
19th British Machine Vision Conf., British Machine Vision Association
and Society for Pattern Recognition, Malvern, UK, 173182.
Li, H., and Hartley, R. (2006). Five-point motion estimation made easy.
18th Int. Conf. on Pattern Recognition (ICPR 2006), Institute of
Electrical and Electronics Engineers (IEEE), New York, 630633.
Lowe, D. G. (2004). Distinctive image features from scale-invariant
keypoints.Int. J. Comput. Vis., 60(2), 91110.
Nistér, D. (2004). An efficient solution to the five-point relative pose prob-
lem.IEEE Trans. Pattern Anal. Mach. Intell., 26(6), 756770.
Park, M.-W., Makhmalbaf, A., and Brilakis, I. (2011). Comparative study
of vision tracking methods for tracking of construction site resources.
Autom. Constr., 20(7), 905915.
Pizarro, O., Eustice, R., and Singh, H. (2003). Relative pose estimation for
instrumented, calibrated platforms.Digital image computing: Tech-
niques and applications, Proc., 7th Biennial Australian Pattern Recog-
nition Society Conf., DICTA 2003, C. Sun, H. Tablet, S. Ourselin, and
T. Adriaansen eds., CSIRO, Collingwood, Australia, 601612
Point Grey. (2011). Stereo vision camera catalog, ASCE, Reston, VA.
Rashidi, A., Dai, F., Brilakis, I., and Vela, P. (2011). Comparison of
camera motion estimation methods for 3D reconstruction of
infrastructure.ASCE Int. Workshop on Computing in Civil Engineer-
ing, ASCE, Reston, VA
Ross, D., Lim, J., Lin, R.-S., and Yang, M.-H. (2008). Incremental learn-
ing for robust visual tracking.Int. J. Comput. Vis., 77(1), 125141.
Son, H., and Kim, C. (2010). 3D structural component recognition and
modeling method using color and 3D data for construction progress
monitoring.Autom. Constr., 19(7), 844854.
Song, J., Caldas, C. H., Ergen, E., Haas, C., and Akinci, B. (2004). Field
trials of RFID technology for tracking pre-fabricated pipe spools.
Proc., 21st Int. Symp. on Automation and Robotics in Construction,
International Association for Automation and Robotics in Construction,
Eindhoven, Netherlands.
Song, J., Haas, C., Caldas, C., Ergen, E., and Akinci, B. (2006). Automat-
ing pipe spool tracking in the supply chain.Autom. Constr., 15(2),
166177.
Teizer, J., Caldas, C. H., and Haas, C. T. (2007a). Real-time three-
dimensional occupancy grid modeling for the detection and tracking
of construction resources.J. Constr. Eng. Manage., 133(11), 880888.
Teizer, J., Lao, D., and Sofer, M. (2007b). Rapid automated monitoring of
construction site activities using ultra-wideband.Proc., 24th Int. Symp.
on Automation and Robotics in Construction, International Association
for Automation and Robotics in Construction, Eindhoven, Netherlands,
2328.
Torr, P. H. S. (2002). Bayesian model estimation and selection for epipolar
geometry and generic manifold fitting.Int. J. Comput. Vis., 50(1),
3561.
Yang, J., Arif, O., Vela, P. A., Teizer, J., and Shi, Z. (2010). Tracking
multiple workers on construction sites using video cameras.Adv.
Eng. Inform., 24(4), 428434.
Zhang, Z. (1999). Flexible camera calibration by viewing a plane from
unknown orientations.Proc., 7th IEEE Int. Conf. on Computer Vision,
Vol. 1, Institute of Electrical and Electronics Engineers (IEEE),
New York, 666673.
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 549
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
... Computer vision has received widespread attention in such applications as automatically trackingthe presence of workers, equipment, materials, and plants based on the images or videos taken on a construction site so as to improve the level of construction management such as progress monitoring, productivity control, and safety management [11]- [13], [23], [30]- [32], [34]. For example, tracking construction resources has been proven to be a practical way to automate construction monitoring [30], [32], [41]. ...
... Computer vision has received widespread attention in such applications as automatically trackingthe presence of workers, equipment, materials, and plants based on the images or videos taken on a construction site so as to improve the level of construction management such as progress monitoring, productivity control, and safety management [11]- [13], [23], [30]- [32], [34]. For example, tracking construction resources has been proven to be a practical way to automate construction monitoring [30], [32], [41]. In addition, the trajectories of workers and equipment can be used to assess their productivity in order to avoid time-cost overruns [7], [15], [43], and they are also the pre-requisite for automatic analysis of operations and proactive improvement of construction safety [16], [17], [20], [27], [37], [41]. ...
... Tracking approaches based on 2D images. For example, Park and Brilakis (2012) developed a hybrid approach by combining two methods for locating the wheel loaders in a way that can proactively manage occlusion cases [32]. Similarly, Kim et al. (2016) [20] relied on the Kalman filtering to track the workers and heavy equipment so as to monitor the potential risk of stuck-by. ...
Article
Automatic continuous tracking of objects involved in a construction project is required for such tasks as productivity assessment, unsafe behavior recognition, and progress monitoring. Many computer-vision-based tracking approaches have been investigated and successfully tested on construction sites; however, their practical applications are hindered by the tracking accuracy limited by the dynamic, complex nature of construction sites (i.e. clutter with background, occlusion, varying scale and pose). To achieve better tracking performance, a novel deep-learning-based tracking approach called the Multi-Domain Convolutional Neural Networks (MD-CNN) is proposed and investigated. The proposed approach consists of two key stages: 1) multi-domain representation of learning; and 2) online visual tracking. To evaluate the effectiveness and feasibility of this approach, it is applied to a metro project in Wuhan China, and the results demonstrate good tracking performance in construction scenarios with complex background. The average distance error and F-measure for the MDNet are 7.64 pixels and 67, respectively. The results demonstrate that the proposed approach can be used by site managers to monitor and track workers for hazard prevention in construction sites.
... In recent years, there has been an increase in research works using computer vision in construction environments due to their abundance of visual data and the recent advancements in this area [32]. In the works of [33][34][35], equipment and workers were successfully tracked using computer vision. However, all these works had a common problem to deal with: occlusions. ...
... When an entity is tracked with cameras, it may temporarily hide behind an object and be lost, which leads to failure. In order to mitigate this drawback, the research works of [33][34][35] suggest carefully choosing the positions of cameras, so that their field of view covers as much as possible of the working place. Nevertheless, selecting adequate positions for the cameras demands some planning time and extra infrastructure to place them where desired. ...
Article
Full-text available
In order to reduce the accident risk in road construction and maintenance, this paper proposes a novel solution for road-worker safety based on an untethered real-time locating system (RTLS). This system tracks the location of workers in real time using ultra-wideband (UWB) technology and indicates if they are in a predefined danger zone or not, where the predefined safe zone is delimited by safety cones. Unlike previous works that focus on road-worker safety by detecting vehicles that enter into the working zone, our proposal solves the problem of distracted workers leaving the safe zone. This paper presents a simple-to-deploy safety system. Our UWB anchors do not need any cables for powering, synchronisation, or data transfer. The anchors are placed inside safety cones, which are already available in construction sites. Finally, there is no need to manually measure the positions of anchors and introduce them to the system thanks to a novel self-positioning approach. Our proposal, apart from automatically estimating the anchors’ positions, also defines the limits of safe and danger zones. These features notably reduce the deployment time of the proposed safety system. Moreover, measurements show that all the proposed simplifications are obtained with an accuracy of 97%.
... Related work in this area already uses stereo vision cameras to determine three-dimensional coordinates. For example, they are used to track construction resources such as construction vehicles, construction workers, or construction materials at a distance of up to 50 m from the camera with a maximum error of 0.658 m and a reliability of 95 % [24]. Also, the method has been extended over time to generate corresponding 3D trajectories to different and multiple construction resources [18,19]. ...
... Several camera components cover each work area with partially overlapping capture areas enabling reliable coverage even if parts of the work are obscured, e.g., by buildings parts, machines, or piles of construction material. Stereo vision cameras capture the position of machines and equipment in three-dimensional space [19,24] by two horizontally offset cameras through which two different views are recorded. By comparing the images, the depth information can be obtained as a disparity map [16], encoding the difference in the horizontal coordinates of the corresponding pixels. ...
Conference Paper
Full-text available
Construction sites are dynamic and complex systems with signi cant potential for time and cost e�ciency improvement through digitization and interconnection. Construction 4.0 is the use of modern information and communication technologies known from Industry 4.0 (I4.0) to interconnect construction sites with cyber-physical systems (CPS). In these decentralized systems, construction workers, machines, and processes become smart I4.0 components that can exchange data and information with each other in a decentralized and self-controlled manner. The basis for informed decision-making to control and optimize relevant on-site processes is real-time detection of the current construction progress and the machines used on the construction site. Computer vision (CV)-based tracking systems o er a technical solution that can reliably detect construction workers and machines and track construction progress and processes. These tracking systems generate large amounts of data that must be processed and analysed automatically. The goal is to integrate the tracking system into the CPS as an I4.0 component. Essential for this is the digital twin as a virtual representation of an I4.0 component that centrally collects, processes, and provides data for the respective component. This paper presents an I4.0-based digital twin approach for digitizing and interconnection of the construction site into a CPS. The approach integrates a CV-based tracking system as an I4.0 component to locate construction equipment on the construction site. The tracking system is a multi-camera multi-object tracking system that uses stereo vision cameras and a real-time capable detector. The asset administration shell (AAS) is used as the platform for the digital twin.
... This integrated BIM 3D model not only supports the entire life cycle of a building (from design to construction to maintenance management) but also makes it possible to streamline the work ow. Generally, BIM integration within outdoor environments needs to be seamless and accurate, however the currently available IPS solutions used in tandem with BIM have restrictions when it comes to universal traceability of key aspects (such as labor in tunnels and buildings, and the location of construction resources such as vehicles and materials 10,11,12,13 ), which hinder the implementation of comprehensive and universal management of buildings and underground in urban regions. ...
Preprint
Full-text available
Indoor positioning system (IPS) technologies have a wide range of applications; however, three major limitations associated with currently used IPS technologies are: (1) weak penetration strength of signals to penetrate building materials, inhibiting seamless connection of outdoor coordinates to indoor coordinates; hence these technologies rely on local coordinates, making them incompatible with the world geodetic system (WGS84) and universal traceability, (2) active source signals that require beacons to transmit navigation signals. In contrast, the muometric positioning system utilizes naturally abundant cosmic-ray muons signals to compensate for some of these setbacks. However, its main practical challenges are: (1) the low signal rate (~1 per 10 days for laptop-sized receivers horizontally located 50 m apart from each other) and (2) the requirement for large reference detectors (> 4 m2) above the receiver to track cosmic ray precipitation. In this work, an alternative concept called CAT navigation, which relies on the extended air shower time structure for higher rate positioning (without requiring reference detectors) is first proposed and demonstrated; it located receivers placed on the ground floors of multiple buildings (within WGS84) in conditions where other IPS methods are difficult to apply. The resultant positioning accuracy was 3-4 m (at 50 m apart), which is reasonably accurate for GPS -IPS seamless bridging, and with a laptop sized receiver the averaged positioning signal update rate was (683 s)-1 which can be improved to (170 s)-1 with a future upgrade of the data gathering electronics. By integrating CAT receivers into GPS equipped smartphones, it is anticipated that this GPS -CAT hybrid method will seamlessly connect multi-users’ coordinates from outdoor to indoor environments.
... Computer vision and artificial intelligence are frequently used to locate and identify objects and humans. Research has been conducted on detecting traffic cones [12], monitoring construction workers [13] [14], and managing resources on construction sites in general [15]. One significant drawback of this approach is the need for precise calibration of the cameras. ...
Article
Full-text available
Despite all efforts to enhance safety, construction sites remain a major location for traffic accidents. Short-term construction sites, in particular, face limitations in implementing extensive safety measures due to their condensed timelines. This paper seeks to enhance safety in short-term construction sites by alerting maintenance personnel and approaching vehicles to potentially dangerous scenarios. Focusing on defining the exact dimensions of static construction sites, this method employs high-precision Real-Time-Kinematics-GNSS for localizing traffic cones and deriving the construction site geometry through respective algorithms. By analyzing the geometry, we can identify situations where maintenance personnel are in close proximity to the active lane or when vehicles enter the construction site. To increase awareness of hazardous situations, we present methods for distributing information to maintenance personnel and vehicles, along with technical solutions for warning those involved. Additionally, we discuss the distribution of the construction site’s geometry among approaching vehicles, which can provide future automated vehicles with crucial information on the site’s exact start and end points.
... A Web of Science nemzetközi adatbázisban fellelhető száz leggyakrabban idézett cikk alapján a legintenzívebben kutatott területek a következők: -emberi és robotizált munkaerő együttműködése (kobotok) [16][17]; -csapatban dolgozó építőipari robotok (swarm robotika) [18][19][20][21][22][23]; -automatizált technológiák (pl. 3D betonnyomtatás) ; -mesterséges intelligencia alkalmazása építőipari feladatok esetén [46][47][48][49][50][51]; -a BIM és az automatizált építőipar kapcsolata [52][53][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68]; -automatizált építőipari gépek/robotok (STCR) [69][70][71][72][73][74]; -automatizált felmérések, épület/építménymonitorozás, IoT-k [75][76][77][78][79][80][81][82][83][84][85][86][87][88][89][90][91][92]; -építésautomatizálással kapcsolatos átfogó elméleti elemzések ; -fenntartható fejlődés az automatizált építőiparban [114][115][116][117]. ...
Article
Full-text available
Az építőipar a munkaerő hiánya és az egyre fokozódó minőségi elvárások miatt a hagyományos, jellemzően emberi erőforrást alkalmazó vagy emberek által közvetlenül működtetett technológiák irányából apró lépésenként az automatizált technológiák irányába fordul. Az ezzel együtt járó változás csak úgy lehet zökkenőmentes, ha az építőipar résztvevői aktív részesei a változási folyamatnak. A cikk az építőipar fejlődési irányait, annak problematikáját és lehetőségeit kívánja bemutatni a területtel kapcsolatos kutatások és a már alkalmazott technológiai megoldások elemzésével a közeljövőben lehetséges változások, további lehetőségek, illetve problémák feltérképezésére és megvilágítására törekedve.
... Chi and Caldis [8] proposed a method of using a video camera to automatically identify both onsite personnel and equipment. Park and Brilakis [28] presented a method for detecting construction workers in video frames. Son et al. [29] introduced a vision-based collision warning system based on the automated 3D position estimation of each worker to protect workers from potentially dangerous situations. ...
Article
Full-text available
Image-based techniques have become integral to the construction sector, aiding in project planning, progress monitoring, quality control, and documentation. In this paper, we address two key challenges that limit our ability to fully exploit the potential of images. The first is the “semantic gap” between low-level image features and high-level semantic descriptions. The second is the lack of principled integration between images and other digital systems used in construction, such as construction schedules and building information modeling (BIM). These challenges make it difficult to effectively incorporate images into digital twins of construction (DTC), a critical concept that addresses the construction industry’s need for more efficient project management and decision-making. To address these challenges, we first propose an ontology-based construction image interpretation (CII) framework to formalize the interpretation and integration workflow. Then, the DiCon-SII ontology is developed to provide a formalized vocabulary for visual construction contents and features. DiCon-SII also acts as a bridge between images and other digital systems to help construct an image-involved DTC. To evaluate the practical application of DiCon-SII and CII in supporting construction management tasks and as a precursor to DTC, we conducted a case study involving drywall installation. Via this case study, we demonstrate how the proposed methods can be used to infer the operational stage of a construction process, estimate labor productivity, and retrieve specific images based on user queries.
... [ [16][17][18][19] Surveillance systems Generally mounted at an elevated and farther off places for obtaining a perfect field of view. Very ideal for visualizing the progress of work, monitoring quality of work, the safety and security issues. ...
Article
Full-text available
The concept of computer vision is as old as six decades. A spurt of remarkable magnitude on real-world applications of computer vision techniques in the realm of civil engineering is coming up since just one-and-a-half decade. A huge literature survey by the authors has revealed that, the applications are predominantly seen in allied domains of civil engineering such as, structural health monitoring, construction safety monitoring, infrastructure inspection, surveillance, data collection, and object detection. Being in-terdisciplinary, the emerging technologies from other engineering fields are getting integrated and making inroads to allied civil engineering projects in general, and construction industry related projects in particular. As the existing review publications provide a focused or context specific applications of computer vision in civil engineering, a deep review of literature will certainly provide a systematic and a lucid approach to gain an in-depth understanding. In this context this paper makes a vivid presentation of the reported and documented applications of computer vison in structural damage detection, health monitoring , vibration assessment, data anomaly detection, video surveillance applications, and investigation of serviceability conditions. It also provides a deeper insight into the current and futuristic foreseeable trends. The intent of this review is threefold. Firstly, to garner a deep understanding of possible research area and open problems for exploration. Secondly, to assess the role of computer vision as an AI based technique for aiding smart construction and for the increased quality in construction. Finally, to bring awareness and to provide futuristic ideas to the prospective research scholars, project students, teachers, and professionals. To an extent this review will also guide the practitioners to arrive at informed decisions.
Article
Full-text available
The vision-based measurement for robot with 3 or more degrees of freedom (DOFs) require on-site calibrations and auxiliary sensors to establish an object coordinate system at present. Obviously, it is time consuming and requires extensive computing resources, and these hinder the application of vision-based measurement for robot in many fields. To solve these questions, a measurement coordinate system on robot was established according to the posture of robotic parts in preset position, structure of robot and vision system. Two categories of structural parameters were proposed; posture transformation matrices of robotic parts from the preset posture to an actual posture were built; and the visual-related spatial vectors were expressed by the two categories of structural parameters, optical parameters, and posture matrices in the measurement coordinate system. According to these, a new forward intersection strategy is proposed for robotic vision system. The new structural parameters for vision system on robot were proven to be reliable. The expressions of the visual-related vectors for robot were found to be correct and effective. Experimental results demonstrated that the new forward intersection strategy was accurate and efficient, and this strategy can be applied to robotic vision platforms to facilitate control of robots. 18 intersection ; radius of 40 principal optic axis; robotic part; co-spherical resection ; structural parameter; measurement coordinate system I. INTRODU CTION The requirements of robotic vision and artificial intelligence technologies promote close-range photogrammetry to be high precision, efficiency, and convenience in agriculture [I] and other industrial application fields. Robot have 3 or more degrees of freedom (DOFs), the forward intersection strategy should change to adapt to the platform. After the structural parameters for vision system on 2 DOFs platform were defined and calibrated by co-spherical resection [2], forward intersection measurement for 3 or more DOFs carrier could be proceeded. In traditional close-range photogrammetry, the collinear equations of object points, image points, and perspective points are to achieve forward intersection measurement [3 and 4]. This parameter vector model is expressed by the corresponding optical parameters of camera, the exterior orientation elements, and the space translation vectors between images in a coordinate system established for object [5]. After surveying adjustment [6] for parameter calibration and aberration correction for image points [7], the spatial coordinates of to-be-measured point could be obtained. In many occasions, on-site calibrations [8, 9, 10, and II] or other distance-detecting vision sensors [12 and 13] were aids
Conference Paper
Full-text available
Camera motion estimation is one of the most significant steps for structure-from-motion (SFM) with a monocular camera. The normalized 8-point, the 7-point, and the 5-point algorithms are normally adopted to perform the estimation, each of which has distinct performance characteristics. Given unique needs and challenges associated to civil infrastructure SFM scenarios, selection of the proper algorithm directly impacts the structure reconstruction results. In this paper, a comparison study of the aforementioned algorithms is conducted to identify the most suitable algorithm, in terms of accuracy and reliability, for reconstructing civil infrastructure. The free variables tested are baseline, depth, and motion. A concrete girder bridge was selected as the "test-bed" to reconstruct using an off-the-shelf camera capturing imagery from all possible positions that maximally the bridge's features and geometry. The feature points in the images were extracted and matched via the SURF descriptor. Finally, camera motions are estimated based on the corresponding image points by applying the aforementioned algorithms, and the results evaluated.
Article
An efficient algorithmic solution to the classical five-point relative pose problem is presented. The problem is to find the possible solutions for relative camera motion between two calibrated views given five corresponding points. The algorithm consists of computing the coefficients of a tenth degree polynomial and subsequently finding its roots. It is the first algorithm well suited for numerical implementation that also corresponds to the inherent complexity of the problem. The algorithm is used to a robust hypothesise-and-test framework to estimate structure and motion in real-time.
Article
Early detection of actual or potential schedule delay in field construction activities is vital to project management. This entails project managers to design, implement, and maintain a systematic approach for construction progress monitoring to promptly identify, process and communicate discrepancies between actual and as-planned performances. To achieve this goal, this research focuses on exploring application of unsorted daily progress photograph logs available on any construction site as a data collection technique. Our approach is based on computing-from the images themselves-the photographer's locations and orientations, along with a sparse 3D geometric representation of the as-built site using daily progress photographs and superimposition of the reconstructed scene over as-planned 4D models. Within such an environment, progress photographs are registered in the virtual as-planned environment and this allows a large unstructured collection of daily construction images to be sorted, interactively browsed and explored. In addition, sparse reconstructed scenes superimposed over 4D models allow site images to be geo-registered with the as-planned components and consequently, location-based image processing technique to be implemented and progress data to be extracted automatically. The results of progress comparison between as-planned and as-built performances are visualized in the D4AR (4D Augmented Reality) environment using a traffic light metaphor. We present our preliminary results on three ongoing construction projects and discuss implementation, perceived benefits and future potential enhancement of this new technology in construction, in all fronts of automatic data collection, processing and communication.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.