Content uploaded by Christian Koch
Author content
All content in this area was uploaded by Christian Koch on Apr 30, 2014
Content may be subject to copyright.
Three-Dimensional Tracking of Construction Resources
Using an On-Site Camera System
Man-Woo Park, A.M.ASCE1; Christian Koch2; and Ioannis Brilakis, M.ASCE3
Abstract: Vision trackers have been proposed as a promising alternative for tracking at large-scale, congested construction sites. They
provide the location of a large number of entities in a camera view across frames. However, vision trackers provide only two-dimensional
(2D) pixel coordinates, which are not adequate for construction applications. This paper proposes and validates a method that overcomes this
limitation by employing stereo cameras and converting 2D pixel coordinates to three-dimensional (3D) metric coordinates. The proposed
method consists of four steps: camera calibration, camera pose estimation, 2D tracking, and triangulation. Given that the method employs
fixed, calibrated stereo cameras with a long baseline, appropriate algorithms are selected for each step. Once the first two steps reveal camera
system parameters, the third step determines 2D pixel coordinates of entities in subsequent frames. The 2D coordinates are triangulated on
the basis of the camera system parameters to obtain 3D coordinates. The methodology presented in this paper has been implemented
and tested with data collected from a construction site. The results demonstrate the suitability of this method for on-site tracking purposes.
DOI: 10.1061/(ASCE)CP.1943-5487.0000168.© 2012 American Society of Civil Engineers.
CE Database subject headings: Automation; Imaging techniques; Models; Information technology (IT); Cameras; Remote sensing;
Construction management.
Author keywords: Automation; Imaging techniques; Computer-aided vision system; Models; Information technology; Remote sensing.
Introduction
Three-dimensional (3D) object tracking on construction sites has a
wide variety of applications. It allows identification and tracking of
personnel, equipment, and materials to support effective progress
monitoring, activity sequence analysis, productivity measurements,
and asset management and to enhance site safety. In addition,
tracking instantly enables the identification of critical activities
and problems, which allows for on-site project control and
decision-making capabilities. Available tracking solutions are pri-
marily on the basis of radio frequency technologies, including
global positioning system (GPS), radio frequency identification
(RFID), and ultra-wideband (UWB) technologies. They all work
under the same principle of having a sensor attached on each entity
to be tracked. These technologies have been applied and proven to
work excellently for most scenarios involved in construction man-
agement, such as proactive work zone safety and material registra-
tion and installation (Teizer et al. 2007b;Ergen et al. 2007;Song
et al. 2006). However, when it comes to large-scale and congested
construction sites, the installation of the sensor system can be costly
and time-consuming because of the large amount of items involved.
Also, privacy issues can arise out of tagging workers. For these
specific scenarios, vision-based tracking may have the potential
for use as an efficient alternative.
Vision-based methods have been introduced for tracking entities
on construction sites. Vision-based tracking works by receiving
video streams and estimating an entity’s motion in subsequent
video frames on the basis of the history of their appearance and
location. Its capability of tracking multiple entities without the
installation of sensors on the entities has a great potential in con-
struction applications. It provides two-dimensional (2D) pixel
coordinates, xand y, of the entities across time. The 2D results
may be useful when predefined measurements on an entity’s tra-
jectory are available, as in Gong and Caldas’research work
(2010). However, the 2D results are generally not enough to extract
substantial information for most construction management tasks
because it is unknown how far entities are located from the camera.
Because of the lack of depth information (z), even approximate dis-
tance measurements between two entities, e.g., workers and mobile
equipment, are not reliable but necessary for safety management.
Also, any movement along the zaxis is not measurable. Brilakis
et al. (2011) proposed a framework for 3D vision-based tracking
that can provide 3D coordinates of entities by deploying stereo
cameras. The framework consists of several processes, including
construction entity detection, 2D tracking, and correlation of 2D
tracking results and calculating 3D location. Thus far only 2D
tracking of construction resources has been validated successfully
(Park et al. 2011) from this framework. Because of the large
amount of contents involved in the framework, each single process
has not been fully detailed and validated.
This paper presents and validates the framework’s method for
correlating 2D tracking results paired with multiple views and cal-
culating 3D location of construction entities. This method employs
stereo vision to provide 3D trajectories of moving entities. In
the current state of research, stereo vision has been applied to
1Ph.D. Candidate, School of Civil and Environmental Engineering,
Georgia Institute of Technology, 130 Hinman Research Building, 723
Cherry St., Atlanta, GA 30332 (corresponding author). E-mail: mw.park@
gatech.edu
2Postdoctoral Associate, Computing in Engineering, Faculty of
Civil and Environmental Engineering, Ruhr-Universität Bochum, Uni-
versitätsstraße 150, 44780 Bochum, Germany. E-mail: koch@inf.bi.rub.de
3Assistant Professor, School of Civil and Environmental Engineering,
Georgia Institute of Technology, 328 Sustainable Education Building, 788
Atlantic Dr. NW, Atlanta, GA 30332. E-mail: brilakis@gatech.edu
Note. This manuscript was submitted on June 10, 2011; approved on
September 16, 2011; published online on September 19, 2011. Discussion
period open until December 1, 2012; separate discussions must be sub-
mitted for individual papers. This paper is part of the Journal of Comput-
ing in Civil Engineering, Vol. 26, No. 4, July 1, 2012. ©ASCE, ISSN
0887-3801/2012/4-541–549/$25.00.
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 541
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
3D modeling for construction progress monitoring (Chae and Kano
2007;Son and Kim 2010;Golparvar-Fard et al. 2010), focusing on
the retrieval and visualization of static structure components,
whereas the focus of this paper lies in 3D localization of moving
entities and the accuracy of localization measurements. The camera
system consists of two fixed cameras having several meters of base-
line, which is significantly longer than Bumblebee’s 24-cm base-
line (Point Gray 2011). The long baseline allows competitive
accuracy in localizing far-located entities. Under the proposed
method, in the first step, cameras are calibrated to find their intrin-
sic parameters, i.e., the focal length, the principal point, radial dis-
tortions, and tangential distortions of a camera. The second step is
to estimate a relative pose (rotation and translation) of the calibrated
cameras, which is called extrinsic parameters. Once the intrinsic
and extrinsic parameters are known, a 2D tracker is applied to every
video frame of each camera in the third step. Using a kernel-based
2D tracking algorithm, the 2D pixel coordinates of an entity’s cent-
roid are determined. To obtain 3D locations, in the fourth step, 2D
tracking results are triangulated on the basis of the intrinsic and
extrinsic parameters. In every frame, a projection of the determined
centroid is obtained for each camera. Finally, an intersection of the
two projections from two cameras determines the 3D location of a
tracked entity.
The proposed method is tested on the videos recorded of a con-
struction site. The tests involve three types of entities: a steel plate,
a worker, and a van. Various point matching methods and different
baseline lengths are applied to identify their effects on accuracy.
The results are at maximum 0.658-m error with 95% confidence,
which validates the effectiveness, accuracy, and applicability of the
proposed vision-based 3D tracking approach.
Background
State of Practice in Tracking Technology
Common tracking methods are either on the basis of radio
frequency, which includes several types of technologies like
GPS RFID, Bluetooth, and wireless fidelity, e.g., Wi-Fi and
ultra-wideband, or they make use of optical remote sensors, such
as 2D image/video cameras and 3D range cameras, e.g., Flash
LADAR.
Global positioning system is an outdoor satellite-based world-
wide navigation system formed by a constellation of satellites and
ground control stations. The 3D position is determined by a GPS
receiver using triangulation on the basis of these satellites. Global
positioning system is an established location technology that offers
a wide range of off-the-shelf solutions in both hardware and soft-
ware. According to Caldas et al. (2004), GPS applications have
been applied to construction practices, such as positioning of equip-
ment and surveying. However, when using only GPS, there is lim-
ited potential in other applications, such as improving the
management of materials on construction job sites. Moreover, it
can only operate outdoors and the accuracy is only approximately
10 m.
Radio frequency identification is used for identifying and
tracking various objects (Ergen et al. 2007). Radio frequency iden-
tification systems are primarily composed of a tag and a reader.
Radio frequency identification technology does not require line
of sight and it is also durable in harsh environments and can be
embedded in concrete. Radio frequency identification enables
efficient automatic data collection because readers can be mounted
on any structure in the reading range and each reader can scan
multiple tags at a given time. However, this technology, unless
combined with other tools (Ergen et al. 2007), can only report
the radius inside which the tracked entity exists, and most impor-
tantly, the near-sighted effect prohibits its use in tracking applica-
tions. Combinations of GPS and RFID technologies have been
recently explored (Song et al. 2004,2006). The advantage of this
combination is that GPS sensors need to only accompany the tag
readers and not the materials. Every time a tag is located, the 3D
coordinates as reported by the GPS can be recorded as the location
of each piece of material at that given time.
Another type of radio technology that can be applied to short-
range communications is UWB. Ultra-wideband is able to detect
time of flight of the radio transmissions at various frequencies,
which enables it to perform effectively in providing precision
localization even in the presence of severe multipath effects
(Fontana et al. 2003). Another advantage is the low average power
requirement that results from the low pulse rate (Fontana 2004).
Teizer et al. (2007b) applied the UWB technology to construction.
It was used for a material location tracking system with primary
applications to active work zone safety. Its ability to provide accu-
rate 3D locations in real time is a definite benefit to tracking in
construction sites.
Vision technologies and laser technologies are attracting
increasing interests for tracking in large-scale, congested sites
because they are free of tags. A 3D range imaging/video camera,
e.g., a Flash LADAR, provides not only the intensity but also the
estimated range of the corresponding image area. When compared
with 3D laser scanners, which have been used in construction, the
device is portable and inexpensive. Testing various kinds of data
filtering, transformation, and clustering algorithms, Gong and
Caldas (2008) used 3D range cameras for spatial modeling. Teizer
et al. (2007a) demonstrated tracking with 3D range cameras and the
potential of its use for site safety enhancement. However, the low
resolution and short range make it difficult to be applied to large-
scale construction sites. Few tests have been executed in outdoor
construction sites in which the environments are more cluttered and
less controlled. Also, it is reported that the reflectance of a surface
varies extremely even in indoor environments (Gächter et al. 2006).
Moreover, when multiple cameras are used, they can interfere with
one other (Fuchs 2010).
Traditional 2D vision trackers are simply on the basis of a
sequence of images and can be a proper alternative to RFID meth-
ods because they remove the need for installing sensors and identity
(ID) tags of any kind on the tracked entity. For this reason, this
technology is (1) highly applicable in dynamic, busy construction
sites in which large numbers of equipment, personnel, and materi-
als are involved; and (2) more desirable from personnel who wish to
avoid being “tagged”with sensors. In Gruen’s research (1997), it is
highly regarded for its capability to measure a large number of par-
ticles with a high level of accuracy. Yang et al. (2010) proposed a
vision tracker that can track multiple construction workers. Gong
and Caldas (2010) showed the applicability of vision tracking to
automated productivity analysis.
Two-dimensional vision trackers can be categorized in kernel-
based, contour-based, and point-based methods, depending on the
way of representing objects. In kernel-based methods, an object is
represented by the color or texture in the region of interest, and its
position in the next frame is estimated on the basis of the region’s
color or texture information. In contour-based methods, an object is
represented by silhouettes or contours that determine the boundary
of the object. In point-based methods, an object is represented by a
set of feature points extracted from the region that contains the
object. Out of the three categories, kernel-based methods are the
most suitable for construction-related applications with respect
to the construction sites’characteristics, such as illumination
542 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
condition, occlusion, and object types. Park et al. (2011) reported
that kernel-based methods could effectively track construction en-
tities in various illumination conditions and that they performed
well even on objects occluded by 50% or more. The entities that
failed to be tracked because of severe occlusions can still be recov-
ered by reinitialization within an object detection process. The 2D
tracker used in this paper is on the basis of the method by Ross et al.
(2008). It tracks an object on the basis of the model template com-
posed of eigenimages, and represents the tracked object as an
affine-transformed rectangle that encloses it. Six affine parameters,
xand ycoordinates of the centroid, scale, aspect ratio, rotation, and
skew, are estimated through particle filtering.
Stereo View Geometry
Two-dimensional vision-based tracking is not comparable with
other 3D technologies previously described unless it can provide
3D information. To reconstruct the 3D position of an entity, several
steps must be taken to determine the stereo view geometry (Hartley
and Zisserman 2004). Heikkilä and Silvén (1997), Zhang (1999),
and Bouguet (2004) presented and provided standard calibration
tools. The calibration tools reveal intrinsic camera parameters,
including the focal length, the principal point, radial distortions,
and tangential distortions. They use calibration objects that have
specific patterns, such as a checkerboard. In Zhang’s calibration
method, tangential distortion is not modeled. Heikkilä and Silvén’s
toolbox and Bouguet’s toolbox use the same distortion model that
takes into account both radial and tangential distortions. Therefore,
both toolboxes generally result in almost equivalent calibration.
Bouguet provides additional functions, such as error analysis,
which is useful to recalibrate with revised inputs.
After having calibrated each camera separately, the external
camera system has to be determined (see Fig. 1). For this purpose,
feature points are identified and matched within the two camera
views. The most well-known and robust algorithms commonly
used for this task are the scale-invariant feature transform (SIFT)
(Lowe 2004) and speeded up robust features (SURF) (Bay et al.
2008). Whereas SIFT uses Laplacian of Gaussian (LOG), differ-
ence of Gaussian (DOG), and histograms of local oriented gra-
dients, SURF relies on a Hessian matrix and the distribution of
Haar-wavelet responses for feature point detection and matching,
respectively. Although SIFT turned out to be slightly better in terms
of accuracy, SURF is computationally much more efficient (Bauer
et al. 2007). The algorithms SIFT and SURF provide point matches,
including extreme outliers (mismatches) that have to be removed.
To achieve that, robust algorithms for managing the outliers were
introduced. Random sample consensus (RANSAC) (Hartley and
Zisserman 2004) and maximum a posteriori sample consensus
(MAPSAC) (Torr 2002) are the representative robust methods.
The RANSAC method minimizes the number of outliers by ran-
domly selecting a small subset of the point matches and repeating
the maximization process for different subsets until it reaches a de-
sired confidence in the exclusion of outliers. One of its problems is
the poor estimates associated with a high threshold (Torr 2002).
Working in a similar way to RANSAC, MAPSAC resolved this
problem by minimizing not only the number of outliers but also
the error associated with the inliers.
The next step is the estimation of the essential matrix, E, on the
basis of the identified point matches. In general, the normalized
eight point (Hartley 1997), seven point (Hartley and Zisserman
2004), six point (Pizarro et al. 2003), and five point (Nistér
2004) algorithms are used for this purpose. Eight, seven, six,
and five is the minimal number of points required to perform
the estimation. Rashidi et al. (2011) compared the resulting accu-
racy of these algorithms in practical civil infrastructure environ-
ments, finding the five-point algorithm to be the best. However,
because of its simplicity and reasonable accuracy the normalized
eight-point algorithm is still the most common one and the second
best according to Brückner et al. (2008). On the basis of the essen-
tial matrix, E, the relative pose of two cameras (Rand Tin Fig. 1),
can be derived directly (Hartley and Zissermann 2004).
In the last step, triangulation is performed. On the basis of two
corresponding pixels in the respective view, two lines of sight have
to be intersected to find the 3D position (Fig. 1). However, because
of image noise and slightly incorrect point correspondences, the
two rays may not intersect in space. To address this problem,
Hartley-Sturm optimal triangulation (Hartley and Sturm 1997)
and optimal correction (Kanatani et al. 2008) algorithms are cur-
rently used as standard methods for finding corrected correspond-
ences. They both try to find the minimum displacement through the
geometric error minimization, correct the pixel coordinates accord-
ingly, and intersect the corrected rays to determine 3D coordinates.
Although the latter has a faster process, the former’s results are
more accurate (Fathi and Brilakis 2011).
Several researchers have introduced and applied stereo vision
technologies to construction. Most applications presented so far
are related to 3D modeling of structures for progress monitoring.
Chae and Kano (2007) estimated spatial data for development of a
project control system from stereo images. In another work, Son
and Kim (2010) used a stereo vision system to acquire 3D data
and to recognize 3D structural components. Golparvar-Fard et al.
(2010) presented a sparse 3D representation of a site scene using
daily progress photographs for use as an as-built model. On the
contrary to creating 3D geometry models on the basis of static
feature points, the application of stereo vision in this paper locates
moving entities in 3D across time. Furthermore, this paper
measures the accuracy of 3D positioning by comparing with total
station data.
Problem Statement and Objectives
As described in the previous section, the results of general vision-
based tracking are restricted to 2D. The applications of these
results are limited at large-scale, congested construction sites.
Brilakis et al. (2011) introduced a framework for 3D vision
tracking, which employs multiple fixed cameras to calculate the
3D location of an entity. From this framework, this paper aims
to present and validate the method of combining 2D tracking results
Fig. 1. Epipolar geometry and centroid relocation
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 543
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
with stereo vision geometry for the sake of accurate 3D trajectories
of far-located construction entities. This research is aiming strictly
for accurate localization of construction entities and not for real-
time processing. Each single step involved in this method should
be optimized to characteristics of the fixed camera system and con-
struction sites, such as various types of construction entities, the
long baseline, and the long distance from cameras to an entity that
is inevitable at large-scale construction sites.
Methodology
The proposed method is shown in Fig. 2and is composed of four
steps of camera calibration, camera pose estimation, 2D tracking,
and triangulation. To calculate 3D positions of an object, the regis-
tration of the camera system is required. The camera system in this
method is composed of two cameras located several meters apart
from one other. This system is described by epipolar geometry, as
shown in Fig. 1. This geometry consists of two types of parameters:
intrinsic and extrinsic parameters. Intrinsic parameters determine
the linear system of projecting 3D points on the image plane
(P1and P2in Fig. 1). Bouguet’s calibration toolbox (2004) is used
to reveal the intrinsic parameters because of its accuracy, robust
convergence, and convenience.
The focal point of the left camera becomes an origin of the co-
ordinate in this system. Extrinsic parameters represent the relative
pose of the right camera to the left one (the rotation matrix Rand
the translation vector Tin Fig. 1). The estimation of Rand Tin-
volves point matching between two views. Two combinations of
algorithms are considered in this paper. One is using SURF
(Bay 2008) and RANSAC (Hartley and Zisserman 2004) for the
feature descriptor and outlier removal, respectively. This combina-
tion proved to be fast and accurate enough for point cloud gener-
ation of infrastructure (Fathi and Brilakis 2011). The other is using
SIFT (Lowe 2004) and MAPSAC (Torr 2002), which is slower but
capable of acquiring more matches than the former combination.
Even though the use of SIFT is slower than SURF, it is worth
to consider this combination in the application because of the fol-
lowing reasons. First, cameras are fixed in the application, which
requires the camera pose estimation only once at the initial stage of
the framework. Therefore, the longer processing time of using SIFT
can be ignored. Second, as a longer baseline line, i.e., distance
between two cameras, is used, fewer point matches are obtained
because of higher disparity between two camera views. In this case,
SIFT and MAPSAC can be helpful to feed more inlier matches and
less outlier matches to the next step.
The normalized eight point algorithm (Hartley 1997) is selected
to estimate the essential matrix on the basis of intrinsic parameters
and point matches. The selected method is the most widely used
because of its simple implementation and reasonably accurate
results. Although this method is less computationally efficient
and more sensitive to degeneracy problems compared with other
methods (Nistér 2004;Li and Hartley 2006), it is still efficient
and accurate enough to satisfy needs with regard to fixed camera
positions, a long baseline, and the complexity of the construction
sites. Finally, extrinsic parameters, Rand T, are recovered directly
from the essential matrix (Hartley and Zisserman 2004). These
parameters together with the intrinsic parameters are used for
triangulating 2D tracking results.
For each calibrated camera view, an identified construction
entity is tracked across subsequent frames. According to the com-
parative study of Park et al. (2011), a kernel-based 2D tracker,
which is based on the method by Ross et al. (2008), is used. In
this paper, the eigenimage is constructed selectively with gray scale
values or saturation values depending on the tracked entity’s color
characteristics to enhance the accuracy. Also, in the particle filter-
ing process, the position translation, delta-xand delta-ybetween
consecutive frames, is considered instead of the entity location,
xand ycoordinates. This estimation strategy is beneficial to cor-
rectly locate the entity with fewer samples in particle filtering.
The centroid coordinates are updated every frame by accumulating
the estimated translation vector.
The results obtained in two previous sections, epipolar geometry
and two centroids, are fed into the triangulation step. Generally, the
projections of two centroid coordinates determined from two views
do not intersect one other because of camera lens distortions and
errors caused by 2D tracking. Even if the 2D tracker correctly
locates the entity on each frame, the disparity between two camera
views causes mismatch of the centroids. To enhance the accuracy of
the triangulation process, the two centroids had to be relocated so
that their projections intersect (see Fig. 1). For this purpose, Hartley
and Sturm’s algorithm (Hartley and Sturm 1997) is selected be-
cause the accuracy is more critical than the processing time in
the application. Intersecting projections of the modified pair of
centroids for each frame leads to the 3D coordinate of the tracked
entity.
Experiments and Results
The data for validation are collected from a construction site at the
Georgia Institute of Technology. This site is the construction of an
indoor football practice facility managed by Barton Malow Com-
pany. The roof and columns of the steel-framed facility were
already completed when the data were collected. The videos were
taken with two high-definition (HD) camcorders (Canon VISXIA
HF S100, 30 frames per second, 1;920 × 1;080 pixels) located ap-
proximately 4.5 m above the ground on one side of the facility
structure where the ground area of the facility structure could be
overlooked. One total station (Sokkia SET 230RK3) was used
to acquire ground truths of the entities’trajectories, which are com-
pared with obtained results.
Figs. 3and 4show the positions of the cameras and entities’
trajectories from a bird’s eye view on the basis of the total station
coordinate system and cameras’views. In Figs. 3and 4, trajectories
1 and 2 are composed of 10 and eight segments of straight lines,
Camera 1
Camera
Calibration
Essential
Matrix
2D Tracking
Centroids
of entities
Camera Pose
Estimation
Intrinsic
Parameters
Camera 2
Camera
Calibration
2D Tracking
Centroids
of entities
Intrinsic
Parameters
Triangulation
3D Coordinates
Fig. 2. Methodology overview
544 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
approximately located at a 39 and 43 m distant from the left camera,
respectively. Trajectory 3 is one straight line located 36 m distant
from the left camera. The total station data include the end points of
all segments, i.e., nine, 11, and two points for trajectory 1, 2, and 3,
respectively. The ground-truth trajectories are made by connecting
those points with straight lines. The proposed methodology is
tested on three types of entities: a worker, a steel plate carried
by a worker, and a van. Trajectories 1 and 2 are those of a worker
and a steel plate, and trajectory 3 is of a van. The accuracy of
tracking is quantified by an absolute error that is defined as the
distance between the tracked point and the ground-truth trajectory.
For each frame j, the distance Djis calculated by the following
equation:
Dj¼jðQiþ1QiÞ×ðPjQiÞj∕jðQiþ1QiÞj
where Qiand Qiþ1= endpoints of the ith line segment Li¼Qiþ
tðQiþ1QiÞof ground-truth trajectories on which the object in
frame jlies; and Pj=jth frame’s tracking results, i.e., 3D points.
The main causes of error considered in this paper can be classified
into the 2D tracker error and the error of camera pose estimation.
Also, the assumption that an entity moves exactly along the straight
line is another miscellaneous cause of error.
Camera Calibration and Camera Pose Estimation
For the purpose of camera calibration, a video of a moving checker-
board (7 by 9 blocks of 65 × 65 mm squares) is recorded by each
camera. A total of 26 frames are selected appropriately to have vari-
ous angles of view and are fed into Bouguet’s calibration toolbox
(Bouguet 2004). Once the checkerboard videos are taken and the
cameras are calibrated, all camera system settings remained the
same through the experiments. All functions that may automati-
cally cause a change in the camera intrinsic parameters, such as
autofocus and automated image stabilization, are disabled. Out
of all the video frames, a pair of corresponding frames of left
and right cameras is used to obtain a large number of point matches.
The point matches and calculated intrinsic parameters are used to
estimate camera poses. Because the positions of the cameras are
fixed in the proposed method, all these procedures are required only
once as a preprocess.
Tracking of Steel Plate
A 0.6-m by 0.3-m steel plate is chosen as the first entity to track.
The plate is carried by a worker walking along trajectory 1 and 2.
The video contains 1,430 frames in total, with 790 and 640 frames
for trajectory 1 and 2, respectively, which indicates the results have
1,430 tracked 3D coordinates. In this experiment, right camera 1
(Fig. 3) is set to have a 3.8-m baseline. The template model for
the 2D tracker is composed of gray pixel values. The tracker ac-
curately fits the steel plate with an affine-transformed rectangle
in most frames. Therefore, it can be inferred that the errors in this
experiment mostly come from triangulation, including camera pose
estimation. Fig. 5shows 3D tracking results of using different
-15 -10 -5 0 5 10 15
0
5
10
15
20
25
30
35
40
45
50
X (m)
Z (m)
Trajectory 1
Trajectory 2
Trajectory 3
Total station
Left camera
Right camera 1
Right camera 2
8.3m
3.8m
Fig. 3. Layout of tests from bird’s eye view
Fig. 4. Entities’trajectories: (a) trajectories 1 and 2 from view of right
camera 1; (b) trajectory 3 from view of right camera 2
-14 -10 -6 -2 26
32
36
40
44
48
2
4
X (m)
Z (m)
Y (m)
Ground-Truth
SIFT+MAPSAC (DR=0.6)
SURF+RANSAC (DR=0.8)
SURF+RANSAC (DR=0.6)
Fig. 5. Tracking results of steel plate
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 545
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
camera pose estimation methods, and Table 1summarizes the error
results.
The SURF algorithm is tested with two threshold values of dis-
tance ratio (DR): 0.8 and 0.6. Distance ratio is the distance of the
closest neighbor to that of the second closest neighbor (Lowe
2004). Discarding feature points that have distance ratios higher
than the threshold is an effective way of reducing false-positive
matches. In the case of DR ¼0:8, more point matches are obtained
than DR ¼0:6, but they contain apparent outliers (Fig. 6) that have
adverse effects on essential matrix estimation. The effect of outliers
is reflected on the large error of tracking. Even though SURF with a
DR of 0.6 generates fewer point matches than others, it reduces
outliers significantly and performs even better than SIFT
(DR ¼0:6) and MAPSAC, which provide approximately twice
as many point matches. Assuming the error follows a normal
Table 1. Errors of Tracking Steel Plate
Method DR
Number of
point matches
Error (m)
Total Trajectory 1 Trajectory 2
Max Mean STD Max Mean STD Max Mean STD
SIFT plus MAPSAC 0.6 568 0.836 0.252 0.179 0.836 0.314 0.192 0.569 0.177 0.125
SURF plus RANSAC 0.8 423 3.965 1.220 0.911 3.965 1.537 0.983 2.532 0.828 0.620
0.6 271 0.631 0.180 0.127 0.631 0.222 0.136 0.429 0.127 0.091
Note: DR ¼distance ratio; STD ¼standard deviation.
Fig. 6. Point matches obtained by SURF plus RANSAC; DR ¼0:8
Table 2. Errors of Tracking Van
Method DR
Number of
point matches
Error: Trajectory 3 (m)
Max Mean STD
SIFT plus
MAPSAC
0.6 230 0.865 0.278 0.194
SURF plus
RANSAC
0.8 235 1.239 0.426 0.327
0.6 183 0.931 0.289 0.235
Note: STD ¼standard deviation.
Fig. 8. 2D tracking results in right camera view
-16 -12 -8 -4 048
30
34
38
42
2
4
X (m)
Z (m)
Y (m)
Ground-Truth
SIFT+MAPSAC (DR=0.6)
SURF+RANSAC (DR=0.8)
SURF+RANSAC (DR=0.6)
Fig. 7. Tracking results of van
-14 -10 -6 -2 26
32
36
40
44
48
2
4
X (m)
Z (m)
Y (m)
Ground-Truth
SIFT+MAPSAC (DR=0.6)
SURF+RANSAC (DR=0.6)
Fig. 9. Tracking results of worker with short baseline
546 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
distribution, it is concluded that the tracking error is less than
0.429 m with 95% confidence.
Tracking of Van
The second experiment deals with the tracking of a van that is 2-m
wide, 1.95-m high, and 5.13-m long and moving forward and back-
ward along trajectory 3. The video contains a total of 1,034 frames.
A long baseline (8.3 m) is tested in this experiment placing a cam-
era at “right camera 2”in Fig. 3. Gray pixel values are used for
templates of the 2D tracker. Fig. 7displays obtained trajectories
with ground truth. Similar to the first experiment, it is observed
that outliers finally result in inaccurate depth estimation (SURF
plus RANSAC with DR ¼0:8). There is a difference between
the results for forward and backward moving even though they
were on the same trajectory. This disparity is caused exclusively
by the 2D tracking results. Fig. 8shows 2D tracking results in
the right camera view in which the slight difference between for-
ward and backward trajectories is observable.
The error results are presented in Table 2. The long baseline
allows a smaller number of point matches than with the short base-
line because of the greater difference between the left and right
camera views. The number decreases to less than a half compared
with the first experiment, tracking of a steel plate. The algorithms
SIFT plus MAPSAC, which generated 26% more matches than The
algorithms SURF plus RANSAC, performed better in this case. As-
suming the error follows a normal distribution, it is concluded that
the tracking error is less than 0.658 m with 95% confidence.
Tracking of Worker
The third experiment is performed on a worker moving along
trajectories 1 and 2. Two lengths of baseline, 3.8 and 8.3 m, are
tested. The videos with a short and a long baseline contain
1435 and 1368 frames, respectively. The region of a worker’s upper
body, which can be well characterized by fluorescent colors of a
hard hat and a safety vest, is tracked. Instead of gray pixel values,
saturation values are used for composing the template model.
Figs. 9and 10 present the trajectory results in which it is noticeable
that the longer baseline allows more stable and accurate trajec-
tories. The longer baseline forms a larger angle between two pro-
jections, P1and P2, in Fig. 1, which results in lower error rate. In
Table 3, errors of a long baseline are approximately half of a short
Table 3. Errors of Tracking Worker
Method
Baseline
length (m)
Number of
point matches
Error (m)
Total Trajectory 1 Trajectory 2
Max Mean STD Max Mean STD Max Mean STD
SIFT plus MAPSAC 3.8 584 1.959 0.523 0.357 1.959 0.605 0.374 1.490 0.426 0.309
8.3 215 1.053 0.258 0.193 1.053 0.317 0.211 0.555 0.187 0.140
SURF plus RANSAC 3.8 503 2.549 0.714 0.481 2.549 0.841 0.503 1.791 0.562 0.404
8.3 166 1.510 0.381 0.321 1.510 0.455 0.374 0.731 0.292 0.212
Note: STD ¼standard deviation; DR ¼distance ratio ¼0:6.
Fig. 12. 2D tracking results of 693rd frame: (a) left camera; (b) right
camera
Fig. 11. Appearance variations: (a) steel plate; (b) worker
-14 -10 -6 -2 26
32
36
40
44
48
2
4
X (m)
Z (m)
Y (m)
Ground-Truth
SIFT+MAPSAC (DR=0.6)
SURF+RANSAC (DR=0.6)
Fig. 10. Tracking results of worker with long baseline
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 547
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
baseline, and SIFT plus MAPSAC produces lower errors than
SURF plus MAPSAC.
Whenever a worker changes his direction, the 2D tracker suffers
severe variations of a worker’s appearance. When compared with
Fig. 11(a), Fig. 11(b) shows more substantial changes in the dis-
tribution of pixel values inside a rectangle. This is why the errors
with a short baseline are higher than the errors of tracking a steel
plate. The error caused by the 2D tracker can be divided into two
cases. The first case is when the determined centroid in each view
does not exactly match the real centroid, i.e., the total station target
point. The second case is when the two centroids from the left and
right cameras do not correspond to one other (Fig. 12). These kinds
of errors are partly compensated by a decrease in triangulation er-
ror, which is achieved by using a long baseline. Assuming the
error follows a normal distribution, it is concluded that the tracking
error is less than 0.636 m with 95% confidence.
Conclusion
In this paper, details of correlating multiple 2D tracking results
were presented. Under this method, camera calibration revealed
intrinsic parameters of cameras by processing video frames of a
checkerboard. The extrinsic parameters of two cameras were esti-
mated using point matches between two corresponding views. A
2D tracker provided 2D pixel coordinates of an entity’s centroid
in each calibrated camera view. Epipolar geometry constructed with
the intrinsic and extrinsic parameters triangulated the centroids
from multiple views and retrieved 3D location information.
The proposed method was tested on the videos recorded on a
real construction site. The tests involved three types of entities:
a steel plate, a worker, and a van. A kernel-based 2D tracker
was employed, and different methods of point match extraction
were experimented to reveal the effect of errors caused by correlat-
ing multiple views. The algorithms SIFT plus MAPSAC provided a
larger number of point matches, which generally resulted in a good
estimation of extrinsic parameters, especially for long baselines.
For tracking of a steel plate and a van, the maximum errors deter-
mined with 95% confidence were smaller than the entity’s width.
Various appearances of a worker from the front, side, and rear
views brought about larger errors of 2D tracking than tracking
of a steel plate. However, it results in at most a 0.658-m error with
95% confidence using a long baseline. The results validated that the
vision-based 3D tracking approach can effectively provide accurate
localization of construction site entities, with a distance ranging
approximately 40-50 m.
The sole objective of this research is to achieve a competitive
accuracy in 3D positioning, whereas real-time processing is not an
immediate target. Working with high definition videos is not a real-
time process at the prototype level, which is expected by the current
research work. Incidentally, there are several types of applications
that do not require real-time processing and can be postprocessed,
e.g., productivity measurement, progress monitoring, and activity
sequence analysis. Also, it is expected that real-time commercial
development is attainable through code optimization and parallel
computing. For example, the access to pixel data of a high defini-
tion image, which takes a significant amount of processing time,
can be reduced by discarding static pixel areas. The next step as
a future work is to investigate how visual pattern recognition meth-
ods can be used to automatically recognize and match entities,
which would remove the need for manual entity selection and help
to recover a failure of tracking. Furthermore, it is worth to do re-
search on the camera network composed of multiple stereo camera
systems. Various angles of views and networks among them can
reduce failures caused by occlusions.
Acknowledgments
This material is on the basis of work supported by the National
Science Foundation under Grants No. 0933931 and 0904109.
Any opinions, findings, conclusions, or recommendations ex-
pressed in this material are those of the authors and do not neces-
sarily reflect the views of the National Science Foundation. The
authors would also like to thank Keitaro Kamiya, Masoud Gheisari,
and the Barton Malow Company for their help in collecting data for
the experiments.
References
Bauer, J., Sünderhauf, N., and Protzel, P. (2007). “Comparing several im-
plementations of two recently published feature detectors.”Proc., Int.
Conf. on Intelligent and Autonomous Systems, Institute of Electrical and
Electronics Engineers (IEEE), New York.
Bay, H., Tuytelaars, T., and Gool, L. V. (2008). “SURF: Speeded up
robust features.”Comput. Vis. Image Understanding, 110(3),
346–359.
Bouguet, J. Y. (2004). “Camera calibration toolbox for Matlab.”〈http://
www.vision.caltech.edu/bouguetj/calib_doc〉(Apr. 18, 2011).
Brilakis, I., Park, M.-W., and Jog, G. (2011). “Automated vision tracking of
project related entities.”Adv. Eng. Inf., 25(4), 713–724.
Brückner, M., Bajramovic, F., and Denzler, J. (2008). “Experimental evalu-
ation of relative pose estimation algorithms.”Proc., 3rd Int. Conf. on
Computer Vision Theory and Applications, Vol. 2, Institute for Systems
and Technologies of Information, Control and Communication (IN-
STICC), Setubal, Portugal, 431–438.
Caldas, C. H., Torrent, D. G., and Haas, C. T. (2004). “Integration of au-
tomated data collection technologies for real-time field materials man-
agement.”Proc., 21st Int. Symp. on Automation and Robotics in
Construction, International Association for Automation and Robotics
in Construction.
Chae, S., and Kano, N. (2007). “Application of location information by
stereo camera images to project progress monitoring.”Proc., 24th
Int. Symp. on Automation and Robotics in Construction, International
Association for Automation and Robotics in Construction, Eindhoven,
Netherlands, 89–92.
Ergen, E., Akinci, B., and Sacks, R. (2007). “Tracking and locating com-
ponents in a precast storage yard utilizing radio frequency identification
technology and GPS.”Autom. Constr., 16(3), 354–367
Fathi, H., and Brilakis, I. (2011). “Automated sparse 3D point cloud gen-
eration of infrastructure using its distinctive visual features.”Adv. Eng.
Inf., 25(4), 760–770.
Fontana, R. J. (2004). “Recent system applications of short-pulse ultra-
wideband (UWB) technology.”IEEE Trans. Microwave Theory Tech.,
52(9), 2087–2104.
Fontana, R. J., Richley, E., and Barney, J. (2003). “Commercialization of an
ultra wideband precision asset location system.”Proc., IEEE Conf. on
Ultra Wideband Systems and Technologies, Institute of Electrical and
Electronics Engineers (IEEE), New York, 369–373.
Fuchs, S. (2010). “Multipaths interference compensation in time-of-flight
camera image.”Proc., 20th Int. Conf. on Pattern Recognition, IEEE
Computer Society, Washington, DC, 3583–3586.
Gächter, S., Nguyen, V., and Siegwart, R. (2006). “Results on range image
segmentation for service robots.”Proc., IEEE Int. Conf. on Computer
Vision Systems, Institute of Electrical and Electronics Engineers (IEEE),
New York.
Golparvar-Fard, M., Peña-Mora, F., and Savarese, S. (2010). “Application
of D4AR—A 4-dimensional augmented reality model for automating
construction progress monitoring data collection, processing and com-
munication.”J. Inf. Technol. Constr., 14, 129–153.
Gong, J., and Caldas, C. H. (2008). “Data processing for real-time construc-
tion site spatial modeling.”Autom. Constr., 17(5), 526–535.
548 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.
Gong, J., and Caldas, C. H. (2010). “Computer vision-based video
interpretation model for automated productivity analysis of construction
operations.”J. Comput. Civ. Eng., 24(3), 252–263.
Gruen, A. (1997). “Fundamentals of videogrammetry—a review.”Hum.
Movement Sci. J., 16(2–3), 155–187.
Hartley, R. (1997). “In defense of the eight-point algorithm.”IEEE Trans.
Pattern Anal. Mach. Intell., 19(6), 580–593.
Hartley, R., and Sturm, P. (1997). “Triangulation.”Comput. Vis. Image
Understanding, 68(2), 146–157.
Hartley, R., and Zisserman, A. (2004). Multiple view geometry in computer
vision, Cambridge University Press, Cambridge, UK.
Heikkilä, J., and Silvén, O. (1997). “A four-step camera calibration
procedure with implicit image correction.”Proc., IEEE Computer
Society Conf. on Computer Vision and Pattern Recognition, Institute
of Electrical and Electronics Engineers (IEEE), New York, 1106–1112.
Kanatani, K., Sugaya, Y., and Niitsuma, H. (2008). “Triangulation from
two views revisited: Hartley-Sturm vs. optimal correction.”Proc.,
19th British Machine Vision Conf., British Machine Vision Association
and Society for Pattern Recognition, Malvern, UK, 173–182.
Li, H., and Hartley, R. (2006). “Five-point motion estimation made easy.”
18th Int. Conf. on Pattern Recognition (ICPR 2006), Institute of
Electrical and Electronics Engineers (IEEE), New York, 630–633.
Lowe, D. G. (2004). “Distinctive image features from scale-invariant
keypoints.”Int. J. Comput. Vis., 60(2), 91–110.
Nistér, D. (2004). “An efficient solution to the five-point relative pose prob-
lem.”IEEE Trans. Pattern Anal. Mach. Intell., 26(6), 756–770.
Park, M.-W., Makhmalbaf, A., and Brilakis, I. (2011). “Comparative study
of vision tracking methods for tracking of construction site resources.”
Autom. Constr., 20(7), 905–915.
Pizarro, O., Eustice, R., and Singh, H. (2003). “Relative pose estimation for
instrumented, calibrated platforms.”Digital image computing: Tech-
niques and applications, Proc., 7th Biennial Australian Pattern Recog-
nition Society Conf., DICTA 2003, C. Sun, H. Tablet, S. Ourselin, and
T. Adriaansen eds., CSIRO, Collingwood, Australia, 601–612
Point Grey. (2011). Stereo vision camera catalog, ASCE, Reston, VA.
Rashidi, A., Dai, F., Brilakis, I., and Vela, P. (2011). “Comparison of
camera motion estimation methods for 3D reconstruction of
infrastructure.”ASCE Int. Workshop on Computing in Civil Engineer-
ing, ASCE, Reston, VA
Ross, D., Lim, J., Lin, R.-S., and Yang, M.-H. (2008). “Incremental learn-
ing for robust visual tracking.”Int. J. Comput. Vis., 77(1), 125–141.
Son, H., and Kim, C. (2010). “3D structural component recognition and
modeling method using color and 3D data for construction progress
monitoring.”Autom. Constr., 19(7), 844–854.
Song, J., Caldas, C. H., Ergen, E., Haas, C., and Akinci, B. (2004). “Field
trials of RFID technology for tracking pre-fabricated pipe spools.”
Proc., 21st Int. Symp. on Automation and Robotics in Construction,
International Association for Automation and Robotics in Construction,
Eindhoven, Netherlands.
Song, J., Haas, C., Caldas, C., Ergen, E., and Akinci, B. (2006). “Automat-
ing pipe spool tracking in the supply chain.”Autom. Constr., 15(2),
166–177.
Teizer, J., Caldas, C. H., and Haas, C. T. (2007a). “Real-time three-
dimensional occupancy grid modeling for the detection and tracking
of construction resources.”J. Constr. Eng. Manage., 133(11), 880–888.
Teizer, J., Lao, D., and Sofer, M. (2007b). “Rapid automated monitoring of
construction site activities using ultra-wideband.”Proc., 24th Int. Symp.
on Automation and Robotics in Construction, International Association
for Automation and Robotics in Construction, Eindhoven, Netherlands,
23–28.
Torr, P. H. S. (2002). “Bayesian model estimation and selection for epipolar
geometry and generic manifold fitting.”Int. J. Comput. Vis., 50(1),
35–61.
Yang, J., Arif, O., Vela, P. A., Teizer, J., and Shi, Z. (2010). “Tracking
multiple workers on construction sites using video cameras.”Adv.
Eng. Inform., 24(4), 428–434.
Zhang, Z. (1999). “Flexible camera calibration by viewing a plane from
unknown orientations.”Proc., 7th IEEE Int. Conf. on Computer Vision,
Vol. 1, Institute of Electrical and Electronics Engineers (IEEE),
New York, 666–673.
JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JULY/AUGUST 2012 / 549
J. Comput. Civ. Eng. 2012.26:541-549.
Downloaded from ascelibrary.org by Christian Koch on 07/10/12. For personal use only.
No other uses without permission. Copyright (c) 2012. American Society of Civil Engineers. All rights reserved.