ArticlePDF Available

Recognizing Diverse Construction Activities in Site Images via Relevance Networks of Construction Related Objects Detected by Convolutional Neural Networks

Authors:

Abstract and Figures

Timely and overall knowing the states and resource allocation of diverse activities on construction sites is critical to resource leveling, progress tracking, and productivity analysis. Despite its importance, this task is still performed manually. Previous studies have taken a significant step forward in introducing computer vision technologies to it, while they are oriented toward limited classes of objects or limited types of activities. Furthermore, they especially focus on single activity recognition, where an image contains only one execution of an activity by one or few objects. This paper proposes a two-step method for recognizing diverse construction activities in site images. It detects 22 classes of construction related objects using convolutional neural networks. Given objects detected, semantic relevance representing the likelihood of the cooperation or coexistence between two objects in a construction activity, spatial relevance representing the two-dimensional pixel proximity in the image coordinate, and activity patterns are defined to recognize 17 types of construction activities. The advantage of the proposed method is its potential to recognize multiple concurrent construction activities in a fully automatic way. Therefore, it is possible to save people's valuable time in manual data collection and concentrate their attention to solve problems that necessarily demand their expertise.
Content may be subject to copyright.
Recognizing Diverse Construction Activities in Site Images
via Relevance Networks of Construction-Related Objects
Detected by Convolutional Neural Networks
Xiaochun Luo1; Heng Li2; Dongping Cao3; Fei Dai, M.ASCE4;
JoonOh Seo5; and SangHyun Lee, M.ASCE6
Abstract: Timely and overall knowledge of the states and resource allocation of diverse activities on construction sites is critical to resource
leveling, progress tracking, and productivity analysis. Despite its importance, this task is still performed manually. Previous studies have
taken a significant step forward in introducing computer vision technologies, although they have been oriented toward limited classes of
objects or limited types of activities. Furthermore, they especially focus on single activity recognition, where an image contains only the
execution of an activity by one or a few objects. This paper introduces a two-step method for recognizing diverse construction activities in still
site images. It detects 22 classes of construction-related objects using convolutional neural networks. With objects detected, semantic
relevance representing the likelihood of the cooperation or coexistence between two objects in a construction activity, spatial relevance
representing the two-dimensional pixel proximity in the image coordinates, and activity patterns are defined to recognize 17 types of
construction activities. The advantage of the proposed method is its potential to recognize diverse concurrent construction activities in
a fully automatic way. Therefore, it is possible to save managersvaluable time in manual data collection and concentrate their attention
on solving problems that necessarily demand their expertise. DOI: 10.1061/(ASCE)CP.1943-5487.0000756.© 2018 American Society of
Civil Engineers.
Author keywords: Construction activity recognition; Convolutional neural networks; Semantic relevance; Spatial relevance; Relevance
networks.
Introduction
Construction sites generally have a large scale, and diverse activ-
ities take place there concurrently. Timely and overall awareness
of activitiesstates and resource allocation is critical to many
project-level management tasks, including resource leveling, prog-
ress tracking, and productivity analysis. Despite its importance,
the manual approach to activity tracking and resource counting,
which relies on managersexperience and diligence, is still the
mainstream in practice. In the last decade, a considerable amount
of literature has been published concerning visual object detection
(Chi and Caldas 2011;Du et al. 2011,Memarzadeh et al. 2013;
Rezazadeh Azar and McCabe 2012a,b;Zhu et al. 2010) and con-
struction activity recognition (Golparvar-Fard et al. 2013;Gong
et al. 2011;Han and Lee 2013;Han et al. 2013;Ray and Teizer
2012;Rezazadeh Azar et al. 2013;Yang et al. 2016;Zou and
Kim 2007). These studies have contributed to taking a significant
step forward in introducing computer vision technologies to the
time-consuming tasks (Yang et al. 2015).
Previous studies, however, have been primarily oriented toward
limited classes of objects, e.g., workers, excavators, and dump
trucks (Memarzadeh et al. 2013;Park and Brilakis 2012;
Rezazadeh Azar and McCabe 2012b), or limited types of construc-
tion activities, e.g., earthwork and concrete pouring (Bugler et al.
2017;Gong and Caldas 2010;Rezazadeh Azar et al. 2013). They
can hardly be extended to analyze other classes of objects or oper-
ations. One of the possible reasons is that most of them use hand-
crafted features to build their detectors, which are trained with
relatively small data sets of limited object classes. Although these
methods have shown satisfying performance in their envisaged set-
tings, the increased number of object classes poses a big challenge
to them. The results of PASCALVOC Object Detection Challenges
from 2009 to 2012 (Williams 2012) show that detectors based on
handcrafted features arrived at a performance bottleneck. Further-
more, they especially focus on single activity recognition, where an
image contains only one execution of an activity by one or a few
objects.
Deep learning allows computational models that are composed
of multiple processing layers to learn representations of data with
multiple levels of abstraction (LeCun et al. 2015). The methods
1Senior Research Fellow, Dept. of Building and Real Estate, Hong
Kong Polytechnic Univ., Hung Hom, Kowloon 999077, Hong Kong
(corresponding author). E-mail: bsericlo@polyu.edu.hk
2Chair Professor, Dept. of Building and Real Estate, Hong Kong
Polytechnic Univ., Hung Hom, Kowloon 999077, Hong Kong. E-mail:
bshengli@polyu.edu.hk
3Assistant Professor, Dept. of Construction Management and Real
Estate, School of Economics and Management, Tongji Univ., 1239 Siping
Rd., Shanghai 200092, China. E-mail: dongping.cao@tongji.edu.cn
4Assistant Professor, Dept. of Civil and Environmental Engineering,
West Virginia Univ., Morgantown, WV 26506-6103. E-mail: fei.dai@
mail.wvu.edu
5Assistant Professor, Dept. of Building and Real Estate, Hong Kong
Polytechnic Univ., Hung Hom, Kowloon 999077, Hong Kong. E-mail:
joonoh.seo@polyu.edu.hk
6Associate Professor, Dept. of Civil and Environmental Engineering,
Tishman Construction Management Program, Univ. of Michigan,
Ann Arbor, MI 48109. E-mail: shdpm@umich.edu
Note. This manuscript was submitted on May 15, 2017; approved on
October 25, 2017; published online on February 16, 2018. Discussion
period open until July 16, 2018; separate discussions must be submitted
for individual papers. This paper is part of the Journal of Computing
in Civil Engineering, © ASCE, ISSN 0887-3801.
© ASCE 04018012-1 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
have dramatically improved the state of the art in visual object clas-
sification (Krizhevsky et al. 2012) and object detection (Girshick
et al. 2014). The stellar success stories of deep learning from the
computer vision domain motivate the authors to address the previ-
ously mentioned concerns with it.
This paper introduces a method for recognizing diverse con-
struction activities in still site images. It uses convolutional neural
networks (CNNs)Faster R-CNN (Ren et al. 2015) as the region
proposal network and ResNet-50 (He et al. 2015) as the object
detection networkto detect those frequently observed objects,
establishes relevance networks of the objects, and recognizes activ-
ities by pattern matching. The advantage of the proposed technol-
ogy is its potential to recognize diverse concurrent construction
activities in a fully automatic way. Therefore, it is possible to save
managersvaluable time in manual data collection and concentrate
their attention on solving problems that necessarily demand their
expertise.
In the rest of this paper, the related work on visual object
detection and activity recognition conducted by both the computer
vision community and the construction community is firstly re-
viewed. Then, the method development is presented in three parts:
(1) preparing the data sets for transfer learning of a pretrained
ResNet-50 model, (2) developing the key concepts and rules for
relevance network creation and the patterns for activity recognition,
and (3) evaluating the performance of Faster R-CNN on object
detection and that of the proposed method on diverse construction
activity recognition. Finally, the contribution to knowledge, re-
search limitations, and future work are discussed.
Background and Related Work
Object Detection in Computer Vision
Object detection is one the central topics in the computer vision
community, and its methodologies for feature engineering have
evolved from handcrafted feature descriptors to deep-learning ar-
chitectures in the last two decades (LeCun et al. 2015). The motive
forces behind the transition include the development of computer
architectures (e.g., graphics processing units and multicore com-
puter systems), the advent of large-scale data sets such as ImageNet
(Deng et al. 2009), and the better designs for modeling and training
deep networks (Hinton 2009).
Among the typical handcrafted features are the scale-invariant
feature transform (SIFT), histogram of oriented gradients (HOG),
and deformable part-based model (DPM). Lowe (1999) introduced
SIFT as a local feature descriptor, which addresses the problem of
image patch comparison and significantly propels the development
of structure from motion (Haming and Peters 2010). Dalal and
Triggs (2005) introduced HOG, which counts occurrences of gra-
dient orientations in localized portions of an image. In combination
with the linear support vector machine (SVM) (Andrew 2013)as
the classifier, it becomes a favorite feature descriptor for object
classification and detection. Felzenszwalb et al. (2008) introduced
DPM, which includes a coarse global template covering an entire
object and higher-resolution part templates represented by HOG.
DPM reinforces the popularity and strength of HOG by addressing
the limitation of HOG in independently representing and detecting
deformable objects such as pedestrians and excavators.
In 2012, interest in CNNs was rekindled by Krizhevsky et al.
(2012) by showing substantially higher image classification accu-
racy on the ImageNet Large Scale Visual Recognition Challenge.
From then on, the most effective algorithms for image classifica-
tion, object recognition, and visual tracking were developed mostly
based on CNNs. For example, Hong et al. (2015) introduced an
online visual tracking algorithm by learning a discriminative
saliency map using a CNN and achieved the best result in the Visual
Object Tracking challenge 2015 (Kristan et al. 2015).
Girshick et al. (2014) introduced regions with CNN features
(R-CNN) for object detection and semantic segmentation by com-
bining region proposals with CNNs. R-CNN achieve a mean aver-
age precision (mAP) of 53.3% on the data set PASCAL VOC
(2012), which improves mAP by more than 30% relative to the
previous best result. A follow-up study (Girshick 2015) introduced
Fast R-CNN to efficiently classify object proposals using deep con-
volutional networks, which improves the training and testing speed.
Ren et al. (2015) introduced Faster R-CNN toward real-time object
detection with region proposal networks, which share full-image
convolutional features with the detection network and thus enable
cost-free region proposals.
Visual Object Detection in Construction
In the past decade, a considerable amount of literature has been
published on vision-based object detection in the construction do-
main (Du et al. 2011;Memarzadeh et al. 2013;Park and Brilakis
2012;Rezazadeh Azar and McCabe 2012a,b;Son et al. 2012;Zhu
et al. 2010). A major family of them focused on detecting
construction equipment and workers (Chi and Caldas 2011;
Memarzadeh et al. 2013;Teizer et al. 2007). Park and Brilakis
(2012) introduced a three-step method to detect workers in videos:
using background subtraction to identify moving objects, using
SVM to classify the human-shaped objects according to the HOG
shape features of those objects, and finally using k-NN to detect
workers based on the color histogram of those human-shaped ob-
jects. Similarly, Chi and Caldas (2011) proposed another three-step
method, but with different feature descriptors and classifiers, to de-
tect mobile heavy equipment and workers. In their method, moving
objects are segmented with background subtraction. These seg-
mented regions are factorized with geometric attributes, e.g., the
position of region centroid, occupied area size in pixels, and aspect
ratio. Eventually, these geometric attributes in combination with the
gray information of the area are classified with a Bayes classifier
and an artificial neural network classifier independently.
Background subtraction, however, is based on a static back-
ground hypothesis, which limits its use in moving or jiggling
videos. To address the problem, Memarzadeh et al. (2013) proposed
to use HOG and colors to describe excavators, dump trucks, and
workers in videos and directly detect them by classifying the areas
of sliding windows in each frame with SVM. Rezazadeh Azar and
McCabe (2012b) introduced a DPM-like method to detect hydraulic
excavators in images and videos. The method uses HOG features to
detect the first part of the arm (the boom), searches the adjacent
part (the dipper) to finalize the recognition, and finally determines
the pose of the excavator based on the location of the dipper.
Object detection methods based on HOG features suffer from
their window sliding operation and are computationally expensive.
As noted by Rezazadeh Azar and McCabe (2012a), scanning a
1,024 ×768-pixel image with such a method for detecting dump
trucks from eight viewpoints takes 69 s. To speed up the detection
process, they proposed to use a fast classifier based on Haar-like
features for static images and a movement filter using Bayes deci-
sion rules for video frames to evaluate the confidence level to which
an object of interest exists in the sliding window before using the
expensive HOG detector.
Several other studies focused on detecting construction compo-
nents, e.g., concrete columns in videos (Zhu et al. 2010) and con-
crete structural components in images (Son et al. 2012), with edge
© ASCE 04018012-2 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
and color features. Zhu et al. (2010) proposed to locate edge points
using column colors, identify the orientation information of edge
points using the Hough transform, and finally detect columns using
an artificial neural network classifier. Son et al. (2012) proposed a
method for transforming the RGB color space to non-RGB color
spaces to raise detection robustness in various illumination condi-
tions, and SVM proved to be the superior classifier for the non-
RGB spaces.
Earlier studies by Brilakis and Soibelman (2006) and Brilakis
et al. (2005) focused on classifying site images by detecting how
much area a specific material (e.g., earth, concrete, and paint) occu-
pies in the image plane using a technique named content-based im-
age retrieval, which detects materials in four steps: decomposing
images into their basic features (e.g., color, texture, structure, etc.)
by applying a series of filters; clustering these regions; computing
region feature signatures; and detecting regions of interest by com-
paring each signature with annotated samples in a material database.
In summary, previous research on visual object detection in
construction has primarily focused on detecting workers and equip-
ment, and the number of object classes that they can detect is lim-
ited. None of the studies reviewed appears to address the detection
problem of diverse objects. Research involving construction mate-
rials aims at image retrieval based on their basic features, e.g., color,
texture, and structure, rather than object detection. One of the rea-
sons could be that they primarily use handcrafted features to build
their classifiers, which are trained with relatively small data sets
with limited object types. Although they have shown satisfying per-
formance in their envisaged settings, it remains a challenge to them
to handle the increased number of object classes. The evidence
from the computer vision community, e.g., the results of PASCAL
VOC Object Detection Challenges (Williams 2012), show that
detectors based on handcrafted features arrived at a performance
bottleneck. The best average precisions (APs) of people detection
with the training and test data provided by PASCAL VOC from
2005 to 2012 were 0.013, 0.164, 0.221, 0.420, 0.415, 0.475, 0.516,
and 0.461, respectively (PASCAL VOC 2012).
Human Activity Recognition in Computer Vision
Human activity recognition is an active research topic in the
computer vision community with many important applications,
including humancomputer interfaces, content-based video index-
ing, video surveillance, and robotics (Aggarwal and Cai 1999;
Aggarwal and Ryoo 2011;Egnor and Branson 2016;Vrigkas
et al. 2015;Weinland et al. 2011). The relevant methods can be
categorized into four groups: shape-based, space-time, stochastic,
and rule-based (Vrigkas et al. 2015). Shape-based methods re-
present human activities with two- or three-dimensional skeleton
pose models and recognize activities by human pose classification
(Lillo et al. 2014). Space-time methods recognize activities based
on pixel-based space-time features across frames. Optical flow has
been proved to be one of the critical cues in implementing these
methods (Shechtman and Irani 2005;Wang and Schmid 2011,
2013). Stochastic methods model human activities by considering
an activity entity as a stochastically predictable sequence of states.
The primary stochastic techniques used in activity recognition in-
clude hidden Markov models (Murphy and Paskin 2001;Rabiner
and Juang 1986) and conditional random fields (Lafferty et al.
2001;Quattoni et al. 2007). Rule-based approaches model activ-
ities with a set of constraints describing atomic actions or a set of
activity patterns and recognize them by logic reasoning or probabi-
listic pattern matching (Morariu and Davis 2011).
Construction Activity Recognition and Productivity
Analysis
Vrigkas et al. (2015) classified human activities into six levels
regarding their complexity: (1) gestures, (2) atomic actions,
(3) human-to-object or human-to-human interactions, (4) group
actions, (5) behaviors, and (6) events. Since human behaviors and
events are high-level activities that involve emotions, personality,
psychological states, and social roles, this study excludes the
two levels at the current stage and extends the remaining levels to
cover the activities by construction equipment. Table 1shows the
taxonomy of construction activities of interest.
The existing literature on construction activity recognition is
extensive and focuses particularly on the productivity analysis of
construction equipment (Bugler et al. 2017;Gong and Caldas
2010;Rezazadeh Azar et al. 2013;Yang et al. 2014;Zou and Kim
2007). Pioneering this task, Gong and Caldas (2010) proposed a
rule-based interaction activity recognition method to analyze the
productivity of concrete pouring of a tower crane and concrete
buckets. The method breaks down construction operations into a
variety of working task elements (semantic context) and describes
how these elements unfold in planned locations (spatial context)
and sequences (temporal context). The spatial context is integrated
Table 1. Construction Activity Taxonomy Regarding Activity Complexity
Activity Definition Examples
Gestures Primitive movements of the body parts of an object
that may correspond to an action of this objecta
Workers: walking; standing with legs upright, one leg upright, legs bent, or one
leg bent; sitting; kneeling on one leg bent; etc.
Excavators: swinging left or right, lowering or raising the boom, closing or
dumping the bucket, sticking out or in, etc.
Atomic actions Movements of an object describing a certain motion
that may be part of more complex activities
Rebar workers: transporting rebars, sorting rebars, placing concrete spacers,
fixing rebars using a hand tool or a mechanical mean, etc.
Excavators: digging earth, leveling earth, transporting earth, unloading earth to
a dump truck, etc.
Interactions Activities that involve two or more objects Transporting a prefabricated mesh to working areas with a tower crane, which
involves multiple-stage interactions between workers with the tower crane: a
worker winding and securing the rope, a worker instructing the tower crane to
move, and a worker positioning and unloading the mesh.
Group activities Activities performed by a group of objects Placing concrete: workers preparing concrete areas; transporting concrete to
concrete areas with a crane and a bucket (or a track-mounted concrete pump);
and workers placing, spreading, compacting, and leveling concrete.
aObject represents human and construction equipment for the sake of simplicity.
© ASCE 04018012-3 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
with video scenes by manually specifying regions in the video
scenes and assigning a working state to each of these regions.
In other words, when the bucket enters a specific region, it signals
that the work task transforms into the assigned working state.
Bugler et al. (2017) proposed a rule-based interaction activity
detection method for analyzing earthwork productivity of excava-
tors and dump trucks. In their method, the activity state (i.e., static,
moving, absent, or filling) is checked when an excavator and a
dump truck are in proximity. They use photogrammetry to estimate
the volume of the removed earth and derive the productivity of
earthwork. Similarily, Rezazadeh Azar et al. (2013) proposed an-
other rule-based method for analyzing dirt loading cycles of exca-
vators and dump trucks. They combine logical reasoning with a
SVM classifier to achieve better detection robustness and accuracy.
The logic reasoning checks equipment orientations for filling, and
then the SVM classifier detects the earthwork state according to the
distances between the base point of the excavator and four corners
of the dump truck.
Publications that concentrate on atomic action recognition
in construction more frequently adopt space-time methods.
Golparvar-Fard et al. (2013) presented a method using spatiotem-
poral features and SVM classifiers to understand the action of
earthmoving equipment (excavators and trucks). Yang et al. (2016)
introduced a study using dense trajectories to recognize workers
actions.
Yang et al. (2014) presented a stochastic activity recognition
method to infer two-state tower crane activities (concrete pouring
and nonconcrete material movement) using crane jib trajectories
and site layout information. In the method, the jib angle trajectory
is tracked with a 2D-to-3D rigid pose tracking algorithm, and a
probabilistic graph model was introduced to process the tracking
results as well as recognize crane activities.
Previous studies on vision-based construction activity recogni-
tion were likely influenced by the object detection methods. As a
result, they primarily focused on limited types of activities con-
ducted by those objects, which are easy to detect based on the hand-
crafted features. These methods can hardly be extended to analyze
other activities. Furthermore, they especially focus on single activ-
ity recognition, where an image contains only the execution of a
single activity by one or few objects. However, overall awareness
of the states and issues of project-level tasks, e.g., resource leveling,
progress tracking, and productivity analysis, requires the informa-
tion of diverse, concurrent activities. There is a need for such a
technique that can detect various objects in site images and recog-
nize these construction activities relevant to them.
Method Development
This study addresses the recognition problem of interaction activ-
ities. For the sake of simplicity, the term activities is used to refer
to the interaction activities hereafter. The method for construction
activity recognition consists of two steps: object detection and ac-
tivity recognition. To detect objects in site images, Faster R-CNN
(Ren et al. 2015) is used as the region proposal network and
ResNet-50 (He et al. 2015) as the object detection network.
ResNet won several first places in such tracks as ImageNet Clas-
sification, ImageNet Detection, ImageNet Localization, COCO
Detection, and COCO Segmentation (ImageNet and Microsoft
COCO 2015) and therefore represents the state of the art of these
domains.
Given objects detected with the deep-learning model, this study
introduces relevance networks and activity patterns to recognize
construction activities. A relevance network is created based on
two concepts: semantic relevance and spatial relevance. Semantic
relevance represents the likelihood of the cooperation or coexist-
ence between two objects in a construction activity, while spatial
relevance represents two-dimensional (2D) pixel distances in the
image coordinate. This study establishes 20 activity patterns based
on objects in relevance networks to recognize 17 types of construc-
tion activities. Fig. 1summarizes the overview of system develop-
ment and application regarding the three workflows: training
Faster R-CNN, testing Faster R-CNN, and testing the method.
Training and Test Data Sets
Data sets are critical to training and testing deep neural networks.
This study focuses on analyzing images taken at the foundation and
structure construction stages of building projects, and a total of
Fig. 1. Overview of system development and application
© ASCE 04018012-4 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
22 classes of objects frequently observed are covered. Table 2
describes the object taxonomy, where a three-tier (categories,
subcategories, and classes) tree-view structure is adopted. First,
the objects are grouped into four categories: workers, materials/
products, equipment, and general vehicles. The second tier (i.e., the
subcategories) unfolds under the first tier. For example, the cat-
egory of materials/products is further divided into four subcate-
gories: concrete related, formwork related, rebar related, and
scaffolding related. The last tier is composed of the classes under
the subcategories. For instance, there are two classes in concrete-
related materials/products: concrete in placing and concrete in
finishing. There is an exception with workers and general vehicles,
which are not further divided into subcategories due to the diffi-
culty of recognizing their trades or activities only based on these
features.
This study used three image sources to construct the data sets.
ImageNet is a large-scale ontology of images built upon the back-
bone of the WordNet structure (Deng et al. 2009). The authors col-
lected most of the ordinary objects in the vehicles and equipment
classes, e.g., automobiles, backhoes, and cranes, from ImageNet.
Another important image source is the online image repositories,
including Google Images, Baidu Images, Bing Images, and Yahoo
Flickr. Images were searched with keywords relevant to such con-
struction materials as rebar, formwork, scaffolding, and concrete.
In addition, 763 site images taken from four building construction
sites in Hong Kong were included. Unlike the images from the first
two sources, the images that the researchers took on sites reflected
the actual situations in which the proposed method will apply.
Fig. 2shows some samples of them.
Finally, a total of 7,790 images were collected and annotated in
the PASCAL VOC format, which represents objects with bounding
boxes and class labels and saves the annotations with XML files
(Everingham and Winn 2007). The open source toolkit LabelImg
(Tzu 2015) was employed to manually perform the annotating pro-
cess, which took around 200 work hours in 3 weeks. The process
cannot be automated since the images collected by the researchers
are still site images from various sources, including the online
image repositories and local construction projects. Table 2also
shows the statistics of the annotated objects in these images;
6,232 (80%) images were used to build the training set, and
1,558 (20%) images were used as the test set by randomly selecting
one in every five images out of the general data set.
Semantic Relevance
This study introduces semantic relevance Rseto represent the
likelihood of the cooperation or coexistence of two objects in a
construction activity. In this study, Rsescores between classes
Table 2. Object Classes and Statistics of the Annotated Objects in the Training and Test Data Sets
Code Class
Total
b.b.
Training data set Test data set
Num.
b.b.
Size
avg.
Size
std.
Size
c.v.
Num.
b.b.
Size
avg.
Size
std.
Size
c.v.
Workers
WKRaWorker 2,994 2,404 109 78 0.71 590 112 75 0.67
Materials/products
Concrete related
M-CCP Concrete in placing 255 205 187 71 0.38 50 186 83 0.45
M-CCF Concrete in finishing 184 144 234 98 0.42 40 228 90 0.39
Formwork related
M-FMW Formwork 256 215 226 168 0.75 41 290 232 0.80
M-FSB Formwork of slabs and beams 249 197 566 421 0.74 52 630 505 0.80
M-FWC Formwork of walls and columns 1,273 991 272 189 0.69 282 260 186 0.72
M-FSS Formwork of stairs 175 144 205 78 0.38 31 226 113 0.50
Rebar related
M-REB Rebar 467 384 280 202 0.72 83 293 222 0.76
M-RSB Rebar of slabs and beams 339 283 330 279 0.85 56 294 256 0.87
M-RWC Rebar of walls and columns 1,323 1,041 297 203 0.68 282 277 190 0.69
Scaffolding related
M-SCF Scaffolding 808 639 433 245 0.57 169 457 245 0.54
M-SSF Scaffolding of slab formwork 190 160 300 115 0.38 30 286 42 0.15
Activity-specific equipment
Earthwork related
E-BKH Backhoe 737 595 435 317 0.73 142 423 294 0.70
E-BDZ Bulldozer 225 181 416 127 0.31 44 422 117 0.28
E-DMP Dump truck 236 186 432 270 0.63 50 468 281 0.60
Concrete related
E-CCB Concrete bucket 467 368 219 97 0.44 99 210 93 0.44
E-CCM Concrete mixer 617 496 401 166 0.41 121 454 245 0.54
E-CCP Concrete pump 411 330 285 124 0.44 81 291 129 0.44
Material delivery related
E-CRA Crane 802 638 416 200 0.48 163 400 162 0.41
E-LRY Lorry 437 352 456 130 0.29 85 473 152 0.32
General vehicles
E-VANaVan 451 363 451 143 0.32 88 483 125 0.26
E-CARaCar 1,088 862 383 219 0.57 226 376 279 0.74
Note: Num. b.b. = number of bounding boxes; Total b.b. = total number of bounding boxes. Sizes of ground-truth bounding boxes are defined with pixel
numbers on their diagonal. Size avg. = average of the sizes of bounding boxes; Size std. = sizesstandard deviation; Size c.v. = sizescoefficient of variance.
aThe general classes, which can be present in various construction activities and are of limited activity indication capability.
© ASCE 04018012-5 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
are quantified with a 5-point Likert scale with intervals of 0.25.
Fig. 3shows the thumb rules for establishing the scores. Rse1.0
represents the homogenous relevance (i.e., two objects are from
the same class), while Rse0 represents an outcome of no semantic
relevance between the two classes. Fig. 4illustrate the establish-
ment of these scores.
Rse0.75 represents the relationship between classes within
each subcategory. In the subcategories of activity-specific equip-
ment, it denotes intrasubcategory alternative or cooperative pos-
sibilities. For example, concrete buckets can be an alternative
to concrete pumps, and there can be cooperative operation be-
tween, e.g., concrete mixers and concrete buckets in concrete plac-
ing. Similarly, in materials/products, the score 0.75 represents that
two classes are within the same subcategory and apt to be
temporally successive. They can be similar temporary products,
e.g., formwork of walls and columns and that of slabs and beams
in formwork construction.
Rse0.5 represents a strong relationship between classes of
different subcategories. In materials/products, the score indicates
a supportive relationship between two classes. They can be two
classes of materials/products, among which one is being treated,
and another supports the treatment, e.g., concrete in placing and
formwork of slabs and beams. It can be the direct relationship
of handling and being handled between workers and materials/
products. Workers and their appurtenance can also be quantified
at this level, e.g., workers along with a backhoe leveling land and
tiling concrete panels, and workers placing concrete with a con-
crete bucket. Additionally, it can also represent a kind of material
and a kind of equipment designed to treat that material, e.g., con-
crete in placing and a concrete bucket (or a concrete pump).
Fig. 2. Image samples (images by Xiaochun Luo)
Fig. 3. Semantic relevance rules
© ASCE 04018012-6 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
Rse0.25 represents the weak relationship between classes of dif-
ferent subcategories. In the subcategories of equipment (including
activity-specific equipment and general vehicles), it represents the
potential of indirect cooperation between its subcategories. For ex-
ample, a tower crane can cooperate with a concrete mixer indirectly
through a concrete bucket in a concrete-placing activity. In the sub-
categories of materials/products, the score indicates the subsequent
relationship between two classes. For example, formwork of slabs
and beams directly supporting concrete in placing results in Rse0.5.
Concrete in placing subsequently becomes concrete in finishing,
whose Rsewith the formwork of slabs and beams turns into 0.25.
Moreover, the 0.25 score can also represent the indirect relationship
of handling and being handled between equipment and materials/
products. For example, a crane is indirectly connected to concrete
in placing by a concrete bucket.
Finally, a semantic relevance matrix is established to quantita-
tively represent the relations between the 22 classes, as shown in
Table 3.
Spatial Relevance
Camera angles and distances can significantly influence the spatial
relationship of objects in the 2D image coordinates. For example,
when photographing workers, materials, and equipment on the top
working floor of a building project, a horizontal, or almost horizon-
tal, shooting angle can compact their visual distances in the image
such that some objects seem to occlude or overlap others. There is
the distance compaction problem with relatively horizontal shoot-
ing angles. On the contrary, a vertical angle will result in a more
accurate record of their physical distances.
To represent the complicated physical proximity in the field of
view, the researchers introduce the concept of spatial relevance Rsp.
Three situations in defining the spatial relevance of objects A and B
are considered based on the relation of their bounding boxes: one
bounding box contains, intersects, or is separated from another
(Fig. 5). In the first two situations, the definition of the overlap in
evaluating the object detection effectiveness is referenced. It is de-
fined as the intersection over union (IoU) of the predicted bounding
Fig. 4. Illustration of semantic relevance rules
© ASCE 04018012-7 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
box and the ground-truth bounding box. Take the situations in Fig. 6
as an example to illustrate the calculation of spatial relevance.
In the first situation, one bounding box contains another. It sig-
nals that their Rsp has achieved the maximum 1.0. This definition
acknowledges that workers (WKR-3, WKR-4, WKR-5, WKR-6,
WKR-10, and WKR-20) working in the same area (M-RSB-11)
should have the same Rsp value (i.e., 1), even though the bounding
box of one worker (WKR-10) is smaller than another (WKR-6),
caused by the difference of their camera distances.
In the second situation, one bounding box intersects with
another. The spatial relevance between two objects (M-RSB-11
and WKR-0) is represented with the intersection area (in transpar-
ent red) and the minimum bounding box area (WKR-0). This in-
tersection relevance is set within ½0.5;1Þto reflect that objects in
this situation are less relevant than those in the first situation. The
first two situations are represented unitedly with the first band of
Eq. (1), where the function area derives the area of a bounding box.
In the third situation, there is no intersection between two
bounding boxes. The spatial relevance between these two separated
objects is defined with lengths and distances, rather than areas.
It falls in ð0;0.5Þto reflect that objects in this situation are less
relevant than those in the second situation. Its value is defined with
the second band of Eq. (1), where the function side returns the mini-
mum side length of a bounding box, and the function dist computes
the minimum distance between two bounding boxes. In Fig. 6, the
minimum distance between M-RSB-11 and WKR-1 is illustrated
with the yellow line and that between M-RSB-11 and M-FMW
with the blue line.
Consequently, the spatial relevance between any two objects is a
scalar in ð0;1without units
RspðA;BÞ
¼8
>
>
>
<
>
>
>
:
1
21þareaðAÞareaðBÞ
minðareaðAÞ;areaðBÞÞ;if areaðAÞareaðBÞ≠∅
sideðAÞþsideðBÞ
2ðsideðAÞþsideðBÞþdistðA;BÞÞ;otherwise
ð1Þ
Relevance Networks
Given the semantic relevance Rseand spatial relevance Rsp of two
objects, their relevance is defined as the product of Rseand Rsp.
A relevance network is composed of nodes and edges. A node
represents a detected object in the form of a circle, whose center
corresponds to the center of its bounding box in the image. An edge
represents that the connected two objects are relevant, and the
relevance score between the two objects determines the width of
the edge.
This study differentiates between workers and nonworker nodes
in creating relevance networks because the direct connection
between workers presents low activity indication capability in
comparison with their connection with nonworker objects.
Table 3. Semantic Relevance Matrix of the 22 Classes
Class WKR M-CCP M-CCF M-FMW M-FSB M-FWC M-FSS M-REB M-RSB M-RWC M-SCF M-SSF E-BKH E-BDZ E-DMP E-CCB E-CCM E-CCP E-CRA E-LRY E-VAN E-CAR
WKR 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.25 0.25 0.5 0.5 0.25 0.25 0.5 0.25 0.25
M-CCP 0.5 1.0 0.75 0 0.5 0.5 0.5 0 0.5 0.5 0 0 0 0.25 0 0.5 0.5 0.5 0.25 0 0 0
M-CCF 0.5 0.75 1.0 0 0.25 0.25 0.25 0 0.25 0.25 0 0 0 0 0 0.25 0.25 0.25 0 0 0 0
M-FMW 0.5 0 0 1.0 0.75 0.75 0.75 0 0.25 0.5 0.25 0.25 0 0 0 0 0 0 0.5 0.5 0 0
M-FSB 0.5 0.5 0.25 0.75 1.0 0.75 0.75 0.5 0.5 0.25 0.25 0.25 0 0 0 0 0 0 0 0 0 0
M-FWC 0.5 0.5 0.25 0.75 0.75 1.0 0.75 0.25 0.25 0.5 0.25 0.25 0 0 0 0 0 0 0 0 0 0
M-FSS 0.5 0.5 0.25 0.75 0.75 0.75 1.0 0.25 0.25 0.25 0.25 0.25 0 0 0 0 0 0 0 0 0 0
M-REB 0.5 0 0 0 0.5 0.25 0.25 1.0 0.75 0.75 0 0 0 0 0 0 0 0 0.5 0.5 0 0
M-RSB 0.5 0.5 0.25 0.25 0.5 0.25 0.25 0.75 1.0 0.75 0 0 0 0 0 0.5 0.5 0.5 0 0 0 0
M-RWC 0.5 0.5 0.25 0.5 0.25 0.5 0.25 0.75 0.75 1.0 0 0 0 0 0 0.5 0.5 0.5 0 0 0 0
M-SCF 0.5 0 0 0.25 0.25 0.25 0.25 0 0 0 1.0 0.75 0 0 0 0 0 0 0 0 0 0
M-SSF 0.5 0 0 0.25 0.25 0.25 0.25 0 0 0 0.75 1.0 0 0 0 0 0 0 0 0 0 0
E-BKH 0.5 0 0 0 0 0 0 0 0 0 0 0 1.0 0.75 0.75 0 0 0 0 0 0 0
E-BDZ 0.25 0.25 0 0 0 0 0 0 0 0 0 0 0.75 1.0 0.75 0 0 0 0 0 0 0
E-DMP 0.25 0 0 0 0 0 0 0 0 0 0 0 0.75 0.75 1.0 0 0 0 0 0 0 0
E-CCB 0.5 0.5 0.25 0 0 0 0 0 0.5 0.5 0 0 0 0 0 1.0 0.75 0.75 0.5 0 0 0
E-CCM 0.5 0.5 0.25 0 0 0 0 0 0.5 0.5 0 0 0 0 0 0.75 1.0 0.75 0.25 0 0 0
E-CCP 0.25 0.5 0.25 0 0 0 0 0 0.5 0.5 0 0 0 0 0 0.75 0.75 1.0 0 0 0 0
E-CRA 0.25 0.25 0 0.5 0 0 0 0.5 0 0 0 0 0 0 0 0.5 0.25 0 1.0 0.75 0 0
E-LRY 0.5 0 0 0.5 0 0 0 0.5 0 0 0 0 0 0 0 0 0 0 0.75 1.0 0 0
E-VAN 0.25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0
E-CAR 0.25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0
Fig. 5. Spatial relationship
© ASCE 04018012-8 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
Therefore, in the beginning, nonworker objects are added into the
network and the sum of all relevance scores between the object and
its connected nonworker objects as the scale of its radius is used.
Then, worker nodes are added into the previously established
network. A worker nodes most relevant nonworker node Nis iden-
tified first, and then the worker node is attached to Nif their rel-
evance score is higher than the threshold. The radius of the worker
node is scaled by its relevance score with Nor a zero, which
represents that the worker is working independently.
Activity Recognition Patterns
This study focuses on recognizing construction activities using vis-
ually detected objects and develops 20 activity patterns based on
the activity indication capabilities of these objects. The 20 patterns
are categorized into four groups regarding their combination
modes, as summarized in Table 4. The activity patterns in the first
group use activity-specific construction equipment to indicate con-
struction activities directly according to the equipments instanta-
neity. For example, the presence of concrete mixers (E-CCM) or
concrete pump trucks (E-CCP) is transient and informative enough
to show that placing concrete is under way. In practice, not all
equipment presence is instant. Therefore, a combination of differ-
ent pieces of large equipment is more reasonable to derive ongoing
construction activities on typically congested jobsites. For example,
the researchers use the combination of a bulldozer (E-BDZ) and a
dump truck (E-DMP) to represent land leveling.
The activity patterns in the second group are composed merely
of construction materials because of their strong activity indication
capability. For example, concrete in placing (M-CCP), if detected,
can immediately suggest that concrete placing is underway. Sim-
ilarly, the key characteristic of the construction materials is the
instantaneity of its visual features (e.g., colors and image intensity
values), which render their uniqueness as well as visual detection in
site images. Finally, two patterns in this group are established as the
special requirement narrows down the scope of material types.
In the third group of patterns, construction activities are derived
jointly by workers and equipment. One of the typical examples
in this group is constructing a foundation with workers and an
Fig. 6. Illustration of spatial relevance calculation
Table 4. Activity Patterns
Code Pattern Activity
Directly by equipment
DE1 E-BDZ + E-DMP Leveling land
DE2 E-BKH + E-DMP Excavating for foundation
DE3 E-CCM Placing concrete
DE4 E-CCP Placing concrete
DE5 E-LRY Shipping materials
Directly by materials
DM1 M-CCP Placing concrete
DM2 M-CCF Finishing concrete
Jointly by workers and equipment
WE1 WKR + E-BKH Installing foundation components
WE2 WKR + E-CCB Placing concrete
WE3 WKR + E-VAN Transporting goods
WE4 WKR + E-CAR Transporting people
Jointly by workers and materials
WM1 WKR + M-FMW Machining or transferring formwork
WM2 WKR + M-FSB Building formwork of slabs and beams
WM3 WKR + M-FWC Building formwork of walls and columns
WM4 WKR + M-FSS Building formwork of stairs
WM5 WKR + M-REB Machining or transferring rebar
WM6 WKR + M-RSB Fixing rebar of slabs and beams
WM7 WKR + M-RWC Erecting rebar of walls and columns
WM8 WKR + M-SCF Building scaffolding systems
WM9 WKR + M-SSF Building scaffolding for slab formwork
© ASCE 04018012-9 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
excavator. In this pattern, an excavator can be used to lift founda-
tion components, e.g., precast slabs or drainage pipes, and workers
work together to direct the installation process and relocate com-
ponents into place.
In the fourth group, it is proposed to use the concurrence of
workers and materials to indicate ongoing activities. Other than the
materials identified in the second group, most of the construction
materials are unable to indicate solely if a construction activity
relevant to them is going on. In this case, workersconcurrence can
indicate jointly that an activity is under way, e.g., fixing rebar of
columns and walls can be detected when rebar of columns and
walls is detected to be relevant with at least one worker.
Experiments and Results
Object Detection
Evaluation Matrices
The first experiment evaluated object detection performance with
precision-recall curves. The number of correct detections are de-
noted as true positive (TP), the number of wrong detections as false
positive (FP), and the number of missed objects as false negative
(FN). Given the three definitions, precision is the first metric, which
is the ratio of TP to TP + FP, and recall is the second one, which
is the ratio of TP to TP + FN. Referencing the requirements in
PASCAL VOC object detection challenges (Everingham and Winn
2012), precision-recall curves are obtained by setting the precision
for recall r to the maximum precision obtained for any recall r 0>r.
This operation can be effectively performed by sorting all detec-
tions according to their confidence scores in descending order.
Eventually, the average precision (AP) measure of a specific class
is computed as the area under its curve, and the mAP is defined as
the mean of the APs of all classes.
Training Faster R-CNN
In the beginning, the training data set was used to fine-tune the
pretrained ResNet-50 (He et al. 2015) in 100,000 iterations, where
the minibatch size was 64, the momentum 0.9, and the starting
learning rate 0.001, which stepped down to 0.1 times the original
after each 25,000 iterations (i.e., 0.001 base learning rate, step
learning policy, 0.1 gamma, and 25,000 step size). The Faster
R-CNN model consists of two modules, one predicts class-specific
score and another regresses bounding box locations from the initial
recommending boxes, which are referred to as anchors by Ren et al.
(2015). The training losses synthesize the regression loss and the
classification loss of each minibatch of a train set (Ren et al. 2015).
The training process with unstable training losses, as shown in
Fig. 7, qualitatively indicates a noisy converging process, probably
because of the inconsistencies and missed annotations in the train-
ing set. Objects with vague edges are susceptible to inconsistent
annotations, while small objects are prone to missing annotations.
The training took around 14 h on a computer with an Ubuntu 16.04
LTS operation system, an NVidia GTX GeForce 1080 graphics
card, 16-GB random access memory, and an Intel i7-6700K
processor.
Results
APs synthesize three critical factors that determine the object de-
tection performance. First, the confidence scores sort the detections
and acknowledge that those with higher scores have bigger pos-
sibilities to be judged as a positive detection. As a result, precisions
drop faster as recalls increase in precision-recall curves in Fig. 8.
Second, the IoU of the ground-truth bounding box and a recom-
mendation bounding box determines if a detection is positive. The
lower its threshold is, the higher the precision and recall will be.
However, a lower value can allow (or result in) a bigger localiza-
tion error between the ground-truth bounding box and the rec-
ommendation bounding boxes. Third, nonmaximum suppression
(Neubeck and Van Gool 2006) removes the repeated recommenda-
tions for one object. The lower its overlap threshold is, the
more recommendations will be removed. Therefore, bounding
boxes standing closely are apt to be overkilled and reduce the
recall. This threshold is conventionally set at 0.7 (PASCAL
VOC 2012;Ren et al. 2015) and this value was adopted in the
experiments.
Fig. 8shows five precision-recall curves of the 22 classes
with the IoU thresholds for positive detection of 0.1, 0.3, 0.5,
Fig. 7. Training process
© ASCE 04018012-10 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
0.7, and 0.9, respectively. The results with the threshold 0.5 are
conventionally used for comparison (PASCAL VOC 2012;Ren
et al. 2015). Accordingly, the mAP of the proposed model is
67.3%, which is slightly higher than 67.0% of Faster R-CNN +
VGG-16 trained and tested with the PASCAL VOC (2012) data
sets and lower than 75.9% of Faster R-CNN + VGG-16 trained
and tested with the COCO, PASCAL VOC 2007, and PASCAL
VOC (2012) data sets (Girshick et al. 2016).
Obviously, comparison bias is apt to occur because the training
and test data sets are different regarding sizes of training and test
data sets and object characteristics, e.g., occlusion levels, sizes, and
views. Nevertheless, for preliminary comparison, the mAP and APs
of six classes (three with the best APs and three with the worst) are
listed in Table 5. The proposed models performance arrives at
the state of the art regarding mAP but presents a higher variance,
i.e., 0.31 coefficient of variance (CV), compared with 0.23 and
0.16 in the two references.
It was found that the worst detection performance of the model
is on raw construction materials. APs of M-FMW, M-CCP, and
M-REB are 23.1, 26.5, and 32.8%, respectively (Table 6). The
possible reason is that those materials are of free form. It is difficult
for human experts to establish their boundaries, and annotations
of them are easily subject to inconsistency. As shown in Table 6,
these APs are sensitive to the IoU threshold for positive detection.
On the contrary, the model presents the best APs on E-CCM,
E-BKH, and M-SSF (i.e., 90.6, 90.3, and 88.8%). These classes
have relatively distinguishable visual features, e.g., clear edges
and stable textures.
Worker detection is critical to the proposed activity recognition
method since 14 (70%) activity patterns use workers as one of the
primary cues. The reference APs of person detection are 75.9%
using Faster R-CNN + VGG-16 trained with the PASCAL VOC
(2012) data sets; 82.3% using Faster R-CNN + VGG-16 trained
Table 5. Detection Performance Comparison Regarding APs
Class Result
Faster R-CNN + VGG-16 trained with PASCAL VOC (2012) data seta
Cat 87.3
Dog 86.8
Airplane 82.3
Bottle 45.2
Chair 42.2
Plant 34.5
mAP 67.0b
Standard deviation 15.9b
Coefficient of variance 0.23b
Faster R-CNN + VGG-16 trained with COCO, PASCAL VOC 2007, and
PASCAL VOC (2012) data setsa
Cat 91.3
Dog 89.0
Airplane 87.4
Table 59.0
Chair 54.9
Plant 52.2
mAP 75.9b
Standard deviation 12.4b
Coefficient of variance 0.16b
Faster R-CNN + ResNet-50 trained with the data set in this study
E-CCM 90.6
E-BKH 90.3
M-SSF 88.8
M-REB 32.8
M-CCP 26.5
M-FMW 23.1
mAP 67.3
Standard deviation 20.7
Coefficient of variance 0.31
aData are extracted from Table 7 of Ren et al. (2015).
bResults are derived from the APs of the 20 classes of Ren et al. (2015).
Fig. 8. Precision-recall curves with different IoU thresholds for positive detection
© ASCE 04018012-11 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
with the COCO, PASCAL VOC 2007, and PASCAL VOC (2012)
data sets (Ren et al. 2015); and 79.3% using ResNet-50 on the Im-
ageNet validation set (He et al. 2015). The AP of worker detection
is 60.1%, which is lower than those reference APs of person de-
tection out of the computer vision community. In the implementa-
tion of Ren et al. (2015), given the scaled and fixed input images
with the width of 800 pixels and the height of 600 pixels, the mini-
mum objects that the model can detect are determined by the
minimum size of anchors, which is around 87 pixels wide and
175 pixels high. Therefore, this discrepancy could be attributed
to the fact that worker objects in the training and test data sets
are with relatively low resolutions; they have the smallest average
sizes, i.e., 109 pixels in the training data set and 112 pixels in the
test data set, as shown in Table 2. In practice, severe occlusion to
workers due to temporary facilities and construction equipment on
cluttered sites can aggravate their undetectability.
Although the results are comparable with those in the computer
vision community, there is still room for improvement from a prac-
tical perspective. Fig. 8and Table 6illustrate that lowering the
IoU threshold for positive detection is an immediate, but compro-
mising, solution to improve the performance of the object detec-
tion. For example, when the IoU threshold is lowered to 0.3 and
0.1, the mAP is improved to 76.6 and 78.2% respectively from
67.3%, and the AP of workers (WKR) is improved to 70.5 and
67.1% respectively from 60.1%.
Activity Recognition
Evaluation Matrices
The evaluation of activity recognition performance depends on
three similar basic definitions of TF, FP, and FN: TP represents
the number of correct recognitions, FP the number of wrong rec-
ognitions, and FN the number of missed activities. If a wrong
recognition occurs, identifying which ground-truth activity raises
the wrong recognition is difficult because diverse activities can be
observed in an image. Therefore, traditional confusion matrices that
are frequently used to evaluate single-mode classification systems,
in which an image contains only one execution of an activity, are
unsuitable. Consequently, the performance of diverse activity rec-
ognition is evaluated in terms of precision (i.e., the ratio of TP to
TP + FP) and recall (i.e., the ratio of TP to TP + FN).
Exemplary Cases of Activity Recognition
Before proceeding to evaluate activity recognition performance,
three cases of activity recognition with the proposed method are
described (Fig. 9). The left column in Fig. 9shows object detection
results with labels in the form of class code + id + (detection
confidence). The right column shows activity recognition results
in the form of relevance networks and activity patterns. Nonworker
nodes in relevance networks are labeled in the form of class code +
id + (total relevance score), while worker nodes are without the
total relevance score part. The relevance threshold is set to 0.25 to
divide the network. It means that two objects are relevant only
when their relevance is not less than that value. Moreover, division
results in subnetworks and supports identifying group activities
according to the connections among the nonworker nodes in a
subnetwork.
Case 1: In this case, three activities are recognized. The first
activity is erecting rebar of walls and columns, which involves
two activity entities, i.e., WKR-1 + M-RWC-1 (Pattern WM7)
and WKR-3 + M-RWC-1 (Pattern WM7). The second activity is
building scaffolding systems, which is conducted by WKR-0 +
M-SCF-9 (Pattern WM8). The third activity is building formwork
of walls and columns by WKR-2 + M-FWC-0 (Pattern WM3). The
first and third activities are part of the group activity of build-
ing formwork. They are connected through nonworker nodes
M-RWC-10, M-RWC-6, M-RWC-2, and M-RWC-0.
Case 2: The model detects four workers WKR-0, WKR-1,
WKR-2, and WKR-3. Three of them are relevant to M-FWC-8. As
a result, there are three activity entities, i.e., WKR-1 + FWC-8,
WKR-2 + FWC-8, and WKR-3 + FWC-8, and one activity,
i.e., building formwork of walls and columns. WKR-0 is in the
proximity of M-SCF-11 and believed to be building scaffolding
systems according to Pattern WM8. Concrete mixer E-CCM-1 is
detected, which matches Pattern DE3 and shows that placing con-
crete is under way. Additionally, wrong object detection occurs in
this case. Container M-FMW-5 is wrongly detected since it is raw
formwork material. A FP detection is avoided since no workers
are found relevant to M-FMW-5. On the contrary, if a worker were
found relevant to it, a FP detection could occur.
Case 3: In this case, three activities are detected. The first
activity is placing concrete, which is conducted by WKR-0 +
E-CCB-0 in line with Pattern WE2. However, this is a FP detection
because there is no such activity. E-CCB-0 was temporarily placed
there and WKR-0 happened to be in the proximity of it. The second
activity involves WKR-4 + M-FWC-5 and is established as build-
ing formwork of walls and columns. The last activity is building
formwork of slabs and beams, which consists of three activity en-
tities between three workers WKR-1, WKR-2, and WKR-3 and the
product M-FSB-4. A group activity of building formwork by four
workers WKR-1, WKR-2, WKR-3, and WKR-4 can be established
by routing from M-FSB-4 to M-FSB-5 in the subnetwork.
Results
This study focuses on using still site images, in which some ob-
jects cannot be effectively identified even by human experts. The
preliminary evaluation of activity recognition performance was
conducted in four steps. First, 200 images were randomly selected
from the images that the researchers took from building projects in
Hong Kong. It is believed that this consideration is helpful to
Table 6. Object Detection Performance Regarding APs with Various IoU
Thresholds for Positive Detection
Class
AP (%),
IoU = 0.1
AP (%),
IoU = 0.3
AP (%),
IoU = 0.5
AP (%),
IoU = 0.7
AP (%),
IoU = 0.9
mAP 78.2 76.6 69.7 50.5 7.0
WKR 70.5 67.1 60.1 32.8 0.6
M-CCP 77.8 62.7 26.5 10.7 0.6
M-CCF 83.5 82.1 60.6 36.7 9.1
M-FMW 30.7 28.6 23.1 12.2 0.1
M-FSB 63.5 59.4 49.0 28.2 1.5
M-FWC 82.6 81.5 74.4 45.5 3.0
M-FSS 82.7 77.3 60.2 23.1 0.0
M-REB 53.6 46.4 32.8 17.5 9.1
M-RSB 74.6 69.5 54.8 25.7 2.3
M-RWC 67.4 63.0 51.4 30.4 1.8
M-SCF 81.7 81.0 70.6 49.4 8.3
M-SSF 89.1 88.8 88.8 78.5 6.5
E-BKH 90.4 90.4 90.3 81.2 13.1
E-BDZ 84.2 84.0 81.8 77.9 3.5
E-DMP 80.6 80.5 79.7 71.3 8.2
E-CCB 87.1 83.6 81.5 70.2 15.5
E-CCM 90.6 90.6 90.6 87.9 17.8
E-CCP 89.1 88.5 87.5 69.5 4.5
E-CRA 80.4 76.9 68.6 51.0 4.2
E-LRY 87.4 87.4 86.1 76.6 15.8
E-VAN 85.1 84.9 84.9 78.4 22.6
E-CAR 77.5 76.7 76.4 69.8 13.6
© ASCE 04018012-12 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
increase the external validity of the experimental results. After that,
the researchers manually annotated and counted activities as ex-
plained in the previous three cases. In this process, single activities
were aggregated to specific activity types, and group activities were
ignored. Then, the proposed method was used to recognize activ-
ities, with the IoU threshold for positive detection of 0.5. Finally,
the performance of the method concerning recall and precision was
evaluated. The experiment resulted in 62.4% precision and 87.3%
recall (151 TP detections, 91 FP detections, and 22 FN detections),
which indicates that the proposed method holds the potential to
recognize construction activities, and that there is still room for
improvement.
Discussion
Research Challenges and Contribution to Knowledge
Recognizing diverse, concurrent activities executed by multiple ob-
jects in individual images is a challenging task. In a simple case of
Fig. 9. Activity recognition cases
© ASCE 04018012-13 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
activity detection, an image contains only the execution of a single
activity by one or a few objects, the objective of such a system is to
correctly classify the image. For example, Golparvar-Fard et al.
(2013) used the SVM classifiers to classify different actions of an
excavators operations in trimmed video clips, which can be viewed
as time-lapse images, resulting in an average accuracy of action
recognition of 86.33%, which is comparable to the performance
of object classification at that time. Yang et al. (2016) investigated
using dense trajectories to recognize actions of construction work-
ers in trimmed videos clips and reported an average accuracy of
59%. Similarly, they used the SVM classifiers to classify workers
actions since each video clip only contains one action. However, in
more general cases where multiple objects are present and diverse
activities take place concurrently, classification algorithmbased
solutions are unsuitable.
A new method that integrates deep learningbased object detec-
tion and relevance networkbased activity pattern recognition can
tackle this challenge. This study contributes to the body of knowl-
edge in two aspects. First, the state-of-the-art deep-learning tech-
nology was employed to detect the frequently observed 22 classes
of objects in site images. To implement this plan, the researchers
collected and annotated the training data set to fine-tune the Faster
R-CNN model and evaluated the performance of the model on the
test data set. It was found that the deep-learning model presents
consistently high APs on objects with clean boundaries and invari-
ant forms in comparison with those published in the computer
vision domain. However, for those objects with free forms or
ambiguous edges, like raw rebar and formwork materials, the deep-
learning model presents low APs.
Second, the researchers developed a set of rules for creating rel-
evance networks and 20 activity patterns to recognize construction
activities. Semantic relevance and spatial relevance were introduced
to build relevance networks. Semantic relevance represents the se-
mantic likelihood that any two objects are concurrently showing in
the same construction activity. Spatial relevance is defined with 2D
pixel distances in the image coordinate, representing the observable
possibility that they are involved in the same activity. Conse-
quently, the relevance of two objects is formulated as the product
of their semantic relevance and spatial relevance. Furthermore, rel-
evance networks can serve as a tool to identify latent group activ-
ities in site images.
Practical Implications
The method in this study is designed with several practical expect-
ations, i.e., using site images, detecting and analyzing concurrent
construction activities, and being fully automatic. Therefore, it is
possible to save managersvaluable time in data collection and
manipulation for on-site monitoring and concentrate their attention
on solving problems that necessarily demand their expertise. More
specifically, this method could nourish several potential applica-
tions. First, the method can be used to index and classify daily site
images, which are usually taken for various management purposes,
e.g., quality control, safety management, and progress record, but
without textual description. Automated indexing and classifying of
these images should be helpful. Second, since surveillance videos
can be decomposed into time-lapse images, the method can be used
to continuously monitor the construction resources involved in spe-
cific activities regarding working hours. Third, given site videos,
it is possible to detect the states of an activity (i.e., not started, just
started, ongoing, and completed). Therefore, the activity progress
deviation can be established against construction programs in
real time.
Research Limitations
As indicated by the experimental results, there is still room for
improvement on the detection performance of several classes of
objects, e.g., raw rebar and formwork materials and workers.
Besides this, the researchers note two limitations of this study.
To implement the full automation, 2D pixel distances in the image
coordinate, rather than 3D physical distances, are used to define the
spatial relevance between objects/workers. Images are assumed to
be taken from relatively vertical angles, e.g., a surveillance camera
mounted under the operator cabin of a tower crane. In fact, many
images are taken from relatively horizontal angles, which compact
distances in 2D images and affect the validity of spatial relevance
calculation. Reconstructing 3D physical positions using multiple
images of various viewpoints, which was explored in Brilakis et al.
(2011) and Park et al. (2012) to track objects, could be a solution to
this limitation.
Also, there is an intrinsic limitation of using images to detect
activities because no temporal information between images is avail-
able. It is difficult to differentiate between prolonged activities
and transitive states. At the current stage of this study, it primarily
investigated how to use site images to recognize construction ac-
tivities. Future work will focus on site surveillance videos to take
advantage of the temporal information across frames to investigate
the dynamics of relevance networks and improve the activity
recognition performance.
The performance of activity recognition depends heavily on that
of object detection. Activity recognition is based on relevance net-
works, which in turn are built according to the objects detected.
As a result, the precision of any activity pattern is statistically lower
than or equal to the minimum AP of the object classes used to es-
tablish the pattern. The precision of activity recognition (62.4%) is
lower than the mAP of object detection (67.3%); the latter sets a
ceiling for the former. The limitations mentioned previously nec-
essarily widen their difference. Lowering the IoU threshold for
positive detection, which sacrifices the localization accuracy, is an
immediate solution to improve object detection performance.
Conclusion
This paper introduces a new method to recognize diverse, concur-
rent activities executed by multiple objects in still site images, for
which the methods based on image classification algorithms are not
suitable. The method consists of two steps: object detection and
activity recognition. Cutting-edge object detection technologies
(i.e., Faster R-CNN + ResNet-50) were employed to implement
the object detection task. The researchers built the training and test
data sets of 22 classes of construction-related objects to train and
evaluate the convolutional neural networks. The method proves to
be comparable with the state of the art of object detection regarding
mAP and the best APs but presents a relatively big AP variance.
Free forms, blurred edges, and low resolutions are the possible
causes of the low APs.
Semantic relevance and spatial relevance were introduced to cre-
ate relevance networks. Semantic relevance represents the semantic
likelihood that any two objects are concurrently showing in the
same construction activity. Spatial relevance is defined with 2D
pixel distances in the image coordinates, representing the observ-
able possibility that they are involved in the same activity. Based
on relevance networks, a set of activity patterns was defined.
Preliminary experimental results show that the rule-based relevance
networks and activity patterns possess the potential to detect
diverse construction activities in site images.
© ASCE 04018012-14 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
However, the proposed method is limited due to distance com-
paction caused by 2D photography with one camera, which can be
addressed with 3D physical distances obtained through triangulat-
ing various camera views. Another limitation is the difficulty in
differentiating between prolonged activities and transitive states
due to the intrinsic lack of temporal information between site im-
ages. To address it, site surveillance videos will be investigated in
the future to implement dynamic relevance networks by detecting
and correlating identical objects across consecutive frames.
Acknowledgments
The work was supported by the Innovation and Technology
Commission of Hong Kong, under the platform project Smart
Construction Platform based on Cloud BIM and Image Processing
(ITT/002/16LP).
References
Aggarwal, J. K., and Cai, Q. (1999). Human motion analysis.Comput.
Vision Image Understanding, 73(3), 428440.
Aggarwal, J. K., and Ryoo, M. S. (2011). Human activity analysis:
A review.ACM Comput. Surv., 43(3), 143.
Andrew, A. M. (2013). An introduction to support vector machines and
other kernel-based learning methods, Cambridge University Press,
Cambridge, U.K.
Brilakis, I., Park, M.-W., and Jog, G. (2011). Automated vision tracking of
project related entities.Adv. Eng. Inf., 25(4), 713724.
Brilakis, I., and Soibelman, L. (2006). Multimodal image retrieval
from construction databases and model-based systems.J. Constr. Eng.
Manage.,10.1061/(ASCE)0733-9364(2006)132:7(777), 777785.
Brilakis, I., Soibelman, L., and Shinagawa, Y. (2005). Material-based con-
struction site image retrieval.J. Comput. Civ. Eng.,10.1061/(ASCE)
0887-3801(2005)19:4(341), 341355.
Bugler, M., Borrmann, A., Ogunmakin, G., Vela, P. A., and Teizer, J.
(2017). Fusion of photogrammetry and video analysis for productivity
assessment of earthwork processes.Comput.-Aided Civil Infrastruct.
Eng., 32(2), 107123.
Chi, S., and Caldas, C. H. (2011). Automated object identification using
optical video cameras on construction sites.Comput.-Aided Civ.
Infrastruct. Eng., 26(5), 368380.
Dalal, N., and Triggs, B. (2005). Histograms of oriented gradients for
human detection.IEEE Computer Society Conf. on Computer Vision
and Pattern Recognition, Vol. 881, IEEE, New York, 886893.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009).
ImageNet: A large-scale hierarchical image database.IEEE Conf.
on Computer Vision and Pattern Recognition, IEEE, New York,
248255.
Du, S., Shehata, M., and Badawy, W. (2011). Hard hat detection in video
sequences based on face features, motion and color information.
3rd Int. Conf. on Computer Research and Development, IEEE,
New York, 2529.
Egnor, S. R., and Branson, K. (2016). Computational analysis of
behavior.Ann. Rev. Neurosci., 39, 217236.
Everingham, M., and Winn, J. (2007). The PASCALVisual Object Classes
Challenge 2007 (VOC2007) development kit.https://pjreddie.com
/media/files/VOC2007_doc.pdf(Jun. 27, 2017).
Everingham, M., and Winn, J. (2012). The PASCALVisual Object Classes
Challenge 2012 (VOC2012) development kit.http://host.robots.ox.ac
.uk/pascal/VOC/voc2012/devkit_doc.pdf. (Jun. 27, 2017).
Felzenszwalb, P. F., McAllester, D., and Ramanan, D. (2008). A discrim-
inatively trained, multiscale, deformable part model.Proc., Computer
Vision and Pattern Recognition, IEEE, New York, 18.
Girshick, R. (2015). Fast R-CNN.Proc., IEEE Int. Conf. on Computer
Vision, IEEE, New York, 14401448.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature
hierarchies for accurate object detection and semantic segmentation.
Proc., IEEE Conf. on Computer Vision and Pattern Recognition, IEEE,
New York, 580587.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2016). Region-based
convolutional networks for accurate object detection and segmenta-
tion.IEEE Trans. Pattern Anal. Mach. Intell., 38(1), 142158.
Golparvar-Fard, M., Heydarian, A., and Niebles, J. C. (2013). Vision-
based action recognition of earthmoving equipment using spatio-
temporal features and support vector machine classifiers.Adv. Eng.
Inf., 27(4), 652663.
Gong, J., and Caldas, C. H. (2010). Computer vision-based video inter-
pretation model for automated productivity analysis of construction
operations.J. Comput. Civ. Eng.,10.1061/(ASCE)CP.1943-5487
.0000027, 252263.
Gong, J., Caldas, C. H., and Gordon, C. (2011). Learning and classifying
actions of construction workers and equipment using Bag-of-Video-
Feature-Words and Bayesian network models.Adv. Eng. Inf., 25(4),
771782.
Haming, K., and Peters, G. (2010). The structure-from-motion
reconstruction pipelineA survey with focus on short image sequen-
ces.Kybernetika, 46(5), 926937.
Han, S., and Lee, S. (2013). A vision-based motion capture and recog-
nition framework for behavior-based safety management.Autom.
Constr., 35, 131141.
Han, S., Lee, S., and Pe˜na-Mora, F. (2013). Vision-based detection of un-
safe actions of a construction worker: Case study of ladder climbing.J.
Comput. Civ. Eng.,10.1061/(ASCE)CP.1943-5487.0000279, 635644.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for
image recognition.Pre-print arXiv:1512.03385.
Hinton, G. E. (2009). Deep belief networks.Scholarpedia, 4(5), 5947.
Hong, S., You, T., Kwak, S., and Han, B. (2015). Online tracking by learn-
ing discriminative saliency map with convolutional neural network.
Pre-print arXiv:1502.06796.
ImageNet and Microsoft COCO. (2015). ImageNet and MS COCO
Visual Recognition Challenges Joint Workshop.http://image-net
.org/challenges/ilsvrc+mscoco2015(Jun. 26, 2017).
Kristan, M., Matas, J., Leonardis, A., Felsberg, M., and Cehovin, L. (2015).
The visual object tracking VOT2015 challenge results.Proc., IEEE
Int. Conf. on Computer Vision Workshops, IEEE, New York, 123.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet clas-
sification with deep convolutional neural networks.Proc., Advances
in Neural Information Processing Systems, NIPS, La Jolla, CA,
10971105.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random
fields: Probabilistic models for segmenting and labeling sequence data.
Proc., 8th Int. Conf. on Machine Learning, IMLS, Stroudsburg, PA,
282289.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning.Nature,
521(7553), 436444.
Lillo, I., Soto, A., and Carlos Niebles, J. (2014). Discriminative hierarchi-
cal modeling of spatio-temporally composable human activities.Proc.,
IEEE Conf. on Computer Vision and Pattern Recognition, IEEE,
New York, 812819.
Lowe, D. G. (1999). Object recognition from local scale-invariant
features.Proc., Int. Conf. on Computer Vision, Vol. 1152, IEEE,
New York, 11501157.
Memarzadeh, M., Golparvar-Fard, M., and Niebles, J. C. (2013). Auto-
mated 2D detection of construction equipment and workers from site
video streams using histograms of oriented gradients and colors.
Autom. Constr., 32, 2437.
Morariu, V. I., and Davis, L. S. (2011). Multi-agent event recognition in
structured scenarios.Proc., IEE Conf. on Computer Vision and Pattern
Recognition, IEEE, New York, 32893296.
Murphy, K. P., and Paskin, M. A. (2001). Linear-time inference in hier-
archical HMMs.Proc., NIPS, NIPS, La Jolla, CA, 833840.
Neubeck, A., and Van Gool, L. (2006). Efficient non-maximum suppres-
sion.Proc., 18th Int. Conf. on Pattern Recognition, IEEE, New York,
850855.
Park, M. W., and Brilakis, I. (2012). Construction worker detection in
video frames for initializing vision trackers.Autom. Constr., 28,
1525.
© ASCE 04018012-15 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
Park, M. W., Koch, C., and Brilakis, I. (2012). Three-dimensional tracking
of construction resources using an on-site camera system.J. Comput.
Civ. Eng.,10.1061/(ASCE)CP.1943-5487.0000168, 541549.
PASCAL VOC. (2012). The PASCAL visual object classes.http://host
.robots.ox.ac.uk/pascal/VOC/(Jun. 26, 2017).
Quattoni, A., Wang, S., Morency, L.-P., Collins, M., and Darrell, T. (2007).
Hidden conditional random fields.IEEE Trans. Pattern Anal. Mach.
Intell., 29(10), 18481852.
Rabiner, L., and Juang, B. (1986). An introduction to hidden Markov
models.IEEE ASSP Magazine, 3(1), 416.
Ray, S. J., and Teizer, J. (2012). Real-time construction worker posture
analysis for ergonomics training.Adv. Eng. Inf., 26(2), 439455.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-CNN: Towards
real-time object detection with region proposal networks.Proc.,
Advances in Neural Information Processing Systems, NIPS, La Jolla,
CA, 9199.
Rezazadeh Azar, E., Dickinson, S., and McCabe, B. (2013). Server-
customer interaction tracker: Computer vision-based system to estimate
dirt-loading cycles.J. Constr. Eng. Manage.,10.1061/(ASCE)CO
.1943-7862.0000652, 785794.
Rezazadeh Azar, E., and McCabe, B. (2012a). Automated visual recog-
nition of dump trucks in construction videos.J. Comput. Civ. Eng.,
10.1061/(ASCE)CP.1943-5487.0000179, 769781.
Rezazadeh Azar, E., and McCabe, B. (2012b). Part based model and
spatial-temporal reasoning to recognize hydraulic excavators in
construction images and videos.Autom. Constr., 24, 194202.
Shechtman, E., and Irani, M. (2005). Space-time behavior based correla-
tion.Proc., IEEE Computer Society Conf. on Computer Vision and
Pattern Recognition, IEEE, New York, 405412.
Son, H., Kim, C., and Kim, C. (2012). Automated color model-based con-
crete detection in construction-site images by using machine learning
algorithms.J. Comput. Civ. Eng.,10.1061/(ASCE)CP.1943-5487
.0000141, 421433.
Teizer, J., Caldas, C., and Haas, C. (2007). Real-time three-dimensional
occupancy grid modeling for the detection and tracking of construction
resources.J. Constr. Eng. Manage.,10.1061/(ASCE)0733-9364
(2007)133:11(880), 880888.
Tzu, T. (2015). LabelImg: A graphical image annotation tool.https://
github.com/tzutalin/labelImg(Dec. 1, 2016).
Vrigkas, M., Nikou, C., and Kakadiaris, I. A. (2015). A review of human
activity recognition methods.Front. Robot. AI, 2, 28.
Wang, H., and Schmid, C. (2011). Action recognition by dense trajecto-
ries.IEEE Conf. on Computer Vision and Pattern Recognition, IEEE,
New York, 31693176.
Wang, H., and Schmid, C. (2013). Action recognition with improved tra-
jectories.Proc., IEEE Int. Conf. on Computer Vision, IEEE, New York,
35513558.
Weinland, D., Ronfard, R., and Boyer, E. (2011). A survey of vision-based
methods for action representation, segmentation and recognition.
Comput. Vision Image Understanding, 115(2), 224241.
Williams, C. (2012). Introduction: History and analysis (in Part II: VOC
2005-2012: The VOC years and legacy).http://host.robots.ox.ac.uk
/pascal/VOC/voc2012/workshop/history_analysis.pdf(Jun. 26, 2017).
Yang, J., Park, M.-W., Vela, P. A., and Golparvar-Fard, M. (2015). Con-
struction performance monitoring via still images, time-lapse photos,
and video streams: Now, tomorrow, and the future.Adv. Eng. Inf.,
29(2), 211224.
Yang, J., Shi, Z., and Wu, Z. (2016). Vision-based action recognition of
construction workers using dense trajectories.Adv. Eng. Inf., 30(3),
327336.
Yang, J., Vela, P., Teizer, J., and Shi, Z. (2014). Vision-based tower crane
tracking for understanding construction activity.J. Comput. Civ. Eng.,
10.1061/(ASCE)CP.1943-5487.0000242, 103112.
Zhu, Z., Ndiour, I. J., Brilakis, I., and Vela, P. A. (2010). Improvements to
concrete column detection in live video.Proc., 27th Int. Symp. on
Automation and Robotics in Construction, Faculty of Civil Engineering
of the Slovak Univ. of Technology in Bratislava, Bratislava, Slovakia,
2527.
Zou, J., and Kim, H. (2007). Using hue, saturation, and value color space
for hydraulic excavator idle time analysis.J. Comput. Civ. Eng.,
10.1061/(ASCE)0887-3801(2007)21:4(238), 238246.
© ASCE 04018012-16 J. Comput. Civ. Eng.
J. Comput. Civ. Eng., 2018, 32(3): 04018012
Downloaded from ascelibrary.org by Tongji University on 03/25/18. Copyright ASCE. For personal use only; all rights reserved.
... Furthermore, Kim et al. (2018) (Kim et al., 2018) analyzed the collaborative state of excavators and dump trucks during earthwork by considering their co-existence, proximity, and synchronized actions. Luo et al. (2018) (Luo et al., 2018b) developed a framework capable of detecting various objects on construction sites and analyzing their spatial and semantic relationships, recognizing a spectrum of construction activities. While these studies highlight the importance of spatial relationships in monitoring construction sites, they primarily concentrate on aspects such as co-existence and proximity of objects. ...
... Computer vision-based approaches typically use images and video footage collected by surveillance cameras or unmanned aerial vehicles. Deep learning models detect construction equipment, workers, materials, and personal protective equipment to predict the trajectory of construction equipment or workers (Paneru and Jeelani, 2021;Liu et al., 2018;Fang et al., 2018;Park and Brilakis, 2016) or recognize construction activities in images by analyzing the interactions between the detected objects (Roberts and Golparvar-Fard, 2019;Kim and Chi, 2022;Kim et al., 2018;Luo et al., 2018b). Compared to sensors, cameras are less expensive and easier to deploy (Chen et al., 2023;Memarzadeh et al., 2013). ...
Article
Full-text available
Effectively monitoring and analyzing on-site module installation for modular integrated construction (MiC) is essential to properly coordinating the MiC process. In this study, the authors propose an automated productivity monitoring framework for on-site module installation operations consisting of three modules: object detection, activity classification, and productivity analysis. The object detection module detects mobile cranes and modules interacting with mobile cranes, and the activity classification module classifies module installation activities into five different activities by considering the spatiotemporal relationship between the detected objects. Finally, the productivity analysis module analyzes the productivity of the module installation process by utilizing the accumulated activity classification results over image frames. The proposed model achieves an average accuracy of 89% (hooking: 85.71%, lifting: 84.44%, positioning: 94.90%, returning: 83.09%, and idling: 96.87%) in classifying the five activities. The developed framework enables practitioners to measure the productivity of the on-site module installation process automatically. In addition, productivity data stored from diverse construction sites contribute to identifying progress-impeding factors and improving the productivity of the entire MiC process.
... As a result, it can now be applied to construction activity recognition. Luo et al. [49] utilized the Faster R-CNN object detection algorithm to identify construction workers and entity objects in construction site images. They then employed a relevance network, taking into account the spatial relationships between the objects, to recognize multiple construction activities. ...
Article
Full-text available
Recognition and classification for construction activities help to monitor and manage construction workers. Deep learning and computer vision technologies have addressed many limitations of traditional manual methods in complex construction environments. However, distinguishing different workers and establishing a clear recognition logic remain challenging. To address these issues, we propose a novel construction activity recognition method that integrates multiple deep learning algorithms. To complete this research, we created three datasets: 727 images for construction entities, 2546 for posture and orientation estimation, and 5455 for worker re-identification. First, a YOLO v5-based model is trained for worker posture and orientation detection. A person re-identification algorithm is then introduced to distinguish workers by tracking their coordinates, body and head orientations, and postures over time, then estimating their attention direction. Additionally, a YOLO v5-based object detection model is developed to identify ten common construction entity objects. The worker’s activity is determined by combining their attentional orientation, positional information, and interaction with detected construction entities. Ten video clips are selected for testing, and a total of 745 instances of workers are detected, achieving an accuracy rate of 88.5%. With further refinement, this method shows promise for a broader application in construction activity recognition, enhancing site management efficiency.
... For instance, Roberts et al. [1] used a combination of 2D CNN with Hidden Markov Models to detect, track, and identify the activities of excavators and dump trucks. Luo et al. [10] used a combination of 2D CNNs and relevance networks for detecting various construction-related objects and their associated set of interactive activities by exploiting the two-dimensional pixel proximity of the detected objects. Kim and Chi [11] also performed interaction analysis to identify the activities and operation cycles of excavators and dump trucks by combining 2D CNN and Long Short-Term Memory (LSTM) architectures. ...
... Above all, smart skin capabilities provide vital environmental data in a package compatible with humanoid geometries, allowing robots the ability to understand and interact with their surroundings in a variety of ways [22][23][24][25][26][27][28]. For instance, a smart skin with pressure sensors could allow for the localization and quantification of pressure forces in contact with the skin, permitting safe interactions with humans in the robot's operating environment, such as a human being able to brush a robot's arm aside. ...
Article
Full-text available
An investigation was performed to develop a process to design and manufacture a 3-D smart skin with an embedded network of distributed sensors for non-developable (or doubly curved) surfaces. A smart skin is the sensing component of a smart structure, allowing such structures to gather data from their surrounding environments to make control and maintenance decisions. Such smart skins are desired across a wide variety of domains, particularly for those devices where their surfaces require high sensitivity to external loads or environmental changes such as human-assisting robots, medical devices, wearable health components, etc. However, the fabrication and deployment of a network of distributed sensors on non-developable surfaces faces steep challenges. These challenges include the conformal coverage of a target object without causing prohibitive stresses in the sensor interconnects and ensuring positional accuracy in the skin sensor deployment positions, as well as packaging challenges resulting from the thin, flexible form factor of the skin. In this study, novel and streamlined processes for making such 3-D smart skins were developed from the initial sensor network design to the final integrated skin assembly. Specifically, the process involved the design of the network itself (for which a physical simulation-based optimization was developed), the deployment of the network to a targeted 3D surface (for which a specialized tool was designed and implemented), and the assembly of the final skin (for which a novel process based on dip coating was developed and implemented.)
... Furthermore, as workers constitute the primary workforce at construction sites, monitoring their activities during construction is of great importance. In [25], Luo et al. proposed a two-stage method for recognizing objects in construction sites, specifically detecting 22 classes of objects and 17 types of construction activities. In [26], a database-free approach was introduced, which achieves high accuracy in detecting construction activities with a small amount of data. ...
Article
Full-text available
Ensuring the safety of workers and machinery during operations is a critical task in the construction sites. However, an inevitable circumstance in construction sites is the complex and dynamic environment, which often leads to occlusions. When detecting occluded objects in construction sites, general vision-based approaches tend to exhibit lower accuracy and may even miss detections, resulting in potential safety hazards. To handle this issue, this paper proposes a vision-based approach for detecting occluded objects in construction sites. Firstly, the proposed detection algorithm adopts the state-of-the-art YOLOv7 as its backbone. To enhance its capability in capturing contextual information of occluded objects, a novel channel attention mechanism is employed. Then, a design scheme for the detector head is provided by integrating a novel loss function Scylla-Intersection over Union (SIoU) and the non-maximum suppression (NMS) strategy. With the help of the loss function SIoU, the network can compute the loss values of occluded objects more accurately. To ensure that the network can select the right predicted box which closely aligns with the ground truth, the Euclidean distance is utilized as spatial penalty factor during the NMS stage. By implementing these two strategies, the proposed method can preserve both the category information and bounding boxes of occluded objects, which makes them possible to be detected. Finally, detailed experiments are done to verify the proposed method. Experimental results demonstrate that the proposed method has the potential for improving the detection accuracy. Moreover, it shows a better performance in detecting occluded objects in the dynamic construction sites compared to the existing baselines.
Article
Cameras are one of the most valuable sensors for collecting high-quality visual data on construction sites for uses ranging from surveillance to automated information exaction. The dynamic nature of sites means the visual data from cameras can suffer from occlusions and lack of coverage due to progressing works, hindering the performance of automated visual analysis methods. Therefore, manual planning and adjustments by experienced practitioners are required for appropriate camera placement at the site, which is expensive and time-consuming. Past research has simulated cameras and used algorithms with an objective function to optimize installation parameters with planned site models ranging from two-dimensional (2D) to four-dimensional (4D). However, these models lack information from ongoing site conditions, hampering actual camera performance. This study proposes a camera placement framework incorporating 4D-building information model (BIM) and reality models. The framework first identifies the camera placement determinants through expert interviews. Next, the planned BIM and reality models are used to construct the simulation environment, and camera placement parameters are optimized. The proposed framework is implemented and evaluated on a construction site. 25% average camera coverage improvement from the benchmark solution is achieved. This study further contributes to the potential application of automated visual monitoring systems on construction sites.
Article
Full-text available
Recognizing human activities from video sequences or still images is a challenging task due to problems, such as background clutter, partial occlusion, changes in scale, viewpoint, lighting, and appearance. Many applications, including video surveillance systems, human-computer interaction, and robotics for human behavior characterization, require a multiple activity recognition system. In this work, we provide a detailed review of recent and state-of-the-art research advances in the field of human activity classification. We propose a categorization of human activity methodologies and discuss their advantages and limitations. In particular, we divide human activity classification methods into two large categories according to whether they use data from different modalities or not. Then, each of these categories is further analyzed into sub-categories, which reflect how they model human activities and what type of activities they are interested in. Moreover, we provide a comprehensive analysis of the existing, publicly available human activity classification datasets and examine the requirements for an ideal human activity recognition dataset. Finally, we report the characteristics of future research directions and present some open issues on human activity recognition.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
The high complexity of modern large-scale construction projects leads their schedules to be sensitive to delays. At underground construction sites, the earthwork processes are vital, as most of the following tasks depend on them. This article presents a method for estimating the productivity of soil removal by combining two technologies based on computer vision: photogrammetry and video analysis. Photogrammetry is applied to create a time series of point clouds throughout excavation, which are used to measure the volume of the excavated soil for daily estimates of productivity. Video analysis is used to generate statistics regarding the construction activities for estimating productivity at finer time scales, when combined with the output from the photogrammetry pipeline. As there may be multiple causes for specific productivity levels, the automated generation of progress and activity statistics from both measurement methods supports interpretation of the productivity estimates. Comparison to annotated ground truth for the tracking and activity monitoring method highlights the reliability of the extracted information. The suitability of the approach is demonstrated by two case studies of real-world urban excavation projects.
Article
Wide spread monitoring cameras on construction sites provide large amount of information for construction management. The emerging of computer vision and machine learning technologies enables automated recognition of construction activities from videos. As the executors of construction, the activities of construction workers have strong impact on productivity and progress. Compared to machine work, manual work is more subjective and may differ largely in operation flow and productivity among different individuals. Hence only a handful of work studies on vision based action recognition of construction workers. Lacking of publicly available datasets is one of the main reasons that currently hinder advancement. The paper studies worker actions comprehensively, abstracts 11 common types of actions from 5 kinds of trades and establishes a new real world video dataset with 1176 instances. For action recognition, a cutting-edge video description method, dense trajectories, has been applied. Support vector machines are integrated with a bag-of-features pipeline for action learning and classification. Performances on multiple types of descriptors (Histograms of Oriented Gradients – HOG, Histograms of Optical Flow – HOF, Motion Boundary Histogram – MBH) and their combination have been evaluated. Discussion on different parameter settings and comparison to the state-of-the-art method are provided. Experimental results show that the system with codebook size 500 and MBH descriptor has achieved an average accuracy of 59% for worker action recognition, outperforming the state-of-the-art result by 24%.
Article
In this review, we discuss the emerging field of computational behavioral analysis-the use of modern methods from computer science and engineering to quantitatively measure animal behavior. We discuss aspects of experiment design important to both obtaining biologically relevant behavioral data and enabling the use of machine vision and learning techniques for automation. These two goals are often in conflict. Restraining or restricting the environment of the animal can simplify automatic behavior quantification, but it can also degrade the quality or alter important aspects of behavior. To enable biologists to design experiments to obtain better behavioral measurements, and computer scientists to pinpoint fruitful directions for algorithm improvement, we review known effects of artificial manipulation of the animal on behavior. We also review machine vision and learning techniques for tracking, feature extraction, automated behavior classification, and automated behavior discovery, the assumptions they make, and the types of data they work best with. Expected final online publication date for the Annual Review of Neuroscience Volume 39 is July 08, 2016. Please see http://www.annualreviews.org/catalog/pubdates.aspx for revised estimates.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.