ArticlePDF Available

Recognizing Diverse Construction Activities in Site Images via Relevance Networks of Construction Related Objects Detected by Convolutional Neural Networks

November 2017
Journal of Computing in Civil Engineering 32(3)

November 2017
32(3)

DOI:10.1061/(ASCE)CP.1943-5487.0000756

Authors:

Xiaochun Luo

The Hong Kong Polytechnic University

Heng Li

The Hong Kong Polytechnic University

Dongping Cao

Tongji University

Fei Dai

West Virginia University

Show all 6 authorsHide

Timely and overall knowing the states and resource allocation of diverse activities on construction sites is critical to resource leveling, progress tracking, and productivity analysis. Despite its importance, this task is still performed manually. Previous studies have taken a significant step forward in introducing computer vision technologies to it, while they are oriented toward limited classes of objects or limited types of activities. Furthermore, they especially focus on single activity recognition, where an image contains only one execution of an activity by one or few objects. This paper proposes a two-step method for recognizing diverse construction activities in site images. It detects 22 classes of construction related objects using convolutional neural networks. Given objects detected, semantic relevance representing the likelihood of the cooperation or coexistence between two objects in a construction activity, spatial relevance representing the two-dimensional pixel proximity in the image coordinate, and activity patterns are defined to recognize 17 types of construction activities. The advantage of the proposed method is its potential to recognize multiple concurrent construction activities in a fully automatic way. Therefore, it is possible to save people's valuable time in manual data collection and concentrate their attention to solve problems that necessarily demand their expertise.

. Object Classes and Statistics of the Annotated Objects in the Training and Test Data Sets

…

Illustration of semantic relevance rules

Illustration of spatial relevance calculation

…

Figures - uploaded by Xiaochun Luo

Content may be subject to copyright.

Content uploaded by Xiaochun Luo

Content may be subject to copyright.

Recognizing Diverse Construction Activities in Site Images

via Relevance Networks of Construction-Related Objects

Detected by Convolutional Neural Networks

Xiaochun Luo1; Heng Li2; Dongping Cao3; Fei Dai, M.ASCE4;

JoonOh Seo5; and SangHyun Lee, M.ASCE6

Abstract: Timely and overall knowledge of the states and resource allocation of diverse activities on construction sites is critical to resource

leveling, progress tracking, and productivity analysis. Despite its importance, this task is still performed manually. Previous studies have

taken a significant step forward in introducing computer vision technologies, although they have been oriented toward limited classes of

objects or limited types of activities. Furthermore, they especially focus on single activity recognition, where an image contains only the

execution of an activity by one or a few objects. This paper introduces a two-step method for recognizing diverse construction activities in still

site images. It detects 22 classes of construction-related objects using convolutional neural networks. With objects detected, semantic

relevance representing the likelihood of the cooperation or coexistence between two objects in a construction activity, spatial relevance

representing the two-dimensional pixel proximity in the image coordinates, and activity patterns are defined to recognize 17 types of

construction activities. The advantage of the proposed method is its potential to recognize diverse concurrent construction activities in

a fully automatic way. Therefore, it is possible to save managers’valuable time in manual data collection and concentrate their attention

Civil Engineers.

Author keywords: Construction activity recognition; Convolutional neural networks; Semantic relevance; Spatial relevance; Relevance

networks.

Introduction

Construction sites generally have a large scale, and diverse activ-

ities take place there concurrently. Timely and overall awareness

of activities’states and resource allocation is critical to many

project-level management tasks, including resource leveling, prog-

ress tracking, and productivity analysis. Despite its importance,

the manual approach to activity tracking and resource counting,

which relies on managers’experience and diligence, is still the

mainstream in practice. In the last decade, a considerable amount

of literature has been published concerning visual object detection

(Chi and Caldas 2011;Du et al. 2011,Memarzadeh et al. 2013;

Rezazadeh Azar and McCabe 2012a,b;Zhu et al. 2010) and con-

struction activity recognition (Golparvar-Fard et al. 2013;Gong

et al. 2011;Han and Lee 2013;Han et al. 2013;Ray and Teizer

2012;Rezazadeh Azar et al. 2013;Yang et al. 2016;Zou and

Kim 2007). These studies have contributed to taking a significant

step forward in introducing computer vision technologies to the

time-consuming tasks (Yang et al. 2015).

Previous studies, however, have been primarily oriented toward

limited classes of objects, e.g., workers, excavators, and dump

trucks (Memarzadeh et al. 2013;Park and Brilakis 2012;

Rezazadeh Azar and McCabe 2012b), or limited types of construc-

tion activities, e.g., earthwork and concrete pouring (Bugler et al.

2017;Gong and Caldas 2010;Rezazadeh Azar et al. 2013). They

can hardly be extended to analyze other classes of objects or oper-

ations. One of the possible reasons is that most of them use hand-

crafted features to build their detectors, which are trained with

relatively small data sets of limited object classes. Although these

methods have shown satisfying performance in their envisaged set-

tings, the increased number of object classes poses a big challenge

to them. The results of PASCALVOC Object Detection Challenges

from 2009 to 2012 (Williams 2012) show that detectors based on

handcrafted features arrived at a performance bottleneck. Further-

more, they especially focus on single activity recognition, where an

image contains only one execution of an activity by one or a few

objects.

Deep learning allows computational models that are composed

of multiple processing layers to learn representations of data with

multiple levels of abstraction (LeCun et al. 2015). The methods

1Senior Research Fellow, Dept. of Building and Real Estate, Hong

Kong Polytechnic Univ., Hung Hom, Kowloon 999077, Hong Kong

(corresponding author). E-mail: bsericlo@polyu.edu.hk

2Chair Professor, Dept. of Building and Real Estate, Hong Kong

Polytechnic Univ., Hung Hom, Kowloon 999077, Hong Kong. E-mail:

bshengli@polyu.edu.hk

3Assistant Professor, Dept. of Construction Management and Real

Estate, School of Economics and Management, Tongji Univ., 1239 Siping

Rd., Shanghai 200092, China. E-mail: dongping.cao@tongji.edu.cn

4Assistant Professor, Dept. of Civil and Environmental Engineering,

West Virginia Univ., Morgantown, WV 26506-6103. E-mail: fei.dai@

mail.wvu.edu

5Assistant Professor, Dept. of Building and Real Estate, Hong Kong

Polytechnic Univ., Hung Hom, Kowloon 999077, Hong Kong. E-mail:

joonoh.seo@polyu.edu.hk

6Associate Professor, Dept. of Civil and Environmental Engineering,

Tishman Construction Management Program, Univ. of Michigan,

Ann Arbor, MI 48109. E-mail: shdpm@umich.edu

Note. This manuscript was submitted on May 15, 2017; approved on

October 25, 2017; published online on February 16, 2018. Discussion

period open until July 16, 2018; separate discussions must be submitted

for individual papers. This paper is part of the Journal of Computing

in Civil Engineering, © ASCE, ISSN 0887-3801.

J. Comput. Civ. Eng., 2018, 32(3): 04018012

have dramatically improved the state of the art in visual object clas-

sification (Krizhevsky et al. 2012) and object detection (Girshick

et al. 2014). The stellar success stories of deep learning from the

computer vision domain motivate the authors to address the previ-

ously mentioned concerns with it.

This paper introduces a method for recognizing diverse con-

struction activities in still site images. It uses convolutional neural

networks (CNNs)—Faster R-CNN (Ren et al. 2015) as the region

proposal network and ResNet-50 (He et al. 2015) as the object

detection network—to detect those frequently observed objects,

establishes relevance networks of the objects, and recognizes activ-

ities by pattern matching. The advantage of the proposed technol-

ogy is its potential to recognize diverse concurrent construction

activities in a fully automatic way. Therefore, it is possible to save

managers’valuable time in manual data collection and concentrate

their attention on solving problems that necessarily demand their

expertise.

In the rest of this paper, the related work on visual object

detection and activity recognition conducted by both the computer

vision community and the construction community is firstly re-

viewed. Then, the method development is presented in three parts:

(1) preparing the data sets for transfer learning of a pretrained

ResNet-50 model, (2) developing the key concepts and rules for

relevance network creation and the patterns for activity recognition,

and (3) evaluating the performance of Faster R-CNN on object

detection and that of the proposed method on diverse construction

activity recognition. Finally, the contribution to knowledge, re-

search limitations, and future work are discussed.

Background and Related Work

Object Detection in Computer Vision

Object detection is one the central topics in the computer vision

community, and its methodologies for feature engineering have

evolved from handcrafted feature descriptors to deep-learning ar-

chitectures in the last two decades (LeCun et al. 2015). The motive

forces behind the transition include the development of computer

architectures (e.g., graphics processing units and multicore com-

puter systems), the advent of large-scale data sets such as ImageNet

(Deng et al. 2009), and the better designs for modeling and training

deep networks (Hinton 2009).

Among the typical handcrafted features are the scale-invariant

feature transform (SIFT), histogram of oriented gradients (HOG),

and deformable part-based model (DPM). Lowe (1999) introduced

SIFT as a local feature descriptor, which addresses the problem of

image patch comparison and significantly propels the development

of structure from motion (Haming and Peters 2010). Dalal and

Triggs (2005) introduced HOG, which counts occurrences of gra-

dient orientations in localized portions of an image. In combination

with the linear support vector machine (SVM) (Andrew 2013)as

the classifier, it becomes a favorite feature descriptor for object

classification and detection. Felzenszwalb et al. (2008) introduced

DPM, which includes a coarse global template covering an entire

object and higher-resolution part templates represented by HOG.

DPM reinforces the popularity and strength of HOG by addressing

the limitation of HOG in independently representing and detecting

deformable objects such as pedestrians and excavators.

In 2012, interest in CNNs was rekindled by Krizhevsky et al.

(2012) by showing substantially higher image classification accu-

racy on the ImageNet Large Scale Visual Recognition Challenge.

From then on, the most effective algorithms for image classifica-

tion, object recognition, and visual tracking were developed mostly

based on CNNs. For example, Hong et al. (2015) introduced an

online visual tracking algorithm by learning a discriminative

saliency map using a CNN and achieved the best result in the Visual

Object Tracking challenge 2015 (Kristan et al. 2015).

Girshick et al. (2014) introduced regions with CNN features

(R-CNN) for object detection and semantic segmentation by com-

bining region proposals with CNNs. R-CNN achieve a mean aver-

age precision (mAP) of 53.3% on the data set PASCAL VOC

(2012), which improves mAP by more than 30% relative to the

previous best result. A follow-up study (Girshick 2015) introduced

Fast R-CNN to efficiently classify object proposals using deep con-

volutional networks, which improves the training and testing speed.

Ren et al. (2015) introduced Faster R-CNN toward real-time object

detection with region proposal networks, which share full-image

convolutional features with the detection network and thus enable

cost-free region proposals.

Visual Object Detection in Construction

In the past decade, a considerable amount of literature has been

published on vision-based object detection in the construction do-

main (Du et al. 2011;Memarzadeh et al. 2013;Park and Brilakis

2012;Rezazadeh Azar and McCabe 2012a,b;Son et al. 2012;Zhu

et al. 2010). A major family of them focused on detecting

construction equipment and workers (Chi and Caldas 2011;

Memarzadeh et al. 2013;Teizer et al. 2007). Park and Brilakis

(2012) introduced a three-step method to detect workers in videos:

using background subtraction to identify moving objects, using

SVM to classify the human-shaped objects according to the HOG

shape features of those objects, and finally using k-NN to detect

workers based on the color histogram of those human-shaped ob-

jects. Similarly, Chi and Caldas (2011) proposed another three-step

method, but with different feature descriptors and classifiers, to de-

tect mobile heavy equipment and workers. In their method, moving

objects are segmented with background subtraction. These seg-

mented regions are factorized with geometric attributes, e.g., the

position of region centroid, occupied area size in pixels, and aspect

ratio. Eventually, these geometric attributes in combination with the

gray information of the area are classified with a Bayes classifier

and an artificial neural network classifier independently.

Background subtraction, however, is based on a static back-

ground hypothesis, which limits its use in moving or jiggling

videos. To address the problem, Memarzadeh et al. (2013) proposed

to use HOG and colors to describe excavators, dump trucks, and

workers in videos and directly detect them by classifying the areas

of sliding windows in each frame with SVM. Rezazadeh Azar and

McCabe (2012b) introduced a DPM-like method to detect hydraulic

excavators in images and videos. The method uses HOG features to

detect the first part of the arm (the boom), searches the adjacent

part (the dipper) to finalize the recognition, and finally determines

the pose of the excavator based on the location of the dipper.

Object detection methods based on HOG features suffer from

their window sliding operation and are computationally expensive.

As noted by Rezazadeh Azar and McCabe (2012a), scanning a

1,024 ×768-pixel image with such a method for detecting dump

trucks from eight viewpoints takes 69 s. To speed up the detection

process, they proposed to use a fast classifier based on Haar-like

features for static images and a movement filter using Bayes deci-

sion rules for video frames to evaluate the confidence level to which

an object of interest exists in the sliding window before using the

expensive HOG detector.

Several other studies focused on detecting construction compo-

nents, e.g., concrete columns in videos (Zhu et al. 2010) and con-

crete structural components in images (Son et al. 2012), with edge

J. Comput. Civ. Eng., 2018, 32(3): 04018012

and color features. Zhu et al. (2010) proposed to locate edge points

using column colors, identify the orientation information of edge

points using the Hough transform, and finally detect columns using

an artificial neural network classifier. Son et al. (2012) proposed a

method for transforming the RGB color space to non-RGB color

spaces to raise detection robustness in various illumination condi-

tions, and SVM proved to be the superior classifier for the non-

RGB spaces.

Earlier studies by Brilakis and Soibelman (2006) and Brilakis

et al. (2005) focused on classifying site images by detecting how

much area a specific material (e.g., earth, concrete, and paint) occu-

pies in the image plane using a technique named content-based im-

age retrieval, which detects materials in four steps: decomposing

images into their basic features (e.g., color, texture, structure, etc.)

by applying a series of filters; clustering these regions; computing

region feature signatures; and detecting regions of interest by com-

paring each signature with annotated samples in a material database.

In summary, previous research on visual object detection in

construction has primarily focused on detecting workers and equip-

ment, and the number of object classes that they can detect is lim-

ited. None of the studies reviewed appears to address the detection

problem of diverse objects. Research involving construction mate-

rials aims at image retrieval based on their basic features, e.g., color,

texture, and structure, rather than object detection. One of the rea-

sons could be that they primarily use handcrafted features to build

their classifiers, which are trained with relatively small data sets

with limited object types. Although they have shown satisfying per-

formance in their envisaged settings, it remains a challenge to them

to handle the increased number of object classes. The evidence

from the computer vision community, e.g., the results of PASCAL

VOC Object Detection Challenges (Williams 2012), show that

detectors based on handcrafted features arrived at a performance

bottleneck. The best average precisions (APs) of people detection

with the training and test data provided by PASCAL VOC from

2005 to 2012 were 0.013, 0.164, 0.221, 0.420, 0.415, 0.475, 0.516,

and 0.461, respectively (PASCAL VOC 2012).

Human Activity Recognition in Computer Vision

Human activity recognition is an active research topic in the

computer vision community with many important applications,

including human–computer interfaces, content-based video index-

ing, video surveillance, and robotics (Aggarwal and Cai 1999;

Aggarwal and Ryoo 2011;Egnor and Branson 2016;Vrigkas

et al. 2015;Weinland et al. 2011). The relevant methods can be

categorized into four groups: shape-based, space-time, stochastic,

and rule-based (Vrigkas et al. 2015). Shape-based methods re-

present human activities with two- or three-dimensional skeleton

pose models and recognize activities by human pose classification

(Lillo et al. 2014). Space-time methods recognize activities based

on pixel-based space-time features across frames. Optical flow has

been proved to be one of the critical cues in implementing these

methods (Shechtman and Irani 2005;Wang and Schmid 2011,

2013). Stochastic methods model human activities by considering

an activity entity as a stochastically predictable sequence of states.

The primary stochastic techniques used in activity recognition in-

clude hidden Markov models (Murphy and Paskin 2001;Rabiner

and Juang 1986) and conditional random fields (Lafferty et al.

2001;Quattoni et al. 2007). Rule-based approaches model activ-

ities with a set of constraints describing atomic actions or a set of

activity patterns and recognize them by logic reasoning or probabi-

listic pattern matching (Morariu and Davis 2011).

Construction Activity Recognition and Productivity

Analysis

Vrigkas et al. (2015) classified human activities into six levels

regarding their complexity: (1) gestures, (2) atomic actions,

(3) human-to-object or human-to-human interactions, (4) group

actions, (5) behaviors, and (6) events. Since human behaviors and

events are high-level activities that involve emotions, personality,

psychological states, and social roles, this study excludes the

two levels at the current stage and extends the remaining levels to

cover the activities by construction equipment. Table 1shows the

taxonomy of construction activities of interest.

The existing literature on construction activity recognition is

extensive and focuses particularly on the productivity analysis of

construction equipment (Bugler et al. 2017;Gong and Caldas

2010;Rezazadeh Azar et al. 2013;Yang et al. 2014;Zou and Kim

2007). Pioneering this task, Gong and Caldas (2010) proposed a

rule-based interaction activity recognition method to analyze the

productivity of concrete pouring of a tower crane and concrete

buckets. The method breaks down construction operations into a

variety of working task elements (semantic context) and describes

how these elements unfold in planned locations (spatial context)

and sequences (temporal context). The spatial context is integrated

Table 1. Construction Activity Taxonomy Regarding Activity Complexity

Activity Definition Examples

Gestures Primitive movements of the body parts of an object

that may correspond to an action of this objecta

Workers: walking; standing with legs upright, one leg upright, legs bent, or one

leg bent; sitting; kneeling on one leg bent; etc.

Excavators: swinging left or right, lowering or raising the boom, closing or

dumping the bucket, sticking out or in, etc.

Atomic actions Movements of an object describing a certain motion

that may be part of more complex activities

Rebar workers: transporting rebars, sorting rebars, placing concrete spacers,

fixing rebars using a hand tool or a mechanical mean, etc.

Excavators: digging earth, leveling earth, transporting earth, unloading earth to

a dump truck, etc.

Interactions Activities that involve two or more objects Transporting a prefabricated mesh to working areas with a tower crane, which

involves multiple-stage interactions between workers with the tower crane: a

worker winding and securing the rope, a worker instructing the tower crane to

move, and a worker positioning and unloading the mesh.

Group activities Activities performed by a group of objects Placing concrete: workers preparing concrete areas; transporting concrete to

concrete areas with a crane and a bucket (or a track-mounted concrete pump);

and workers placing, spreading, compacting, and leveling concrete.

aObject represents human and construction equipment for the sake of simplicity.

J. Comput. Civ. Eng., 2018, 32(3): 04018012

with video scenes by manually specifying regions in the video

scenes and assigning a working state to each of these regions.

In other words, when the bucket enters a specific region, it signals

that the work task transforms into the assigned working state.

Bugler et al. (2017) proposed a rule-based interaction activity

detection method for analyzing earthwork productivity of excava-

tors and dump trucks. In their method, the activity state (i.e., static,

moving, absent, or filling) is checked when an excavator and a

dump truck are in proximity. They use photogrammetry to estimate

the volume of the removed earth and derive the productivity of

earthwork. Similarily, Rezazadeh Azar et al. (2013) proposed an-

other rule-based method for analyzing dirt loading cycles of exca-

vators and dump trucks. They combine logical reasoning with a

SVM classifier to achieve better detection robustness and accuracy.

The logic reasoning checks equipment orientations for filling, and

then the SVM classifier detects the earthwork state according to the

distances between the base point of the excavator and four corners

of the dump truck.

Publications that concentrate on atomic action recognition

in construction more frequently adopt space-time methods.

Golparvar-Fard et al. (2013) presented a method using spatiotem-

poral features and SVM classifiers to understand the action of

earthmoving equipment (excavators and trucks). Yang et al. (2016)

introduced a study using dense trajectories to recognize workers’

actions.

Yang et al. (2014) presented a stochastic activity recognition

method to infer two-state tower crane activities (concrete pouring

and nonconcrete material movement) using crane jib trajectories

and site layout information. In the method, the jib angle trajectory

is tracked with a 2D-to-3D rigid pose tracking algorithm, and a

probabilistic graph model was introduced to process the tracking

results as well as recognize crane activities.

Previous studies on vision-based construction activity recogni-

tion were likely influenced by the object detection methods. As a

result, they primarily focused on limited types of activities con-

ducted by those objects, which are easy to detect based on the hand-

crafted features. These methods can hardly be extended to analyze

other activities. Furthermore, they especially focus on single activ-

ity recognition, where an image contains only the execution of a

single activity by one or few objects. However, overall awareness

of the states and issues of project-level tasks, e.g., resource leveling,

progress tracking, and productivity analysis, requires the informa-

tion of diverse, concurrent activities. There is a need for such a

technique that can detect various objects in site images and recog-

nize these construction activities relevant to them.

Method Development

This study addresses the recognition problem of interaction activ-

ities. For the sake of simplicity, the term activities is used to refer

to the interaction activities hereafter. The method for construction

activity recognition consists of two steps: object detection and ac-

tivity recognition. To detect objects in site images, Faster R-CNN

(Ren et al. 2015) is used as the region proposal network and

ResNet-50 (He et al. 2015) as the object detection network.

ResNet won several first places in such tracks as ImageNet Clas-

sification, ImageNet Detection, ImageNet Localization, COCO

Detection, and COCO Segmentation (ImageNet and Microsoft

COCO 2015) and therefore represents the state of the art of these

domains.

Given objects detected with the deep-learning model, this study

introduces relevance networks and activity patterns to recognize

construction activities. A relevance network is created based on

two concepts: semantic relevance and spatial relevance. Semantic

relevance represents the likelihood of the cooperation or coexist-

ence between two objects in a construction activity, while spatial

relevance represents two-dimensional (2D) pixel distances in the

image coordinate. This study establishes 20 activity patterns based

on objects in relevance networks to recognize 17 types of construc-

tion activities. Fig. 1summarizes the overview of system develop-

ment and application regarding the three workflows: training

Faster R-CNN, testing Faster R-CNN, and testing the method.

Training and Test Data Sets

Data sets are critical to training and testing deep neural networks.

This study focuses on analyzing images taken at the foundation and

structure construction stages of building projects, and a total of

Fig. 1. Overview of system development and application

J. Comput. Civ. Eng., 2018, 32(3): 04018012

22 classes of objects frequently observed are covered. Table 2

describes the object taxonomy, where a three-tier (categories,

subcategories, and classes) tree-view structure is adopted. First,

the objects are grouped into four categories: workers, materials/

products, equipment, and general vehicles. The second tier (i.e., the

subcategories) unfolds under the first tier. For example, the cat-

egory of materials/products is further divided into four subcate-

gories: concrete related, formwork related, rebar related, and

scaffolding related. The last tier is composed of the classes under

the subcategories. For instance, there are two classes in concrete-

related materials/products: concrete in placing and concrete in

finishing. There is an exception with workers and general vehicles,

which are not further divided into subcategories due to the diffi-

culty of recognizing their trades or activities only based on these

features.

This study used three image sources to construct the data sets.

ImageNet is a large-scale ontology of images built upon the back-

bone of the WordNet structure (Deng et al. 2009). The authors col-

lected most of the ordinary objects in the vehicles and equipment

classes, e.g., automobiles, backhoes, and cranes, from ImageNet.

Another important image source is the online image repositories,

including Google Images, Baidu Images, Bing Images, and Yahoo

Flickr. Images were searched with keywords relevant to such con-

struction materials as rebar, formwork, scaffolding, and concrete.

In addition, 763 site images taken from four building construction

sites in Hong Kong were included. Unlike the images from the first

two sources, the images that the researchers took on sites reflected

the actual situations in which the proposed method will apply.

Fig. 2shows some samples of them.

Finally, a total of 7,790 images were collected and annotated in

the PASCAL VOC format, which represents objects with bounding

boxes and class labels and saves the annotations with XML files

(Everingham and Winn 2007). The open source toolkit LabelImg

(Tzu 2015) was employed to manually perform the annotating pro-

cess, which took around 200 work hours in 3 weeks. The process

cannot be automated since the images collected by the researchers

are still site images from various sources, including the online

image repositories and local construction projects. Table 2also

shows the statistics of the annotated objects in these images;

6,232 (80%) images were used to build the training set, and

1,558 (20%) images were used as the test set by randomly selecting

one in every five images out of the general data set.

Semantic Relevance

This study introduces semantic relevance Rseto represent the

likelihood of the cooperation or coexistence of two objects in a

construction activity. In this study, Rsescores between classes

Table 2. Object Classes and Statistics of the Annotated Objects in the Training and Test Data Sets

Code Class

Total

b.b.

Training data set Test data set

Num.

b.b.

Size

avg.

Size

std.

Size

c.v.

Num.

b.b.

Size

avg.

Size

std.

Size

c.v.

Workers

WKRaWorker 2,994 2,404 109 78 0.71 590 112 75 0.67

Materials/products

Concrete related

M-CCP Concrete in placing 255 205 187 71 0.38 50 186 83 0.45

M-CCF Concrete in finishing 184 144 234 98 0.42 40 228 90 0.39

Formwork related

M-FMW Formwork 256 215 226 168 0.75 41 290 232 0.80

M-FSB Formwork of slabs and beams 249 197 566 421 0.74 52 630 505 0.80

M-FWC Formwork of walls and columns 1,273 991 272 189 0.69 282 260 186 0.72

M-FSS Formwork of stairs 175 144 205 78 0.38 31 226 113 0.50

Rebar related

M-REB Rebar 467 384 280 202 0.72 83 293 222 0.76

M-RSB Rebar of slabs and beams 339 283 330 279 0.85 56 294 256 0.87

M-RWC Rebar of walls and columns 1,323 1,041 297 203 0.68 282 277 190 0.69

Scaffolding related

M-SCF Scaffolding 808 639 433 245 0.57 169 457 245 0.54

M-SSF Scaffolding of slab formwork 190 160 300 115 0.38 30 286 42 0.15

Activity-specific equipment

Earthwork related

E-BKH Backhoe 737 595 435 317 0.73 142 423 294 0.70

E-BDZ Bulldozer 225 181 416 127 0.31 44 422 117 0.28

E-DMP Dump truck 236 186 432 270 0.63 50 468 281 0.60

Concrete related

E-CCB Concrete bucket 467 368 219 97 0.44 99 210 93 0.44

E-CCM Concrete mixer 617 496 401 166 0.41 121 454 245 0.54

E-CCP Concrete pump 411 330 285 124 0.44 81 291 129 0.44

Material delivery related

E-CRA Crane 802 638 416 200 0.48 163 400 162 0.41

E-LRY Lorry 437 352 456 130 0.29 85 473 152 0.32

General vehicles

E-VANaVan 451 363 451 143 0.32 88 483 125 0.26

E-CARaCar 1,088 862 383 219 0.57 226 376 279 0.74

Note: Num. b.b. = number of bounding boxes; Total b.b. = total number of bounding boxes. Sizes of ground-truth bounding boxes are defined with pixel

numbers on their diagonal. Size avg. = average of the sizes of bounding boxes; Size std. = sizes’standard deviation; Size c.v. = sizes’coefficient of variance.

aThe general classes, which can be present in various construction activities and are of limited activity indication capability.

J. Comput. Civ. Eng., 2018, 32(3): 04018012

are quantified with a 5-point Likert scale with intervals of 0.25.

Fig. 3shows the thumb rules for establishing the scores. Rse1.0

represents the homogenous relevance (i.e., two objects are from

the same class), while Rse0 represents an outcome of no semantic

relevance between the two classes. Fig. 4illustrate the establish-

ment of these scores.

Rse0.75 represents the relationship between classes within

each subcategory. In the subcategories of activity-specific equip-

ment, it denotes intrasubcategory alternative or cooperative pos-

sibilities. For example, concrete buckets can be an alternative

to concrete pumps, and there can be cooperative operation be-

tween, e.g., concrete mixers and concrete buckets in concrete plac-

ing. Similarly, in materials/products, the score 0.75 represents that

two classes are within the same subcategory and apt to be

temporally successive. They can be similar temporary products,

e.g., formwork of walls and columns and that of slabs and beams

in formwork construction.

Rse0.5 represents a strong relationship between classes of

different subcategories. In materials/products, the score indicates

a supportive relationship between two classes. They can be two

classes of materials/products, among which one is being treated,

and another supports the treatment, e.g., concrete in placing and

formwork of slabs and beams. It can be the direct relationship

of handling and being handled between workers and materials/

products. Workers and their appurtenance can also be quantified

at this level, e.g., workers along with a backhoe leveling land and

tiling concrete panels, and workers placing concrete with a con-

crete bucket. Additionally, it can also represent a kind of material

and a kind of equipment designed to treat that material, e.g., con-

crete in placing and a concrete bucket (or a concrete pump).

Fig. 2. Image samples (images by Xiaochun Luo)

Fig. 3. Semantic relevance rules

J. Comput. Civ. Eng., 2018, 32(3): 04018012

Rse0.25 represents the weak relationship between classes of dif-

ferent subcategories. In the subcategories of equipment (including

activity-specific equipment and general vehicles), it represents the

potential of indirect cooperation between its subcategories. For ex-

ample, a tower crane can cooperate with a concrete mixer indirectly

through a concrete bucket in a concrete-placing activity. In the sub-

categories of materials/products, the score indicates the subsequent

relationship between two classes. For example, formwork of slabs

and beams directly supporting concrete in placing results in Rse0.5.

Concrete in placing subsequently becomes concrete in finishing,

whose Rsewith the formwork of slabs and beams turns into 0.25.

Moreover, the 0.25 score can also represent the indirect relationship

of handling and being handled between equipment and materials/

products. For example, a crane is indirectly connected to concrete

in placing by a concrete bucket.

Finally, a semantic relevance matrix is established to quantita-

tively represent the relations between the 22 classes, as shown in

Table 3.

Spatial Relevance

Camera angles and distances can significantly influence the spatial

relationship of objects in the 2D image coordinates. For example,

when photographing workers, materials, and equipment on the top

working floor of a building project, a horizontal, or almost horizon-

tal, shooting angle can compact their visual distances in the image

such that some objects seem to occlude or overlap others. There is

the distance compaction problem with relatively horizontal shoot-

ing angles. On the contrary, a vertical angle will result in a more

accurate record of their physical distances.

To represent the complicated physical proximity in the field of

view, the researchers introduce the concept of spatial relevance Rsp.

Three situations in defining the spatial relevance of objects A and B

are considered based on the relation of their bounding boxes: one

bounding box contains, intersects, or is separated from another

(Fig. 5). In the first two situations, the definition of the overlap in

evaluating the object detection effectiveness is referenced. It is de-

fined as the intersection over union (IoU) of the predicted bounding

Fig. 4. Illustration of semantic relevance rules

J. Comput. Civ. Eng., 2018, 32(3): 04018012

box and the ground-truth bounding box. Take the situations in Fig. 6

as an example to illustrate the calculation of spatial relevance.

In the first situation, one bounding box contains another. It sig-

nals that their Rsp has achieved the maximum 1.0. This definition

acknowledges that workers (WKR-3, WKR-4, WKR-5, WKR-6,

WKR-10, and WKR-20) working in the same area (M-RSB-11)

should have the same Rsp value (i.e., 1), even though the bounding

box of one worker (WKR-10) is smaller than another (WKR-6),

caused by the difference of their camera distances.

In the second situation, one bounding box intersects with

another. The spatial relevance between two objects (M-RSB-11

and WKR-0) is represented with the intersection area (in transpar-

ent red) and the minimum bounding box area (WKR-0). This in-

tersection relevance is set within ½0.5;1Þto reflect that objects in

this situation are less relevant than those in the first situation. The

first two situations are represented unitedly with the first band of

Eq. (1), where the function area derives the area of a bounding box.

In the third situation, there is no intersection between two

bounding boxes. The spatial relevance between these two separated

objects is defined with lengths and distances, rather than areas.

It falls in ð0;0.5Þto reflect that objects in this situation are less

relevant than those in the second situation. Its value is defined with

the second band of Eq. (1), where the function side returns the mini-

mum side length of a bounding box, and the function dist computes

the minimum distance between two bounding boxes. In Fig. 6, the

minimum distance between M-RSB-11 and WKR-1 is illustrated

with the yellow line and that between M-RSB-11 and M-FMW

with the blue line.

Consequently, the spatial relevance between any two objects is a

scalar in ð0;1without units

RspðA;BÞ

¼8

21þareaðAÞ⋂areaðBÞ

minðareaðAÞ;areaðBÞÞ;if areaðAÞ⋂areaðBÞ≠∅

sideðAÞþsideðBÞ

2ðsideðAÞþsideðBÞþdistðA;BÞÞ;otherwise

ð1Þ

Relevance Networks

Given the semantic relevance Rseand spatial relevance Rsp of two

objects, their relevance is defined as the product of Rseand Rsp.

A relevance network is composed of nodes and edges. A node

represents a detected object in the form of a circle, whose center

corresponds to the center of its bounding box in the image. An edge

represents that the connected two objects are relevant, and the

relevance score between the two objects determines the width of

the edge.

This study differentiates between workers and nonworker nodes

in creating relevance networks because the direct connection

between workers presents low activity indication capability in

comparison with their connection with nonworker objects.

Table 3. Semantic Relevance Matrix of the 22 Classes

Class WKR M-CCP M-CCF M-FMW M-FSB M-FWC M-FSS M-REB M-RSB M-RWC M-SCF M-SSF E-BKH E-BDZ E-DMP E-CCB E-CCM E-CCP E-CRA E-LRY E-VAN E-CAR

WKR 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.25 0.25 0.5 0.5 0.25 0.25 0.5 0.25 0.25

M-CCP 0.5 1.0 0.75 0 0.5 0.5 0.5 0 0.5 0.5 0 0 0 0.25 0 0.5 0.5 0.5 0.25 0 0 0

M-CCF 0.5 0.75 1.0 0 0.25 0.25 0.25 0 0.25 0.25 0 0 0 0 0 0.25 0.25 0.25 0 0 0 0

M-FMW 0.5 0 0 1.0 0.75 0.75 0.75 0 0.25 0.5 0.25 0.25 0 0 0 0 0 0 0.5 0.5 0 0

M-FSB 0.5 0.5 0.25 0.75 1.0 0.75 0.75 0.5 0.5 0.25 0.25 0.25 0 0 0 0 0 0 0 0 0 0

M-FWC 0.5 0.5 0.25 0.75 0.75 1.0 0.75 0.25 0.25 0.5 0.25 0.25 0 0 0 0 0 0 0 0 0 0

M-FSS 0.5 0.5 0.25 0.75 0.75 0.75 1.0 0.25 0.25 0.25 0.25 0.25 0 0 0 0 0 0 0 0 0 0

M-REB 0.5 0 0 0 0.5 0.25 0.25 1.0 0.75 0.75 0 0 0 0 0 0 0 0 0.5 0.5 0 0

M-RSB 0.5 0.5 0.25 0.25 0.5 0.25 0.25 0.75 1.0 0.75 0 0 0 0 0 0.5 0.5 0.5 0 0 0 0

M-RWC 0.5 0.5 0.25 0.5 0.25 0.5 0.25 0.75 0.75 1.0 0 0 0 0 0 0.5 0.5 0.5 0 0 0 0

M-SCF 0.5 0 0 0.25 0.25 0.25 0.25 0 0 0 1.0 0.75 0 0 0 0 0 0 0 0 0 0

M-SSF 0.5 0 0 0.25 0.25 0.25 0.25 0 0 0 0.75 1.0 0 0 0 0 0 0 0 0 0 0

E-BKH 0.5 0 0 0 0 0 0 0 0 0 0 0 1.0 0.75 0.75 0 0 0 0 0 0 0

E-BDZ 0.25 0.25 0 0 0 0 0 0 0 0 0 0 0.75 1.0 0.75 0 0 0 0 0 0 0

E-DMP 0.25 0 0 0 0 0 0 0 0 0 0 0 0.75 0.75 1.0 0 0 0 0 0 0 0

E-CCB 0.5 0.5 0.25 0 0 0 0 0 0.5 0.5 0 0 0 0 0 1.0 0.75 0.75 0.5 0 0 0

E-CCM 0.5 0.5 0.25 0 0 0 0 0 0.5 0.5 0 0 0 0 0 0.75 1.0 0.75 0.25 0 0 0

E-CCP 0.25 0.5 0.25 0 0 0 0 0 0.5 0.5 0 0 0 0 0 0.75 0.75 1.0 0 0 0 0

E-CRA 0.25 0.25 0 0.5 0 0 0 0.5 0 0 0 0 0 0 0 0.5 0.25 0 1.0 0.75 0 0

E-LRY 0.5 0 0 0.5 0 0 0 0.5 0 0 0 0 0 0 0 0 0 0 0.75 1.0 0 0

E-VAN 0.25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0

E-CAR 0.25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0

Fig. 5. Spatial relationship

J. Comput. Civ. Eng., 2018, 32(3): 04018012

Therefore, in the beginning, nonworker objects are added into the

network and the sum of all relevance scores between the object and

its connected nonworker objects as the scale of its radius is used.

Then, worker nodes are added into the previously established

network. A worker node’s most relevant nonworker node Nis iden-

tified first, and then the worker node is attached to Nif their rel-

evance score is higher than the threshold. The radius of the worker

node is scaled by its relevance score with Nor a zero, which

represents that the worker is working independently.

Activity Recognition Patterns

This study focuses on recognizing construction activities using vis-

ually detected objects and develops 20 activity patterns based on

the activity indication capabilities of these objects. The 20 patterns

are categorized into four groups regarding their combination

modes, as summarized in Table 4. The activity patterns in the first

group use activity-specific construction equipment to indicate con-

struction activities directly according to the equipment’s instanta-

neity. For example, the presence of concrete mixers (E-CCM) or

concrete pump trucks (E-CCP) is transient and informative enough

to show that placing concrete is under way. In practice, not all

equipment presence is instant. Therefore, a combination of differ-

ent pieces of large equipment is more reasonable to derive ongoing

construction activities on typically congested jobsites. For example,

the researchers use the combination of a bulldozer (E-BDZ) and a

dump truck (E-DMP) to represent land leveling.

The activity patterns in the second group are composed merely

of construction materials because of their strong activity indication

capability. For example, concrete in placing (M-CCP), if detected,

can immediately suggest that concrete placing is underway. Sim-

ilarly, the key characteristic of the construction materials is the

instantaneity of its visual features (e.g., colors and image intensity

values), which render their uniqueness as well as visual detection in

site images. Finally, two patterns in this group are established as the

special requirement narrows down the scope of material types.

In the third group of patterns, construction activities are derived

jointly by workers and equipment. One of the typical examples

in this group is constructing a foundation with workers and an

Fig. 6. Illustration of spatial relevance calculation

Table 4. Activity Patterns

Code Pattern Activity

Directly by equipment

DE1 E-BDZ + E-DMP Leveling land

DE2 E-BKH + E-DMP Excavating for foundation

DE3 E-CCM Placing concrete

DE4 E-CCP Placing concrete

DE5 E-LRY Shipping materials

Directly by materials

DM1 M-CCP Placing concrete

DM2 M-CCF Finishing concrete

Jointly by workers and equipment

WE1 WKR + E-BKH Installing foundation components

WE2 WKR + E-CCB Placing concrete

WE3 WKR + E-VAN Transporting goods

WE4 WKR + E-CAR Transporting people

Jointly by workers and materials

WM1 WKR + M-FMW Machining or transferring formwork

WM2 WKR + M-FSB Building formwork of slabs and beams

WM3 WKR + M-FWC Building formwork of walls and columns

WM4 WKR + M-FSS Building formwork of stairs

WM5 WKR + M-REB Machining or transferring rebar

WM6 WKR + M-RSB Fixing rebar of slabs and beams

WM7 WKR + M-RWC Erecting rebar of walls and columns

WM8 WKR + M-SCF Building scaffolding systems

WM9 WKR + M-SSF Building scaffolding for slab formwork

J. Comput. Civ. Eng., 2018, 32(3): 04018012

excavator. In this pattern, an excavator can be used to lift founda-

tion components, e.g., precast slabs or drainage pipes, and workers

work together to direct the installation process and relocate com-

ponents into place.

In the fourth group, it is proposed to use the concurrence of

workers and materials to indicate ongoing activities. Other than the

materials identified in the second group, most of the construction

materials are unable to indicate solely if a construction activity

relevant to them is going on. In this case, workers’concurrence can

indicate jointly that an activity is under way, e.g., fixing rebar of

columns and walls can be detected when rebar of columns and

walls is detected to be relevant with at least one worker.

Experiments and Results

Object Detection

Evaluation Matrices

The first experiment evaluated object detection performance with

precision-recall curves. The number of correct detections are de-

noted as true positive (TP), the number of wrong detections as false

positive (FP), and the number of missed objects as false negative

(FN). Given the three definitions, precision is the first metric, which

is the ratio of TP to TP + FP, and recall is the second one, which

is the ratio of TP to TP + FN. Referencing the requirements in

PASCAL VOC object detection challenges (Everingham and Winn

2012), precision-recall curves are obtained by setting the precision

for recall r to the maximum precision obtained for any recall r 0>r.

This operation can be effectively performed by sorting all detec-

tions according to their confidence scores in descending order.

Eventually, the average precision (AP) measure of a specific class

is computed as the area under its curve, and the mAP is defined as

the mean of the APs of all classes.

Training Faster R-CNN

In the beginning, the training data set was used to fine-tune the

pretrained ResNet-50 (He et al. 2015) in 100,000 iterations, where

the minibatch size was 64, the momentum 0.9, and the starting

learning rate 0.001, which stepped down to 0.1 times the original

after each 25,000 iterations (i.e., 0.001 base learning rate, step

learning policy, 0.1 gamma, and 25,000 step size). The Faster

R-CNN model consists of two modules, one predicts class-specific

score and another regresses bounding box locations from the initial

recommending boxes, which are referred to as anchors by Ren et al.

(2015). The training losses synthesize the regression loss and the

classification loss of each minibatch of a train set (Ren et al. 2015).

The training process with unstable training losses, as shown in

Fig. 7, qualitatively indicates a noisy converging process, probably

because of the inconsistencies and missed annotations in the train-

ing set. Objects with vague edges are susceptible to inconsistent

annotations, while small objects are prone to missing annotations.

The training took around 14 h on a computer with an Ubuntu 16.04

LTS operation system, an NVidia GTX GeForce 1080 graphics

card, 16-GB random access memory, and an Intel i7-6700K

processor.

Results

APs synthesize three critical factors that determine the object de-

tection performance. First, the confidence scores sort the detections

and acknowledge that those with higher scores have bigger pos-

sibilities to be judged as a positive detection. As a result, precisions

drop faster as recalls increase in precision-recall curves in Fig. 8.

Second, the IoU of the ground-truth bounding box and a recom-

mendation bounding box determines if a detection is positive. The

lower its threshold is, the higher the precision and recall will be.

However, a lower value can allow (or result in) a bigger localiza-

tion error between the ground-truth bounding box and the rec-

ommendation bounding boxes. Third, nonmaximum suppression

(Neubeck and Van Gool 2006) removes the repeated recommenda-

tions for one object. The lower its overlap threshold is, the

more recommendations will be removed. Therefore, bounding

boxes standing closely are apt to be overkilled and reduce the

recall. This threshold is conventionally set at 0.7 (PASCAL

VOC 2012;Ren et al. 2015) and this value was adopted in the

experiments.

Fig. 8shows five precision-recall curves of the 22 classes

with the IoU thresholds for positive detection of 0.1, 0.3, 0.5,

Fig. 7. Training process

J. Comput. Civ. Eng., 2018, 32(3): 04018012

0.7, and 0.9, respectively. The results with the threshold 0.5 are

conventionally used for comparison (PASCAL VOC 2012;Ren

et al. 2015). Accordingly, the mAP of the proposed model is

67.3%, which is slightly higher than 67.0% of Faster R-CNN +

VGG-16 trained and tested with the PASCAL VOC (2012) data

sets and lower than 75.9% of Faster R-CNN + VGG-16 trained

and tested with the COCO, PASCAL VOC 2007, and PASCAL

VOC (2012) data sets (Girshick et al. 2016).

Obviously, comparison bias is apt to occur because the training

and test data sets are different regarding sizes of training and test

data sets and object characteristics, e.g., occlusion levels, sizes, and

views. Nevertheless, for preliminary comparison, the mAP and APs

of six classes (three with the best APs and three with the worst) are

listed in Table 5. The proposed model’s performance arrives at

the state of the art regarding mAP but presents a higher variance,

i.e., 0.31 coefficient of variance (CV), compared with 0.23 and

0.16 in the two references.

It was found that the worst detection performance of the model

is on raw construction materials. APs of M-FMW, M-CCP, and

M-REB are 23.1, 26.5, and 32.8%, respectively (Table 6). The

possible reason is that those materials are of free form. It is difficult

for human experts to establish their boundaries, and annotations

of them are easily subject to inconsistency. As shown in Table 6,

these APs are sensitive to the IoU threshold for positive detection.

On the contrary, the model presents the best APs on E-CCM,

E-BKH, and M-SSF (i.e., 90.6, 90.3, and 88.8%). These classes

have relatively distinguishable visual features, e.g., clear edges

and stable textures.

Worker detection is critical to the proposed activity recognition

method since 14 (70%) activity patterns use workers as one of the

primary cues. The reference APs of person detection are 75.9%

using Faster R-CNN + VGG-16 trained with the PASCAL VOC

(2012) data sets; 82.3% using Faster R-CNN + VGG-16 trained

Table 5. Detection Performance Comparison Regarding APs

Class Result

Faster R-CNN + VGG-16 trained with PASCAL VOC (2012) data seta

Cat 87.3

Dog 86.8

Airplane 82.3

Bottle 45.2

Chair 42.2

Plant 34.5

mAP 67.0b

Standard deviation 15.9b

Coefficient of variance 0.23b

Faster R-CNN + VGG-16 trained with COCO, PASCAL VOC 2007, and

PASCAL VOC (2012) data setsa

Cat 91.3

Dog 89.0

Airplane 87.4

Table 59.0

Chair 54.9

Plant 52.2

mAP 75.9b

Standard deviation 12.4b

Coefficient of variance 0.16b

Faster R-CNN + ResNet-50 trained with the data set in this study

E-CCM 90.6

E-BKH 90.3

M-SSF 88.8

M-REB 32.8

M-CCP 26.5

M-FMW 23.1

mAP 67.3

Standard deviation 20.7

Coefficient of variance 0.31

aData are extracted from Table 7 of Ren et al. (2015).

bResults are derived from the APs of the 20 classes of Ren et al. (2015).

Fig. 8. Precision-recall curves with different IoU thresholds for positive detection

J. Comput. Civ. Eng., 2018, 32(3): 04018012

with the COCO, PASCAL VOC 2007, and PASCAL VOC (2012)

data sets (Ren et al. 2015); and 79.3% using ResNet-50 on the Im-

ageNet validation set (He et al. 2015). The AP of worker detection

is 60.1%, which is lower than those reference APs of person de-

tection out of the computer vision community. In the implementa-

tion of Ren et al. (2015), given the scaled and fixed input images

with the width of 800 pixels and the height of 600 pixels, the mini-

mum objects that the model can detect are determined by the

minimum size of anchors, which is around 87 pixels wide and

175 pixels high. Therefore, this discrepancy could be attributed

to the fact that worker objects in the training and test data sets

are with relatively low resolutions; they have the smallest average

sizes, i.e., 109 pixels in the training data set and 112 pixels in the

test data set, as shown in Table 2. In practice, severe occlusion to

workers due to temporary facilities and construction equipment on

cluttered sites can aggravate their undetectability.

Although the results are comparable with those in the computer

vision community, there is still room for improvement from a prac-

tical perspective. Fig. 8and Table 6illustrate that lowering the

IoU threshold for positive detection is an immediate, but compro-

mising, solution to improve the performance of the object detec-

tion. For example, when the IoU threshold is lowered to 0.3 and

0.1, the mAP is improved to 76.6 and 78.2% respectively from

67.3%, and the AP of workers (WKR) is improved to 70.5 and

67.1% respectively from 60.1%.

Activity Recognition

Evaluation Matrices

The evaluation of activity recognition performance depends on

three similar basic definitions of TF, FP, and FN: TP represents

the number of correct recognitions, FP the number of wrong rec-

ognitions, and FN the number of missed activities. If a wrong

recognition occurs, identifying which ground-truth activity raises

the wrong recognition is difficult because diverse activities can be

observed in an image. Therefore, traditional confusion matrices that

are frequently used to evaluate single-mode classification systems,

in which an image contains only one execution of an activity, are

unsuitable. Consequently, the performance of diverse activity rec-

ognition is evaluated in terms of precision (i.e., the ratio of TP to

TP + FP) and recall (i.e., the ratio of TP to TP + FN).

Exemplary Cases of Activity Recognition

Before proceeding to evaluate activity recognition performance,

three cases of activity recognition with the proposed method are

described (Fig. 9). The left column in Fig. 9shows object detection

results with labels in the form of class code + id + (detection

confidence). The right column shows activity recognition results

in the form of relevance networks and activity patterns. Nonworker

nodes in relevance networks are labeled in the form of class code +

id + (total relevance score), while worker nodes are without the

total relevance score part. The relevance threshold is set to 0.25 to

divide the network. It means that two objects are relevant only

when their relevance is not less than that value. Moreover, division

results in subnetworks and supports identifying group activities

according to the connections among the nonworker nodes in a

subnetwork.

Case 1: In this case, three activities are recognized. The first

activity is erecting rebar of walls and columns, which involves

two activity entities, i.e., WKR-1 + M-RWC-1 (Pattern WM7)

and WKR-3 + M-RWC-1 (Pattern WM7). The second activity is

building scaffolding systems, which is conducted by WKR-0 +

M-SCF-9 (Pattern WM8). The third activity is building formwork

of walls and columns by WKR-2 + M-FWC-0 (Pattern WM3). The

first and third activities are part of the group activity of build-

ing formwork. They are connected through nonworker nodes

M-RWC-10, M-RWC-6, M-RWC-2, and M-RWC-0.

Case 2: The model detects four workers WKR-0, WKR-1,

WKR-2, and WKR-3. Three of them are relevant to M-FWC-8. As

a result, there are three activity entities, i.e., WKR-1 + FWC-8,

WKR-2 + FWC-8, and WKR-3 + FWC-8, and one activity,

i.e., building formwork of walls and columns. WKR-0 is in the

proximity of M-SCF-11 and believed to be building scaffolding

systems according to Pattern WM8. Concrete mixer E-CCM-1 is

detected, which matches Pattern DE3 and shows that placing con-

crete is under way. Additionally, wrong object detection occurs in

this case. Container M-FMW-5 is wrongly detected since it is raw

formwork material. A FP detection is avoided since no workers

are found relevant to M-FMW-5. On the contrary, if a worker were

found relevant to it, a FP detection could occur.

Case 3: In this case, three activities are detected. The first

activity is placing concrete, which is conducted by WKR-0 +

E-CCB-0 in line with Pattern WE2. However, this is a FP detection

because there is no such activity. E-CCB-0 was temporarily placed

there and WKR-0 happened to be in the proximity of it. The second

activity involves WKR-4 + M-FWC-5 and is established as build-

ing formwork of walls and columns. The last activity is building

formwork of slabs and beams, which consists of three activity en-

tities between three workers WKR-1, WKR-2, and WKR-3 and the

product M-FSB-4. A group activity of building formwork by four

workers WKR-1, WKR-2, WKR-3, and WKR-4 can be established

by routing from M-FSB-4 to M-FSB-5 in the subnetwork.

Results

This study focuses on using still site images, in which some ob-

jects cannot be effectively identified even by human experts. The

preliminary evaluation of activity recognition performance was

conducted in four steps. First, 200 images were randomly selected

from the images that the researchers took from building projects in

Hong Kong. It is believed that this consideration is helpful to

Table 6. Object Detection Performance Regarding APs with Various IoU

Thresholds for Positive Detection

Class

AP (%),

IoU = 0.1

AP (%),

IoU = 0.3

AP (%),

IoU = 0.5

AP (%),

IoU = 0.7

AP (%),

IoU = 0.9

mAP 78.2 76.6 69.7 50.5 7.0

WKR 70.5 67.1 60.1 32.8 0.6

M-CCP 77.8 62.7 26.5 10.7 0.6

M-CCF 83.5 82.1 60.6 36.7 9.1

M-FMW 30.7 28.6 23.1 12.2 0.1

M-FSB 63.5 59.4 49.0 28.2 1.5

M-FWC 82.6 81.5 74.4 45.5 3.0

M-FSS 82.7 77.3 60.2 23.1 0.0

M-REB 53.6 46.4 32.8 17.5 9.1

M-RSB 74.6 69.5 54.8 25.7 2.3

M-RWC 67.4 63.0 51.4 30.4 1.8

M-SCF 81.7 81.0 70.6 49.4 8.3

M-SSF 89.1 88.8 88.8 78.5 6.5

E-BKH 90.4 90.4 90.3 81.2 13.1

E-BDZ 84.2 84.0 81.8 77.9 3.5

E-DMP 80.6 80.5 79.7 71.3 8.2

E-CCB 87.1 83.6 81.5 70.2 15.5

E-CCM 90.6 90.6 90.6 87.9 17.8

E-CCP 89.1 88.5 87.5 69.5 4.5

E-CRA 80.4 76.9 68.6 51.0 4.2

E-LRY 87.4 87.4 86.1 76.6 15.8

E-VAN 85.1 84.9 84.9 78.4 22.6

E-CAR 77.5 76.7 76.4 69.8 13.6

J. Comput. Civ. Eng., 2018, 32(3): 04018012

increase the external validity of the experimental results. After that,

the researchers manually annotated and counted activities as ex-

plained in the previous three cases. In this process, single activities

were aggregated to specific activity types, and group activities were

ignored. Then, the proposed method was used to recognize activ-

ities, with the IoU threshold for positive detection of 0.5. Finally,

the performance of the method concerning recall and precision was

evaluated. The experiment resulted in 62.4% precision and 87.3%

recall (151 TP detections, 91 FP detections, and 22 FN detections),

which indicates that the proposed method holds the potential to

recognize construction activities, and that there is still room for

improvement.

Discussion

Research Challenges and Contribution to Knowledge

Recognizing diverse, concurrent activities executed by multiple ob-

jects in individual images is a challenging task. In a simple case of

Fig. 9. Activity recognition cases

J. Comput. Civ. Eng., 2018, 32(3): 04018012

activity detection, an image contains only the execution of a single

activity by one or a few objects, the objective of such a system is to

correctly classify the image. For example, Golparvar-Fard et al.

(2013) used the SVM classifiers to classify different actions of an

excavator’s operations in trimmed video clips, which can be viewed

as time-lapse images, resulting in an average accuracy of action

recognition of 86.33%, which is comparable to the performance

of object classification at that time. Yang et al. (2016) investigated

using dense trajectories to recognize actions of construction work-

ers in trimmed videos clips and reported an average accuracy of

59%. Similarly, they used the SVM classifiers to classify workers’

actions since each video clip only contains one action. However, in

more general cases where multiple objects are present and diverse

activities take place concurrently, classification algorithm–based

solutions are unsuitable.

A new method that integrates deep learning–based object detec-

tion and relevance network–based activity pattern recognition can

tackle this challenge. This study contributes to the body of knowl-

edge in two aspects. First, the state-of-the-art deep-learning tech-

nology was employed to detect the frequently observed 22 classes

of objects in site images. To implement this plan, the researchers

collected and annotated the training data set to fine-tune the Faster

R-CNN model and evaluated the performance of the model on the

test data set. It was found that the deep-learning model presents

consistently high APs on objects with clean boundaries and invari-

ant forms in comparison with those published in the computer

vision domain. However, for those objects with free forms or

ambiguous edges, like raw rebar and formwork materials, the deep-

learning model presents low APs.

Second, the researchers developed a set of rules for creating rel-

evance networks and 20 activity patterns to recognize construction

activities. Semantic relevance and spatial relevance were introduced

to build relevance networks. Semantic relevance represents the se-

mantic likelihood that any two objects are concurrently showing in

the same construction activity. Spatial relevance is defined with 2D

pixel distances in the image coordinate, representing the observable

possibility that they are involved in the same activity. Conse-

quently, the relevance of two objects is formulated as the product

of their semantic relevance and spatial relevance. Furthermore, rel-

evance networks can serve as a tool to identify latent group activ-

ities in site images.

Practical Implications

The method in this study is designed with several practical expect-

ations, i.e., using site images, detecting and analyzing concurrent

construction activities, and being fully automatic. Therefore, it is

possible to save managers’valuable time in data collection and

manipulation for on-site monitoring and concentrate their attention

on solving problems that necessarily demand their expertise. More

specifically, this method could nourish several potential applica-

tions. First, the method can be used to index and classify daily site

images, which are usually taken for various management purposes,

e.g., quality control, safety management, and progress record, but

without textual description. Automated indexing and classifying of

these images should be helpful. Second, since surveillance videos

can be decomposed into time-lapse images, the method can be used

to continuously monitor the construction resources involved in spe-

cific activities regarding working hours. Third, given site videos,

it is possible to detect the states of an activity (i.e., not started, just

started, ongoing, and completed). Therefore, the activity progress

deviation can be established against construction programs in

real time.

Research Limitations

As indicated by the experimental results, there is still room for

improvement on the detection performance of several classes of

objects, e.g., raw rebar and formwork materials and workers.

Besides this, the researchers note two limitations of this study.

To implement the full automation, 2D pixel distances in the image

coordinate, rather than 3D physical distances, are used to define the

spatial relevance between objects/workers. Images are assumed to

be taken from relatively vertical angles, e.g., a surveillance camera

mounted under the operator cabin of a tower crane. In fact, many

images are taken from relatively horizontal angles, which compact

distances in 2D images and affect the validity of spatial relevance

calculation. Reconstructing 3D physical positions using multiple

images of various viewpoints, which was explored in Brilakis et al.

(2011) and Park et al. (2012) to track objects, could be a solution to

this limitation.

Also, there is an intrinsic limitation of using images to detect

activities because no temporal information between images is avail-

able. It is difficult to differentiate between prolonged activities

and transitive states. At the current stage of this study, it primarily

investigated how to use site images to recognize construction ac-

tivities. Future work will focus on site surveillance videos to take

advantage of the temporal information across frames to investigate

the dynamics of relevance networks and improve the activity

recognition performance.

The performance of activity recognition depends heavily on that

of object detection. Activity recognition is based on relevance net-

works, which in turn are built according to the objects detected.

As a result, the precision of any activity pattern is statistically lower

than or equal to the minimum AP of the object classes used to es-

tablish the pattern. The precision of activity recognition (62.4%) is

lower than the mAP of object detection (67.3%); the latter sets a

ceiling for the former. The limitations mentioned previously nec-

essarily widen their difference. Lowering the IoU threshold for

positive detection, which sacrifices the localization accuracy, is an

immediate solution to improve object detection performance.

Conclusion

This paper introduces a new method to recognize diverse, concur-

rent activities executed by multiple objects in still site images, for

which the methods based on image classification algorithms are not

suitable. The method consists of two steps: object detection and

activity recognition. Cutting-edge object detection technologies

(i.e., Faster R-CNN + ResNet-50) were employed to implement

the object detection task. The researchers built the training and test

data sets of 22 classes of construction-related objects to train and

evaluate the convolutional neural networks. The method proves to

be comparable with the state of the art of object detection regarding

mAP and the best APs but presents a relatively big AP variance.

Free forms, blurred edges, and low resolutions are the possible

causes of the low APs.

Semantic relevance and spatial relevance were introduced to cre-

ate relevance networks. Semantic relevance represents the semantic

likelihood that any two objects are concurrently showing in the

same construction activity. Spatial relevance is defined with 2D

pixel distances in the image coordinates, representing the observ-

able possibility that they are involved in the same activity. Based

on relevance networks, a set of activity patterns was defined.

Preliminary experimental results show that the rule-based relevance

networks and activity patterns possess the potential to detect

diverse construction activities in site images.

J. Comput. Civ. Eng., 2018, 32(3): 04018012

However, the proposed method is limited due to distance com-

paction caused by 2D photography with one camera, which can be

addressed with 3D physical distances obtained through triangulat-

ing various camera views. Another limitation is the difficulty in

differentiating between prolonged activities and transitive states

due to the intrinsic lack of temporal information between site im-

ages. To address it, site surveillance videos will be investigated in

the future to implement dynamic relevance networks by detecting

and correlating identical objects across consecutive frames.

Acknowledgments

The work was supported by the Innovation and Technology

Commission of Hong Kong, under the platform project “Smart

Construction Platform based on Cloud BIM and Image Processing”

(ITT/002/16LP).

References

Aggarwal, J. K., and Cai, Q. (1999). “Human motion analysis.”Comput.

Vision Image Understanding, 73(3), 428–440.

Aggarwal, J. K., and Ryoo, M. S. (2011). “Human activity analysis:

A review.”ACM Comput. Surv., 43(3), 1–43.

Andrew, A. M. (2013). An introduction to support vector machines and

other kernel-based learning methods, Cambridge University Press,

Cambridge, U.K.

Brilakis, I., Park, M.-W., and Jog, G. (2011). “Automated vision tracking of

project related entities.”Adv. Eng. Inf., 25(4), 713–724.

Brilakis, I., and Soibelman, L. (2006). “Multimodal image retrieval

from construction databases and model-based systems.”J. Constr. Eng.

Manage.,10.1061/(ASCE)0733-9364(2006)132:7(777), 777–785.

Brilakis, I., Soibelman, L., and Shinagawa, Y. (2005). “Material-based con-

struction site image retrieval.”J. Comput. Civ. Eng.,10.1061/(ASCE)

0887-3801(2005)19:4(341), 341–355.

Bugler, M., Borrmann, A., Ogunmakin, G., Vela, P. A., and Teizer, J.

(2017). “Fusion of photogrammetry and video analysis for productivity

assessment of earthwork processes.”Comput.-Aided Civil Infrastruct.

Eng., 32(2), 107–123.

Chi, S., and Caldas, C. H. (2011). “Automated object identification using

optical video cameras on construction sites.”Comput.-Aided Civ.

Infrastruct. Eng., 26(5), 368–380.

Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for

human detection.”IEEE Computer Society Conf. on Computer Vision

and Pattern Recognition, Vol. 881, IEEE, New York, 886–893.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009).

“ImageNet: A large-scale hierarchical image database.”IEEE Conf.

on Computer Vision and Pattern Recognition, IEEE, New York,

248–255.

Du, S., Shehata, M., and Badawy, W. (2011). “Hard hat detection in video

sequences based on face features, motion and color information.”

3rd Int. Conf. on Computer Research and Development, IEEE,

New York, 25–29.

Egnor, S. R., and Branson, K. (2016). “Computational analysis of

behavior.”Ann. Rev. Neurosci., 39, 217–236.

Everingham, M., and Winn, J. (2007). “The PASCALVisual Object Classes

Challenge 2007 (VOC2007) development kit.”〈https://pjreddie.com

/media/files/VOC2007_doc.pdf〉(Jun. 27, 2017).

Everingham, M., and Winn, J. (2012). “The PASCALVisual Object Classes

Challenge 2012 (VOC2012) development kit.”〈http://host.robots.ox.ac

.uk/pascal/VOC/voc2012/devkit_doc.pdf〉. (Jun. 27, 2017).

Felzenszwalb, P. F., McAllester, D., and Ramanan, D. (2008). “A discrim-

inatively trained, multiscale, deformable part model.”Proc., Computer

Vision and Pattern Recognition, IEEE, New York, 1–8.

Girshick, R. (2015). “Fast R-CNN.”Proc., IEEE Int. Conf. on Computer

Vision, IEEE, New York, 1440–1448.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). “Rich feature

hierarchies for accurate object detection and semantic segmentation.”

Proc., IEEE Conf. on Computer Vision and Pattern Recognition, IEEE,

New York, 580–587.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2016). “Region-based

convolutional networks for accurate object detection and segmenta-

tion.”IEEE Trans. Pattern Anal. Mach. Intell., 38(1), 142–158.

Golparvar-Fard, M., Heydarian, A., and Niebles, J. C. (2013). “Vision-

based action recognition of earthmoving equipment using spatio-

temporal features and support vector machine classifiers.”Adv. Eng.

Inf., 27(4), 652–663.

Gong, J., and Caldas, C. H. (2010). “Computer vision-based video inter-

pretation model for automated productivity analysis of construction

operations.”J. Comput. Civ. Eng.,10.1061/(ASCE)CP.1943-5487

.0000027, 252–263.

Gong, J., Caldas, C. H., and Gordon, C. (2011). “Learning and classifying

actions of construction workers and equipment using Bag-of-Video-

Feature-Words and Bayesian network models.”Adv. Eng. Inf., 25(4),

771–782.

Haming, K., and Peters, G. (2010). “The structure-from-motion

reconstruction pipeline—A survey with focus on short image sequen-

ces.”Kybernetika, 46(5), 926–937.

Han, S., and Lee, S. (2013). “A vision-based motion capture and recog-

nition framework for behavior-based safety management.”Autom.

Constr., 35, 131–141.

Han, S., Lee, S., and Pe˜na-Mora, F. (2013). “Vision-based detection of un-

safe actions of a construction worker: Case study of ladder climbing.”J.

Comput. Civ. Eng.,10.1061/(ASCE)CP.1943-5487.0000279, 635–644.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). “Deep residual learning for

image recognition.”Pre-print arXiv:1512.03385.

Hinton, G. E. (2009). “Deep belief networks.”Scholarpedia, 4(5), 5947.

Hong, S., You, T., Kwak, S., and Han, B. (2015). “Online tracking by learn-

ing discriminative saliency map with convolutional neural network.”

Pre-print arXiv:1502.06796.

ImageNet and Microsoft COCO. (2015). “ImageNet and MS COCO

Visual Recognition Challenges Joint Workshop.”〈http://image-net

.org/challenges/ilsvrc+mscoco2015〉(Jun. 26, 2017).

Kristan, M., Matas, J., Leonardis, A., Felsberg, M., and Cehovin, L. (2015).

“The visual object tracking VOT2015 challenge results.”Proc., IEEE

Int. Conf. on Computer Vision Workshops, IEEE, New York, 1–23.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). “ImageNet clas-

sification with deep convolutional neural networks.”Proc., Advances

in Neural Information Processing Systems, NIPS, La Jolla, CA,

1097–1105.

Lafferty, J., McCallum, A., and Pereira, F. (2001). “Conditional random

fields: Probabilistic models for segmenting and labeling sequence data.”

Proc., 8th Int. Conf. on Machine Learning, IMLS, Stroudsburg, PA,

282–289.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). “Deep learning.”Nature,

521(7553), 436–444.

Lillo, I., Soto, A., and Carlos Niebles, J. (2014). “Discriminative hierarchi-

cal modeling of spatio-temporally composable human activities.”Proc.,

IEEE Conf. on Computer Vision and Pattern Recognition, IEEE,

New York, 812–819.

Lowe, D. G. (1999). “Object recognition from local scale-invariant

features.”Proc., Int. Conf. on Computer Vision, Vol. 1152, IEEE,

New York, 1150–1157.

Memarzadeh, M., Golparvar-Fard, M., and Niebles, J. C. (2013). “Auto-

mated 2D detection of construction equipment and workers from site

video streams using histograms of oriented gradients and colors.”

Autom. Constr., 32, 24–37.

Morariu, V. I., and Davis, L. S. (2011). “Multi-agent event recognition in

structured scenarios.”Proc., IEE Conf. on Computer Vision and Pattern

Recognition, IEEE, New York, 3289–3296.

Murphy, K. P., and Paskin, M. A. (2001). “Linear-time inference in hier-

archical HMMs.”Proc., NIPS, NIPS, La Jolla, CA, 833–840.

Neubeck, A., and Van Gool, L. (2006). “Efficient non-maximum suppres-

sion.”Proc., 18th Int. Conf. on Pattern Recognition, IEEE, New York,

850–855.

Park, M. W., and Brilakis, I. (2012). “Construction worker detection in

video frames for initializing vision trackers.”Autom. Constr., 28,

15–25.

J. Comput. Civ. Eng., 2018, 32(3): 04018012

Park, M. W., Koch, C., and Brilakis, I. (2012). “Three-dimensional tracking

of construction resources using an on-site camera system.”J. Comput.

Civ. Eng.,10.1061/(ASCE)CP.1943-5487.0000168, 541–549.

PASCAL VOC. (2012). “The PASCAL visual object classes.”〈http://host

.robots.ox.ac.uk/pascal/VOC/〉(Jun. 26, 2017).

Quattoni, A., Wang, S., Morency, L.-P., Collins, M., and Darrell, T. (2007).

“Hidden conditional random fields.”IEEE Trans. Pattern Anal. Mach.

Intell., 29(10), 1848–1852.

Rabiner, L., and Juang, B. (1986). “An introduction to hidden Markov

models.”IEEE ASSP Magazine, 3(1), 4–16.

Ray, S. J., and Teizer, J. (2012). “Real-time construction worker posture

analysis for ergonomics training.”Adv. Eng. Inf., 26(2), 439–455.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). “Faster R-CNN: Towards

real-time object detection with region proposal networks.”Proc.,

Advances in Neural Information Processing Systems, NIPS, La Jolla,

CA, 91–99.

Rezazadeh Azar, E., Dickinson, S., and McCabe, B. (2013). “Server-

customer interaction tracker: Computer vision-based system to estimate

dirt-loading cycles.”J. Constr. Eng. Manage.,10.1061/(ASCE)CO

.1943-7862.0000652, 785–794.

Rezazadeh Azar, E., and McCabe, B. (2012a). “Automated visual recog-

nition of dump trucks in construction videos.”J. Comput. Civ. Eng.,

10.1061/(ASCE)CP.1943-5487.0000179, 769–781.

Rezazadeh Azar, E., and McCabe, B. (2012b). “Part based model and

spatial-temporal reasoning to recognize hydraulic excavators in

construction images and videos.”Autom. Constr., 24, 194–202.

Shechtman, E., and Irani, M. (2005). “Space-time behavior based correla-

tion.”Proc., IEEE Computer Society Conf. on Computer Vision and

Pattern Recognition, IEEE, New York, 405–412.

Son, H., Kim, C., and Kim, C. (2012). “Automated color model-based con-

crete detection in construction-site images by using machine learning

algorithms.”J. Comput. Civ. Eng.,10.1061/(ASCE)CP.1943-5487

.0000141, 421–433.

Teizer, J., Caldas, C., and Haas, C. (2007). “Real-time three-dimensional

occupancy grid modeling for the detection and tracking of construction

resources.”J. Constr. Eng. Manage.,10.1061/(ASCE)0733-9364

(2007)133:11(880), 880–888.

Tzu, T. (2015). “LabelImg: A graphical image annotation tool.”〈https://

github.com/tzutalin/labelImg〉(Dec. 1, 2016).

Vrigkas, M., Nikou, C., and Kakadiaris, I. A. (2015). “A review of human

activity recognition methods.”Front. Robot. AI, 2, 28.

Wang, H., and Schmid, C. (2011). “Action recognition by dense trajecto-

ries.”IEEE Conf. on Computer Vision and Pattern Recognition, IEEE,

New York, 3169–3176.

Wang, H., and Schmid, C. (2013). “Action recognition with improved tra-

jectories.”Proc., IEEE Int. Conf. on Computer Vision, IEEE, New York,

3551–3558.

Weinland, D., Ronfard, R., and Boyer, E. (2011). “A survey of vision-based

methods for action representation, segmentation and recognition.”

Comput. Vision Image Understanding, 115(2), 224–241.

Williams, C. (2012). “Introduction: History and analysis (in Part II: VOC

2005-2012: The VOC years and legacy).”〈http://host.robots.ox.ac.uk

/pascal/VOC/voc2012/workshop/history_analysis.pdf〉(Jun. 26, 2017).

Yang, J., Park, M.-W., Vela, P. A., and Golparvar-Fard, M. (2015). “Con-

struction performance monitoring via still images, time-lapse photos,

and video streams: Now, tomorrow, and the future.”Adv. Eng. Inf.,

29(2), 211–224.

Yang, J., Shi, Z., and Wu, Z. (2016). “Vision-based action recognition of

construction workers using dense trajectories.”Adv. Eng. Inf., 30(3),

327–336.

Yang, J., Vela, P., Teizer, J., and Shi, Z. (2014). “Vision-based tower crane

tracking for understanding construction activity.”J. Comput. Civ. Eng.,

10.1061/(ASCE)CP.1943-5487.0000242, 103–112.

Zhu, Z., Ndiour, I. J., Brilakis, I., and Vela, P. A. (2010). “Improvements to

concrete column detection in live video.”Proc., 27th Int. Symp. on

Automation and Robotics in Construction, Faculty of Civil Engineering

of the Slovak Univ. of Technology in Bratislava, Bratislava, Slovakia,

25–27.

Zou, J., and Kim, H. (2007). “Using hue, saturation, and value color space

for hydraulic excavator idle time analysis.”J. Comput. Civ. Eng.,

10.1061/(ASCE)0887-3801(2007)21:4(238), 238–246.

J. Comput. Civ. Eng., 2018, 32(3): 04018012

Deep learning-based automated productivity monitoring for on-site module installation in off-site construction

Article

Full-text available

Mar 2024

Effectively monitoring and analyzing on-site module installation for modular integrated construction (MiC) is essential to properly coordinating the MiC process. In this study, the authors propose an automated productivity monitoring framework for on-site module installation operations consisting of three modules: object detection, activity classification, and productivity analysis. The object detection module detects mobile cranes and modules interacting with mobile cranes, and the activity classification module classifies module installation activities into five different activities by considering the spatiotemporal relationship between the detected objects. Finally, the productivity analysis module analyzes the productivity of the module installation process by utilizing the accumulated activity classification results over image frames. The proposed model achieves an average accuracy of 89% (hooking: 85.71%, lifting: 84.44%, positioning: 94.90%, returning: 83.09%, and idling: 96.87%) in classifying the five activities. The developed framework enables practitioners to measure the productivity of the on-site module installation process automatically. In addition, productivity data stored from diverse construction sites contribute to identifying progress-impeding factors and improving the productivity of the entire MiC process.

Construction Activity Recognition Method Based on Object Detection, Attention Orientation Estimation, and Person Re-Identification

Article

Full-text available

Jun 2024

Recognition and classification for construction activities help to monitor and manage construction workers. Deep learning and computer vision technologies have addressed many limitations of traditional manual methods in complex construction environments. However, distinguishing different workers and establishing a clear recognition logic remain challenging. To address these issues, we propose a novel construction activity recognition method that integrates multiple deep learning algorithms. To complete this research, we created three datasets: 727 images for construction entities, 2546 for posture and orientation estimation, and 5455 for worker re-identification. First, a YOLO v5-based model is trained for worker posture and orientation detection. A person re-identification algorithm is then introduced to distinguish workers by tracking their coordinates, body and head orientations, and postures over time, then estimating their attention direction. Additionally, a YOLO v5-based object detection model is developed to identify ten common construction entity objects. The worker’s activity is determined by combining their attentional orientation, positional information, and interaction with detected construction entities. Ten video clips are selected for testing, and a total of 745 instances of workers are detected, achieving an accuracy rate of 88.5%. With further refinement, this method shows promise for a broader application in construction activity recognition, enhancing site management efficiency.

Single-Stage Spatiotemporal Activity Recognition of Excavators: A Case Study

Conference Paper

Jun 2024

Design and Manufacture of Multifunctional 3-D Smart Skins with Embedded Sensor Networks for Robotic Applications

Article

Full-text available

May 2024
SENSORS-BASEL

An investigation was performed to develop a process to design and manufacture a 3-D smart skin with an embedded network of distributed sensors for non-developable (or doubly curved) surfaces. A smart skin is the sensing component of a smart structure, allowing such structures to gather data from their surrounding environments to make control and maintenance decisions. Such smart skins are desired across a wide variety of domains, particularly for those devices where their surfaces require high sensitivity to external loads or environmental changes such as human-assisting robots, medical devices, wearable health components, etc. However, the fabrication and deployment of a network of distributed sensors on non-developable surfaces faces steep challenges. These challenges include the conformal coverage of a target object without causing prohibitive stresses in the sensor interconnects and ensuring positional accuracy in the skin sensor deployment positions, as well as packaging challenges resulting from the thin, flexible form factor of the skin. In this study, novel and streamlined processes for making such 3-D smart skins were developed from the initial sensor network design to the final integrated skin assembly. Specifically, the process involved the design of the network itself (for which a physical simulation-based optimization was developed), the deployment of the network to a targeted 3D surface (for which a specialized tool was designed and implemented), and the assembly of the final skin (for which a novel process based on dip coating was developed and implemented.)

A vision-based approach for detecting occluded objects in construction sites

Article

Full-text available

Mar 2024
NEURAL COMPUT APPL

Ensuring the safety of workers and machinery during operations is a critical task in the construction sites. However, an inevitable circumstance in construction sites is the complex and dynamic environment, which often leads to occlusions. When detecting occluded objects in construction sites, general vision-based approaches tend to exhibit lower accuracy and may even miss detections, resulting in potential safety hazards. To handle this issue, this paper proposes a vision-based approach for detecting occluded objects in construction sites. Firstly, the proposed detection algorithm adopts the state-of-the-art YOLOv7 as its backbone. To enhance its capability in capturing contextual information of occluded objects, a novel channel attention mechanism is employed. Then, a design scheme for the detector head is provided by integrating a novel loss function Scylla-Intersection over Union (SIoU) and the non-maximum suppression (NMS) strategy. With the help of the loss function SIoU, the network can compute the loss values of occluded objects more accurately. To ensure that the network can select the right predicted box which closely aligns with the ground truth, the Euclidean distance is utilized as spatial penalty factor during the NMS stage. By implementing these two strategies, the proposed method can preserve both the category information and bounding boxes of occluded objects, which makes them possible to be detected. Finally, detailed experiments are done to verify the proposed method. Experimental results demonstrate that the proposed method has the potential for improving the detection accuracy. Moreover, it shows a better performance in detecting occluded objects in the dynamic construction sites compared to the existing baselines.

Object Detection and Instance Segmentation in Construction Sites

Conference Paper

May 2024

4D BIM and Reality Model-Driven Camera Placement Optimization for Construction Monitoring

Article

Mar 2024
J CONSTR ENG M ASCE

Cameras are one of the most valuable sensors for collecting high-quality visual data on construction sites for uses ranging from surveillance to automated information exaction. The dynamic nature of sites means the visual data from cameras can suffer from occlusions and lack of coverage due to progressing works, hindering the performance of automated visual analysis methods. Therefore, manual planning and adjustments by experienced practitioners are required for appropriate camera placement at the site, which is expensive and time-consuming. Past research has simulated cameras and used algorithms with an objective function to optimize installation parameters with planned site models ranging from two-dimensional (2D) to four-dimensional (4D). However, these models lack information from ongoing site conditions, hampering actual camera performance. This study proposes a camera placement framework incorporating 4D-building information model (BIM) and reality models. The framework first identifies the camera placement determinants through expert interviews. Next, the planned BIM and reality models are used to construct the simulation environment, and camera placement parameters are optimized. The proposed framework is implemented and evaluated on a construction site. 25% average camera coverage improvement from the benchmark solution is achieved. This study further contributes to the potential application of automated visual monitoring systems on construction sites.

ConSE: An ontology for visual representation and semantic enrichment of digital images in construction sites

Article

Apr 2024
ADV ENG INFORM

Learning Multi-Granularity Task Primitives from Construction Videos for Human-Robot Collaboration

Conference Paper

Jan 2024

A New Benchmark Model for the Automated Detection and Classification of a Wide Range of Heavy Construction Equipment

Article

Mar 2024

The Visual Object Tracking VOT2015 challenge results

Conference Paper

Full-text available

Dec 2015

A Review of Human Activity Recognition Methods

Article

Full-text available

Nov 2015

Recognizing human activities from video sequences or still images is a challenging task due to problems, such as background clutter, partial occlusion, changes in scale, viewpoint, lighting, and appearance. Many applications, including video surveillance systems, human-computer interaction, and robotics for human behavior characterization, require a multiple activity recognition system. In this work, we provide a detailed review of recent and state-of-the-art research advances in the field of human activity classification. We propose a categorization of human activity methodologies and discuss their advantages and limitations. In particular, we divide human activity classification methods into two large categories according to whether they use data from different modalities or not. Then, each of these categories is further analyzed into sub-categories, which reflect how they model human activities and what type of activities they are interested in. Moreover, we provide a comprehensive analysis of the existing, publicly available human activity classification datasets and examine the requirements for an ideal human activity recognition dataset. Finally, we report the characteristics of future research directions and present some open issues on human activity recognition.

Improvements to Concrete Column Detection in Live Video

Conference Paper

Jun 2010

Histograms of Oriented Gradients for Human Detection

Conference Paper

Jul 2005
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

Imagenet classification with deep convolutional neural networks

Conference Paper

Jan 2012

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Conference Paper

Jan 2016

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.

Fusion of Photogrammetry and Video Analysis for Productivity Assessment of Earthwork Processes

Article

Sep 2016
COMPUT-AIDED CIV INF

The high complexity of modern large-scale construction projects leads their schedules to be sensitive to delays. At underground construction sites, the earthwork processes are vital, as most of the following tasks depend on them. This article presents a method for estimating the productivity of soil removal by combining two technologies based on computer vision: photogrammetry and video analysis. Photogrammetry is applied to create a time series of point clouds throughout excavation, which are used to measure the volume of the excavated soil for daily estimates of productivity. Video analysis is used to generate statistics regarding the construction activities for estimating productivity at finer time scales, when combined with the output from the photogrammetry pipeline. As there may be multiple causes for specific productivity levels, the automated generation of progress and activity statistics from both measurement methods supports interpretation of the productivity estimates. Comparison to annotated ground truth for the tracking and activity monitoring method highlights the reliability of the extracted information. The suitability of the approach is demonstrated by two case studies of real-world urban excavation projects.

Vision-based action recognition of construction workers using dense trajectories

Article

Aug 2016
ADV ENG INFORM

Wide spread monitoring cameras on construction sites provide large amount of information for construction management. The emerging of computer vision and machine learning technologies enables automated recognition of construction activities from videos. As the executors of construction, the activities of construction workers have strong impact on productivity and progress. Compared to machine work, manual work is more subjective and may differ largely in operation flow and productivity among different individuals. Hence only a handful of work studies on vision based action recognition of construction workers. Lacking of publicly available datasets is one of the main reasons that currently hinder advancement. The paper studies worker actions comprehensively, abstracts 11 common types of actions from 5 kinds of trades and establishes a new real world video dataset with 1176 instances. For action recognition, a cutting-edge video description method, dense trajectories, has been applied. Support vector machines are integrated with a bag-of-features pipeline for action learning and classification. Performances on multiple types of descriptors (Histograms of Oriented Gradients – HOG, Histograms of Optical Flow – HOF, Motion Boundary Histogram – MBH) and their combination have been evaluated. Discussion on different parameter settings and comparison to the state-of-the-art method are provided. Experimental results show that the system with codebook size 500 and MBH descriptor has achieved an average accuracy of 59% for worker action recognition, outperforming the state-of-the-art result by 24%.

Computational Analysis of Behavior

Article

Aug 2016

In this review, we discuss the emerging field of computational behavioral analysis-the use of modern methods from computer science and engineering to quantitatively measure animal behavior. We discuss aspects of experiment design important to both obtaining biologically relevant behavioral data and enabling the use of machine vision and learning techniques for automation. These two goals are often in conflict. Restraining or restricting the environment of the animal can simplify automatic behavior quantification, but it can also degrade the quality or alter important aspects of behavior. To enable biologists to design experiments to obtain better behavioral measurements, and computer scientists to pinpoint fruitful directions for algorithm improvement, we review known effects of artificial manipulation of the animal on behavior. We also review machine vision and learning techniques for tracking, feature extraction, automated behavior classification, and automated behavior discovery, the assumptions they make, and the types of data they work best with. Expected final online publication date for the Annual Review of Neuroscience Volume 39 is July 08, 2016. Please see http://www.annualreviews.org/catalog/pubdates.aspx for revised estimates.

Deep Residual Learning for Image Recognition

Article

Dec 2015

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Recognizing Diverse Construction Activities in Site Images via Relevance Networks of Construction Related Objects Detected by Convolutional Neural Networks

Abstract and Figures

Recommended publications

Fighting fake Chinese Herbal Medicines

Simulation model prepares cardiologists for surgeries

Deep Feature Based Contextual Model for Object Detection

Detection of Manipulation Action Consequences (MAC)

SCNN: A General Distribution Based Statistical Convolutional Neural Network with Application to Vide...

A Robust Video Object Tracking by Using Active Contours