ArticlePDF Available

Attribute-Based Safety Risk Assessment. II: Predicting Safety Outcomes Using Generalized Linear Models

January 2015
Journal of Construction Engineering and Management 141(8):04015022

January 2015
141(8):04015022

DOI:10.1061/(ASCE)CO.1943-7862.0000981

Authors:

Behzad Esmaeili

Purdue University

Matthew Hallowell

University of Colorado Boulder

Balaji Rajagopalan

One of the recent advancements in preconstruction safety management is the identification and quantification of risks associated with fundamental attributes of construction work environments that cause injuries. The goal of this paper is to test the validity of using these fundamental risk attributes to predict safety outcomes. The modeling approach required two steps, as follows: (1) a principal component analysis was performed on the safety attributes to reduce dimension of the data and remove collinearity among attributes (the principle component analysis provided insights into the relative importance of the various attributes and provided an orthogonal decomposition of the data), and (2) the leading principal components (which are orthogonal by definition) were used as potential predictors in a generalized linear model with a logit link function to model the probability of different accident categories. The predictive power was then assessed using a rank probability skill score, which quantified the probabilistic skill of the forecasts over the categories. The analysis shows strong predictive skill, making the models attractive for safety managers to use to skilfully forecast the probability of a safety incident given identifiable characteristics of planned work. Researchers in the technology domain may find these models useful in predicting safety outcomes during design, work packaging, and scheduling. (C) 2015 American Society of Civil Engineers.

. Previous Safety Predictive Models

…

. Accident Reports Analyzed

…

Scree plot of variance captured by each PC for SIC 1521

…

. Classifying Injury Types and Their Distribution for Each SIC As Response Variable

…

. Principle Component Analysis Loadings for the First Five PCs of SIC 1521

…

Figures - uploaded by Behzad Esmaeili

Content may be subject to copyright.

Content uploaded by Behzad Esmaeili

Content may be subject to copyright.

Attribute-Based Safety Risk Assessment. II: Predicting

Safety Outcomes Using Generalized Linear Models

Behzad Esmaeili, A.M.ASCE1; Matthew R. Hallowell, A.M.ASCE2; and Balaji Rajagopalan, A.M.ASCE3

Abstract: One of the recent advancements in preconstruction safety management is the identification and quantification of risks associated

with fundamental attributes of construction work environments that cause injuries. The goal of this paper is to test the validity of using these

fundamental risk attributes to predict safety outcomes. The modeling approach required two steps, as follows: (1) a principal component

analysis was performed on the safety attributes to reduce dimension of the data and remove collinearity among attributes (the principle

component analysis provided insights into the relative importance of the various attributes and provided an orthogonal decomposition

of the data), and (2) the leading principal components (which are orthogonal by definition) were used as potential predictors in a generalized

linear model with a logit link function to model the probability of different accident categories. The predictive power was then assessed using

a rank probability skill score, which quantified the probabilistic skill of the forecasts over the categories. The analysis shows strong predictive

skill, making the models attractive for safety managers to use to skilfully forecast the probability of a safety incident given identifiable

characteristics of planned work. Researchers in the technology domain may find these models useful in predicting safety outcomes during

Author keywords: Safety risk management; Predictive models; Principal component analysis (PCA); Generalized linear models (GLMs);

Labor and personnel issues.

Introduction

Safe completion of a project is the ultimate goal of any contractor.

To achieve this goal, many contractors are in strict accordance with

Occupational Safety and Health Administration (OSHA) regulations,

which involve providing personal protective equipment and protec-

tive measures. Although these prevention strategies are effective, they

are not enough to achieve excellent safety performance (Hinze et al.

2013). Since most of the compliance practices are passive or reactive,

they do not provide early warnings based on the specific character-

istics of a work environment. For this reason, increasing attention is

being paid to proactive safety strategies that identify precursors of an

incident and assess the risk of potential hazards in advance.

Seeking to predict safety performance and provide early warn-

ing, numerous studies have investigated the relationship between

safety-related outcomes and variables that might affect safety

(Zohar 1980;Tam and Fung 1998;Gillen et al. 2002;Cooper and

Phillips 2004;Chen and Yang 2004;Fang et al. 2006;Johnson

2007;Rozenfeld et al. 2010). Different metrics such as number

of injuries, experience modification rate (EMR), or safe behavior

have been used to measure safety outcomes. Additionally, a variety

of factors have also been used to measure the predictor variables,

such as the characteristics of construction firms, safety program

elements employed, and a firm’s safety climate. The main limita-

tions of these previous studies are the following: (1) the dynamic

nature of construction projects has been widely ignored, (2) the

predictions are not based on objective or empirical data, and (3) the

safety climate cannot be measured in early stages of a project.

To address these limitations, other studies (Barandan and Usmen

2006;Hallowell et al. 2011;Esmaeili and Hallowell 2013) suggested

assessing the specific safety risks of tasks or trades as a predictor

of the existing level of hazards onsite. However, this approach

has a practical limitation; because the industry is diverse and ever-

evolving, it is impossible to quantify risks for all potential scenarios.

To address these knowledge gaps, Esmaeili (2012)proposedan

attribute-based risk identification and analysis method that helps

practitioners model safety risk independent of specific activities or

construction objects. In this method, the risk of worker injury is con-

sidered to be the direct result of temporal and spatial interactions

among a limited number of fundamental and identifiable attributes

that characterize the work environment. These attributes, which can

be identified in early stages of the project, are mainly related to the

physical conditions of a jobsite such as the presence of open edges,

overhead power lines, and moving equipment in proximity to

workers. The conceivable benefit of an attribute-based risk identifi-

cation system lies in the fact that hazards coinciding with the specific

features (attributes) of a task may be identified and minimized during

the preconstruction phase of the project, thereby alleviating the

workers’exposure to risk during actual construction.

The fundamental attributes of construction environments were

used to serve as predictor variables in probabilistic safety models

forecasting the potential injury outcomes of those tasks. A sound

and reliable mathematical approach used extensively in meteoro-

logical science for weather forecasting contributed the formation of

this model. Just as taking measurement of temperature, wind speed,

1Assistant Professor, Durham School of Architectural Engineering and

Construction, Univ. of Nebraska, 113 Nebraska Hall, Lincoln, NE 68588

(corresponding author). E-mail: besmaeili2@unl.edu

2Beavers Endowed Professor of Construction Engineering and Associ-

ate Chair; and Dept. of Civil, Environmental, and Architectural Engineer-

ing, Univ. of Colorado at Boulder (UCB), 428 UCB, 1111 Engineering Dr.,

Boulder, CO 80309-0428. E-mail: matthew.hallowell@colorado.edu

3Professor and Chair, Dept. of Civil, Environmental, and Architectural

Engineering and Cooperative Institute for Research in Environmental

Sciences, Univ. of Colorado at Boulder (UCB), 428 UCB, 1111 Engineer-

ing Dr., Boulder, CO 80309-0428. E-mail: balajir@colorado.edu

Note. This manuscript was submitted on August 4, 2014; approved

on January 6, 2015; published online on March 27, 2015. Discussion

period open until August 27, 2015; separate discussions must be sub-

mitted for individual papers. This paper is part of the Journal of Con-

struction Engineering and Management, © ASCE, ISSN 0733-9364/

04015022(11)/$25.00.

J. Constr. Eng. Manage.

pressure, humidity, and their interrelationships are used to help

forecast weather, safety attributes provide the raw data needed to

drive the prediction of workplace injuries and safety incidents.

In order to limit the scope of the research reported in this paper,

struck-by accidents were the focus, which are one of the leading

causes of construction fatalities (Hinze et al. 2005). A large and

reliable national database of construction accidents provided the

objective data employed in the testing of the mathematical models.

It is expected that this approach and the resulting models will

improve researchers’ability to forecast injuries and anticipate

high-risk periods on a project. Specifically, the predictive models

developed in this paper can help practitioners to choose alternative

means and methods of construction and identify high-risk periods

of a project.

Safety Predictive Models

There are several studies that have attempted to predict safety-

related outcomes (e.g., Tam and Fung 1998). For example, some

researchers (Zohar 1980;Johnson 2007) attempted to relate safety

outcomes (e.g., injury rate) to the factors that affect safe perfor-

mance (e.g., injury prevention practices). Within these studies, a

common independent variable used to forecast safety performance

during construction is safety climate. Safety climate is considered a

subset of organizational climate and can be defined as the so-called

moral perceptions that workers share about the importance of safety

(Zohar 1980). Researchers into this topic searched for empirical

evidence of the relationship between safety climate and safety per-

formance such as the frequency and severity of accidents. In one

of the seminal studies, Zohar (1980) successfully used safety cli-

mate dimensions to predict safety program effectiveness as judged

by safety inspectors in industrial organizations. Glendon and

Litherland (2001) distributed safety climate questionnaires to ex-

amine the relationship between safety climate and safe behavior.

They assumed that safe behavior leads to less-frequent and severe

accidents. In one of the more recent studies in this area, Johnson

(2007) examined the predictive validity of safety climate and found

that safety climate was negatively correlated with the number of

lost workdays due to injury. In another study, Fang et al. (2006)

used logistic regression to investigate the relationship between

safety climate and personal characteristics (e.g., education level).

The Fang et al. (2006) approach was different from the previous

studies in the area of safety climate, because they considered safety

climate as a dependent variable and tried to predict it through per-

sonal characteristics (independent variable).

Other researchers examined other predictive variables. Tam and

Fung (1998) studied the relationship between common safety man-

agement strategies in Hong Kong and their accident rates using

multiple regression analysis. They found that four variables are sig-

nificant in determining safety performance, as follows: (1) postac-

cident investigation, (2) the proportion of subcontracted labor,

(3) safety awards, and (4) and safety training. In another study,

Gillen et al. (2002) found a relationship between injured construc-

tion workers’perceptions of workplace safety climate, psychologi-

cal job demands, decision latitude, coworker support, and the

severity of injuries sustained by the workers. Their model explained

23% of the variance in injury severities. Cooper and Phillips (2004)

also used multiple regressions and found that the perception of

importance of safety training can predict the actual levels of safe

behavior. Some other researchers attempted to predict safety per-

formance using leading indicators. In one of the recent studies,

Hinze et al. (2013) identified a list of construction-safety strategies

and linked them with the project’s recordable injury rate (RIR).

In Singapore, Chua and Goh (2005) considered construction

accidents as random events and successfully fitted Poisson distri-

bution to a data set of accidents. The major limitation of predicting

accident occurrence using a general probability density function

(PDF) is that this type of predictive models does not consider

unique characteristics of activities or a project. In another study,

Goh and Chua (2013) used neural network analysis to investigate

relationship between safety performance and occupational safety

and health elements identified from accident reports. They found

that safety performance in a project is mostly impacted by incident

investigation and analysis, emergency preparedness, and group

meetings. While this approach can be used to identify successful

safety practices in general, it heavily relies on soft factors such as

safety policy, safety training, and groups meetings which are diffi-

cult to be measured objectively in a real construction project.

Although the mentioned predictive models can be effective tools

in measuring safety status, they have the following limitations:

(1) the reported relationship between safety climate and safety

behavior is largely dependent on subjective self-reporting instru-

ments (Chen and Yang 2004), (2) these models focus on unsafe

behavior and ignore the importance of physically unsafe condi-

tions, and (3) the proposed models cannot be integrated in to the

preconstruction safety activities because during design and precon-

struction there is no knowledge of the safety climate or behavioral

issues in the project during early stages of a project.

While accidents are infrequent events, some researchers implic-

itly considered risk as a predictive measure for safety performance.

This group of scholars has attempted to predict hazard by quantify-

ing risks produced by different trades (Baradan and Usmen 2006),

activities (Hallowell and Gambatese 2009), or loss-of-control

events (Rozenfeld et al. 2010). Lee and Halpin (2003) presented

a predictive tool to estimate accident risk in utility-trenching oper-

ations using training, supervision, and preplanning as predictive

variables. In order to assess the condition of different predictive

variables, they used the fuzzy input from the user. Outside of the

construction domain, Chen and Yang (2004) used regular observa-

tion of unsafe acts and conditions to develop a predictive risk index

as an indication of safety performance in a process plant.

Although these approaches offer benefits, there are two main

limitations in these types of predictive models, as follows: (1) there

are numerous activities and loss-of-control events, and quantifying

risks for all of them is impractical; and (2) in most research, risk has

been assessed subjectively, thereby limiting the internal and exter-

nal validity of the estimates. The prominent predictive models in

construction safety domain, as well as their associated response

and predictor variables, are listed in Table 1. After conducting an

extensive literature review, it can be concluded that developing pre-

dictive models using empirical data is important for obtaining more

reliable and robust knowledge in this area.

Contribution to the Body of Knowledge

The research reported in this paper departs from the current body

of knowledge by testing the validity of several statistical models

to predict hazardous situations in the early stages of a project. The

research reported in this paper is the first to employ an objective

large accident database to forecast safety related outcomes of ac-

cidents using a finite number of measurable attributes. This paper

makes several contributions to both theory and practice, not the

least of which is a predictive model of safety outcomes that is

(1) based on a large volume of internally and externally valid em-

pirical data, (2) explored with an efficient and rigorous technique

derived from established meteorological science, (3) robust enough

to predict outcomes for any combination of attributes that may be

J. Constr. Eng. Manage.

encountered in contemporary worksites, and (4) focused on the

unsafe physical conditions instead of unsafe behavior.

As far as planned theoretical contributions of the paper are

concerned, the results of the research reported in this paper enable

a project manager to predict the probability of different injury out-

comes independent from the tasks or other unique features of a

project. In addition, using a finite number of attributes that can be

identified during the early stages of a construction project provides

several opportunities for the project personnel to change the design

in such a way as to mitigate hazards. Furthermore, the individual-

ized predictive models for different type of construction projects

(e.g., family housing, highway, and so on) helps practitioners to

adapt models for different types of construction, such as vertical

and horizontal. To summarize, the applied results of the research

reported in this paper bring the current safety practices one step

closer to the vision of so-called zero injury in that if work injuries

can be predicted, they may be prevented.

Research Method

This paper built upon the previously established content analysis

of 1,771 struck-by accident reports that identified the fundamental

attributes that cause accidents. In this process, the fundamental

attributes that lead to struck-by accidents were identified as pre-

dictor variables and the severity of accidents that were caused by

these attributes were recorded as the response variable (Esmaeili

2012). The presence/absence of groups of hazardous attributes

(independent variable) and the various injury types (dependent var-

iable) became the dataset for the research reported in this paper.

The accident reports came from the Occupational Safety and

Health Administration (OSHA) Integrated Management Informa-

tion System (IMIS; OSHA 2013). The scope of the dataset was

limited to two major groups, as follows: (1) building construction

general contractors and operative builders, and (2) heavy construc-

tion other than building construction contractors (which usually

have the higher rate of struck-by accidents). The OSHA IMIS data-

base classifies the injury reports into different project groups, such

as single-family housing, residential, and industrial. Table 2shows

the detailed breakdown structure of the construction work groups.

In total, 22 attributes that cause struck-by accidents were iden-

tified (Table 3). The output of the content analysis was a matrix

featuring accident reports (rows) and safety attributes (columns);

within the matrix, if attribute jcontributed to the accident i, then

xij ¼1, otherwise xij ¼0. Several cases were omitted as missing

data because they did not have specific accident severity or the de-

scription in the report was less than two lines. Due to its lack of

sufficient accident reports, Standard Industrial Classification (SIC)

1531 was omitted from the analysis.

Injury severity was used to create the categorical dependent

variable (Lee and Halpin 2003). The severity related to each

accident were recorded and resulted in 26 different types of injury

outcomes. Fatality and severe accidents dominated as accident out-

come, which was expected because the IMIS database includes

OSHA recordable injuries that have severe consequences. How-

ever, this characteristic can cause problems for predictive models

Table 1. Previous Safety Predictive Models

Number Study Dependent variable Independent variable

1 Tam and Fung (1998) Accident rates Safety management strategies

2 Glendon and Litherland (2001) Percent safe behavior Safety climate

3 Gillen et al. (2002) Severity of accidents Perceived safety climate, job demands,

decision latitudes, and coworker support

4 Lee and Halpin (2003) Safety risk Training, supervision, and preplanning

5 Cooper and Phillips (2004) Safe behavior Safety training

6 Fang et al. (2006) Safety climate Personal characteristics

7 Baradan and Usmen (2006) Safety risk Construction trades

8 Johnson (2007) Lost workdays Safety climate

9 Hallowell and Gambatese (2009) Safety risk Formwork activities

10 Rozenfeld et al. (2010) Safety risk Loss-of-control events

11 Esmaeili and Hallowell (2013) Safety risk profiles Highway maintenance and reconstruction tasks

Table 2. Accident Reports Analyzed

Major groups SIC codeaDescription

Struck-by

incidents

After removing

missing data

Major Group 15, building

construction general contractors

and operative builders

1521 General contractors, single-family houses 247 149

1522 General contractors, residential buildings, other than

single-family

111 71

1531 Operative builders 19 14

1541 General contractors, industrial buildings, and warehouses 105 86

1542 General contractors, nonresidential buildings, other than

industrial buildings and warehouses

209 178

Major Group 16, heavy

construction other than building

construction contractors

1611 Highway and street construction, except elevated highways 501 463

1622 Bridge, tunnel, and elevated highway construction 116 104

1623 Water, sewer, pipeline, and communications and power line

construction

280 226

1629 Heavy construction, not elsewhere classified 183 159

Total 1,771 1,436

aSIG = Standard Industrial Classification.

J. Constr. Eng. Manage.

because the fatality or other serious injuries will become the most

common predicted outcome since less-severe outcomes may not be

representatively included in the database. To resolve this challenge,

the response variables were categorized into three main groups,

as follows: (1) the response variables dichotomized into fatal and

nonfatal injuries; (2) injury outcomes were classified into the cat-

egories of not severe, mild, and severe; and (3) response variables

qualified the data, i.e., first aid, medical case, lost work time, per-

manent disablement, and fatality. Whereas categorizing response

variables might improve quality of predictive models, it does not

help researchers to predict less-severe injuries that are underrepre-

sented in the database. The distribution of each injury outcome in

different SIC codes is shown in Table 4.

Principal component analysis (PCA) was used to identify the

linear combination of attributes that had the greatest explanatory

power and also to reduce the dimension of the multivariate data

into fewer orthogonal components, i.e., principal components

(PCs). The PCs became independent variables in a generalized

linear model (GLM) to predict the probability of each category of

injury severity. The performance of the predictive models was as-

sessed using a rank probability skill score. At the end, the Friedman

two-way ANOVA by ranks was used to compare the predictive

power of the models. The specific research methods employed are

discussed next.

For clarification, the different steps conducted in the research

reported in this paper are summarized in Fig. 1.

Principal Component Analysis

The PCA was introduced by Pearson (1901) and refined by

Hotelling (1933). The main objective of this technique is to reduce

the dimensionality of a dataset consisting of a large number of in-

terrelated variables while retaining the maximum possible variance.

This pursuit is achieved by transforming the dataset into a new set

of orthogonal variables, i.e., the principal components. In this pro-

cess the first principal component accounts for the largest amount

of variance in the data, the second principal component accounts

for the next largest amount of variance and is uncorrelated with the

first, and so on. Several applications have been stated in the liter-

ature for PCA, such as data reduction (Wold et al. 1987), modeling

(Palau et al. 2012), outlier detection (Barnett and Lewis 1994), var-

iable selection (Jolliffe 2002), clustering (Saitta et al. 2008), and

prediction (Salas et al. 2011). Principle component analysis is also

widely used in climate research wherein global climate data in

space and time needs to be analyzed, to identify coherent spatial

and temporal patterns for diagnosis and prediction (von Storch and

Zwiers 1999).

To explain the mathematical algorithm briefly, suppose that the

results of content analysis on OSHA IMIS database are stored in

Matrix Xof size Nrows by Mcolumns, where Nis the number of

accident records, and Mis the number of attributes. Matrix Xcan

be shown as

XðN;MÞ¼

Accident No:1

Accident No:N

x1,1::: x1;M

...

xN;1::: xN;M

5ð1Þ

where if attribute jcontributed to the accident i, then xij ¼1,

otherwise xij ¼0. The objective is to find a linear transformation

as WMM

ZNM ¼XNM ×WMM ð2Þ

where Z= score matrix whose kth column is Zk, the kth PC,

k¼1;2;:::;m; and W= orthogonal matrix, called loading, that

Table 3. Struck-by Attributes, i.e., Predictor Variables

Number Struck-by attributes

1 Working in swing area of a boomed vehicle

2 Workers on foot and moving equipmenta

3 Lack of vision or visibility

4 Flagger on the jobsite

5 Site topography

6 Working with heavy equipment

7 Falling out from heavy equipment

8 Nail gun

9 Working with power tools/large tools

10 Equipment backup

11 Working near active roadway

12 Vehicle accident

13 Flying debris/objects

14 Falling objects

15 Structure collapse

16 Material storage

17 Lifting heavy materialsb

18 Transporting heavy materials horizontally

19 Working at trench

20 Wind

21 Snow

22 Temperature

aFor example, workers are assigned to an activity in proximity of an

excavator.

bHeavy materials are defined as objects that if hit a worker, even with low

speed, can cause an injury.

Table 4. Classifying Injury Types and Their Distribution for Each SIC As Response Variable

Number of

categories

Type of injury,

response variance

Standard Industrial Classification code

Average1521 1522 1541 1542 1611 1622 1623 1629

1 Not fatality 69.8 67.9 55.9 54.6 24.1 42.1 36.3 30.2 46

2 Fatality 30.2 32.1 44.1 45.4 75.9 57.9 63.7 69.8 54.0

1 Not severe 10.1 9.9 6.5 9.2 3.0 5.6 2.4 7.0 6.7

2 Mild 59.7 58 49.4 45.4 21.1 36.5 33.9 23.2 39.3

3 Severe 30.2 32.1 44.1 45.4 75.9 57.9 63.7 69.8 54.0

1 First aid 10.1 9.9 6.5 9.2 3.0 5.6 2.4 7.0 6.7

2 Medical case 20.8 21.0 11.8 8.1 3.6 7.5 7.8 2.9 9.3

3 Lost work time 30.8 29.6 35.5 30.8 13.3 23.4 22.4 16.9 24.8

4 Permanent disablement 8.2 7.4 2.2 6.5 4.2 5.6 3.7 3.5 5.3

5 Fatality 30.2 32.1 44.1 45.4 75.9 57.9 63.7 69.8 54.0

Note: The SIC is expressed as a percentage.

J. Constr. Eng. Manage.

projects Xto Z. The PCA aims to find elements of Win a way that

the squared sum of X’s projection on to the PCs direction is the

maximum. Jolliffe (2002) showed that the columns of W(wi) are

the Eigenvectors of X’s covariance matrix (Cx).

Another common approach to find PCs is to use a correlation

matrix instead of a covariance matrix. However, Chatfield and

Collins (1980) stated that PCs obtained from a correlation matrix

are not the same as PCs obtained from a covariance matrix. One

of the main drawbacks of using a covariance matrix is that PCs

obtained from this method are sensitive to the units of measure-

ment used for each variable. It means that variables with largest

variance will dominate the first few PCs. In this case, because all

measurements are made in the same units, the covariance matrix

might be more appropriate. The algorithm of PCA is implemented

through the so-called prcomp function (Stats Package 2013)inR,

which is an open-source statistical program (R Development Core

Team 2011).

Selecting the number of PCs in the analysis is an important

issue. One of the common rules for selecting PCs is to drop any PC

with variance less than 1, which is known as Kaiser’s rule (Kaiser

1960). However, most of the scholars considered this as the most

inaccurate of all methods (Velicer and Jackson 1990). Other meth-

ods of selecting PCs that have proven to be more effective are to

look at the scree plot or retained variance by PCs since the ith

Eigenvalue λiis a valid measure of variance accounted by the ith

PCs (Jolliffe 2002). Therefore, the cumulative variance (CumVar)

retained by the first kPC can be determined as

CumVark¼X

i¼1

λiX

j¼1

λjð3Þ

Generalized Linear Models

Regression techniques have been widely used in the construction

industry to predict quarterly new orders for housing, commercial,

and industrial construction projects (Akintoye and Skitmore 1994;

Goh 1999), values of total construction activities (Tang et al. 1990),

and cash flow (Park et al. 2005). In general, regression techniques

aim to model the relationships among variables by quantifying the

magnitude that a response variable is related to a set of explanatory

variables. The output of the regression model is a forecasting tool

that can be used to evaluate the impact of various alternative inputs

on the response variable (Goh and Teo 2000).

A classical method for evaluating the relationship between a

predictor and response variable is linear regression. One of the

major assumptions of linear regression (LR) is that the response

variables come from a normally distributed population. However,

in reality, many response variables are categorical and violate

this assumption. In order to overcome this barrier, a more general

approach was adopted that does not have this limitation of LR,

called generalized linear models. This modeling technique pro-

vides a very flexible approach for exploring the relationships

among a variety of variables (discrete, categorical, continuous

and positive, and extreme value) as compared to traditional regres-

sion (McCullagh and Nelder 1989). In GLM, instead of modeling

the mean, a one-to-one continuous differentiable transformation

gðμiÞ, called a link function, is used. Depending on the assumed

distribution of the response variable (Y), an appropriate link func-

tions can be defined (McCullagh and Nelder 1989). As mentioned

in the section that described data acquisition, the response variables

in the research reported in this paper are dichotomous (fatality/no

fatality) and categorical (e.g. severe/mild/nonsevere). Thus, a logit

link function was used

ηi¼gðμiÞ¼log μi

1−μið4Þ

where μi= expected value of the response variable (injury out-

come); and ηi= linear predictor that transforms the expected value

of the response variable in a way such that

ηi¼x0

iβð5Þ

where β= regression coefficient; and x= set of predictors and in-

cludes Nobservations (accident reports) and Ppossible predictor

variables (leading PCs). For two categorical variables, the model is

logistic regression, and for three and five categorical variables, the

model is the multinomial regression. Model parameters in GLM

will be determined in an iterative process called iterated weighted

least-squares (IWLS). In summary, this method finds a set of model

parameters that maximize the likelihood of reproducing the data

distribution of the training set. Multinomial logistic regression is

a generalization of the two-category case described previously

(Hastie et al. 2002). The binomial and multinomial logistic regres-

sion was implemented using the library VGAM (Yee and Wild

1996) in the open-source statistical package R. After esti-

mating β, one can predict ηand then the values can be transformed

into an original response using an inverse link function. The same

approach extends to multiple categories.

Model Pruning

One of the common threats to the validity and usefulness of stat-

istical models is overfitting the dataset, which yields a large number

of insignificant variables in the model. Therefore, the predicted var-

iables of the model should be pruned to find a so-called best model

that contains the right quantity of variables. To do that, a stepwise

regression approach was adopted that minimizes the Akaike infor-

mation criterion (AIC) instead of likelihood function to evaluate

goodness-of-fit in the stepwise search. By minimizing AIC, a bal-

ance between the number of parameters and goodness-of-fit will be

built. This method measures the ability of the predictive model to

reproduce the variance of the observations with the fewest number

of parameters (Wilks 1995). The AIC value can be calculated from

AIC ¼2K−2lnðLÞð6Þ

where K= number of model parameters; and L= maximized value

of the likelihood function for the model. To minimize the AIC, both

Fig. 1. Research steps

J. Constr. Eng. Manage.

forward and backward searches were conducted in the stepwise

regression.

Evaluation of Model Skill

After developing the model, its predictive power should be mea-

sured objectively. In the research reported in this paper, the perfor-

mance of the model was measured against the observed data

through a rank probability score (RPS), which indicates the degree

to which the model predicts the observed data. To calculate the

RPS, two vectors were constructed, as follows: (1) for forecasted

probabilities (Pj), based on the GLM model predictions; and (2) for

observed events (zj), from the observed data. Then the cumulative

density function (CDF) of Pjand zjwere constructed based on the

GLM model predictions, resulting in the vectors PCDF;jand zCDF;j.

The RPS was computed using

RPS ¼1=N×X

i¼1ðPCDF;j−zCDF;jÞ2ð7Þ

Although RPS is quite informative regarding the predictive

power of the model, there is a possibility that the observed data

may be reproduced by pure chance. Therefore, it is necessary to

compare the RPS of the model against the RPS of the random pro-

cess and assess their effectiveness. This test can be done through

the ranked probability skill score (RPSS), which has been used

in various climatological contexts to compare the model’s skill in

predicting categorical rainfall and streamflow quantities (Regonda

et al. 2006). A detailed description of the RPSS method is provided

by Wilks (1995). The RPSS is computed by forming a ratio be-

tween the average RPS values of the model and chance

RPSS ¼1−RPSmodel

RPSchance ð8Þ

The RPSS compares the accuracy of a model’s predictions

against chance. However, in the research reported in this paper,

rather than simply compare the model against a 50=50 chance for

fatality/no fatality (or 33=33=33 chance for three categories re-

sponse variables), it was compared to the ratio of response vari-

ables provided by the original data. In other words, instead of pure

chance, a weighted coin was used, which is a more rigorous test

of model performance. The range for RPSS is from minus infinity

to 1, where negative values indicate that the model results are worse

than chance, 0 means that the model results reproduce chance

events, and positive values show that the model results are closer

to the original observations than chance.

Results

As mentioned previously, Esmaeili (2012) conducted content

analysis on 1,771 struck-by accident reports to identify fundamen-

tal safety attributes. The scree plot of fractional variance captured

by the various modes for SIC 1521 is illustrated in Fig. 2. Over 78%

of the fractional variance in the predictor set is captured by the first

five PCs, meaning that the 18-dimensional data (only 18 attributes

existed in SIC 1521) can be effectively represented by the five-

dimensional PCs.

As mentioned previously, PCs are the linear combination of dif-

ferent attributes. The loadings obtained by the PCA can be used to

determine the weight of various attributes (variables). The loadings

for the first five PCs of SIC 1521 are provided in Table 5. The first

PC, which captures approximately 31% of variance, is essentially

tasks related to working with tools and struck-by nail guns. This is

reasonable, because SIC 1521 includes general contractors, single-

family housing projects in which working with nail guns and power

tools are very common activities. Working with heavy equipment

has the highest loads on the second PC. For the last three PCs,

attributes related to struck-by objects, such as falling objects, struc-

ture collapse, lifting heavy materials, and transporting heavy ma-

terials horizontally, were the most influential attributes. The same

procedure was conducted on PCs for the remaining SICs. The se-

lected number of PCs and variance captured for different categories

of SIC is shown in Table 6.

Once the PCs were selected, GLMs with a logit link function

were fit to the selected PCs to predict the probability of fatality.

A stepwise regression approach found the parameter set that mini-

mized the model AIC. The overall results of generalized linear

models for two-category response variables are shown in Table 7.

0.05

0.1

0.15

0.2

0.25

0.3

0123456789101112131415161718

Variance Captured

PCs

Cut off point

Fig. 2. Scree plot of variance captured by each PC for SIC 1521

Table 5. Principle Component Analysis Loadings for the First Five PCs of SIC 1521

Number Attributes PC1 PC2 PC3 PC4 PC5

1 Working in swing area of a boomed vehicle 0.123 0.201 −0.119 ——

2 Workers on foot and moving equipment ———−0.123 —

3 Site topography —0.151 —−0.179 −0.129

4 Working with heavy equipment 0.247 0.668 —−0.192 0.145

5 Nail gun −0.493 —−0.271 —−0.138

6 Working with power tools/large tools −0.531 —−0.257 —−0.123

7 Equipment backup ————0.199

8 Falling objects 0.379 −0.361 −0.462 0.383 0.456

9 Structure collapse 0.310 −0.457 0.471 0.000 −0.473

10 Lifting heavy materials 0.322 −0.202 −0.466 −0.732 −0.112

11 Transporting heavy materials horizontally 0.216 0.285 −0.421 0.447 −0.660

Variance captured (%) 30.9 18.9 10.9 9.1 7.9

J. Constr. Eng. Manage.

By adopting a stepwise variable selection method, the number of

PCs were reduced for most of SIC categories. By looking at sig-

nificant PCs and their related attributes in each SIC group, more

insights can be obtained through the critical attributes that contrib-

utes to fatalities. For example, two variables [(1) PC1, and (2) PC2]

emphasize the importance of working with power tools (e.g., nail

gun) and working with heavy equipment in causing fatality in SIC

1521. The same procedure was performed for the three [(1) not

severe, (2) mild, and (3) severe] and five [(1) first aid, (2) medical

case, (3) lost work time, (4) permanent disablement, and (5) fatality]

categorical response variables. However, just the results for two-

category response variables (i.e., fatality/no fatality) are presented

because these models had the superior performance in comparison

to other models.

For the two-category injury outcomes (i.e., fatality/no fatality),

since the response to be modeled varies binomially, the logit link

function was used to transform responses, x, into the linear predic-

tor. By estimating parameters (β), link functions (ηi) can be calcu-

lated, and by back-transforming link functions with the inverse

logit, probabilities of fatalities can be obtained. For example, the

underlying formula of model SIC 1521 is

lnPðfatalityÞ

1−PðfatalityÞ¼β0þβ1×PC1þβ2×PC2ð9Þ

By substituting the values β0¼intercept ¼−0.908;β1¼

0.899;and β2¼1.030

lnPðfatalityÞ

1−PðfatalityÞ¼−0.908 þ0.899 ×PC1þ1.030 ×PC2ð10Þ

Model Validation

To measure the predictive power of the models, the research re-

ported in this paper implemented the widely-used measure for

categorical data, i.e., rank probability skill score. As stated previ-

ously, RPSS is one of the strictest verification measures. The RPSS

can vary from minus infinity (no skill) to 1 (perfect skill), and the

expected value of RPSS is less than zero (Mason 2004), which

means that any value greater than zero indicates superior perfor-

mance of the model to the reference forecast. Unfortunately, there

is no established acceptable range for RPSS; however, these values

can be used to compare the skill performance among different mod-

els. For example, the RPSS values obtained as reported in this paper

can be used as a baseline to compare the performance of future

predictive models in the construction safety domain.

The results of the RPSS for different categories of SIC are

shown in Table 8. The RPSS values for all 27 models demonstrate

a strong model performance. The lowest RPSS values for the

two-category response variable models belong to SIC 1611 (0.047),

SIC 1623 (0.047), and SIC 1542 (0.088), which have a high rate of

fatalities, i.e., 76, 64, and 45%, respectively. Similar patterns can be

observed in the three-category and five-category response variables

models. The best RPSS for all three different types of predictive

models belong to SIC 1522, which have 0.246 for two-category,

0.226 for three-category, and 0.222 for five-category response

variables. The range of RPSS values is comparable with other pre-

dictive models for controlling of mixed-mode buildings (RPSS ¼

0.154–0.186;May-Ostendorp et al. 2011) and weather forecast

(RPSS ¼0.0–0.50;Clark et al. 2004). In general, the obtained re-

sults indicate a strong predictive power for all 27 predictive models.

The RPSS values for different groups were compared to find any

significant difference between them.

Table 6. Number of PCs Selected and Total Variance Captured

Items 1521a1522 1541 1542 1611 1622 1623 1629 Total

Number

of PCs

5 5881396912

Total

variance

0.78 0.75 0.89 0.84 0.96 0.92 0.71 0.88 0.88

aRefers to the Standard Industrial Classification code. The definition of

each category is provided in Table 2.

Table 7. Overall Results of the Stepwise Generalized Linear Models for

Two-Category Response Variables

Number

SIC

codeaPredictorbEstimatec

Standard

errordteSignificancef

1 1521 Intercept −0.908 0.201 −4.53 0.000

PC1 0.899 0.310 2.91 0.004

PC2 1.030 0.311 3.31 0.001

2 1522 Intercept −0.881 0.295 −2.98 0.003

PC1 1.294 0.498 2.60 0.009

PC2 0.808 0.470 1.72 0.086

PC4 −1.004 0.712 −1.41 0.159

3 1541 Intercept −0.247 0.228 −1.09 0.278

PC2 0.611 0.419 1.46 0.145

PC7 1.931 0.834 2.31 0.021

4 1542 Intercept −0.316 0.158 −2.00 0.046

PC1 0.422 0.263 1.61 0.108

PC2 −0.408 0.271 −1.51 0.132

PC5 −0.818 0.412 −1.99 0.047

PC6 −1.057 0.493 −2.14 0.032

5 1611 Intercept 1.189 0.114 10.42 0.000

PC1 0.442 0.201 2.20 0.028

PC2 0.741 0.267 2.78 0.006

PC4 −0.670 0.301 −2.23 0.026

6 1622 Intercept 0.437 0.221 1.98 0.048

PC8 −3.243 1.029 −3.15 0.002

PC9 2.155 1.091 1.98 0.048

7 1623 Intercept 0.599 0.141 4.24 0.000

PC2 −0.744 0.298 −2.50 0.013

8 1629 Intercept 1.339 0.294 4.55 0.000

PC1 1.880 0.506 3.71 0.000

PC2 0.720 0.506 1.42 0.155

PC3 1.604 0.695 2.31 0.021

PC6 1.805 1.065 1.70 0.090

PC7 2.084 1.145 1.82 0.069

PC8 −1.329 0.679 −1.96 0.050

PC9 −3.026 1.511 −2.00 0.045

9 All

datag

Intercept 0.420 0.057 7.32 0.000

PC1 −1.042 0.106 −9.80 0.000

PC2 −0.536 0.105 −5.11 0.000

PC3 −0.319 0.142 −2.25 0.024

PC6 −0.623 0.179 −3.48 0.001

PC7 −0.738 0.194 −3.81 0.000

PC8 −0.581 0.202 −2.88 0.004

Note: In reference to fatality/not fatality.

aSIG = Standard Industry Classification. The definition of each category is

provided in Table 2.

bIndependent variables or predictors that selected to be included in the

predictive model.

cRegression coefficient for each variable.

dStandard error (σ) of variable regression coefficient, which is the

measurement of dispersion of regression coefficient over the sampling

distribution.

eValue of t-statistics. The t-statistic is calculated by dividing value of coef-

ficient by its standard error and is compared to theoretical t-distribution for

accuracy.

fSignificance of t-statistic (P-value), which is the estimated probability of

obtaining sample results where the regression coefficient is not true.

gAll data includes data points from the eight SIC categories.

J. Constr. Eng. Manage.

Repeated-Measures Design

The repeated-measures ANOVA was selected to compare RPSS

values obtained from different categories of response variables

because the same set of response variable was used in different

conditions. This test has three important univariate statistical as-

sumptions (Vogt 1999), as follows: (1) randomness of the sample,

(2) homogeneity of variance of differences between treatment lev-

els, and (3) normal distribution of the differences between treatment

levels. Assumption 1 implies that the sample should be randomly

selected from the population. In this case, all accident reports in

SIC were used to develop predictive models. Although there would

be some underreporting of accidents (e.g., near-misses and mild

injuries), the data can be considered as a representative sample of

struck-by accidents across the diverse geographical and cultural

regions of the United States.

Assumption 2, which is usually referred to as sphericity, re-

quires the equality of the variances of the differences between treat-

ment levels. Assumption 2, which is similar to homogeneity of

variance in between-group ANOVA, is a more general condition

of compound symmetry. Mauchley’s test for sphericity is the most

common way to test the hypothesis that the variances of the differ-

ences between conditions are equal. If data violate the sphericity

assumption, the literature suggests several corrections to modify

the F-ratio, such as Greenhouse–Geisser (Greenhouse and Geisser

1959), lower-bound, and Huynh–Feldt (Huynh and Feldt 1976).

SPSS 17 was used to conduct the analysis (Field 2013). Mauchley’s

test indicated that the assumption of sphericity has been met (χ2¼

4.250,P-value ¼0.119 >0.05).

Assumption 3 implies that the differences between treatment

levels should be normally distributed. Kolmogorov–Smirnov and

Shapiro–Wilk tests were conducted to test normality, and the results

appear in Table 9. Some of these data are not normally distributed

(P-value < 0.05); therefore, the last assumption is not satisfied.

Because the assumptions of repeated-measures ANOVA are

not satisfied, the research reported in this paper used an alternative

nonparametric test called the Friedman two-way ANOVA by ranks

(Friedman 1937). The Friedman two-way ANOVA variance by

ranks tests the null hypothesis that the krepeated measures or

matched groups come from the same population or populations

with the same median (Siegel and Castellan 1988). If the result of

the Friedman test is significant, at least one of the groups of the

samples is different from the other samples. The results of the

test indicates that there is a significant difference (χ2¼9.750,

P-value ¼0.005 <0.05) between the value of RPSS of two, three,

and five response variables.

Post-Hoc Tests

When the result of the Friedman test is significant, it indicates that

at least one of the categories differs from at least one other category;

however, it does not show how many of the groups or which one

of the groups is different. To supplement this finding, the research

reported in this paper used two post hoc tests. The Wilcoxon signed

ranks test evaluated the relative magnitude as well as direction of

differences between the assorted groups (Table 10). Applying a

Bonferroni correction, all effects were compared with a 0.0167

level of significance. It could thus be concluded that only the differ-

ence between the RPSS values for the two-category and three-

category procedures were significant (P-value ¼0.004 <0.0167).

The second way to conduct multiple comparisons between

groups is to determine the differences jRu−Rvjfor all pairs of

conditions or groups. Then, the significance of individual pairs of

differences can be tested as per Siegel and Castellan (1988)

jRu−Rvj≥Z∝=kðk−1Þﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

Nkðkþ1Þ

rð11Þ

where Ru= sum of ranks for category u;Rv= sum of ranks for

category v;N= number of cases; k= number of categories;

Table 8. Rank Probability Skill Score Values for Two-Category, Three-Category, and Five-Category Response Variables

Response

variables

Standard Industrial Classification code

All data Mean1521 1522 1541 1542 1611 1622 1623 1629

Two-category 0.149 0.246 0.139 0.088 0.047 0.204 0.047 0.206 0.116 0.138

Three-category 0.120 0.226 0.073 0.074 0.034 0.040 0.036 0.158 0.085 0.094

Five-category 0.189 0.222 0.103 0.086 0.029 0.050 0.037 0.152 0.083 0.106

Table 9. Test of Normality for Differences between Treatment Levels

Differences between

treatment levels

Kolmogorov–Smirnov Shapiro–Wilk

Statistic Significance Statistic Significance

Two-category and

three-category

0.251 0.146 0.720 0.004

Two-category and

five-category

0.279 0.067 0.750 0.008

Three-category and

five-category

0.339 0.007 0.710 0.003

Table 10. Results of Wilcoxon Signed Ranks Test

Summary statistics

Two-category and

three-category

Two-category and

five-category

Three-category and

five-category

Za−2.521 −1.680 −1.260

Sum of negative ranks 36.000 30.000 9.000

Sum of positive ranks 0.000 6.000 27.000

Effect sizeb−0.630 −0.420 −0.315

Asymptotic significance, two-tailed 0.012 0.093 0.208

Exact significance, one-tailed 0.004 0.055 0.125

aBased on positive ranks.

bAccording to Rosenthal (1991), effect size can be calculated as r¼Z=ﬃﬃﬃﬃ

p, in which N¼16.

J. Constr. Eng. Manage.

and Z∝=kðk−1Þis the abscissa value from the unit normal distribution

above which lies ∝=kðk−1Þpercent of the distribution. For α¼

0.05 and k¼3,Zwill be equal to 2.394. As a result, the right-hand

side of the inequality will be calculated as 9.576. By looking at

Table 10, if the difference between the sums of ranks is bigger than

or equal to the critical difference, then that difference is significant.

Performing the calculations, it is found that only the difference be-

tween the first and second type of response variables (12) exceeds

the critical differences (9.576), which means that the difference

between the categories is significant.

In general, the RPSS values for the two-category response

variables models on average are better than the three-category or

five-category response variables models. This result was expected

because the fatalities were the dominant response variable in most

of the SIC groups. In addition, dividing nonfatal responses in the

three-category or five-category response variables would give a

higher weight to the fatality and decrease the predictive power the

models.

A diagnostic test was also conducted, and the residual plots of R

(being actual-Yless projected-Y) versus so-called predicted-Yshow

a random distribution. This confirms that the assumption about the

normality is valid. There was no need to analyze multicolinearity

among variables because the PCs are orthogonal and PCA remove

any multicolinearity.

Practical Implications

While the mathematics behind the models is complicated, the find-

ings can be easily used in practice. For example, to calculate the

probability of a fatality for an activity in SIC 1521, the following

steps should be followed:

1. The list of struck-by attributes should be reviewed by a practi-

tioner to decide which attributes workers would be exposed to

during the activity. For example, assume that there are three

main attributes, as follows: (1) nail gun, (2) falling objects,

and (3) material storage. The matrix of observation can be

constructed by assigning a one for attributes that exists and a

zero for attributes that does not exist. The matrix would appear

like this

X¼½00000001000001010000001×22 ð12Þ

2. The PCs that would be entered into the predictive model

should be calculated. As mentioned previously, PCs can be

calculated from Eq. (1)

ZðPCsÞ1×22 ¼XðobservationsÞ1×22 ×WðloadingsÞ22×22

ð13Þ

Having already completed Step 1, the observation matrix is

ready, and the loading matrix has been calculated as shown in

Table 6. Therefore, the PCs can be calculated as

ZðPCsÞ¼½PC1¼−0.098; PC2¼−0.425; PC3¼−0.690;

PC4¼0.386; PC5¼0.312; :::ð14Þ

3. The resulting PC1 and PC2 can then be inserted into the pre-

dictive model for SIC 1521. The probability of fatality can be

calculated as

lnPðfatalityÞ

1−PðfatalityÞ¼−0.908 þ0.899 ×ð−0.098Þ

þ1.030 ×ð−0.425Þ

¼−1.434 →

yieldsPðfatalityÞ¼0.192 ð15Þ

There are several practical implications for these results. For

example, empowered with this risk-assessment tool, a designer can

see the effect of different design elements on safety and alter the

design to provide a safer construction environment. If the hazards

cannot be prevented during the design, more attention should be

paid to mitigate them during the construction phase. In addition,

a project manager can compare alternative means and methods to

see which ones provide fewer hazards for the workers. Further-

more, a supervisor can identify hazardous activities or situations to

highlight them during job hazard analysis or toolbox meetings.

Limitations

Although the impact of these results on the current preconstruction

safety management is significant, there are several limitations re-

lated to this paper. First, splitting the data and conducting cross

validation is a robust method to check the validity of the model;

however, more studies should be conducted to test the validity of

the model in predicting hazards in real projects. Second, the exter-

nal validity of the research reported in this paper is limited because

the IMIS database includes only severe accidents that are required

to be reported by OSHA regulations. Therefore, the models are

applicable for accidents with serious outcomes. The predictive

models would likely change if the distribution of the input data

were to change. However, this limitation is intrinsic to the OSHA

IMIS database that was available. Future research should be con-

ducted to investigate predictive models for minor injuries and even

near-misses. Third, the safety risk can be mitigated by implement-

ing different practices. Further study should be conducted to evalu-

ate the effect of such injury-prevention practice implementation on

changing the injury outcome. In addition, the predictive models

developed here are based on classification of whether an attribute

contributed to an accident or not; however, in reality, the level of

contribution of each attribute can vary as a continuous variable be-

tween 0 and 1. By considering the impact of injury prevention prac-

tices on reducing risk of an attribute, researchers can enter attributes

as continuous variable into predictive models. Fourth, this paper

focused on the safety attributes that can be identified during the

preconstruction phase. These attributes mainly address physically

unsafe conditions in a project but accidents occur due to interaction

among both unsafe conditions and unsafe behavior. Another study

should be conducted to predict the injury outcomes while consid-

ering both attributes related to unsafe behavior and unsafe physical

conditions. Fifth, these models are predicting a severity of outcome

if an incident occurs, but they do not predict the probability of an

event which is a very small number. Sixth, this model considers

only elements that contribute to the probability of an injury but

does not consider any risk mitigating counterparts. A unified model

is needed. Seventh, while GLM is a strong statistical technique to

develop predictive models, future studies should be conducted to

compare performance predictive models developed using GLM

with other nonlinear modeling techniques such neural network

and support vector machines. Finally, the practical implication

of developed models should be tested in real construction projects.

Developing decision support systems to facilitate and automate

adoption of attribute-based safety risk management can be

rewarding.

J. Constr. Eng. Manage.

Despite these limitations, the proposed predictive models pres-

ent a practical and easy approach for designers, jobsite engineers,

and safety managers who are not familiar with extensive math-

ematical calculations to reliably predict the level of hazard in a

project.

Conclusions

Identifying the level of risk a project includes before the start of a

project is an essential step towards implementing proactive safety

management and achieving excellent safety performance. Unfortu-

nately, the current models that predict safety outcomes of projects

are not based on objective data, ignore the unsafe physical condi-

tions, and cannot be used in the preconstruction phase of a project.

To address these limitations, the research reported in this paper

used an attribute-based risk analysis approach to develop predictive

models that forecast injury outcomes and that may be implemented

during the early stages of projects. To achieve this objective, the

research reported in this paper built upon the previously established

content analysis of 1,771 struck-by accident reports from the

OSHA IMIS database (Esmaeili 2012); the research reported in

this paper identified 22 fundamental attributes (Table 3) that cause

struck-by accidents and recorded the severity of injuries resulted

from them. In the research reported in this paper, the safety attrib-

utes (e.g., flagger on the jobsite) were predictive variables and the

severity of injuries (e.g., medical case) was response variables. The

safety attributes were used to fuel the predictive models as leading

indicators of construction safety and form the basis for injury pre-

vention activities.

In order to reduce the dimension of the data and to remove any

possible multicollinearity among the variables, the matrix of obser-

vations were subjected to principal component analysis. Then the

influential PCs were entered into the GLM model and three series

of mathematical models were developed (for nine different groups

of data), as follows: (1) one that predicts the probability of fatality;

(2) one that predicts the probability of not severe, mild, and severe

injuries; and (3) one that predicts the probability of first aid, medi-

cal case, lost work time, permanent disablement, and fatality. To

evaluate the predictive power of the models, the RPSS of the mod-

els were calculated and the performance of different models was

compared. The results of RPSS indicated the developed models

have a performance better than chance and are valid.

The research reported in this paper resulted in several reliable

and valid predictive models that can be used by practitioners, man-

agers, supervisors, and researchers to accurately forecast the prob-

ability of different types of injuries. Safety predictions are the

cornerstone of an effective proactive safety management program.

As meteorologists build moderately accurate weather forecasts by

monitoring information (e.g., temperature, wind, and speed) from

various sources, the safety managers can use the fundamental safety

risk attributes to predict the probability of different injury out-

comes. To use meteorological terms, the models tell the project

manager if there is a storm (safety issue) approaching the project.

Before the research reported in this paper, there was minimal

systematic method for project managers to predict the possible

injury outcomes for a project. Furnished with a dataset of sufficient

size and quality, it is now possible to apply statistical techniques

and create reliable mathematical models. This paper augments

recent studies that have recognized the importance of identifying

hazards created by unsafe physical conditions in the early stages of

a project and lays the foundation for future work related to precon-

struction safety management and hazards mitigation. It is expected

that these predictive models could change the way potential injuries

are considered during the planning, project financing, and safety

controls stages, and will empower managers seeking to enact the

ever-desirable so-called zero-injuries project.

Acknowledgments

The National Science Foundation is thanked for supporting the

research reported in this paper through an Early Career Award

(i.e., the CAREER Program). This paper is based upon work

supported by the National Science Foundation under Grant No.

1253179. Any opinions, findings, and conclusions or recommen-

dations expressed in this material are those of the writers and do not

necessarily reflect the views of the National Science Foundation.

Bentley Systems is also recognized for their financial support for

the research reported this paper and Mr. Dean Bowman in particular

who provided invaluable insight to the application of this method.

References

Akintoye, A., and Skitmore, M. (1994). “Models of UK private sector

quarterly construction demand.”J. Constr. Manage. Econ., 12(1), 3–13.

Baradan, S., and Usmen, M. A. (2006). “Comparative injury and fatality

risk analysis of building trades.”J. Constr. Eng. Manage.,10.1061/

(ASCE)0733-9364(2006)132:5(533), 533–539.

Barnett, V., and Lewis, T. (1994). Outliers in statistical data, 3rd Ed.,

Wiley, Chichester, U.K.

Chatfield, C., and Collins, A. J. (1980). Introduction to multivariate

analysis, Chapman and Hall, New York.

Chen, J. R., and Yang, Y. T. (2004). “A predictive risk index for safety

performance in process industries.”J. Loss Prev. Process Ind., 17(3),

233–242.

Chua, D. K. H., and Goh, Y. M. (2005). “A Poisson model of construction

incident occurrence.”J. Constr. Eng. Manage.,10.1061/(ASCE)0733

-9364(2005)131:6(715), 715–722.

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., and Wilby, R.

(2004). “The Schaake shuffle: A method for reconstructing space–time

variability in forecasted precipitation and temperature fields.”J. Hydro-

meteorol., 5(1), 243–262.

Cooper, M. D., and Phillips, R. A. (2004). “Exploratory analysis of the

safety climate and safety behavior relationship.”J. Saf. Res., 35(5),

497–512.

Esmaeili, B. (2012). “Identifying and quantifying construction safety risks

at the attribute level.”Ph.D. thesis, Univ. of Colorado, Boulder, CO.

Esmaeili, B., and Hallowell, M. R. (2013). “Integration of safety risk

data with highway construction schedules.”J. Constr. Manage. Econ.,

31(6), 528–541.

Fang, D. P., Chen, Y., and Louisa, W. (2006). “Safety climate in construc-

tion industry: A case study in Hong Kong.”J. Constr. Eng. Manage.,

10.1061/(ASCE)0733-9364(2006)132:6(573), 573–584.

Field, A. (2013). Discovering statistics using IBM SPSS statistics, 4th Ed.,

Sage, Los Angeles.

Friedman, M. (1937). “The use of ranks to avoid the assumption of normal-

ity implicit in the analysis of variance.”J. Am. Stat. Assoc., 32(200),

675–701.

Gillen, M., Baltz, D., Gassel, M., Kirch, L., and Vaccaro, D. (2002).

“Perceived safety climate, job demands, and coworker support among

union and nonunion injured construction workers.”J. Saf. Res., 33(1),

33–51.

Glendon, A. I., and Litherland, D. K. (2001). “Safety climate factors, group

differences and safety behavior in road construction.”J. Saf. Sci., 39(3),

157–188.

Goh, B. H. (1999). “An evaluation of the accuracy of the multiple

regression approach in forecasting sectoral construction demand in

Singapore.”J. Constr. Manage. Econ., 17(2), 231–241.

Goh, B. H., and Teo, H. P. (2000). “Forecasting construction industry de-

mand, price and productivity in Singapore: the Box–Jenkins approach.”

J. Constr. Manage. Econ., 18(5), 607–618.

J. Constr. Eng. Manage.

Goh, Y. M., and Chua, D. K. H. (2013). “Neural network analysis of

construction safety management systems: A case study in Singapore.”

J. Constr. Manage. Econ., 31(5), 460–470.

Greenhouse, S. W., and Geisser, S. (1959). “On methods in the analysis of

profile data.”Psychometrika, 24(2), 95–112.

Hallowell, M. R., Esmaeili, B., and Chinowsky, P. (2011). “Safety risk

interactions among highway construction work tasks.”J. Constr.

Manage. Econ., 29(4), 417–429.

Hallowell, M. R., and Gambatese, J. A. (2009). “Activity-based safety and

health risk quantification for formwork construction.”J. Constr. Eng.

Manage.,10.1061/(ASCE)CO.1943-7862.0000071, 990–998.

Hastie, T., Tibshirani, R., and Friedman, J. (2002). The elements of

statistical learning: Data mining, inference, and prediction, 2nd Ed.,

Springer, Amsterdam, Netherlands.

Hinze, J, Hallowell, M., and Baud, K. (2013). “Construction-safety best

practices and relationships to safety performance.”J. Constr. Eng.

Manage.,10.1061/(ASCE)CO.1943-7862.0000751, 04013006.

Hinze, J., Huang, X., and Terry, L. (2005). “The nature of struck-by

accidents.”J. Constr. Eng. Manage.,10.1061/(ASCE)0733-9364(2005)

131:2(262), 262–268.

Hotelling, H. (1933). “Analysis of a complex of statistical variables into

principal components.”J. Educ. Psychol., 24(7), 498–520.

Huynh, H., and Feldt, L. S. (1976). “Estimation of the Box correction for

degrees of freedom from sample data in randomized block and split plot

designs.”J. Educ. Stat., 1(1), 69–82.

Johnson, S. E. (2007). “The predictive validity of safety climate.”J. Saf.

Res., 38(5), 511–521.

Joliffe, I. T. (2002). Principal component analysis, 2nd Ed., Springer,

Berlin.

Kaiser, H. F. (1960). “The application of electronic computers to factor

analysis.”Educ. Psychol. Meas., 20(1), 141–151.

Lee, S., and Halpin, D. W. (2003). “Predictive tool for estimating accident

risk.”J. Constr. Eng. Manage.,10.1061/(ASCE)0733-9364(2003)129:

4(431), 431–436.

Mason, S. J. (2004). “On using climatology as a reference strategy in the

brier and ranked probability skill scores.”Mon. Weather Rev., 132(7),

1891–1895.

May-Ostendorp, P., Henze, G., Corbin, C., Rajagopalan, B., and

Felsmann, C. (2011). “Model predictive control of mixed-mode build-

ings with rule extraction.”Build. Environ., 46(2), 428–437.

McCullagh, P., and Nelder, J. A. (1989). Generalized linear models,

Chapman and Hall, London.

OSHA (Occupational Safety and Health Administration). (2013). “Statis-

tics and data, standard industry classification (SIC) system search.”

〈http://goo.gl/iC2dV〉(Feb. 8, 2015).

Palau, C. V., Arregui, F. J., and Carlos, M. (2012). “Burst detection in water

networks using principal component analysis.”J. Water Resour. Plann.

Manage.,10.1061/(ASCE)WR.1943-5452.0000147,47–54.

Park, H. K., Han, S. H., and Russell, J. S. (2005). “Cash flow forecasting

model for general contractors using moving weights of cost categories.”

J. Manage. Eng.,10.1061/(ASCE)0742-597X(2005)21:4(164),

164–172.

Pearson, K. (1901). “On lines and planes of closest fit to systems of points

in space.”Philos. Mag., 2(6), 559–572.

R Development Core Team. (2011). “R: A language and environment for

statistical computing.”Rep. Prepared for the R Foundation for

Statistical Computing, Vienna, Austria.

Regonda, S, Rajagopalan, B, and Clark, M. (2006). “A new method to

produce categorical streamflow forecasts.”Water Resour. Res.,

42(9), W09501.

Rosenthal, R. (1991). Meta-analytic procedures for social research, Sage,

Newbury Park, CA.

Rozenfeld, O., Sacks, R., Rosenfeld, Y., and Baum, H. (2010). “Construc-

tion job safety analysis.”J. Saf. Sci., 48(4), 491–498.

Saitta, S., Kripakaran, P., Raphael, B., and Smith, I. F. C. (2008). “Improv-

ing system identification using clustering.”J. Comput. Civ. Eng.,

10.1061/(ASCE)0887-3801(2008)22:5(292), 292–302.

Salas, J. D., Fu, C., and Rajagopalan, B. (2011). “Long-range forecasting

of Colorado streamflows based on hydrologic, atmospheric, and

oceanic data.”J. Hydrol. Eng.,10.1061/(ASCE)HE.1943-5584

.0000343, 508–520.

Siegel, S., and Castellan, N. J., Jr. (1988). Nonparametric statistics: For the

behavioral sciences, McGraw–Hill, New York.

SPSS 17 [Computer software]. Cary, NC, SAS Institute.

Stats Package. (2013). “’prcomp’function.”〈http://goo.gl/3PqhvM〉(Feb.

8, 2015).

Tam, C. M., and Fung, I. W. H. (1998). “Effectiveness of safety manage-

ment strategies on safety performance in Hong Kong.”J. Constr.

Manage. Econ., 16(1), 49–55.

Tang, J. C. S., Karasudhi,P., and Tachopiyagoon, P. (1990). “Thai construc-

tion industry: Demand and projection.”J. Constr. Manage. Econ., 8(3),

249–257.

Velicer, W. F., and Jackson, D. N. (1990). “Component analysis versus

common factor-analysis: Some further observations.”Multivariate

Behav. Res., 25(1), 97–114.

Vogt, W. P. (1999). Dictionary of statistics and methodology: A

non-technical guide for the social sciences, 2nd Ed., Sage, Thousand

Oaks, CA.

von Storch, H., and Zwiers, F. W. (1999). Statistical analysis in climate

research, Cambridge University Press, Cambridge, U.K.

Wilks, D. S. (1995). Statistical methods in the atmospheric sciences,

Academic, New York.

Wold, S., Esbensen, K., and Geladi, P. (1987). “Principal component

analysis.”Chemom. Intell. Lab. Syst., 2(1–3), 37–52.

Yee, T. W., and Wild, C. J. (1996). “Vector generalized additive models.”

J. Roy. Stat. Soc. B, 58, 481–493.

Zohar, D. (1980). “Safety climate in industrial organizations: Theoretical

and applied implications.”J. Appl. Psychol., 12, 78–85.

J. Constr. Eng. Manage.

Multi-Head CNN-based Software Development Risk Classification

Article

Full-text available

Dec 2023

Agile methodology for software development has been in vogue for a few decades, notably among small and medium enterprises. The omission of an explicit risk identification approach turns a blind eye to a range of perilous risks, thus dumping the management into strenuous situations and precipitating dreadful issues at the crucial stages of the project. To overcome this drawback a novel Agile Software Risk Identification using Deep learning (ASRI-DL) approach has been proposed that uses a deep learning technique along with the closed fishbowl strategy, thus assisting the team in finding the risks by molding them to think from diverse perspectives, enhancing wider areas of risk coverage. The proposed technique uses a multi-head Convolutional Neural Network (Multihead-CNN) method for classifying the risk into 11 classes such as over-doing, under-doing, mistakes, concept risks, changes, differences, difficulties, dependency, conflicts, issues, and challenges in terms of producing a higher number of risks concerning score, criticality, and uniqueness of the risk ideas. The descriptive statistics further demonstrate that the participation and risk coverage of the individuals in the proposed methodology exceeded the other two as a result of applying the closed fishbowl strategy and making use of the risk identification aid. The proposed method has been compared with existing techniques such as Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), Generalized Linear Models (GLM), and CNN using specific parameters such as accuracy, specificity, and sensitivity. Experimental findings show that the proposed ASRI-DL technique achieves a classification accuracy of 99.16% with a small error rate with 50 training epochs respectively.

Investigating safety development methodologies in the construction industry and identifying gaps in the studies: a review article

Article

Mar 2024

Identifying the appropriate safety methodology is essential to improving construction safety performance. This study aims to investigate safety development methodologies in the construction industry and identify gaps in the studies. Articles published from 2000 to 2022 were reviewed. Seventy-seven eligible articles were selected based on comprehensive and exclusive criteria. After obtaining selected literature, gaps in using these methodologies were discussed. Twelve criteria were used to compare safety methodologies. The selected literature focused more on the construction phase and did not provide an effective strategy in the project planning phase. Although the studies had specific benefits, none examined the safety program based on actual project conditions (resource, time, and cost constraints). There is a need for a model that examines safety in terms of actual project conditions (time, cost, and resource constraints). In addition, the model must optimise not only safety but also other vital components of the project (cost, time, and quality) while considering resource constraints (especially equipment constraints). If such a model is designed, the project team will not resist safety changes, which benefits all the construction stakeholders.

A Systematic Review of Automation Technologies used in Construction Safety Management

Article

May 2024

Machine learning algorithms for safer construction sites: Critical review

Article

Full-text available

Apr 2024

Machine learning, a key thruster of Construction 4.0, has seen exponential publication growth in the last ten years. Many studies have identified ML as the future, but few have critically examined the applications and limitations of various algorithms in construction management. Therefore, this article comprehensively reviewed the top 100 articles from 2018 to 2023 about ML algorithms applied in construction risk management, provided their strengths and limitations, and identified areas for improvement. The study found that integrating various data sources, including historical project data, environmental factors, and stakeholder information, has become a common trend in construction risk. However, the challenges associated with the need for extensive and high-quality datasets, models’ interpretability, and construction projects’ dynamic nature pose significant barriers. The recommendations presented in this paper can facilitate interdisciplinary collaboration between traditional construction and machine learning, thereby enhancing the development of specialized algorithms for real-world projects.

A Machine Learning Approach to Identifying Employees at Risk of Workplace Injury

Conference Paper

Mar 2024

Support Vector Machine Analysis of Construction Workers’ Automatic Behavior and Visual Attention

Conference Paper

Mar 2024

Predicting Serious Injury and Fatality Exposure Using Machine Learning in Construction Projects

Article

Mar 2024

A Data-Driven Approach for Deploying Safety Policies for Schedule Planning in Industrial Construction Projects: A Case Study

Article

Dec 2023

Role of National Conditions in Occupational Fatal Accidents in the Construction Industry Using Interpretable Machine Learning Approach

Article

Nov 2023

Kerim Koc

Construction safety management in the data-rich era: A hybrid review based upon three perspectives of nature of dataset, machine learning approach, and research topic

Article

Oct 2023
ADV ENG INFORM

Meta-Analytic Procedures for Social Research

Book

Jan 1991

Robert Rosenthal

Statistical Analysis in Climate Research

Book

Feb 1984

Climatology is, to a large degree, the study of the statistics of our climate. The powerful tools of mathematical statistics therefore find wide application in climatological research. The purpose of this book is to help the climatologist understand the basic precepts of the statistician's art and to provide some of the background needed to apply statistical methodology correctly and usefully. The book is self contained: introductory material, standard advanced techniques, and the specialised techniques used specifically by climatologists are all contained within this one source. There are a wealth of real-world examples drawn from the climate literature to demonstrate the need, power and pitfalls of statistical analysis in climate research. Suitable for graduate courses on statistics for climatic, atmospheric and oceanic science, this book will also be valuable as a reference source for researchers in climatology, meteorology, atmospheric science, and oceanography.

Vector Generalized Additive Models

Article

Sep 1996

Vector smoothing is used to extend the class of generalized additive models in a very natural way to include a class of multivariate regression models. The resulting models are called ‘vector generalized additive models‘. The class of models for which the methodology gives generalized additive extensions includes the multiple logistic regression model for nominal responses, the continuation ratio model and the proportional and non‐proportional odds models for ordinal responses, and the bivariate probit and bivariate logistic models for correlated binary responses. They may also be applied to generalized estimating equations.

Outliers in Statistical Data

Article