ArticlePDF Available

Analysis of Fatal Truck-Involved Work Zone Crashes in Florida: Application of Tree-Based Models

Authors:

Abstract and Figures

This paper presents the results of an analysis focusing on large truck-involved work zone fatal crashes using seven-year crash data in the State of Florida. Decision tree/random forest models were applied to specifically detect critical crash patterns that result in a fatality outcome. Because of the imbalanced nature of crash severity data (very low frequency of fatal crashes compared with property damage only or injury), data were treated using random and systematic over-sampling techniques. Marginal effects were addressed using Shapley values to increase model explainability. From a methodological perspective, results showed that the combination of over-sampling techniques with ensemble random forests could significantly improve model performance in predicting fatal crashes (compared with conventional logistic regression models). Primary contributors included pedestrian involvement, lighting conditions, safety equipment, driver condition, driver age, and work zone locations. For pedestrian crashes, factors such as dark-not lighted conditions, distracted truck driver, and driver’s age (young drivers outside city limits, senior drivers inside city limits) were highly likely to be fatal. For non-pedestrian crashes, the combination of front airbag deployment with any restraint system other than shoulder and belt was quite likely to be fatal. Also, abnormal driver conditions increased the risk of a fatal outcome. Additionally, the presence of female drivers (as the second driver in multiple vehicle crashes) highly decreased crash severity, probably because females typically drive more carefully than males. Interestingly, truck driver actions and maneuvers as well as roadway design and other physical environment features (i.e., number of lanes, median type, roadway grade, and alignment) did not show significant contribution to the model.
Content may be subject to copyright.
Research Article
Transportation Research Record
1–19
ÓNational Academy of Sciences:
Transportation Research Board 2021
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/03611981211033278
journals.sagepub.com/home/trr
Analysis of Fatal Truck-Involved Work
Zone Crashes in Florida: Application of
Tree-Based Models
Rajesh Gupta
1
, Hamidreza Asgari
2
, Ghazaleh Azimi
2
, Alireza Rahimi
2
,
and Xia Jin
2
Abstract
This paper presents the results of an analysis focusing on large truck-involved work zone fatal crashes using seven-year crash
data in the State of Florida. Decision tree/random forest models were applied to specifically detect critical crash patterns that
result in a fatality outcome. Because of the imbalanced nature of crash severity data (very low frequency of fatal crashes com-
pared with property damage only or injury), data were treated using random and systematic over-sampling techniques.
Marginal effects were addressed using Shapley values to increase model explainability. From a methodological perspective,
results showed that the combination of over-sampling techniques with ensemble random forests could significantly improve
model performance in predicting fatal crashes (compared with conventional logistic regression models). Primary contributors
included pedestrian involvement, lighting conditions, safety equipment, driver condition, driver age, and work zone locations.
For pedestrian crashes, factors such as dark-not lighted conditions, distracted truck driver, and driver’s age (young drivers
outside city limits, senior drivers inside city limits) were highly likely to be fatal. For non-pedestrian crashes, the combination
of front airbag deployment with any restraint system other than shoulder and belt was quite likely to be fatal. Also,
abnormal driver conditions increased the risk of a fatal outcome. Additionally, the presence of female drivers (as the second
driver in multiple vehicle crashes) highly decreased crash severity, probably because females typically drive more
carefully than males. Interestingly, truck driver actions and maneuvers as well as roadway design and other physical environ-
ment features (i.e., number of lanes, median type, roadway grade, and alignment) did not show significant contribution to the
model.
By definition, a work zone is an area where roadwork
takes place, and it may involve lane closures, detours,
and moving equipment (1). As roadways start aging,
the increasing need for timely maintenance has resulted
in an increasing number of work zones in recent years.
Considering that work zones are usually accompanied
by traffic stream disruptions, and they usually expose
workers and their machinery to risk in vulnerable loca-
tions, such sites are dangerous places for both workers
and drivers. Statistics showed that work zone crashes
resulted in 667 fatalities and about 40,000 injuries in
2009 in the U.S.A. (2). In addition, according to
National Work Zone Safety, work zone crash fatality
has consistently increased in recent years, showing a
growth of approximately 36% from 2010 to 2017. In
particular, the State of Florida has been ranked as the
second most dangerous state for work zone fatalities
since around 2010, with an average of 67 fatal crashes
peryear(3,4).
With the above in mind, work zone crash analysis has
received increasing attention in safety analysis and deci-
sions. There is an abundant body of research in work
zone crash analysis focusing on various aspects, from
crash frequencies and severity outcomes at the aggregate
level (58) to disaggregate analyses that predict crash
severity levels based on crash-specific attributes (911).
From a methodological standpoint, parametric model
structures have been widely used in crash severity analy-
sis. In particular, a variety of discrete choice and logistic
regression models have been formulated in the literature,
including multinomial logit models (1215), ordered
1
Department of Statistics, University of Lucknow, Lucknow, Uttar Pradesh,
India
2
Department of Civil and Environmental Engineering, Florida International
University, Miami, FL
Corresponding Author:
Xia Jin, xjin1@fiu.edu
response models (1622), nested logit models (2325),
and random parameter (mixed) logit models (2633).
While logistic regression models demonstrate an overall
satisfactory performance in establishing relationships
between input parameters and the target variable, their
prediction accuracy might be questionable in the presence
of imbalanced data. This is specifically of the essence in
view of crash analysis given its intrinsic data imbalance
issue (i.e., low number of fatalities compared with other
accident categories) (34,35). When evaluating models
that deal with imbalanced datasets, it is important to
assess how the model performs on predicting individual
classes, including the minority class, rather than the over-
all performance, which is usually biased to the majority
class. Furthermore, logistic models rely on several restric-
tive assumptions because of their parametric nature,
which may not stand true in real-life conditions.
To address the aforementioned shortcomings of con-
ventional models, some researchers have proposed alter-
native solutions, such as applying machine learning and
data mining techniques in crash severity analysis. In par-
ticular, tree-based models have been used to identify
decision rules (patterns) that lead to more severe crashes
(3642). In general, it could be inferred that decision
trees would help overcome the violation of statistical
assumptions involved in conventional parametric models
as well as a potentially better treatment of categorical
features; however, they have their own disadvantages,
including the absence of sensitivity analysis and marginal
effects assessment in tree-based models (34,35). To solve
this issue, some researchers have proposed combining
both parametric and non-parametric approaches (37,41).
The research presented in this paper is an effort to
analyze work zone crash severities in the State of
Florida. This study is encouraged by several technical
and practical motivations. First and foremost, it aims to
identify factors associated with work zone fatal crashes
for the purpose of equipping the Florida Department of
Transportation with clear and transparent information
to develop countermeasures and improve work zone
safety. Second, this study focuses on crashes where large
trucks are involved. Large trucks are considered of high
risk in work zones because of their larger sizes and lower
capabilities for fast reactions and preventive maneuvers,
as well as the higher levels of losses involved. Third, both
non-parametric (decision trees) and parametric (logistic
regression) tools are applied for the predictive analytics
and the results directly compared in view of model accu-
racy and prediction errors. In particular, standard
machine learning techniques such as data balancing and
Shapley values are used. The former is used to improve
models’ prediction in view of the minority group, while
the latter explains the marginal effects of each contribut-
ing factor toward crash severity.
Literature Review
Crash Contributing Factors in Work Zones
Several studies have focused on work zone crash severity
and how different factors contribute to the crash out-
come. In particular, researchers have investigated the
impact of several different parameters, including
driver and vehicle attributes, time of day, and roadway
features, as well as weather and lighting conditions
(7–9,30,43–48). In most cases, severity is positively cor-
related with higher posted speed limits, streetlight condi-
tions during nighttime, drug/alcohol influence, and truck
involvement. On the contrary, and as expected, the use
of restraint systems (e.g., seatbelts) and airbag deploy-
ment, along with the presence of work zone control
devices (such as flags, cones, or flashing lights) tend to
reduce the level of severity. Some other studies have gone
further and incorporated drivers’ behavior into severity
analysis (44,49–52). According to these studies, actions
such as violating speed limit, inattentive driving, impro-
per passing or lane changing, and following too closely
significantly lead to higher severity levels. In view of
truck involvement, there is an agreement that involve-
ment of trucks/large trucks will increase crash occurrence
or crash severity (44,49,53–55).
Discussion on Crash Severity Modeling Approaches
A quick review of the literature reveals the predominance
of logistic regression methods in crash severity analysis.
Applications of different forms of binary, multinomial,
ordered, and nested logistic regression have been docu-
mented. Logistic regression model is accompanied by
several advantages. In particular, it establishes conveni-
ent probability scores for different observations, does not
require a linear relationship between independent fea-
tures and the target variable, and is computationally effi-
cient in time and memory requirements. Moreover, the
model can be easily interpreted based on the coefficients
and t-values associated with each of the independent
variables (5658). Also, the model can be specified to
comply with additional assumptions, such as ordered
logit models to account for ordered dependent variables,
nested-structures to incorporate certain types of correla-
tions within different classes, and random parameter
models to incorporate heterogeneity (59,60). Despite the
advantages, logistic regression has its own restrictions. In
particular, the model assumes that there is a linear rela-
tionship between the independent variables and the logit
(log odds) of the dependent variable. Transformations
are required when non-linear relationships are observed.
In addition, the model has a low capability of handling
numerous categorical variables. The presence of categori-
cal variables, lack of linear relationship between the
2Transportation Research Record 00(0)
parameters and the logit of the target variable, and
multi-collinearity are some of the major modeling draw-
backs associated with logistic regression models in crash
and safety analysis (57).
To address the limitations, machine learning techniques
were employed as the next step in the literature. In particu-
lar, non-parametric approaches such as decision trees and
support vector machines (SVM) have gained increasing
attention because of their capability in relaxing the restric-
tions on data distribution properties. Successful instances
of machine learning methods in safety analysis have been
documented in recent years (6168). While machine learn-
ing approaches tend to increase the predictive power of
the models, they are more of a ‘‘black box’’ in nature and
have usually suffered from a lack of explainability. Some
studies have tried to address the explainability issue by
combining both parametric and non-parametric
approaches. For instance, researchers combined decision
trees with logistic regression models (9) or with probit
models (41). There was an agreement that the combined
approach would improve models’ performance by remov-
ing the interaction effects observed in parametric models
and also resolve the explainability issues associated with
non-parametric structures.
Data Imbalance
One major issue in safety analysis is the natural imbal-
ance observed in crash data. Data imbalance refers to the
situation where observations from one specific class
(also called the minority class) have remarkably lower
frequency compared with other classes in the dataset (69
71). Data imbalance is a critical issue for several reasons.
First, in most empirical cases, it is the minority class that
the modelers are interested in (e.g., fatal crashes in this
case). Second, the cost of misclassification for the minor-
ity class is usually much higher compared with other
types of misclassification. In the case of crash analysis,
while crash fatalities are quite rare compared with prop-
erty damage only (PDO) or injury crashes, the associated
loss is remarkably higher.
In practice, imbalanced data is the root of one major
modeling-related issue, theoretically referred to as the
‘‘accuracy paradox’ (72). Accordingly, a model can still
have a satisfactory overall accuracy while it performs
quite poorly on the minority class (34). Considering the
importance of data imbalance and its prevalence in the
real world, modelers have been looking for appropriate
strategies to tackle this challenge. In particular, successful
applications of data balancing techniques and consequent
modelimprovementshavebeenrecentlyreportedinsafety
analysis literature (7376). Ahmadi et al. showed that a
SVM model slightly outperformed multinomial and mixed
logit models, provided that model parameters are effi-
ciently tuned. They addressed data imbalance by tuning
separate cost parameters in their SVM model structure
(73). Lamba et al. (74) used a variety of over- and under-
sampling techniques and showed that a combination of
algorithmic feature selection with random over-sampling
provides the best model performance for precision and
recall. Yahaya et al. used a variety of variable selection
techniques and combined them with SMOTE over-
sampling strategy. They concluded that using balanced
data can be a more efficient approach to identify the most
prolific predictors of the crash injury severity (75).
Fiorentini and Losa applied the random under-sampling
of the majority class (RUMC) method and showed that
the models built on balanced data had significantly higher
true positive rates in both logistic regression and machine
learning (random tree, random forest, and K nearest
neighborhood) models. It was concluded that the use of
balanced datasets seems to be essential for correctly pre-
dicting crashes with higher severities (76).
Data Description
The study used a dataset extracted from the Signal Four
Analytics website (https://S4.Geoplan.Ufl.Edu/). The
website has been developed by the Florida Department
of Highway Safety and Motor Vehicles and includes sta-
tewide crash records. The website comprises information
about crashes collected by Florida Highway Patrol offi-
cers. Driver information, vehicle features, crash charac-
teristics, and environment-related information at the
time of a crash is recorded on the website. The focus of
this study is on large truck crashes that occurred in work
zones between 2010 and 2016. Large trucks are defined
as trucks with a gross vehicle weight rating higher than
10,000 lb. The final dataset includes a total of 5,402
records, of which 75.7% were PDO crashes, 22.7% were
injury crashes, and 1.6% were fatal crashes.
For descriptive analysis, this study is focused on the
fatal crashes. Driver characteristics are those associated
with the at-fault drivers. The at-fault driver was deter-
mined and reported by the police officers who responded
to the accidents. Figure 1 illustrates at-fault driver action
at the time of the crash for fatal crashes. The predomi-
nant category was operating the motor vehicle in a care-
less or negligent manner (34.8%), followed by other
contributing action (12.4%), no contributing action
(11.2%), or failure to keep in the proper lane.
Table 1 shows the descriptive statistics for fatal
crashes. Operating the motor vehicle in careless or negli-
gent maneuver (34.8%) was the predominant driver
action category. Furthermore, normal conditions
(50.6%) had the highest percentage among driver condi-
tion categories, and the majority (50.6%) of drivers were
not distracted. The majority of crashes happened on
straight alignment (88.8%) and level grade (85.4%).
Gupta et al 3
Most of the work zone crashes occurred on two-way
divided traffic way with median (64%), and two-way not
divided traffic way (24.7%). Daylight conditions (49.4%)
and clear weather (64%) had the highest percentages for
light and weather conditions, respectively. For roadway
type, interstate (42.7%), state (25.8%), and U.S. roads
(12.4%) had most of the work zone crashes. Also, most
crashes (73%) happened outside of the city limits.
As to work zone type, work on shoulder or median
(58.4%) and lane closure (24.7%) had the highest fre-
quency among all categories. As to the restraint system, a
significant percentage (30%) of the drivers did not use
any restraint system. Moreover, rear-end crashes (31.5%)
and crash with a pedestrian (14.6%) were the major crash
type categories. Motor vehicle in transport (62.5%) had
the highest percentage among most harmful events.
Methodology
The methodology consists of four different steps:
1. Data balancing using some well-known resam-
pling techniques
2. Variable selection using random forest feature
importance
3. Develop parametric and non-parametric models
using both raw and balanced data
4. Use Shapley values to assess the marginal effect
of the contributing factors
Data Resampling
Different resampling techniques have been introduced in
theory (7780), and several cases of their applications
both in research and industry have been documented in
modeling and data science literature (8184). This study
explored both random over-sampling as the simplest
over-sampling method and a more systematic approach,
known as the synthetic minority over-sampling technique
for nominal and continuous (SMOTE-NC).
Random over-sampling involves supplementing the
training data with multiple copies of some of the minor-
ity classes. One issue with random over-sampling is that
it just naively duplicates existing records. Therefore,
although classification algorithms are exposed to a
greater amount of observations from the minority class,
they will not learn much more about how to set minority
and majority classes apart. In other words, the new data-
set does not contain more information about the charac-
teristics of the minority class than the original data.
A more advanced alternative to random over-
sampling is the SMOTE. Instead of duplicating existing
records, it creates new synthetic records based on the
existing observations. The SMOTE algorithm is parame-
terized with k (the number of nearest neighbors it will
consider) and the amount of over-sampling required (the
number of new points wish to be created). Each step of
the algorithm will:
1. Randomly select a minority point.
2. Randomly select any of its k nearest neighbors
belonging to the same class.
3. Randomly specify a lambda value in the range [0,
1].
4. Generate and place a new point on the vector
between the two points, located lambda percent
of the way from the original point.
Figure 1. At-fault driver action at the time of crash for fatal crashes.
4Transportation Research Record 00(0)
Table 1. Descriptive Analysis for Fatal Crashes
Variable Category %
Driver gender Male 78.7
Female 21.3
Driver action No contributing action 11.2
Operated motor vehicle in
careless or negligent manner
34.8
Failed to yield right-of-way 6.7
Improper backing 2.2
Improper turn 1.1
Ran red light 1.1
Drove too fast for conditions 3.4
Ran stop sign 1.1
Exceeded posted speed 2.2
Wrong side of wrong way 5.6
Failed to keep in proper lane 9.0
Ran off roadway 4.5
Disregarded other traffic sign 2.2
Over-correcting/over-steering 1.1
Operated motor vehicle in
erratic, reckless, or aggressive
manner
1.1
Other contributing action 12.4
Driver condition Apparently normal 50.6
Asleep or fatigued 1.1
Seizure, epilepsy, blackout 1.1
Physically impaired 1.1
Under the influence of
medication/drug/alcohol
11.2
Other 6.7
Unknown 28.1
Driver distracted Not distracted 50.6
Other inside the vehicle 2.2
External distraction 1.1
Inattentive 4.5
Unknown 41.6
Roadway alignment Straight 88.8
Curve right 4.5
Curve left 6.7
Roadway grade Level 85.4
Uphill 5.6
Downhill 7.9
Sag 1.1
Type of shoulder Paved 65.2
Unpaved 25.8
Curb 9.0
Traffic-way Two-way, not divided 24.7
Two-way, not divided, with a
continuous left turn lane
1.1
Two-way, divided, unprotected
(painted .4 ft) median
3.4
Two-way, divided, positive
median barrier
64.0
One-way traffic-way 5.6
Unknown 1.1
Light conditions Daylight 49.4
Dusk 2.2
Dark-lighted 20.2
Dark-not lighted 28.1
(continued)
Table 1. (continued)
Variable Category %
Weather conditions Clear 64.0
Cloudy 32.6
Rain 2.2
Fog, smog, smoke 1.1
Road system identifier Interstate 42.7
U.S. 12.4
State 25.8
County 4.5
Local 7.9
Turnpike/toll 6.7
Within city limits No 73.0
Yes 27.0
Work zone type Lane closure 24.7
Lane shift/crossover 2.2
Work on shoulder or median 58.4
Intermittent or moving work 7.9
Other 6.7
Restraint system Not applicable (non-motorist) 1.2
None used—motor vehicle
occupant
30.2
Shoulder and lap belt used 66.3
Shoulder belt only used 1.2
Other 1.20
Airbag deployed Not applicable 39.3
Not deployed 20.2
Deployed—front 29.2
Deployed—side 1.1
Deployed—combination 5.6
Unknown 4.5
Crash type Head on 5.6
Left entering 1.1
Off road 5.6
Opposing sideswipe 2.2
Other 5.6
Pedestrian 14.6
Rear end 31.5
Right angle 9.0
Rollover 2.2
Same direction sideswipe 5.6
Unknown 3.4
Backed into 2.2
Parked vehicle 11.2
Most harmful event Overturn/rollover 1.1
Fire/explosion 2.2
Cargo/equipment loss or shift 1.1
Pedestrian 14.6
Motor vehicle in transport 65.2
Parked motor vehicle 7.9
Struck by falling, shifting cargo
or anything set in motion by
motor vehicle
1.1
Bridge pier or support 1.1
Cable barrier 2.2
Tree (standing) 1.1
Utility pole/light support 1.1
Ran off roadway, right 1.1
Gupta et al 5
Figure 2 provides an illustration of SMOTE
methodology.
SMOTE’s main advantage over traditional naı
¨ve
methods is that it creates synthetic observations instead
of reusing existing observations, and therefore the classi-
fier is less likely to overfit. However, one should always
make sure that the synthetic observations created by
SMOTE are realistic and make sense.
The SMOTE algorithm can be further generalized to
deal with nominal (categorical) attributes. The new algo-
rithm, known as SMOTE-NC, accounts for the differ-
ence of nominal features by the median of standard
deviations of all continuous features in the minority class.
The median is then used in calculating the Euclidean dis-
tance when searching for k nearest neighbors. Finally,
the synthetic sample is populated by replicating the con-
tinuous features using the same algorithm as SMOTE,
and the nominal features are defined by the majority vote
of the k nearest neighbors (85,86).
Model Structure
Decision trees are among the most popular practical
methods in the realm of machine learning and predictive
modeling. It is a non-parametric technique used in classi-
fying discretely labeled datasets (87). In particular, and
compared with other machine learning techniques, deci-
sion trees have some important advantages. From the
audience’s point of view, this approach is intuitive and
easy to explain. From the analyst’s perspective, it allows
for implicit variable screening and feature selection
requires relatively little effort on data preparation and is
not affected by prevalent non-linear relationships across
the parameters (8890). Stepping into the details, deci-
sion trees classify instances by sorting them down a tree
starting from the root node, traversing through several
internal nodes (also known as decision nodes), and
finally reaching a leaf (terminal) node, where the instance
is labeled into one of the existing class labels (as illu-
strated in Figure 3).
One of the popular tree development algorithms
which has been widely used in recent years is Iterative
Dichotomiser 3 (ID3) (91). The ID3 algorithm develops
decision trees by building them from top down, splitting
the sample at each node using a certain pre-selected fea-
ture and on a certain threshold, and continues this until
a termination criterion has been reached. With the above
in mind, it is evident that the learning algorithm should
cover two main issues. First, what are the most effective
performance measures that could be applied during attri-
bute evaluation? And second, when and how should the
splitting procedure stop?
With respect to the first question, decision trees gener-
ally rely on the ‘‘impurity’’ concept when it comes to
splitting the dataset. Impurity can be defined as an index
of non-homogeneity, that is, the presence of different tar-
get classes in the sample. The goodness-of-split could be
computed based on improvements in impurity. Three
major indices have been introduced and applied in the lit-
erature, namely the entropy index, the information gain
index, and the Gini index. Entropy index is a way to
measure impurity; the ID3 algorithm uses entropy to
Figure 2. Schematic view of synthetic minority over-sampling technique (SMOTE) resampling.
6Transportation Research Record 00(0)
calculate a sample homogeneity. If the sample is com-
pletely homogenous, then the entropy is zero, and if the
sample is equally divided, then the entropy is one.
Information gain calculates the reduction in the entropy
and measures how well a given feature separates or clas-
sifies the target classes. The feature with the highest
information gain is selected as the best one. The Gini
index shows the probability of incorrect classification for
a randomly chosen record from the specific node in the
data subset. Gini score gives an idea of how good a split
is by how mixed the classes are in the two groups created
by the split. A perfect separation results in a Gini score
of zero, whereas the worst case split that results in 50/50
classes.
In response to the second question, the decision tree
calls for a stopping condition that terminates the tree
growing process. This is especially important given the
rise of the overfitting issue. While it might sound ideal
that the tree grows until all records belong to the same
class (impurity = 0), or all the sample records have iden-
tical target labels, it tends to capture all the unwanted
noise associated with the training data, therefore leading
to poor performance on the test datasets. To avoid this,
restrictions are usually applied to the tree growth pro-
cess. This includes a variety of applicable techniques,
ranging from simple limitations on the number of tree
levels to more complicated pruning criteria (9294). To
address low model performance caused by early termina-
tion or pruning, ensemble techniques have been widely
used to derive a final best model.
Feature Selection and Hyper Parameter Tuning
One important aspect of tree-based models is feature
selection. Given the randomness involved in selecting
features at each node, it is of the essence to run the
model with the best optimum subset of variables. To do
this, random forests (RF) based on bagging (bootstrap
aggregating) algorithms are used to select the top impor-
tant features as inputs to the decision tree. RFs combine
results from multiple decision trees using aggregate mea-
sures such as average or majority voting (95,96). RFs
randomly develop hundreds or thousands of decision
trees and then predict the ultimate outcome by aggregat-
ing their results. The ensemble bagging algorithm gener-
ally gives better results because it reduces the overall
variance of the model and helps to avoid overfitting by
giving more generalized results. Successful application of
RFs has been widely documented in the literature
(97,98).
Optimizing hyper parameters for machine learning
models is another key step in making accurate predic-
tions. Hyper parameters define characteristics of the
model that can affect model accuracy and computational
efficiency. They are typically set before fitting the model
to the data. The cross-validation technique is utilized
here along with the random and grid search approach.
Grid search is a widely-used approach in fine-tuning of
machine learning models (99101).
Model Results
The imbalanced sample was randomly split into training
and test sets using a 70/30 ratio (70% for training, 30%
for testing). The training set was then manipulated using
random and SMOTE-NC over-sampling techniques, giv-
ing a total of three different training datasets: the original
imbalanced data, the random resampled data, and the
SMOTE-NC resampled data. The target variable (crash
severity) consisted of three different classes, namely:
PDO, injury, and fatal. A decision tree was initially
developed for the raw imbalanced data (DT_1). The
model was further optimized in three different directions:
(i) feature selection, (ii) data balancing, and (iii) hyper
parameter tuning. For the purpose of feature selection,
RF models were used (Figure 4). For balancing objec-
tives, the models were separately run on the random and
Smote-NC resampled datasets. Several tuning algo-
rithms, including grid search and random search, were
applied to determine the best-tuned estimators. At each
tuning step, the model accuracy was evaluated based on
a 10-fold cross-validation to minimize model variance on
different test sets and avoid potential overfitting. Three
RF models were developed separately for the raw (RF_1)
and the two resampled datasets (RF_2 and RF_3).
Model performance results are illustrated in Table 2.
Precision and recall measures on fatal crashes were con-
sidered as the main metrics to evaluate model perfor-
mance. The recall measure represents the ratio of true
Figure 3. Fundamental elements of a decision tree.
Gupta et al 7
positives over the sum of true positives and false negatives.
The precision measure refers to the ratio of true positives
over the sum of true positives and false positives. For the
specific problem discussed here, it might be implied that
the recall measure (reducing type II error) is of higher pri-
ority. In other words, a low precision measure (i.e., overes-
timating fatal crashes or type I error) will not be as critical
as a low recall measure (i.e., missing a fatal crash or type
II error). It is also possible to check both measures simul-
taneously by assessing the F1_score values (defined as the
harmonic mean of precision and recall values).
For the purpose of comparison, three ordered logistic
regression models were also developed to represent con-
ventional approaches. As shown in Table 2, the logistic
regression on the raw data (OL_1) showed a relatively
high accuracy on test data (78.5%); however, further
investigation reveals that this is more of an accuracy par-
adox since the recall values on the minority group (fatal-
ity) were quite low (0.1), indicating that the model
captured only 10% of the fatal crashes. Using random
over-sampling, the logistic regression model (OL_2)
tends to improve, particularly the recall on fatal crashes
increases to 57%; however, this was accompanied by a
huge drop in the precision metric, indicating that the
model had very high type I error (false positives, overes-
timating fatal crashes). Moving forward to the next
model, the logistic regression on SMOTE-NC resampled
data (OL_3) slightly improves the precision on fatal
crashes and therefore provides the best F1 score across
conventional models, however, this is at the expense of a
relative reduction in the overall accuracy of the model.
Details of optimized regression models on resampled
datasets are presented in the Appendix (Table A1).
Stepping further into the tree-based models, the deci-
sion tree built on the raw data (DT_1) similarly suffered
from the accuracy paradox, though it provided a good
overall accuracy (78%) on the test data. In particular,
the results were highly biased to the majority class (PDO
crashes) with low recall values on both injury (0.19) and
fatality (0.05) groups, and this was more tangible on the
test data. This indicates that the model built on the
imbalanced data was not able to provide satisfactory
predictions on fatality instances. A similar drawback is
observed in the RF model on imbalanced data, where
the model does not capture any fatal crash on the test
data. This indicates that in the presence of imbalanced
data, even a sophisticated ensemble algorithm such as
RF might lose its efficiency.
Results of both RF models on over-sampled data
(RF_2 and RF_3) provided high performance measures
for both training and test data, with random over-
sampling slightly outperforming the SMOTE-NC
method with higher precision (0.4) and recall (0.57) val-
ues on the test data. Both models provided good predic-
tions on different classes, including the minority class
(fatal crashes). Based on the model performance, a final
model was developed using the decision tree method,
using the random over-sampled data and the top 20 fea-
tures (Figure 4) selected based on the results from RF_3.
The decision tree, unlike black box RFs, allows us to
pursue the decision rules and specific patterns that lead
to fatal crashes. The model performance is shown in
Table 2 as DT-Final. Compared with the initial models
and the conventional models, DT_Final provided better
performance in both recall measure and F1-score (har-
monic mean of precision and recall) on fatal crashes (F1-
Figure 4. Top 20 features based on mean Gini index.
8Transportation Research Record 00(0)
Table 2. Model Performance Comparison
Conventional models Machine learning models
OL_1 OL_2 OL_3 DT_1 RF_1 RF_2 RF_3 DT-final
Model
Ordered
logistic
regression
with imbalanced
data
Ordered
logistic
regression
with random
over-sampling
Ordered
logistic
regression
with SMOTE-NC
over-sampling
Decision
tree with
imbalanced
data
Random
forest
with
imbalanced
data
Random
forest
with random
over-sampling
Random forest
with SMOTE-NC
over-sampling
Decision tree
with random
over-sampling
(only top 20 features)
Training data
Overall accuracy (%) 79.50 62.00 72.00 80.00 82.00 84.00 88.00 76.00
F1 score
(on minority group)
0.26 0.74 0.89 0.39 0.45 0.98 0.97 0.91
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
Property
damage only (PDO)
0.82 0.96 0.63 0.69 0.67 0.64 0.8 0.98 0.81 1 0.74 0.84 0.84 0.87 0.64 0.79
Injury 0.62 0.28 0.45 0.44 0.59 0.61 0.7 0.22 0.94 0.28 0.82 0.69 0.84 0.82 0.72 0.62
Fatality 0.69 0.16 0.76 0.72 0.89 0.91 0.78 0.26 1 0.29 0.97 1 0.97 0.97 0.95 0.87
Test data
Overall accuracy (%) 78.50 63.00 61.00 78.00 79.00 74.00 75.00 72.00
F1 score
(on minority group)
0.17 0.20 0.23 0.08 0.00 0.47 0.31 0.43
Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall
PDO 0.8 0.96 0.83 0.69 0.86 0.64 0.79 0.97 0.79 0.98 0.85 0.81 0.83 0.85 0.85 0.79
Injury 0.63 0.27 0.33 0.44 0.32 0.51 0.61 0.19 0.71 0.18 0.45 0.51 0.47 0.44 0.43 0.51
Fatality 0.67 0.1 0.12 0.57 0.15 0.57 0.25 0.05 0 0 0.4 0.57 0.42 0.24 0.33 0.62
9
score = 0.43). In particular, the recall measure (i.e., cor-
rectly predicted fatal crashes as a percentage of total
fatal crashes) increased from 5% to 62%.
The next section focuses on DT-Final to analyze fatal
crash patterns.
Contributing Factors
For the remainder of this paper, the term ‘‘primary con-
tributors’ is used to refer to these top 20 variables. It
should be noted that no roadway conditions or environ-
mental factors such as traffic-way type, median type,
road type, weather conditions, or vehicle maneuver and
driver actions showed primary contributions.
Further investigation of the decision tree reveals that
the severity of work zone crashes was highly dependent
on lighting conditions and pedestrian involvement.
Therefore, four major types of work zone crashes could
be identified:
1. Pedestrian crashes under dark-not lighted condi-
tions (118 crashes in the resampled data, five
crashes in original sample). The results show that
a large truck work zone crash involving pedes-
trians in not lighted conditions always resulted in
a fatal outcome (100%).
2. Pedestrian crashes with some type of lighting (375
crashes in the resampled data, 41 crashes in the
original sample). This was still a highly dangerous
situation, although the presence of lighting
slightly decreased the probability of a fatal out-
come, to almost 88%. However, certain factors
could still increase the crash severity, resulting in
a fatal condition (100%). These factors include:
When the truck driver was distracted, and the
vehicle lacked airbags (52 crashes in the
resampled data, three crashes in the original
data).
A young driver (younger than 24 years) where
the vehicle lacked airbag equipment and the
work zone was located outside city limits (39
cases in the over-sampled, one case in the
original data).
Older drivers (older than 60 years) where the
vehicle lacked airbag equipment and the work
zone was located within city limits (51 cases
in the over-sampled data, one case in original
data).
When front airbag was deployed (94 cases in
the over-sampled data, two cases in the origi-
nal data). This is probably an indicator of a
very hard impact. Though we do not have
individual level severity information, this
might point to a pedestrian fatality.
When airbags of any type were deployed, and
the truck driver was not using any restraint
system (e.g., seatbelt) (45 cases in the
resampled data, one case in the original data).
3. Non-pedestrian crashes under dark-not lighted con-
ditions. Using a full restraint system (i.e., shoulder
and lap belt) seemed essential to avoid severe out-
comes. The combination of front airbag deploy-
ment with any other restraint system would
indicate a fatal outcome (100%) (205 cases in
over-sampled data, six cases in original data).
4. Non-pedestrian crashes under any type of lighting
condition. It is evident that driver conditions had
a significant impact on these types of crashes.
Accordingly, an abnormal driver condition (fati-
gue, drug/alcohol influence, etc.) significantly
increased the probability of a fatal outcome
(65.5% versus 15.8%). A rear-end crash seemed
to be the safest of crash types in these conditions.
Otherwise, any non-pedestrian crash type with a
front airbag deployment would indicate a 90%
fatal outcome (479 cases in over-sampled data, 65
cases in the original data).
In addition to the above, one interesting inference
of the proposed model is the role of female driv-
ers in the work zone area. Accordingly, no fatal
crashes were observed in instances of multiple-
vehicle crash when a female driver (usually as a
not-at-fault driver) was present (153 cases in the
over-sampled data, 126 cases in the original sam-
ple). This might indicate that female drivers drive
in a more cautious manner in work zone areas,
for example, by reducing their speed to safe lev-
els, which will mutually affect other drivers’ speed
when traversing through work zones.
To summarize, from a planning perspective, the pro-
posed model suggests that fatal crashes could be pre-
dicted as a combination of several critical parameters,
including pedestrian involvement, lighting condition,
vehicle safety equipment, and driver condition.
Pedestrian involvement was highly fatal in work zones,
particularly when combined with dark-not lighted condi-
tions. The impact of age was highly correlated with work
zone location. Accordingly, young drivers were more
likely to be involved in a fatal crash in rural areas, while
senior drivers were associated with fatal crashes occur-
ring inside city limits. Driver distraction was another
critical factor in work zone crashes. With regard to
safety equipment, a full restraint system (i.e., shoulder
and lap belt) was essential to reduce fatality/injury rates.
The combination of any other restraint system with air-
bag deployment was highly fatal. Unlike pedestrian
crashes, sideswipe and rear-end crashes seemed to be less
10 Transportation Research Record 00(0)
severe. Interestingly, multiple vehicle crashes with female
drivers decreased crash severity.
Marginal Effects
As discussed earlier in the paper, in this study a relatively
new technique is used, known as Shapley values (also
known as SHAP), to obtain marginal effects of the
machine learning model features (102104). The concept
was initiated by Shapley (102) and is based on game the-
ory. In simple language, the contribution (Shapley value)
of a feature (xi) can be calculated as the change imposed
on the classification probability when the feature is
added to a pre-defined set of features, S, in the model,
that is DPi=P(S[xi
fg
)P(S). Since there are different
permutations of feature subsets, the value needs to be
averaged across all possible permutations. Accordingly,
jxi
ðÞ=XSNxi
fg
S
jj
!N
jj
S
jj
1ðÞ!
N
jj
!(PS
[xi
fg

PSðÞ)
where
jxi
ðÞ= average contribution (Shapley value) associated
with xi
N= set of total features
S= arbitrary subset of features in different permutations
Figure 5 depicts the magnitudes of Shapley values
over all single observations across the sample and uses
the values to show the distribution of the impacts each
feature has on the model output. The color represents
the feature value, with red representing higher values
(compared with the mean) and blue indicating values
lower than the mean. In the case of binary features, red
indicates 1 while blue indicates 0. The horizontal axis
indicates the marginal effect (contribution) of the feature
at each single observation.
Focusing on the red color for binary features, one can
easily infer that normal driver conditions, daylight,
female drivers, rear-end crashes, and curb shoulder types
and sideswipe crashes have a tendency to decrease the
probability of a fatal outcome. On the other hand, lack
of restraint (or lack of sufficient restraint), presence of
pedestrians, different types of airbag deployment, and
driving on the wrong side of wrong way increase the
likelihood of a fatal outcome. One can also check the
contribution values on the horizontal axis. For instance,
in certain cases, abnormal driver conditions can increase
the fatality probability by 25%, or lack of restraint sys-
tem can increase it by almost 30%. On the contrary, the
presence of females or sideswipe crashes can reduce the
fatality probability by 10 to 20%.
While Figure 5 shows the spread of Shapley values in
the whole sample data, Figure 6 depicts the average
Shapley values in the decision tree model and compares
them with correspondent values from the logistic regres-
sion model (OL_3). Accordingly, pedestrian involve-
ment, lack of restraint system, front airbag deployment,
and not lighted conditions show a high average impact
Figure 5. Distribution of Shapley (SHAP) values.
Gupta et al 11
on increasing fatality probability. On the other hand,
normal driver condition and airbags not deployed con-
tribute to lower severity crashes. Compared with mar-
ginal effects from OL_3, results are similar with regard
to the direction of impact in most cases. It should be
noted that two of the variables—distraction of driver at
fault and driver action= wrong side of wrong way—did
not show significant impacts in the conventional model,
and therefore their marginal effects are statistically insig-
nificant. With regard to airbag deployment, two of the
factors (i.e., lack of airbag equipment and front airbag
deployment for at-fault vehicle) show inconsistent signs
with the Shapley values. However, readers should notice
the endogenous effects of airbag presence (or deploy-
ment) on crash severity as stated in the literature (105).
Since the ordered logistic model does not capture this
endogenous effect, the positive impacts from Shapley val-
ues might be more reliable. Overall, marginal effects
from conventional models tend to be higher (inflated)
compared with Shapley values from the machine learning
model. The authors believe this stems from the different
approaches in calculating the two measures. Marginal
effects only consider the exact set of variables for each
observation and then compute the difference in
probabilities for each outcome class, while Shapley val-
ues consider the average over a full permutation of all
existing features for each observation.
Conclusions
This paper is an effort to explore large truck-involved
work zone crash patterns that result in fatalities. The
motivations for this study are two-fold. First, the escalat-
ing need for regular maintenance and reconstruction of
roadways and the significant investments in roadway
infrastructure have resulted in consistent growth in the
number of work zones, which consequently calls for effi-
cient safety plans that ensure the safety of both workers
and drivers in these construction areas. Second, and
based on the current state of the literature, it seems that
machine learning algorithms such as data resampling
techniques and non-parametric models (e.g., decision
trees and RFs) demonstrate better performance in pre-
dicting more severe classes of crash, such as fatal crashes,
compared with conventional parametric methods.
Focusing on the large truck-involved work zone
crashes in Florida, both machine learning techniques
and conventional ordered logistic models were applied in
Figure 6. Comparison of conventional marginal effects and machine learning Shapley (SHAP) values.
12 Transportation Research Record 00(0)
this study. To assess the impact of the data imbalance
issue, the original crash data was resampled using ran-
dom over-sampling and SMOTE-NC techniques. RF
models were developed and tuned. Important variables
were extracted from the RF model based on the average
Gini index. Primary contributors included pedestrian
involvement, lighting conditions, safety equipment,
driver condition, driver age, and work zone locations.
Interestingly, roadway and environmental conditions
such as traffic-way, median type, road system identifier,
weather conditions, vehicle maneuver, and driver actions
did not show any significant contribution to the model.
Consequently, conventional logistic regression models
were built using the identified primary contributors.
Results confirmed that in both machine learning and
logistic regression models, data resampling significantly
improved the model’s performance on the minority class
(fatal crashes). In addition, results showed that the opti-
mized decision tree model outperformed the ordered
logistic regression in view of fatal crash prediction mea-
sures (F1_score) on both training and test data.
On fatality patterns, results showed that a combina-
tion of different factors could significantly increase the
probability of a fatal outcome. In pedestrian crashes, fac-
tors such as dark-not lighted conditions, distracted truck
driver, airbag deployment, and driver’s age (young driv-
ers outside city limits, senior drivers inside city limits)
were highly likely to be fatal. In non-pedestrian crashes,
the combination of front airbag deployment with any
restraint system other than shoulder and belt was quite
likely to be fatal. Also, abnormal driver conditions would
greatly increase the risk of a fatal outcome. Results also
showed that the presence of female driver (in multiple
vehicle crashes) decreased crash severity, probably
because females drive in a more careful driving manner
compared with males.
Finally, this study looked into the Shapley values and
compared them with the marginal effects from the con-
ventional ordered logistic model. Accordingly, pedestrian
involvement, lack of restraint system, front airbag
deployment, and not lighted conditions show a higher
than average impact on increasing fatality probability.
On the other hand, normal driver condition and airbags
not deployed contribute to lower severity crashes.
Results from the machine learning model and the con-
ventional ordered logistic model were similar in most
cases, with values from the machine learning model
being smaller, probably because of different computa-
tional approaches in the two models.
This study and similar pattern recognition studies
using machine learning techniques hold the potential to
provide a better approach to understanding crash contri-
buting factors and developing effective and specific coun-
termeasures, especially when less frequent but more
severe crashes are to be explored. Further research could
focus on the role of temporal variations using time-series
models or expand the analysis by analyzing worker and
driver severity measures separately.
Author Contributions
The authors confirm contribution to the paper as follows: study
conception and design: X. Jin, H. Asgari; data collection: X.
Jin, H. Asgari, G. Azimi, A. Rahimi; analysis and interpreta-
tion of results: R. Gupta and H. Asgari; draft manuscript pre-
paration: R. Gupta, H. Asgari and X. Jin. All authors reviewed
the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with
respect to the research, authorship, and/or publication of this
article.
Funding
The author(s) received no financial support for the research,
authorship, and/or publication of this article.
ORCID iDs
Rajesh Gupta https://orcid.org/0000-0001-7833-0235
Ghazaleh Azimi https://orcid.org/0000-0001-5646-6908
Xia Jin https://orcid.org/0000-0002-8660-3528
References
1. Work/Construction Zones. https://www.nhtsa.gov/sites/
nhtsa.dot.gov/files/workzones.pdf. Accessed July 15, 2019.
2. FHWA. Safe Driving, Safer Work Zones: National Work
Zone Awareness Week 2011. FOCUS, March 4–5, 2011,
ISSN 1060-6637. https://www.fhwa.dot.gov/publications/
focus/11mar/11mar.pdf.
3. National Work Zone Safety. Information Clearinghouse.
https://www.workzonesafety.org/crash-information/work-
zone-fatal-crashes-fatalities/#national. Accessed July 15, 2019.
4. The Economics Daily, Fatal injuries at road work zones.
Bureau of Labor Statistics, U.S. Department of Labor.
https://www.bls.gov/opub/ted/2017/fatal-injuries-at-road-
work-zones.htm Accessed July 20, 2019.
5. Akepati, S. R., and S. Dissanayake. Characteristics and
Contributory Factors of Work Zone Crashes. Presented at
93rd Annual Meeting of the Transportation Research
Board, Washington, D.C., 2011.
6. Srinivasan, R., G. Ullman, M. Finley, and F. Council. Use
of Empirical Bayesian Methods to Estimate Crash Modifi-
cation Factors for Daytime Versus Nighttime Work Zones.
Transportation Research Record: Journal of the Transporta-
tion Research Board, 2011. 2241: 29–38.
7. Yang, H., K. Ozbay, O. Ozturk, and M. Yildirimoglu.
Modeling Work Zone Crash Frequency by Quantifying
Measurement Errors in Work Zone Length. Accident Anal-
ysis & Prevention, Vol. 55, 2013, pp. 192–201.
Gupta et al 13
8. Ozturk, O., K. Ozbay, and H. Yang. Estimating the Impact
of Work Zones on Highway Safety. Transportation
Research Board 93rd Annual Meeting, Washington, D.C.,
2014.
9. Weng, J., and Q. Meng. Analysis of Driver Casualty Risk
for Different Work Zone Types. Accident Analysis & Pre-
vention, Vol. 43, No. 5, 2011, pp. 1811–1817.
10. Meng, Q., and J. Weng. Evaluation of Rear-End Crash
Risk at Work Zone Using Work Zone Traffic Data. Acci-
dent Analysis & Prevention, Vol. 43, No. 4, 2011,
pp. 1291–1300.
11. Xie, Y., K. Zhao, and N. Huynh. Analysis of Driver Injury
Severity in Rural Single-Vehicle Crashes. Accident Analysis
& Prevention, Vol. 47, 2012, pp. 36–44.
12. Celik, A. K., and E. Oktay. A Multinomial Logit Analysis
of Risk Factors Influencing Road Traffic Injury Severities
in the Erzurum and Kars Provinces of Turkey. Accident
Analysis & Prevention, Vol. 72, 2014, pp. 66–77.
13. Chen, Z., and W. D. Fan. A Multinomial Logit Model of
Pedestrian-Vehicle Crash Severity in North Carolina. Inter-
national Journal of Transportation Science and Technology,
Vol. 8, No. 1, 2019, pp. 43–52.
14. Wahab, L., and H. Jiang. A Multinomial Logit Analysis of
Factors Associated with Severity of Motorcycle Crashes in
Ghana. Traffic Injury Prevention, Vol. 20, No. 5, 2019,
pp. 521–527.
15. Hubbert, K., and M. Doustmohammadi. Multinomial
Logit Analysis of Injury Severity in Crashes Involving
Emotional Drivers. International Journal of Psychology and
Behavioral Sciences, Vol. 9, No. 4, 2019, pp. 63–70.
16. Zhu, X., and S. Srinivasan. Modeling Occupant-Level
Injury Severity: An Application to Large-Truck Crashes.
Accident Analysis & Prevention, Vol. 43, No. 4, 2011,
pp. 1427–1437.
17. Lemp, J. D., K. M. Kockelman, and A. Unnikrishnan.
Analysis of Large Truck Crash Severity Using Heteroske-
dastic Ordered Probit Models. Accident Analysis & Preven-
tion, Vol. 43, No. 1, 2011, pp. 370–380.
18. Linchao, L., and T. Fratrovic
´. Analysis of Factors Influen-
cing the Vehicle Damage Level in Fatal Truck-Related
Accidents and Differences in Rural and Urban Areas. Pro-
met-Traffic & Transportation, Vol. 28, No. 4, 2016,
pp. 331–340.
19. Rezapour, M., M. Moomen, and K. Ksaibati. Ordered
Logistic Models of Influencing Factors on Crash Injury
Severity of Single and Multiple-Vehicle Downgrade
Crashes: A Case Study in Wyoming. Journal of Safety
Research, Vol. 68, 2019, pp. 107–118.
20. Asare, I. O., and A. C. Mensah. Crash Severity Modelling
Using Ordinal Logistic Regression Approach. International
Journal of Injury Control and Safety Promotion, Vol. 27,
No. 4, 2020, pp. 412–419.
21. Yuan, Q., X. Xu, J. Zhao, and Q. Zeng. Investigation of
Injury Severity in Urban Expressway Crashes: A Case
Study from Beijing. PLoS One, Vol. 15, No. 1, 2020, p.
e0227869.
22. Chang, L. Y., and F. Mannering. Analysis of Injury Sever-
ity and Vehicle Occupancy in Truck- and Non-Truck-
Involved Accidents. Accident Analysis & Prevention, Vol.
31, No. 5, 1999, pp. 579–592.
23. Abdel-Aty, M., and H. Abdelwahab. Modeling Rear-End
Collisions Including the Role of Driver’s Visibility and
Light Truck Vehicles Using a Nested Logit Structure. Acci-
dent Analysis & Prevention, Vol. 36, No. 3, 2004,
pp. 447–456.
24. Patil, S., S. R. Geedipally, and D. Lord. Analysis of Crash
Severities Using Nested Logit Model—Accounting for the
Underreporting of Crashes. Accident Analysis & Prevention,
Vol. 45, 2012, pp. 646–653.
25. Razi-Ardakani, H., A. Mahmoudzadeh, and M. Kerman-
shah. A Nested Logit Analysis of the Influence of Distrac-
tion on Types of Vehicle Crashes. European Transport
Research Review, Vol. 10, No. 2, 2018, pp. 1–4.
26. Milton, J. C., V. N. Shankar, and F. L. Mannering. High-
way Accident Severities and the Mixed Logit Model: An
Exploratory Empirical Analysis. Accident Analysis & Pre-
vention, Vol. 40, No. 1, 2008, pp. 260–266.
27. Ye, F., and D. Lord. Investigation of Effects of Underre-
porting Crash Data on Three Commonly Used Traffic
Crash Severity Models: Multinomial Logit, Ordered Pro-
bit, and Mixed Logit. Transportation Research Record:
Journal of the Transportation Research Board, 2011. 2241:
51–58.
28. Chen, F., and S. Chen. Injury Severities of Truck Drivers
in Single- and Multi-Vehicle Accidents on Rural High-
ways. Accident Analysis & Prevention, Vol. 43, No. 5, 2011,
pp. 1677–1688.
29. Wu, Q., F. Chen, G. Zhang, X. C. Liu, H. Wang, and S.
M. Bogus. Mixed Logit Model-Based Driver Injury Sever-
ity Investigations in Single- and Multi-Vehicle Crashes on
Rural Two-Lane Highways. Accident Analysis & Preven-
tion, Vol. 72, 2014, pp. 105–115.
30. Islam, M., N. Alnawmasi, and F. Mannering. Unobserved
Heterogeneity and Temporal Instability in the Analysis of
Work-Zone Crash-Injury Severities. Analytic Methods in
Accident Research, Vol. 28, 2020, p. 100130.
31. Azimi, G., A. Rahimi, H. Asgari, and X. Jin. Severity Anal-
ysis for Large Truck Rollover Crashes Using a Random
Parameter Ordered Logit Model. Accident Analysis & Pre-
vention, Vol. 135, 2020, p. 105355.
32. Islam, S., S. L. Jones, and D. Dye. Comprehensive Analy-
sis of Single- and Multi-Vehicle Large Truck At-Fault
Crashes on Rural and Urban Roadways in Alabama. Acci-
dent Analysis & Prevention, Vol. 67, 2014, pp. 148–158.
33. Wang, J., H. Huang, P. Xu, S. Xie, and S. C. Wong. Ran-
dom Parameter Probit Models to Analyze Pedestrian Red-
Light Violations and Injury Severity in Pedestrian–Motor
Vehicle Crashes at Signalized Crossings. Journal of Trans-
portation Safety & Security, Vol. 12, No. 6, 2020,
pp. 818–837.
34. Chang, L. Y., and H. W. Wang. Analysis of Traffic Injury
Severity: An Application of Non-Parametric Classification
Tree Techniques. Accident Analysis & Prevention, Vol. 38,
No. 5, 2006, pp. 1019–1027.
35. Mujalli, R. O., and J. De On
˜a. Injury Severity Models for
Motor Vehicle Accidents: A Review. Proceedings of the
14 Transportation Research Record 00(0)
Institution of Civil Engineers: Transport, Vol. 166, No. 5,
2013, pp. 255–270.
36. Kashani, A. T., and A. S. Mohaymany. Analysis of the
Traffic Injury Severity on Two-Lane, Two-Way Rural
Roads Based on Classification Tree Models. Safety Sci-
ence, Vol. 49, No. 10, 2011, pp. 1314–1320.
37. Weng, J., Q. Meng, and D. Z. Wang. Tree-Based Logistic
Regression Approach for Work Zone Casualty Risk
Assessment. Risk Analysis, Vol. 33, No. 3, 2013,
pp. 493–504.
38. Abella
´n, J., G. Lo
´pez, and J. De On
˜a. Analysis of Traffic
Accident Severity Using Decision Rules via Decision Trees.
Expert Systems with Applications, Vol. 40, No. 15, 2013,
pp. 6047–6054.
39. de On
˜a, J., G. Lo
´pez, and J. Abella
´n. Extracting Decision
Rules from Police Accident Reports Through Decision
Trees. Accident Analysis & Prevention, Vol. 50, 2013,
pp. 1151–1160.
40. Chong, M. M., A. Abraham, and M. Paprzycki. Traffic
Accident Analysis Using Decision Trees and Neural Net-
works. arXiv Preprint CS/0405050, May 16, 2004.
41. Ghasemzadeh, A., and M. M. Ahmed. A Probit-Decision
Tree Approach to Analyze Effects of Adverse Weather
Conditions on Work Zone Crash Severity Using Second
Strategic Highway Research Program Roadway Informa-
tion Dataset. Transportation Research Board 96th Annual
Meeting, Washington, D.C., 2017.
42. Moral-Garcı
´a, S., J. G. Castellano, C. J. Mantas, A. Mon-
tella, and J. Abella
´n. Decision Tree Ensemble Method for
Analyzing Traffic Accidents of Novice Drivers in Urban
Areas. Entropy, Vol. 21, No. 4, 2019, p. 360.
43. Khattak, A. J., A. J. Khattak, and F. M. Council. Effects
of Work Zone Presence on Injury and Non-Injury Crashes.
Accident Analysis & Prevention, Vol. 34, No. 1, 2002,
pp. 19–29.
44. Li, Y., and Y. Bai. Comparison of Characteristics Between
Fatal and Injury Accidents in the Highway Construction
Zones. Safety Science, Vol. 46, No. 4, 2008, pp. 646–660.
45. Li, Y., and Y. Bai. Highway Work Zone Risk Factors and
Their Impact on Crash Severity. Journal of Transportation
Engineering, Vol. 135, No. 10, 2009, pp. 694–701.
46. Khattak, A. J., and F. Targa. Injury Severity and Total
Harm in Truck-Involved Work Zone Crashes. Transporta-
tion Research Record: Journal of the Transportation
Research Board, 2004. 1877: 106–116.
47. Osman, M., R. Paleti, S. Mishra, and M. M. Golias. Anal-
ysis of Injury Severity of Large Truck Crashes in Work
Zones. Accident Analysis & Prevention, Vol. 97, 2016,
pp. 261–273.
48. Zhang, K., and M. Hassan. Crash Severity Analysis of
Nighttime and Daytime Highway Work Zone Crashes.
PLoS One, Vol. 14, No. 8, 2019, p. e0221128.
49. Harb, R., E. Radwan, X. Yan, A. Pande, and M.
Abdel-Aty. Freeway Work-Zone Crash Analysis and Risk
Identification Using Multiple and Conditional Logistic
Regression. Journal of Transportation Engineering, Vol.
134, No. 5, 2008, pp. 203–214.
50. Salem, O. M., A. M. Genaidy, H. Wei, and N. Deshpande.
Spatial Distribution and Characteristics of Accident
Crashes at Work Zones of Interstate Freeways in Ohio.
Proc., 2006 IEEE Intelligent Transportation Systems Con-
ference, IEEE, New York, 2006, pp. 1642–1647.
51. Raub, R. A., O. B. Sawaya, J. L. Schofer, and A. Ziliasko-
poulos. Enhanced Crash Reporting to Explore Workzone
Crash Patterns. Paper No. 01-0166, Northwestern Univer-
sity Center for Public Safety, Evanston, IL, 2001.
52. Sze, N. N., and Z. Song. Factors Contributing to Injury
Severity in Work Zone Related Crashes in New Zealand.
International Journal of Sustainable Transportation, Vol.
13, No. 2, 2019, pp. 148–154.
53. Schrock, S. D., G. L. Ullman, A. S. Cothron, E. Kraus,
and A. P. Voigt. An Analysis of Fatal Work Zone Crashes
in Texas. Report FHW A/TX-05/0-4028-1, Texas Depart-
ment of Transportation, Research and Technology Imple-
mentation Office, October 2004.
54. Qi, Y., R. Srinivasan, H. Teng, and R. Baker. Analysis of
the Frequency and Severity of Rear-End Crashes in Work
Zones. Traffic Injury Prevention, Vol. 14, No. 1, 2013,
pp. 61–72.
55. Wang, Z., J. J. Lu, Q. Wang, L. Lu, and Z. Zhang. Modeling
Injury Severity in Work Zones Using Ordered Probit Regres-
sion. Proc., ICCTP 2010: Integrated Transportation Systems:
Green, Intelligent, Reliable, 2010, pp. 1058–1067, Beijing,
China.
56. Zhang, S., C. Tjortjis, X. Zeng, H. Qiao, I. Buchan, and J.
Keane. Comparing Data Mining Methods with Logistic
Regression in Childhood Obesity Prediction. Information
Systems Frontiers, Vol. 11, No. 4, 2009, pp. 449–460.
57. Kuhle, S., B. Maguire, H. Zhang, D. Hamilton, A. C.
Allen, K. S. Joseph, and V. M. Allen. Comparison of
Logistic Regression with Machine Learning Methods for
the Prediction of Fetal Growth Abnormalities: A Retro-
spective Cohort Study. BMC Pregnancy and Childbirth,
Vol. 18, No. 1, 2018, pp. 1–9.
58. Pua, Y. H., H. Kang, J. Thumboo, R. A. Clark, E. S.
Chew, C. L. Poon, H. C. Chong, and S. J. Yeo. Machine
Learning Methods Are Comparable to Logistic Regression
Techniques in Predicting Severe Walking Limitation Fol-
lowing Total Knee Arthroplasty. Knee Surgery, Sports
Traumatology, Arthroscopy, 2020 28(10): 3207–3216.
59. Hensher, D. A., and W. H. Greene. The Mixed Logit
Model: The State of Practice. Transportation, Vol. 30, No.
2, 2003, pp. 133–176.
60. Hensher, D. A., J. M. Rose, J. M. Rose, and W. H. Greene.
. Applied Choice Analysis: A Primer. Cambridge University
Press, Cambridge, 2005.
61. Jahangiri, A., and H. A. Rakha. Applying Machine Learn-
ing Techniques to Transportation Mode Recognition
Using Mobile Phone Sensor Data. IEEE Transactions on
Intelligent Transportation Systems, Vol. 16, No. 5, 2015,
pp. 2406–2417.
62. Jahangiri, A., H. Rakha, and T. A. Dingus. Red-Light
Running Violation Prediction Using Observational and
Simulator Data. Accident Analysis & Prevention, Vol. 96,
2016, pp. 316–328.
63. Balali, V., and M. Golparvar-Fard. Evaluation of Multi-
class Traffic Sign Detection and Classification Methods for
US Roadway Asset Inventory Management. Journal of
Gupta et al 15
Computing in Civil Engineering, Vol. 30, No. 2, 2016, p.
04015022.
64. Chen, S., W. Wang, and H. Van Zuylen. Construct Sup-
port Vector Machine Ensemble to Detect Traffic Incident.
Expert Systems with Applications, Vol. 36, No. 8, 2009,
pp. 10976–10986.
65. Yu, R., and M. Abdel-Aty. Utilizing Support Vector
Machine in Real-Time Crash Risk Evaluation. Accident
Analysis & Prevention, Vol. 51, 2013, pp. 252–259.
66. Li, Z., P. Liu, W. Wang, and C. Xu. Using Support Vector
Machine Models for Crash Injury Severity Analysis. Acci-
dent Analysis & Prevention, Vol. 45, 2012, pp. 478–486.
67. Chen, C., G. Zhang, Z. Qian, R. A. Tarefder, and Z. Tian.
Investigating Driver Injury Severity Patterns in Rollover
Crashes Using Support Vector Machine Models. Accident
Analysis & Prevention, Vol. 90, 2016, pp. 128–139.
68. Zheng, Z., P. Lu, and B. Lantz. Commercial Truck Crash
Injury Severity Analysis Using Gradient Boosting Data
Mining Model. Journal of Safety Research, Vol. 65, 2018,
pp. 115–124.
69. Japkowicz, N. Learning from Imbalanced Data Sets: A
Comparison of Various Strategies. In AAAI Workshop on
Learning from Imbalanced Data Sets, AAAI Press, Menlo
Park, CA, Vol. 68, 2000, pp. 10–15.
70. Barandela, R., R. M. Valdovinos, J. S. Sa
´nchez, and F. J.
Ferri. The Imbalanced Training Sample Problem: Under
or Over Sampling? In Joint IAPR International Workshops
on Statistical Techniques in Pattern Recognition (SPR) and
Structural and Syntactic Pattern Recognition (SSPR),
Springer, Berlin, Heidelberg, 2004, pp. 806–814.
71. Van Hulse, J., T. M. Khoshgoftaar, and A. Napolitano.
Experimental Perspectives on Learning from Imbalanced
Data. Proc., 24th International Conference on Machine Learn-
ing, Corvalis Oregon USA, June 20-24, 2007, pp. 935–942.
72. Gu, J., Y. Zhou, and X. Zuo. Making Class Bias Useful: A
Strategy of Learning from Imbalanced Data. Proc., Inter-
national Conference on Intelligent Data Engineering and
Automated Learning, Springer, Berlin, Heidelberg, 2007,
pp. 287–295.
73. Ahmadi, A., A. Jahangiri, V. Berardi, and S. G. Machiani.
Crash Severity Analysis of Rear-End Crashes in California
Using Statistical and Machine Learning Classification
Methods. Journal of Transportation Safety & Security, Vol.
12, No. 4, 2020, pp. 522–546.
74. Lamba, D., M. Alsadhan, W. Hsu, E. Fitzsimmons, and
G. Newm ark. Coping with Class Imbalance in Classification
of Traffic Crash Severity Based on Sensor and Road Data: A
Feature Selection and Data Augmentation Approach. The 6th
International Conference on Artificial Intelligence and Applica-
tions (AIAP-2019), May 25-26, 2019, Vancouver, Canada.
75. Yahaya, M., X. Jiang, C. Fu, K. Bashir, and W. Fan.
Enhancing Crash Injury Severity Prediction on Imbalanced
Crash Data by Sampling Technique with Variable Selec-
tion. Proc., 2019 IEEE Intelligent Transportation Systems
Conference (ITSC), IEEE, New York, 2019, pp. 363–368.
76. Fiorentini, N., and M. Losa. Handling Imbalanced Data
in Road Crash Severity Prediction by Machine Learning
Algorithms. Infrastructures, Vol. 5, No. 7, 2020, p. 61.
77. Elhassan, T., M. Aljurf, F. Al-Mohanna, and M. Shoukri.
Classification of Imbalance Data Using Tomek Link (t-
Link) Combined with Random Under-Sampling (RUS) as
a Data Reduction Method. Journal of Informatics and Data
Mining, Vol 1 (2), 2016, https://doi.org/10.21767/2472-
1956.100011
78. Han, H., W. Y. Wang, and B. H. Mao. Borderline-
SMOTE: A New Over-Sampling Method in Imbalanced
Data Sets Learning. Proc., International Conference on
Intelligent Computing, Springer, Berlin, Heidelberg, 2005,
pp. 878–887.
79. Anand, A., G. Pugalenthi, G. B. Fogel, and P. N.
Suganthan. An Approach for Classification of Highly
Imbalanced Data Using Weighting and Undersampling.
Amino Acids, Vol. 39, No. 5, 2010, pp. 1385–1391.
80. Ng, W. W., J. Hu, D. S. Yeung, S. Yin, and F. Roli. Diver-
sified Sensitivity-Based Undersampling for Imbalance Clas-
sification Problems. IEEE Transactions on Cybernetics,
Vol. 45, No. 11, 2014, pp. 2402–2412.
81. Naik, B., L. R. Rilett, J. Appiah, and L. F. Walubita.
Resampling Methods for Estimating Travel Time Uncer-
tainty: Application of the Gap Bootstrap. Transportation
Research Record: Journal of the Transportation Research
Board, 2018. 2672: 137–147.
82. Parsa, A. B., H. Taghipour, S. Derrible, and A. K.
Mohammadian. Real-Time Accident Detection: Coping
with Imbalanced Data. Accident Analysis & Prevention,
Vol. 129, 2019, pp. 202–210.
83. Ambu
¨hl, L., A. Loder, M. C. Bliemer, M. Menendez, and
K. W. Axhausen. Introducing a Re-Sampling Methodology
for the Estimation of Empirical Macroscopic Fundamental
Diagrams. Transportation Research Record: Journal of the
Transportation Research Board, 2018. 2672: 239–248.
84. Kitali, A. E., P. Alluri, T. Sando, and W. Wu. Identifica-
tion of Secondary Crash Risk Factors Using Penalized
Logistic Regression Model. Transportation Research
Record: Journal of the Transportation Research Board,
2019. 2673: 901–914.
85. Li, D. C., H. Y. Chen, and Q. S. Shi. Learning from Small
Datasets Containing Nominal Attributes. Neurocomputing,
Vol. 291, 2018, pp. 226–236.
86. Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P.
Kegelmeyer. SMOTE: Synthetic Minority Over-Sampling
Technique. Journal of Artificial Intelligence Research, Vol.
16, 2002, pp. 321–357.
87. Mitchell, T. M. Machine Learning, McGraw Hill, 1997,
ISBN 0070428077.
88. Kingsford, C., and S. L. Salzberg. What Are Decision
Trees? Nature Biotechnology, Vol. 26, No. 9, 2008,
pp. 1011–1013.
89. Chen, Y. L., and L. T. Hung. Using Decision Trees to Sum-
marize Associative Classification Rules. Expert Systems
with Applications, Vol. 36, No. 2, 2009, pp. 2338–2351.
90. Ghasemzadeh, A., B. E. Hammit, M. M. Ahmed, and R.
K. Young. Parametric Ordinal Logistic Regression and
Non-Parametric Decision Tree Approaches for Assessing
the Impact of Weather Conditions on Driver Speed Selec-
tion Using Naturalistic Driving Data. Transportation
16 Transportation Research Record 00(0)
Research Record: Journal of the Transportation Research
Board, 2018. 2672: 137–147.
91. Quinlan, J. R. Induction of Decision Trees. Machine Learn-
ing, Vol. 1, No. 1, 1986, pp. 81–106.
92. Rokach, L., and O. Maimon. Top-Down Induction of
Decision Trees Classifiers: A Survey. IEEE Transactions on
Systems, Man, and Cybernetics, Part C (Applications and
Reviews), Vol. 35, No. 4, 2005, pp. 476–487.
93. Mingers, J. An Empirical Comparison of Pruning Methods
for Decision Tree Induction. Machine Learning, Vol. 4, No.
2, 1989, pp. 227–243.
94. Bradford, J. P., C. Kunz, R. Kohavi, C. Brunk, and C. E.
Brodley. Pruning Decision Trees with Misclassification
Costs. Proc., European Conference on Machine Learning,
April 21, Springer, Berlin, Heidelberg, 1998, pp. 131–136.
95. Breiman, L. Random Forests. Machine Learning, Vol. 45,
No. 1, 2001, pp. 5–32.
96. Liaw, A., and M. Wiener. Classification and Regression
by Random Forest. R News, Vol. 2, No. 3, 2002, pp.
18–22.
97. Mafi, S., Y. Abdelrazig, and R. Doczy. Machine Learning
Methods to Analyze Injury Severity of Drivers from Dif-
ferent Age and Gender Groups. Transportation Research
Record: Journal of the Transportation Research Board,
2018. 2672: 171–183.
98. Arbabzadeh, N., M. Jalayer, and M. Jafari. Predicting
Traffic Safety Risk Factors Using an Ensemble Classifier.
In Data Analytics for Smart Cities (Alavi, A. H. and W. G.
Buttlar, eds.), Auerbach Publications, Boca Raton, FL,
2018, pp. 201–216.
99. Fayed, H. A., and A. F. Atiya. Speed Up Grid-Search for
Parameter Selection of Support Vector Machines. Applied
Soft Computing, Vol. 80, 2019, pp. 202–210.
100. Soleimani, S., S. R. Mousa, J. Codjoe, and M. Leitner.
A Comprehensive Railroad-Highway Grade Crossing
Consolidation Model: A Machine Learning Approach.
Accident Analysis & Prevention, Vol. 128, 2019, pp. 65–77.
101. Probst, P., A. L. Boulesteix, and B. Bischl. Tunability:
Importance of Hyperparameters of Machine Learning
Algorithms. The Journal of Machine Learning Research,
Vol. 20, No. 1, 2019, pp. 1934–1965.
102. Shapley, L. S. A Value for n-Person Games. In Contribu-
tions to the Theory of Games II (Kuhn, A. W., and H. W.
Tucker, eds.), Princeton University Press, Princeton, New
Jersey, USA, 1953.
103. Lundberg, S., and S. I. Lee. A Unified Approach to Inter-
preting Model Predictions. arXiv Preprint
arXiv:1705.07874, May 22, 2017.
104. Lundberg, S. M., G. Erion, H. Chen, A. DeGrave, J. M.
Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and
S. I. Lee. From Local Explanations to Global Under-
standing with Explainable AI for Trees. Nature Machine
Intelligence, Vol. 2, No. 1, 2020, pp. 56–67.
105. Angel, A., and M. Hickman. Analysis of the Factors
Affecting the Severity of Two-Vehicle Crashes. Ingenierı´a
y Desarrollo, Vol. 24, 2008, pp. 176–194.
Gupta et al 17
Appendix A
Table A1. Ordered Logistic Regression Models
Ordered logistic regression on random over-sampled data Ordered logistic regression on SMOTE-NC over-sampled data
Variable name Value
Standard
error t Value
Marginal effect
on property
damage only
Marginal
effect
on injury
Marginal
effect
on fatality Value
Standard
error t Value
Marginal effect
on property
damage only
Marginal
effect
on injury
Marginal
effect
on fatality
Crash
type_Sideswipe
20.882 0.064 213.711 0.166 20.019 20.148 21.212 0.084 214.417 0.189 20.025 20.164
Crash
type_Pedestrian
2.104 0.225 9.349 20.189 20.292 0.482 2.392 0.755 3.167 20.135 20.398 0.533
Crash type_rear
end
1.336 0.125 10.665 20.148 20.161 0.309 0.696 0.060 11.689 20.078 20.048 0.125
No airbag
equipment
20.488 0.105 24.662 0.084 0.007 20.09 22.091 0.098 221.411 0.32 20.027 20.293
Airbag not
deployed
21.143 0.102 211.151 0.191 0.024 20.215 22.686 0.094 228.684 0.367 0.057 20.424
Front airbag
deployed for at-
fault vehicle
0.226 0.113 2.009 20.035 20.01 0.046 21.735 0.111 215.586 0.307 20.105 20.202
Front airbag
deployed for not-
at-fault vehicle
1.736 0.107 16.157 20.174 20.229 0.403 3.489 0.148 23.548 20.158 20.538 0.696
Restraint not used
for at-fault driver
1.815 0.140 12.929 20.193 20.222 0.415 1.827 0.164 11.112 20.12 20.292 0.412
Restraint_Shoulder
and lap belt for
at-fault driver
0.086 0.104 0.833 20.014 20.002 0.017 22.286 0.476 24.805 0.467 20.269 20.199
Lighting = dark
not_lighted
0.631 0.086 7.354 20.09 20.045 0.134 0.409 0.099 4.150 20.042 20.035 0.078
Lighting = daylight 0.087 0.065 1.338 20.014 20.003 0.017 0.477 0.071 6.729 20.059 20.02 0.079
Driver
action = wrong
side of wrong
way
20.017 0.544 20.031 0.003 0.001 20.003 21.600 0.911 21.756 0.299 20.128 20.171
Driver at fault’s
condition normal
21.835 0.074 224.657 0.222 0.184 20.406 22.707 0.100 227.191 0.193 0.379 20.572
Not distracted (at-
fault driver)
20.427 0.055 27.708 0.066 0.019 20.086 0.025 0.073 0.347 20.003 20.001 0.004
Driver at fault age 0.008 0.002 5.410 20.001 0 0.002 20.006 0.002 23.894 0.001 0 20.001
Driver not at fault
age
20.008 0.001 25.923 0.001 0 20.001 20.003 0.001 22.590 0 0 20.001
Driver not at
fault = female
0.355 0.058 6.072 20.059 20.008 0.068 20.308 0.079 23.912 0.039 0.011 20.05
(continued)
18
Appendix A
Table A1. (continued)
Ordered logistic regression on random over-sampled data Ordered logistic regression on SMOTE-NC over-sampled data
Variable name Value
Standard
error t Value
Marginal effect
on property
damage only
Marginal
effect
on injury
Marginal
effect
on fatality Value
Standard
error t Value
Marginal effect
on property
damage only
Marginal
effect
on injury
Marginal
effect
on fatality
HARMFUL_
EVT1_1_
Pedestrian
2.244 0.245 9.171 20.194 20.314 0.509 4.043 0.763 5.302 20.164 20.581 0.744
Crash
location = within
city limits
20.789 0.051 215.336 0.135 0.011 20.146 21.021 0.057 217.822 0.136 0.024 20.16
Shoulder
type = curb
20.217 0.063 23.425 0.037 0.004 20.041 21.024 0.085 212.085 0.156 20.014 20.142
19
... WZ crash analysis has received enormous attention from transport safety researchers due to the increase in fatalities in recent years (Gupta et al. 2021). Many researchers have investigated WZ crash types and frequency, the severity of the accidents, and the factors responsible for WZ crash around the world (Gupta et al. 2021;Akepati and Dissanayake 2011;Ghasemzadeh and Ahmed 2019;Ashqar et al. 2021;Li and Bai 2006;K. ...
... WZ crash analysis has received enormous attention from transport safety researchers due to the increase in fatalities in recent years (Gupta et al. 2021). Many researchers have investigated WZ crash types and frequency, the severity of the accidents, and the factors responsible for WZ crash around the world (Gupta et al. 2021;Akepati and Dissanayake 2011;Ghasemzadeh and Ahmed 2019;Ashqar et al. 2021;Li and Bai 2006;K. Zhang and Hassan 2019). ...
... Zhang and Hassan 2019). Most of the previous studies collected data from police reports to conduct their studies (Gupta et al. 2021;Ghasemzadeh and Ahmed 2019;B. Zhang 2018;Ashqar et al. 2021;Khattak, J. Liu, and M. Zhang 2015), DOTs (Akepati and Dissanayake 2011;B. ...
Preprint
Full-text available
The construction and maintenance activities of roadway infrastructure positively contribute to social and economic development and improve traffic safety. However, the roadway work zones (WZs) cause safety issues for construction workers and travelers and adversely affect vehicular movement. This study aims to explore public perceptions about WZs and identify factors that influence crashes and public experience at WZs by collecting and analyzing Twitter data. In this paper, we have employed several machine learning methods to classify WZs tweets and then performed exploratory, sentiment, and emotion analysis on the classified tweets. We have then verified our Twitter-related research outcome with police crash reports. The sentiment and emotion analysis using classified tweets (with 92% classification accuracy and 0.68 F-score) showed somewhat negative emotion on roadway WZs and onsite physical elements. However, the overall sentiment and emotion scores support positive outcomes from WZs' activities. We also found a temporal relationship and a strong correlation between WZ-related tweets and fatalities. A cross-analysis of tweets and crash reports revealed that some physical elements (e.g., signs, barriers, barrels, closures, and workers, etc.) are significantly associated with severe crashes at WZs. The results of this research may help policymakers to make appropriate policy decisions in improving driving experiences and reducing WZ-related traffic accidents.
... However, using traditional resampling techniques like RUMC can result in the loss of crucial information [23,32]. Consequently, it is vital to explore synthetic resampling strategies as an alternative approach to effectively tackle this issue [33][34][35][36]. Despite the potential benefits of synthetic resampling strategies, comprehensive comparative analyses assessing the predictive power of parametric and non-parametric machine learning techniques in conjunction with such strategies remain scarce. ...
... SMOTE can help improve the performance of machine learning algorithms by increasing the representation of minority class samples in the training data. This approach has been shown to improve the accuracy and robustness of the models trained on imbalanced data [29,34,35]. ...
Article
Full-text available
As the global elderly population continues to rise, the risk of severe crashes among elderly drivers has become a pressing concern. This study presents a comprehensive examination of crash severity among this demographic, employing machine learning models and data gathered from Virginia, United States of America, between 2014 and 2021. The analysis integrates parametric models, namely logistic regression and linear discriminant analysis (LDA), as well as non-parametric models like random forest (RF) and extreme gradient boosting (XGBoost). Central to this study is the application of resampling techniques, specifically, random over-sampling examples (ROSE) and the synthetic minority over-sampling technique (SMOTE), to address the dataset’s inherent imbalance and enhance the models’ predictive performance. Our findings reveal that the inclusion of these resampling techniques significantly improves the predictive power of parametric models, notably increasing the true positive rate for severe crash prediction from 6% to 60% and boosting the geometric mean from 25% to 69% in logistic regression. Likewise, employing SMOTE resulted in a notable improvement in the non-parametric models’ performance, leading to a true positive rate increase from 8% to 36% in XGBoost. Moreover, the study established the superiority of parametric models over non-parametric counterparts when balanced resampling techniques are utilized. Beyond predictive modeling, the study delves into the effects of various contributing factors on crash severity, enhancing the understanding of how these factors influence elderly road safety. Ultimately, these findings underscore the immense potential of machine learning models in analyzing complex crash data, pinpointing factors that heighten crash severity, and informing targeted interventions to mitigate the risks of elderly driving.
... For example, while the aggregate model may suggest that higher speeds may contribute to injury severity in truck-involved crashes in work zones, its effect may vary under different speed values. That is, crashes in work zones with speed limits above a threshold increase the likelihood of injury (e.g., [7,8]), while crashes in work zones with speed limits below a threshold have no effect on the likelihood of injury [11]. As such, disaggregating truck-involved crashes in work zones by speed can provide additional insights on the effect of speed on truck-involved work zone crashes. ...
... Further, they observed that the risk of rear-end crashes was greater during nighttime and on weekends. Gupta et al. [11] used a random forest model to characterize the most important factors that impact fatality in truck-involved work zone crashes. Unlike most of the other work in this space, they did not find speed to be among the most important factors. ...
Article
Full-text available
This study investigates factors contributing to the injury severity of truck-involved work zones crashes in South Carolina (SC). The outcome of interest is injury or property damage only crashes, and the explanatory factors examined include the occupant, vehicle, collision, roadway, temporal, and environmental characteristics. Two mixed (random parameter) logit models are developed, one for non-interstates with speed limits less than 60 miles per hour (mph) and one for interstates with speed limits greater than or equal to 60 mph, using South Carolina statewide truck-involved work zone crash data from 2014 to 2020. Results of log-likelihood ratio tests indicate that separate speed models are warranted. The factors that were found to contribute to injury at the 90% confidence level in both models (interstate and non-interstate) are (1) dark lighting conditions, (2) female (at-fault) drivers, and (3) driving too fast for roadway conditions. Significant factors that apply only to non-interstates are SC or US primary roadways, activity area of the work zone, at-fault drivers under 35, sideswipe collision, presence of workers in the work zone, and collision with fixed objects. Significant factors that apply only to interstates are three or more vehicles, rear-end collision, location before the first work zone sign, and weekdays.
... These methodologies include random parameter approaches (Venkataraman et al. 2013;Islam and Hernandez 2013a, 2013b, 2016, latent class models (Xiong and Mannering 2013;Behnood et al. 2014;Yasmin et al. 2014), a fusion of both (Islam and Mannering 2021), and consideration of heterogeneity in means and variances (Hang et al. 2022;Islam 2022aIslam , 2022bIslam , 2023aIslam , 2023bIslam , 2024aIslam and Bertini 2023;Mannering 2020, 2021;Islam and Pande 2020;Nasrollahzadeh et al. 2021). Table A2 (supplementary material) presents some of the methodologies used in the work zone safety studies (Ashqar et al. 2021;Gupta et al. 2021;Hang et al., 2022;Islam 2022a In this specific investigation, this study adopts a random parameters logit model. This model is designed to accommodate potential heterogeneity in both the means and variances of the random parameters, offering a robust solution for addressing unobserved heterogeneity. ...
Article
Objective: Work zones are unique in geometry and traffic management, utilizing special traffic signs, standard channelizing devices, appropriate barriers, and pavement markings. These configurations can introduce unexpected driving conditions, potentially posing risks to drivers. This analysis aims to explore potential differences in contributing factors between work-zone crashes where geometry was identified as a factor and those where it was non-geometry factor. To gain insights into driver injury severities in single-vehicle work-zone crashes, this study analyzed work zone crash data from Florida. Method: This study employed random parameters logit models, accommodating potential variations in parameter estimates’ means and variances. The dataset encompassed a wide array of factors known to influence driver injury severity, encompassing crash characteristics, vehicle attributes, roadway features, prevailing traffic volume, driver profiles, and spatial and temporal considerations. Results: This analysis yielded significantly distinct parameters for work-zone crashes, distinguishing between geometry-related and non-geometry-related factors (primarily the human factors). This distinction suggests a complex interplay between these factors. Notably, the marginal effects of individual parameter estimates exhibited marked differences between these two categories – geometry and non-geometry factors. Conclusion: These findings contribute to the growing body of research indicating that geometric restrictions within work zones introduce a distinct set of risk factors compared to non-geometry related factors. Recognizing the significance of geometric restrictions, beyond typical driving conditions, holds the implications for enhancing safety within various work zone configurations and offers valuable insights for crash scene investigators to pinpoint contributing factors accurately.
... The choice of adopting SMOTE, consistent with Orsini et al. (3), was mainly because of the limited sample size available, which would have made RUMC unfeasible, in particular for short data collection durations. It is also worth noting that the practice of using SMOTE is becoming quite established in road safety modeling, and many recent works have applied it with good results (22)(23)(24)(25)(26)(27). ...
Article
Full-text available
Conflict-based approaches to real-time road safety analysis can provide several benefits over traditional crash-based models. In particular, as traffic conflicts are much more frequent than crashes, models can be trained with significantly shorter collection periods. Since existing literature has not investigated the sensitivity of real-time conflict prediction models (RTConfPM) to data collection duration, here we aim to fill this gap and discuss the implications for model resilience. A real-world highway case study was analyzed. Methodologically, various traffic variables aggregated into 5 min intervals were selected as predictors, synthetic minority oversampling technique (SMOTE) was applied to deal with unbalanced classification issues, and support vector machine (SVM) was chosen as classifier. The dichotomous response variable separated safe and unsafe intervals into two classes; the latter were defined considering a minimum number of rear-end conflicts within the interval, which were identified using a surrogate measure of safety (SMS), that is, time-to-collision. Several RTConfPMs were trained and tested, considering different data collection durations and different criteria to define the unsafe situation class. The results, which were shown to be robust with respect to the machine learning classifier used, indicate that the models were able to provide reliable predictions with just three to five days of data, and that the increase in performance with collection periods longer than 10 to 15 days was negligible. These findings can be generalized by considering the number of unsafe situations corresponding to the data collection period of each tested model; they highlight the relevance of RTConfPM as a more flexible and resilient alternative to the crash-based approach.
... Regarding the influence of weather and lighting conditions on crash severity, the results in the literature are inconsistent. It should be noted that while some studies ranked the weather and lighting conditions as one of the main predictors of the crash severity outcome [19,31,75], others are in accordance with the findings of this paper, indicating the insignificance of these variables on crash severity [23,35]. In addition, the significance of the temporal factors has been confirmed in the literature of subject as Zheng et al. [19] identified the importance of time of day and weekday in the severity levels of crashes involving trucks. ...
Article
Full-text available
While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.
Article
Full-text available
Crash severity is undoubtedly a fundamental aspect of a crash event. Although machine learning algorithms for predicting crash severity have recently gained interest by the academic community, there is a significant trend towards neglecting the fact that crash datasets are acutely imbalanced. Overlooking this fact generally leads to weak classifiers for predicting the minority class (crashes with higher severity). In this paper, in order to handle imbalanced accident datasets and provide a better prediction for the minority class, the random undersampling the majority class (RUMC) technique is used. By employing an imbalanced and a RUMC-based balanced training set, we propose the calibration, validation, and evaluation of four different crash severity predictive models, including random tree, k-nearest neighbor, logistic regression, and random forest. Accuracy, true positive rate (recall), false positive rate, true negative rate, precision, F 1-score, and the confusion matrix have been calculated to assess the performance. Outcomes show that RUMC-based models provide an enhancement in the reliability of the classifiers for detecting fatal crashes and those causing injury. Indeed, in imbalanced models, the true positive rate for predicting fatal crashes and those causing injury spans from 0% (logistic regression) to 18.3% (k-nearest neighbor), while for the RUMC-based models, it spans from 52.5% (RUMC-based logistic regression) to 57.2% (RUMC-based k-nearest neighbor). Organizations and decision-makers could make use of RUMC and machine learning algorithms in predicting the severity of a crash occurrence, managing the present, and planning the future of their works.
Article
Full-text available
Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.
Article
Full-text available
Urban expressway is the main artery of traffic network, and an in-depth analysis of the crashes is crucial for improving the traffic safety level of expressways. This study intended to address the injury severity of expressways in Beijing by proposing Bayesian ordered logistic regression model. Crash data were collected from urban express rings and expressways in 2015 and 2016. The results showed that crash location, time and crash season are significant variables influencing injury severity. The findings revealed that the proposed model can address the ordinal feature of injury severity, while accommodating the data with small sample sizes that may not adequately represent population characteristics. The conclusions can provide the management departments with valuable suggestions for the injury prevention and safety improvement on the urban expressways.
Article
Road traffic accident is one of the major problems facing the world. The carnage on Ghana’s roads has raised road accidents to the status of a ‘public health’ threat. The objective of the study is to identify factors that contribute to accident severity using an ordinal regression model to fit a suitable model using the dataset extracted from the database of Motor Traffic and Transport Department, from 1989 to 2019. The results of the ordinal logistic regression analyses show that the nature of cars, National roads, over speeding, and location (urban or rural) are significant indicators of crash severity. Strategies to reduce crash injuries should physical enforcement through greater Police presence on our roads as well as technology. There is also the need to train drivers to be more vigilant in their travels especially on the national roads and in the urban areas. The Recommendation is, a well thought out and contextualised written laws and sanctioned schemes to monitor and enforce strict compliance with road traffic rules should be put in place.
Article
In the state of Florida, work-zone related crashes and their resulting injury severities have been increasing recently, particularly over the 2015 to 2017 time period. In the current study, we seek to provide insights into the factors that have been influencing this trend. Using work zone data from the 2012 to 2017 time period, resulting driver-injury severities in single-vehicle work zone crashes were studied using random parameters logit models that allow for possible heterogeneity in the means and variances of parameter estimates. The available data included a wide variety of factors known to influence driver injury severity including data related to the crash characteristics, vehicle characteristics, roadway attributes, prevailing traffic volume, driver characteristics, and spatial and temporal characteristics. The model estimates produced significantly different parameters for each of the year from 2012 to 2017, and a fundamental shift in unobserved heterogeneity, suggesting statistically significant temporal instability. In addition, in several key instances, the marginal effects of individual parameter estimates show marked differences between one year and the next. However, these findings may not be the sole result of variations in driver behavior over time as has been argued in past research that has found temporal instability. This is because each work zone has a unique set of characteristics and, with the sample of work zones changing from one year to the next as highway maintenance and construction is undertaken in different locations, this work-zone sample variation could be a substantial source of the observed temporal instability.
Article
Purpose: Machine-learning methods are flexible prediction algorithms with potential advantages over conventional regression. This study aimed to use machine learning methods to predict post-total knee arthroplasty (TKA) walking limitation, and to compare their performance with that of logistic regression. Methods: From the department's clinical registry, a cohort of 4026 patients who underwent elective, primary TKA between July 2013 and July 2017 was identified. Candidate predictors included demographics and preoperative clinical, psychosocial, and outcome measures. The primary outcome was severe walking limitation at 6 months post-TKA, defined as a maximum walk time ≤ 15 min. Eight common regression (logistic, penalized logistic, and ordinal logistic with natural splines) and ensemble machine learning (random forest, extreme gradient boosting, and SuperLearner) methods were implemented to predict the probability of severe walking limitation. Models were compared on discrimination and calibration metrics. Results: At 6 months post-TKA, 13% of patients had severe walking limitation. Machine learning and logistic regression models performed moderately [mean area under the ROC curves (AUC) 0.73-0.75]. Overall, the ordinal logistic regression model performed best while the SuperLearner performed best among machine learning methods, with negligible differences between them (Brier score difference, < 0.001; 95% CI [- 0.0025, 0.002]). Conclusions: When predicting post-TKA physical function, several machine learning methods did not outperform logistic regression-in particular, ordinal logistic regression that does not assume linearity in its predictors. Level of evidence: Prognostic level II.
Article
Large truck rollover crashes present significant financial, industrial, and social impacts. This paper presents an effort to investigate the contributing factors to large truck rollover crashes. Specific focus was placed on exploring the role of heterogeneity and the potential sources of heterogeneity regarding their impacts on injury severity outcomes. The data used in this study contained large truck rollover crashes that occurred between 2007 and 2016 in the state of Florida. A random parameter ordered logit (RPOL) model was applied. Various driver, vehicle, roadway, and crash attributes were explored as potential predictors in the model. Their impacts were examined for the presence of heterogeneity. Interaction effects were then added to the random variables in order to detect potential sources of heterogeneity. Model results showed that the impacts of lighting conditions and driving speed had significant variation across observations, and this variation could be attributed to driver actions and driver conditions at the time of the crash, as well as driver vision obstruction. Findings from this study shed light on the direction, magnitude, and randomness of the factors that contribute to large truck rollover crashes. Findings associated with heterogeneity could help develop more effective and targeted countermeasures to improve freight safety. Driver education programs could be planned more efficiently, and advisory and warning signs could be designed in a more insightful manner by taking into account specific roadway attributes, such as sandy surfaces, downhill, curved alignment, unpaved shoulders, and lighting conditions.