ArticlePDF Available

Analysis of Fatal Truck-Involved Work Zone Crashes in Florida: Application of Tree-Based Models

August 2021
Transportation Research Record Journal of the Transportation Research Board 2675(4):036119812110332

August 2021
2675(4):036119812110332

DOI:10.1177/03611981211033278

Authors:

Rajesh Gupta

Ford Motor Company

Hamidreza Asgari

Florida International University

Ghazaleh Azimi

Florida International University

Alireza Rahimi

Florida International University

Show all 5 authorsHide

This paper presents the results of an analysis focusing on large truck-involved work zone fatal crashes using seven-year crash data in the State of Florida. Decision tree/random forest models were applied to specifically detect critical crash patterns that result in a fatality outcome. Because of the imbalanced nature of crash severity data (very low frequency of fatal crashes compared with property damage only or injury), data were treated using random and systematic over-sampling techniques. Marginal effects were addressed using Shapley values to increase model explainability. From a methodological perspective, results showed that the combination of over-sampling techniques with ensemble random forests could significantly improve model performance in predicting fatal crashes (compared with conventional logistic regression models). Primary contributors included pedestrian involvement, lighting conditions, safety equipment, driver condition, driver age, and work zone locations. For pedestrian crashes, factors such as dark-not lighted conditions, distracted truck driver, and driver’s age (young drivers outside city limits, senior drivers inside city limits) were highly likely to be fatal. For non-pedestrian crashes, the combination of front airbag deployment with any restraint system other than shoulder and belt was quite likely to be fatal. Also, abnormal driver conditions increased the risk of a fatal outcome. Additionally, the presence of female drivers (as the second driver in multiple vehicle crashes) highly decreased crash severity, probably because females typically drive more carefully than males. Interestingly, truck driver actions and maneuvers as well as roadway design and other physical environment features (i.e., number of lanes, median type, roadway grade, and alignment) did not show significant contribution to the model.

At-fault driver action at the time of crash for fatal crashes.

…

Schematic view of synthetic minority over-sampling technique (SMOTE) resampling.

…

Fundamental elements of a decision tree.

…

Top 20 features based on mean Gini index.

…

Distribution of Shapley (SHAP) values.

…

Figures - uploaded by Rajesh Gupta

Content may be subject to copyright.

Content uploaded by Rajesh Gupta

Content may be subject to copyright.

Research Article

Transportation Research Record

1–19

ÓNational Academy of Sciences:

Transportation Research Board 2021

Article reuse guidelines:

sagepub.com/journals-permissions

DOI: 10.1177/03611981211033278

journals.sagepub.com/home/trr

Analysis of Fatal Truck-Involved Work

Zone Crashes in Florida: Application of

Tree-Based Models

Rajesh Gupta

, Hamidreza Asgari

, Ghazaleh Azimi

, Alireza Rahimi

and Xia Jin

Abstract

This paper presents the results of an analysis focusing on large truck-involved work zone fatal crashes using seven-year crash

data in the State of Florida. Decision tree/random forest models were applied to specifically detect critical crash patterns that

result in a fatality outcome. Because of the imbalanced nature of crash severity data (very low frequency of fatal crashes com-

pared with property damage only or injury), data were treated using random and systematic over-sampling techniques.

Marginal effects were addressed using Shapley values to increase model explainability. From a methodological perspective,

results showed that the combination of over-sampling techniques with ensemble random forests could significantly improve

model performance in predicting fatal crashes (compared with conventional logistic regression models). Primary contributors

included pedestrian involvement, lighting conditions, safety equipment, driver condition, driver age, and work zone locations.

For pedestrian crashes, factors such as dark-not lighted conditions, distracted truck driver, and driver’s age (young drivers

outside city limits, senior drivers inside city limits) were highly likely to be fatal. For non-pedestrian crashes, the combination

of front airbag deployment with any restraint system other than shoulder and belt was quite likely to be fatal. Also,

abnormal driver conditions increased the risk of a fatal outcome. Additionally, the presence of female drivers (as the second

driver in multiple vehicle crashes) highly decreased crash severity, probably because females typically drive more

carefully than males. Interestingly, truck driver actions and maneuvers as well as roadway design and other physical environ-

ment features (i.e., number of lanes, median type, roadway grade, and alignment) did not show significant contribution to the

model.

By definition, a work zone is an area where roadwork

takes place, and it may involve lane closures, detours,

and moving equipment (1). As roadways start aging,

the increasing need for timely maintenance has resulted

in an increasing number of work zones in recent years.

Considering that work zones are usually accompanied

by traffic stream disruptions, and they usually expose

workers and their machinery to risk in vulnerable loca-

tions, such sites are dangerous places for both workers

and drivers. Statistics showed that work zone crashes

resulted in 667 fatalities and about 40,000 injuries in

2009 in the U.S.A. (2). In addition, according to

National Work Zone Safety, work zone crash fatality

has consistently increased in recent years, showing a

growth of approximately 36% from 2010 to 2017. In

particular, the State of Florida has been ranked as the

second most dangerous state for work zone fatalities

since around 2010, with an average of 67 fatal crashes

peryear(3,4).

With the above in mind, work zone crash analysis has

received increasing attention in safety analysis and deci-

sions. There is an abundant body of research in work

zone crash analysis focusing on various aspects, from

crash frequencies and severity outcomes at the aggregate

level (5–8) to disaggregate analyses that predict crash

severity levels based on crash-specific attributes (9–11).

From a methodological standpoint, parametric model

structures have been widely used in crash severity analy-

sis. In particular, a variety of discrete choice and logistic

regression models have been formulated in the literature,

including multinomial logit models (12–15), ordered

Department of Statistics, University of Lucknow, Lucknow, Uttar Pradesh,

India

Department of Civil and Environmental Engineering, Florida International

University, Miami, FL

Corresponding Author:

Xia Jin, xjin1@fiu.edu

response models (16–22), nested logit models (23–25),

and random parameter (mixed) logit models (26–33).

While logistic regression models demonstrate an overall

satisfactory performance in establishing relationships

between input parameters and the target variable, their

prediction accuracy might be questionable in the presence

of imbalanced data. This is specifically of the essence in

view of crash analysis given its intrinsic data imbalance

issue (i.e., low number of fatalities compared with other

accident categories) (34,35). When evaluating models

that deal with imbalanced datasets, it is important to

assess how the model performs on predicting individual

classes, including the minority class, rather than the over-

all performance, which is usually biased to the majority

class. Furthermore, logistic models rely on several restric-

tive assumptions because of their parametric nature,

which may not stand true in real-life conditions.

To address the aforementioned shortcomings of con-

ventional models, some researchers have proposed alter-

native solutions, such as applying machine learning and

data mining techniques in crash severity analysis. In par-

ticular, tree-based models have been used to identify

decision rules (patterns) that lead to more severe crashes

(36–42). In general, it could be inferred that decision

trees would help overcome the violation of statistical

assumptions involved in conventional parametric models

as well as a potentially better treatment of categorical

features; however, they have their own disadvantages,

including the absence of sensitivity analysis and marginal

effects assessment in tree-based models (34,35). To solve

this issue, some researchers have proposed combining

both parametric and non-parametric approaches (37,41).

The research presented in this paper is an effort to

analyze work zone crash severities in the State of

Florida. This study is encouraged by several technical

and practical motivations. First and foremost, it aims to

identify factors associated with work zone fatal crashes

for the purpose of equipping the Florida Department of

Transportation with clear and transparent information

to develop countermeasures and improve work zone

safety. Second, this study focuses on crashes where large

trucks are involved. Large trucks are considered of high

risk in work zones because of their larger sizes and lower

capabilities for fast reactions and preventive maneuvers,

as well as the higher levels of losses involved. Third, both

non-parametric (decision trees) and parametric (logistic

regression) tools are applied for the predictive analytics

and the results directly compared in view of model accu-

racy and prediction errors. In particular, standard

machine learning techniques such as data balancing and

Shapley values are used. The former is used to improve

models’ prediction in view of the minority group, while

the latter explains the marginal effects of each contribut-

ing factor toward crash severity.

Literature Review

Crash Contributing Factors in Work Zones

Several studies have focused on work zone crash severity

and how different factors contribute to the crash out-

come. In particular, researchers have investigated the

impact of several different parameters, including

driver and vehicle attributes, time of day, and roadway

features, as well as weather and lighting conditions

(7–9,30,43–48). In most cases, severity is positively cor-

related with higher posted speed limits, streetlight condi-

tions during nighttime, drug/alcohol influence, and truck

involvement. On the contrary, and as expected, the use

of restraint systems (e.g., seatbelts) and airbag deploy-

ment, along with the presence of work zone control

devices (such as flags, cones, or flashing lights) tend to

reduce the level of severity. Some other studies have gone

further and incorporated drivers’ behavior into severity

analysis (44,49–52). According to these studies, actions

such as violating speed limit, inattentive driving, impro-

per passing or lane changing, and following too closely

significantly lead to higher severity levels. In view of

truck involvement, there is an agreement that involve-

ment of trucks/large trucks will increase crash occurrence

or crash severity (44,49,53–55).

Discussion on Crash Severity Modeling Approaches

A quick review of the literature reveals the predominance

of logistic regression methods in crash severity analysis.

Applications of different forms of binary, multinomial,

ordered, and nested logistic regression have been docu-

mented. Logistic regression model is accompanied by

several advantages. In particular, it establishes conveni-

ent probability scores for different observations, does not

require a linear relationship between independent fea-

tures and the target variable, and is computationally effi-

cient in time and memory requirements. Moreover, the

model can be easily interpreted based on the coefficients

and t-values associated with each of the independent

variables (56–58). Also, the model can be specified to

comply with additional assumptions, such as ordered

logit models to account for ordered dependent variables,

nested-structures to incorporate certain types of correla-

tions within different classes, and random parameter

models to incorporate heterogeneity (59,60). Despite the

advantages, logistic regression has its own restrictions. In

particular, the model assumes that there is a linear rela-

tionship between the independent variables and the logit

(log odds) of the dependent variable. Transformations

are required when non-linear relationships are observed.

In addition, the model has a low capability of handling

numerous categorical variables. The presence of categori-

cal variables, lack of linear relationship between the

2Transportation Research Record 00(0)

parameters and the logit of the target variable, and

multi-collinearity are some of the major modeling draw-

backs associated with logistic regression models in crash

and safety analysis (57).

To address the limitations, machine learning techniques

were employed as the next step in the literature. In particu-

lar, non-parametric approaches such as decision trees and

support vector machines (SVM) have gained increasing

attention because of their capability in relaxing the restric-

tions on data distribution properties. Successful instances

of machine learning methods in safety analysis have been

documented in recent years (61–68). While machine learn-

ing approaches tend to increase the predictive power of

the models, they are more of a ‘‘black box’’ in nature and

have usually suffered from a lack of explainability. Some

studies have tried to address the explainability issue by

combining both parametric and non-parametric

approaches. For instance, researchers combined decision

trees with logistic regression models (9) or with probit

models (41). There was an agreement that the combined

approach would improve models’ performance by remov-

ing the interaction effects observed in parametric models

and also resolve the explainability issues associated with

non-parametric structures.

Data Imbalance

One major issue in safety analysis is the natural imbal-

ance observed in crash data. Data imbalance refers to the

situation where observations from one specific class

(also called the minority class) have remarkably lower

frequency compared with other classes in the dataset (69–

71). Data imbalance is a critical issue for several reasons.

First, in most empirical cases, it is the minority class that

the modelers are interested in (e.g., fatal crashes in this

case). Second, the cost of misclassification for the minor-

ity class is usually much higher compared with other

types of misclassification. In the case of crash analysis,

while crash fatalities are quite rare compared with prop-

erty damage only (PDO) or injury crashes, the associated

loss is remarkably higher.

In practice, imbalanced data is the root of one major

modeling-related issue, theoretically referred to as the

‘‘accuracy paradox’’ (72). Accordingly, a model can still

have a satisfactory overall accuracy while it performs

quite poorly on the minority class (34). Considering the

importance of data imbalance and its prevalence in the

real world, modelers have been looking for appropriate

strategies to tackle this challenge. In particular, successful

applications of data balancing techniques and consequent

modelimprovementshavebeenrecentlyreportedinsafety

analysis literature (73–76). Ahmadi et al. showed that a

SVM model slightly outperformed multinomial and mixed

logit models, provided that model parameters are effi-

ciently tuned. They addressed data imbalance by tuning

separate cost parameters in their SVM model structure

(73). Lamba et al. (74) used a variety of over- and under-

sampling techniques and showed that a combination of

algorithmic feature selection with random over-sampling

provides the best model performance for precision and

recall. Yahaya et al. used a variety of variable selection

techniques and combined them with SMOTE over-

sampling strategy. They concluded that using balanced

data can be a more efficient approach to identify the most

prolific predictors of the crash injury severity (75).

Fiorentini and Losa applied the random under-sampling

of the majority class (RUMC) method and showed that

the models built on balanced data had significantly higher

true positive rates in both logistic regression and machine

learning (random tree, random forest, and K nearest

neighborhood) models. It was concluded that the use of

balanced datasets seems to be essential for correctly pre-

dicting crashes with higher severities (76).

Data Description

The study used a dataset extracted from the Signal Four

Analytics website (https://S4.Geoplan.Ufl.Edu/). The

website has been developed by the Florida Department

of Highway Safety and Motor Vehicles and includes sta-

tewide crash records. The website comprises information

about crashes collected by Florida Highway Patrol offi-

cers. Driver information, vehicle features, crash charac-

teristics, and environment-related information at the

time of a crash is recorded on the website. The focus of

this study is on large truck crashes that occurred in work

zones between 2010 and 2016. Large trucks are defined

as trucks with a gross vehicle weight rating higher than

10,000 lb. The final dataset includes a total of 5,402

records, of which 75.7% were PDO crashes, 22.7% were

injury crashes, and 1.6% were fatal crashes.

For descriptive analysis, this study is focused on the

fatal crashes. Driver characteristics are those associated

with the at-fault drivers. The at-fault driver was deter-

mined and reported by the police officers who responded

to the accidents. Figure 1 illustrates at-fault driver action

at the time of the crash for fatal crashes. The predomi-

nant category was operating the motor vehicle in a care-

less or negligent manner (34.8%), followed by other

contributing action (12.4%), no contributing action

(11.2%), or failure to keep in the proper lane.

Table 1 shows the descriptive statistics for fatal

crashes. Operating the motor vehicle in careless or negli-

gent maneuver (34.8%) was the predominant driver

action category. Furthermore, normal conditions

(50.6%) had the highest percentage among driver condi-

tion categories, and the majority (50.6%) of drivers were

not distracted. The majority of crashes happened on

straight alignment (88.8%) and level grade (85.4%).

Gupta et al 3

Most of the work zone crashes occurred on two-way

divided traffic way with median (64%), and two-way not

divided traffic way (24.7%). Daylight conditions (49.4%)

and clear weather (64%) had the highest percentages for

light and weather conditions, respectively. For roadway

type, interstate (42.7%), state (25.8%), and U.S. roads

(12.4%) had most of the work zone crashes. Also, most

crashes (73%) happened outside of the city limits.

As to work zone type, work on shoulder or median

(58.4%) and lane closure (24.7%) had the highest fre-

quency among all categories. As to the restraint system, a

significant percentage (30%) of the drivers did not use

any restraint system. Moreover, rear-end crashes (31.5%)

and crash with a pedestrian (14.6%) were the major crash

type categories. Motor vehicle in transport (62.5%) had

the highest percentage among most harmful events.

Methodology

The methodology consists of four different steps:

1. Data balancing using some well-known resam-

pling techniques

2. Variable selection using random forest feature

importance

3. Develop parametric and non-parametric models

using both raw and balanced data

4. Use Shapley values to assess the marginal effect

of the contributing factors

Data Resampling

Different resampling techniques have been introduced in

theory (77–80), and several cases of their applications

both in research and industry have been documented in

modeling and data science literature (81–84). This study

explored both random over-sampling as the simplest

over-sampling method and a more systematic approach,

known as the synthetic minority over-sampling technique

for nominal and continuous (SMOTE-NC).

Random over-sampling involves supplementing the

training data with multiple copies of some of the minor-

ity classes. One issue with random over-sampling is that

it just naively duplicates existing records. Therefore,

although classification algorithms are exposed to a

greater amount of observations from the minority class,

they will not learn much more about how to set minority

and majority classes apart. In other words, the new data-

set does not contain more information about the charac-

teristics of the minority class than the original data.

A more advanced alternative to random over-

sampling is the SMOTE. Instead of duplicating existing

records, it creates new synthetic records based on the

existing observations. The SMOTE algorithm is parame-

terized with k (the number of nearest neighbors it will

consider) and the amount of over-sampling required (the

number of new points wish to be created). Each step of

the algorithm will:

1. Randomly select a minority point.

2. Randomly select any of its k nearest neighbors

belonging to the same class.

3. Randomly specify a lambda value in the range [0,

1].

4. Generate and place a new point on the vector

between the two points, located lambda percent

of the way from the original point.

Figure 1. At-fault driver action at the time of crash for fatal crashes.

4Transportation Research Record 00(0)

Table 1. Descriptive Analysis for Fatal Crashes

Variable Category %

Driver gender Male 78.7

Female 21.3

Driver action No contributing action 11.2

Operated motor vehicle in

careless or negligent manner

34.8

Failed to yield right-of-way 6.7

Improper backing 2.2

Improper turn 1.1

Ran red light 1.1

Drove too fast for conditions 3.4

Ran stop sign 1.1

Exceeded posted speed 2.2

Wrong side of wrong way 5.6

Failed to keep in proper lane 9.0

Ran off roadway 4.5

Disregarded other traffic sign 2.2

Over-correcting/over-steering 1.1

Operated motor vehicle in

erratic, reckless, or aggressive

manner

1.1

Other contributing action 12.4

Driver condition Apparently normal 50.6

Asleep or fatigued 1.1

Seizure, epilepsy, blackout 1.1

Physically impaired 1.1

Under the influence of

medication/drug/alcohol

11.2

Other 6.7

Unknown 28.1

Driver distracted Not distracted 50.6

Other inside the vehicle 2.2

External distraction 1.1

Inattentive 4.5

Unknown 41.6

Roadway alignment Straight 88.8

Curve right 4.5

Curve left 6.7

Roadway grade Level 85.4

Uphill 5.6

Downhill 7.9

Sag 1.1

Type of shoulder Paved 65.2

Unpaved 25.8

Curb 9.0

Traffic-way Two-way, not divided 24.7

Two-way, not divided, with a

continuous left turn lane

1.1

Two-way, divided, unprotected

(painted .4 ft) median

3.4

Two-way, divided, positive

median barrier

64.0

One-way traffic-way 5.6

Unknown 1.1

Light conditions Daylight 49.4

Dusk 2.2

Dark-lighted 20.2

Dark-not lighted 28.1

(continued)

Table 1. (continued)

Variable Category %

Weather conditions Clear 64.0

Cloudy 32.6

Rain 2.2

Fog, smog, smoke 1.1

Road system identifier Interstate 42.7

U.S. 12.4

State 25.8

County 4.5

Local 7.9

Turnpike/toll 6.7

Within city limits No 73.0

Yes 27.0

Work zone type Lane closure 24.7

Lane shift/crossover 2.2

Work on shoulder or median 58.4

Intermittent or moving work 7.9

Other 6.7

Restraint system Not applicable (non-motorist) 1.2

None used—motor vehicle

occupant

30.2

Shoulder and lap belt used 66.3

Shoulder belt only used 1.2

Other 1.20

Airbag deployed Not applicable 39.3

Not deployed 20.2

Deployed—front 29.2

Deployed—side 1.1

Deployed—combination 5.6

Unknown 4.5

Crash type Head on 5.6

Left entering 1.1

Off road 5.6

Opposing sideswipe 2.2

Other 5.6

Pedestrian 14.6

Rear end 31.5

Right angle 9.0

Rollover 2.2

Same direction sideswipe 5.6

Unknown 3.4

Backed into 2.2

Parked vehicle 11.2

Most harmful event Overturn/rollover 1.1

Fire/explosion 2.2

Cargo/equipment loss or shift 1.1

Pedestrian 14.6

Motor vehicle in transport 65.2

Parked motor vehicle 7.9

Struck by falling, shifting cargo

or anything set in motion by

motor vehicle

1.1

Bridge pier or support 1.1

Cable barrier 2.2

Tree (standing) 1.1

Utility pole/light support 1.1

Ran off roadway, right 1.1

Gupta et al 5

Figure 2 provides an illustration of SMOTE

methodology.

SMOTE’s main advantage over traditional naı

¨ve

methods is that it creates synthetic observations instead

of reusing existing observations, and therefore the classi-

fier is less likely to overfit. However, one should always

make sure that the synthetic observations created by

SMOTE are realistic and make sense.

The SMOTE algorithm can be further generalized to

deal with nominal (categorical) attributes. The new algo-

rithm, known as SMOTE-NC, accounts for the differ-

ence of nominal features by the median of standard

deviations of all continuous features in the minority class.

The median is then used in calculating the Euclidean dis-

tance when searching for k nearest neighbors. Finally,

the synthetic sample is populated by replicating the con-

tinuous features using the same algorithm as SMOTE,

and the nominal features are defined by the majority vote

of the k nearest neighbors (85,86).

Model Structure

Decision trees are among the most popular practical

methods in the realm of machine learning and predictive

modeling. It is a non-parametric technique used in classi-

fying discretely labeled datasets (87). In particular, and

compared with other machine learning techniques, deci-

sion trees have some important advantages. From the

audience’s point of view, this approach is intuitive and

easy to explain. From the analyst’s perspective, it allows

for implicit variable screening and feature selection

requires relatively little effort on data preparation and is

not affected by prevalent non-linear relationships across

the parameters (88–90). Stepping into the details, deci-

sion trees classify instances by sorting them down a tree

starting from the root node, traversing through several

internal nodes (also known as decision nodes), and

finally reaching a leaf (terminal) node, where the instance

is labeled into one of the existing class labels (as illu-

strated in Figure 3).

One of the popular tree development algorithms

which has been widely used in recent years is Iterative

Dichotomiser 3 (ID3) (91). The ID3 algorithm develops

decision trees by building them from top down, splitting

the sample at each node using a certain pre-selected fea-

ture and on a certain threshold, and continues this until

a termination criterion has been reached. With the above

in mind, it is evident that the learning algorithm should

cover two main issues. First, what are the most effective

performance measures that could be applied during attri-

bute evaluation? And second, when and how should the

splitting procedure stop?

With respect to the first question, decision trees gener-

ally rely on the ‘‘impurity’’ concept when it comes to

splitting the dataset. Impurity can be defined as an index

of non-homogeneity, that is, the presence of different tar-

get classes in the sample. The goodness-of-split could be

computed based on improvements in impurity. Three

major indices have been introduced and applied in the lit-

erature, namely the entropy index, the information gain

index, and the Gini index. Entropy index is a way to

measure impurity; the ID3 algorithm uses entropy to

Figure 2. Schematic view of synthetic minority over-sampling technique (SMOTE) resampling.

6Transportation Research Record 00(0)

calculate a sample homogeneity. If the sample is com-

pletely homogenous, then the entropy is zero, and if the

sample is equally divided, then the entropy is one.

Information gain calculates the reduction in the entropy

and measures how well a given feature separates or clas-

sifies the target classes. The feature with the highest

information gain is selected as the best one. The Gini

index shows the probability of incorrect classification for

a randomly chosen record from the specific node in the

data subset. Gini score gives an idea of how good a split

is by how mixed the classes are in the two groups created

by the split. A perfect separation results in a Gini score

of zero, whereas the worst case split that results in 50/50

classes.

In response to the second question, the decision tree

calls for a stopping condition that terminates the tree

growing process. This is especially important given the

rise of the overfitting issue. While it might sound ideal

that the tree grows until all records belong to the same

class (impurity = 0), or all the sample records have iden-

tical target labels, it tends to capture all the unwanted

noise associated with the training data, therefore leading

to poor performance on the test datasets. To avoid this,

restrictions are usually applied to the tree growth pro-

cess. This includes a variety of applicable techniques,

ranging from simple limitations on the number of tree

levels to more complicated pruning criteria (92–94). To

address low model performance caused by early termina-

tion or pruning, ensemble techniques have been widely

used to derive a final best model.

Feature Selection and Hyper Parameter Tuning

One important aspect of tree-based models is feature

selection. Given the randomness involved in selecting

features at each node, it is of the essence to run the

model with the best optimum subset of variables. To do

this, random forests (RF) based on bagging (bootstrap

aggregating) algorithms are used to select the top impor-

tant features as inputs to the decision tree. RFs combine

results from multiple decision trees using aggregate mea-

sures such as average or majority voting (95,96). RFs

randomly develop hundreds or thousands of decision

trees and then predict the ultimate outcome by aggregat-

ing their results. The ensemble bagging algorithm gener-

ally gives better results because it reduces the overall

variance of the model and helps to avoid overfitting by

giving more generalized results. Successful application of

RFs has been widely documented in the literature

(97,98).

Optimizing hyper parameters for machine learning

models is another key step in making accurate predic-

tions. Hyper parameters define characteristics of the

model that can affect model accuracy and computational

efficiency. They are typically set before fitting the model

to the data. The cross-validation technique is utilized

here along with the random and grid search approach.

Grid search is a widely-used approach in fine-tuning of

machine learning models (99–101).

Model Results

The imbalanced sample was randomly split into training

and test sets using a 70/30 ratio (70% for training, 30%

for testing). The training set was then manipulated using

random and SMOTE-NC over-sampling techniques, giv-

ing a total of three different training datasets: the original

imbalanced data, the random resampled data, and the

SMOTE-NC resampled data. The target variable (crash

severity) consisted of three different classes, namely:

PDO, injury, and fatal. A decision tree was initially

developed for the raw imbalanced data (DT_1). The

model was further optimized in three different directions:

(i) feature selection, (ii) data balancing, and (iii) hyper

parameter tuning. For the purpose of feature selection,

RF models were used (Figure 4). For balancing objec-

tives, the models were separately run on the random and

Smote-NC resampled datasets. Several tuning algo-

rithms, including grid search and random search, were

applied to determine the best-tuned estimators. At each

tuning step, the model accuracy was evaluated based on

a 10-fold cross-validation to minimize model variance on

different test sets and avoid potential overfitting. Three

RF models were developed separately for the raw (RF_1)

and the two resampled datasets (RF_2 and RF_3).

Model performance results are illustrated in Table 2.

Precision and recall measures on fatal crashes were con-

sidered as the main metrics to evaluate model perfor-

mance. The recall measure represents the ratio of true

Figure 3. Fundamental elements of a decision tree.

Gupta et al 7

positives over the sum of true positives and false negatives.

The precision measure refers to the ratio of true positives

over the sum of true positives and false positives. For the

specific problem discussed here, it might be implied that

the recall measure (reducing type II error) is of higher pri-

ority. In other words, a low precision measure (i.e., overes-

timating fatal crashes or type I error) will not be as critical

as a low recall measure (i.e., missing a fatal crash or type

II error). It is also possible to check both measures simul-

taneously by assessing the F1_score values (defined as the

harmonic mean of precision and recall values).

For the purpose of comparison, three ordered logistic

regression models were also developed to represent con-

ventional approaches. As shown in Table 2, the logistic

regression on the raw data (OL_1) showed a relatively

high accuracy on test data (78.5%); however, further

investigation reveals that this is more of an accuracy par-

adox since the recall values on the minority group (fatal-

ity) were quite low (0.1), indicating that the model

captured only 10% of the fatal crashes. Using random

over-sampling, the logistic regression model (OL_2)

tends to improve, particularly the recall on fatal crashes

increases to 57%; however, this was accompanied by a

huge drop in the precision metric, indicating that the

model had very high type I error (false positives, overes-

timating fatal crashes). Moving forward to the next

model, the logistic regression on SMOTE-NC resampled

data (OL_3) slightly improves the precision on fatal

crashes and therefore provides the best F1 score across

conventional models, however, this is at the expense of a

relative reduction in the overall accuracy of the model.

Details of optimized regression models on resampled

datasets are presented in the Appendix (Table A1).

Stepping further into the tree-based models, the deci-

sion tree built on the raw data (DT_1) similarly suffered

from the accuracy paradox, though it provided a good

overall accuracy (78%) on the test data. In particular,

the results were highly biased to the majority class (PDO

crashes) with low recall values on both injury (0.19) and

fatality (0.05) groups, and this was more tangible on the

test data. This indicates that the model built on the

imbalanced data was not able to provide satisfactory

predictions on fatality instances. A similar drawback is

observed in the RF model on imbalanced data, where

the model does not capture any fatal crash on the test

data. This indicates that in the presence of imbalanced

data, even a sophisticated ensemble algorithm such as

RF might lose its efficiency.

Results of both RF models on over-sampled data

(RF_2 and RF_3) provided high performance measures

for both training and test data, with random over-

sampling slightly outperforming the SMOTE-NC

method with higher precision (0.4) and recall (0.57) val-

ues on the test data. Both models provided good predic-

tions on different classes, including the minority class

(fatal crashes). Based on the model performance, a final

model was developed using the decision tree method,

using the random over-sampled data and the top 20 fea-

tures (Figure 4) selected based on the results from RF_3.

The decision tree, unlike black box RFs, allows us to

pursue the decision rules and specific patterns that lead

to fatal crashes. The model performance is shown in

Table 2 as DT-Final. Compared with the initial models

and the conventional models, DT_Final provided better

performance in both recall measure and F1-score (har-

monic mean of precision and recall) on fatal crashes (F1-

Figure 4. Top 20 features based on mean Gini index.

8Transportation Research Record 00(0)

Table 2. Model Performance Comparison

Conventional models Machine learning models

OL_1 OL_2 OL_3 DT_1 RF_1 RF_2 RF_3 DT-final

Model

Ordered

logistic

regression

with imbalanced

data

Ordered

logistic

regression

with random

over-sampling

Ordered

logistic

regression

with SMOTE-NC

over-sampling

Decision

tree with

imbalanced

data

Random

forest

with

imbalanced

data

Random

forest

with random

over-sampling

Random forest

with SMOTE-NC

over-sampling

Decision tree

with random

over-sampling

(only top 20 features)

Training data

Overall accuracy (%) 79.50 62.00 72.00 80.00 82.00 84.00 88.00 76.00

F1 score

(on minority group)

0.26 0.74 0.89 0.39 0.45 0.98 0.97 0.91

Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall

Property

damage only (PDO)

0.82 0.96 0.63 0.69 0.67 0.64 0.8 0.98 0.81 1 0.74 0.84 0.84 0.87 0.64 0.79

Injury 0.62 0.28 0.45 0.44 0.59 0.61 0.7 0.22 0.94 0.28 0.82 0.69 0.84 0.82 0.72 0.62

Fatality 0.69 0.16 0.76 0.72 0.89 0.91 0.78 0.26 1 0.29 0.97 1 0.97 0.97 0.95 0.87

Test data

Overall accuracy (%) 78.50 63.00 61.00 78.00 79.00 74.00 75.00 72.00

F1 score

(on minority group)

0.17 0.20 0.23 0.08 0.00 0.47 0.31 0.43

Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall

PDO 0.8 0.96 0.83 0.69 0.86 0.64 0.79 0.97 0.79 0.98 0.85 0.81 0.83 0.85 0.85 0.79

Injury 0.63 0.27 0.33 0.44 0.32 0.51 0.61 0.19 0.71 0.18 0.45 0.51 0.47 0.44 0.43 0.51

Fatality 0.67 0.1 0.12 0.57 0.15 0.57 0.25 0.05 0 0 0.4 0.57 0.42 0.24 0.33 0.62

score = 0.43). In particular, the recall measure (i.e., cor-

rectly predicted fatal crashes as a percentage of total

fatal crashes) increased from 5% to 62%.

The next section focuses on DT-Final to analyze fatal

crash patterns.

Contributing Factors

For the remainder of this paper, the term ‘‘primary con-

tributors’’ is used to refer to these top 20 variables. It

should be noted that no roadway conditions or environ-

mental factors such as traffic-way type, median type,

road type, weather conditions, or vehicle maneuver and

driver actions showed primary contributions.

Further investigation of the decision tree reveals that

the severity of work zone crashes was highly dependent

on lighting conditions and pedestrian involvement.

Therefore, four major types of work zone crashes could

be identified:

1. Pedestrian crashes under dark-not lighted condi-

tions (118 crashes in the resampled data, five

crashes in original sample). The results show that

a large truck work zone crash involving pedes-

trians in not lighted conditions always resulted in

a fatal outcome (100%).

2. Pedestrian crashes with some type of lighting (375

crashes in the resampled data, 41 crashes in the

original sample). This was still a highly dangerous

situation, although the presence of lighting

slightly decreased the probability of a fatal out-

come, to almost 88%. However, certain factors

could still increase the crash severity, resulting in

a fatal condition (100%). These factors include:

When the truck driver was distracted, and the

vehicle lacked airbags (52 crashes in the

resampled data, three crashes in the original

data).

A young driver (younger than 24 years) where

the vehicle lacked airbag equipment and the

work zone was located outside city limits (39

cases in the over-sampled, one case in the

original data).

Older drivers (older than 60 years) where the

vehicle lacked airbag equipment and the work

zone was located within city limits (51 cases

in the over-sampled data, one case in original

data).

When front airbag was deployed (94 cases in

the over-sampled data, two cases in the origi-

nal data). This is probably an indicator of a

very hard impact. Though we do not have

individual level severity information, this

might point to a pedestrian fatality.

When airbags of any type were deployed, and

the truck driver was not using any restraint

system (e.g., seatbelt) (45 cases in the

resampled data, one case in the original data).

3. Non-pedestrian crashes under dark-not lighted con-

ditions. Using a full restraint system (i.e., shoulder

and lap belt) seemed essential to avoid severe out-

comes. The combination of front airbag deploy-

ment with any other restraint system would

indicate a fatal outcome (100%) (205 cases in

over-sampled data, six cases in original data).

4. Non-pedestrian crashes under any type of lighting

condition. It is evident that driver conditions had

a significant impact on these types of crashes.

Accordingly, an abnormal driver condition (fati-

gue, drug/alcohol influence, etc.) significantly

increased the probability of a fatal outcome

(65.5% versus 15.8%). A rear-end crash seemed

to be the safest of crash types in these conditions.

Otherwise, any non-pedestrian crash type with a

front airbag deployment would indicate a 90%

fatal outcome (479 cases in over-sampled data, 65

cases in the original data).

In addition to the above, one interesting inference

of the proposed model is the role of female driv-

ers in the work zone area. Accordingly, no fatal

crashes were observed in instances of multiple-

vehicle crash when a female driver (usually as a

not-at-fault driver) was present (153 cases in the

over-sampled data, 126 cases in the original sam-

ple). This might indicate that female drivers drive

in a more cautious manner in work zone areas,

for example, by reducing their speed to safe lev-

els, which will mutually affect other drivers’ speed

when traversing through work zones.

To summarize, from a planning perspective, the pro-

posed model suggests that fatal crashes could be pre-

dicted as a combination of several critical parameters,

including pedestrian involvement, lighting condition,

vehicle safety equipment, and driver condition.

Pedestrian involvement was highly fatal in work zones,

particularly when combined with dark-not lighted condi-

tions. The impact of age was highly correlated with work

zone location. Accordingly, young drivers were more

likely to be involved in a fatal crash in rural areas, while

senior drivers were associated with fatal crashes occur-

ring inside city limits. Driver distraction was another

critical factor in work zone crashes. With regard to

safety equipment, a full restraint system (i.e., shoulder

and lap belt) was essential to reduce fatality/injury rates.

The combination of any other restraint system with air-

bag deployment was highly fatal. Unlike pedestrian

crashes, sideswipe and rear-end crashes seemed to be less

10 Transportation Research Record 00(0)

severe. Interestingly, multiple vehicle crashes with female

drivers decreased crash severity.

Marginal Effects

As discussed earlier in the paper, in this study a relatively

new technique is used, known as Shapley values (also

known as SHAP), to obtain marginal effects of the

machine learning model features (102–104). The concept

was initiated by Shapley (102) and is based on game the-

ory. In simple language, the contribution (Shapley value)

of a feature (xi) can be calculated as the change imposed

on the classification probability when the feature is

added to a pre-defined set of features, S, in the model,

that is DPi=P(S[xi

)P(S). Since there are different

permutations of feature subsets, the value needs to be

averaged across all possible permutations. Accordingly,

jxi

ðÞ=XSNxi

S

1ðÞ!

!(PS

[xi



PSðÞ)

where

jxi

ðÞ= average contribution (Shapley value) associated

with xi

N= set of total features

S= arbitrary subset of features in different permutations

Figure 5 depicts the magnitudes of Shapley values

over all single observations across the sample and uses

the values to show the distribution of the impacts each

feature has on the model output. The color represents

the feature value, with red representing higher values

(compared with the mean) and blue indicating values

lower than the mean. In the case of binary features, red

indicates 1 while blue indicates 0. The horizontal axis

indicates the marginal effect (contribution) of the feature

at each single observation.

Focusing on the red color for binary features, one can

easily infer that normal driver conditions, daylight,

female drivers, rear-end crashes, and curb shoulder types

and sideswipe crashes have a tendency to decrease the

probability of a fatal outcome. On the other hand, lack

of restraint (or lack of sufficient restraint), presence of

pedestrians, different types of airbag deployment, and

driving on the wrong side of wrong way increase the

likelihood of a fatal outcome. One can also check the

contribution values on the horizontal axis. For instance,

in certain cases, abnormal driver conditions can increase

the fatality probability by 25%, or lack of restraint sys-

tem can increase it by almost 30%. On the contrary, the

presence of females or sideswipe crashes can reduce the

fatality probability by 10 to 20%.

While Figure 5 shows the spread of Shapley values in

the whole sample data, Figure 6 depicts the average

Shapley values in the decision tree model and compares

them with correspondent values from the logistic regres-

sion model (OL_3). Accordingly, pedestrian involve-

ment, lack of restraint system, front airbag deployment,

and not lighted conditions show a high average impact

Figure 5. Distribution of Shapley (SHAP) values.

Gupta et al 11

on increasing fatality probability. On the other hand,

normal driver condition and airbags not deployed con-

tribute to lower severity crashes. Compared with mar-

ginal effects from OL_3, results are similar with regard

to the direction of impact in most cases. It should be

noted that two of the variables—distraction of driver at

fault and driver action= wrong side of wrong way—did

not show significant impacts in the conventional model,

and therefore their marginal effects are statistically insig-

nificant. With regard to airbag deployment, two of the

factors (i.e., lack of airbag equipment and front airbag

deployment for at-fault vehicle) show inconsistent signs

with the Shapley values. However, readers should notice

the endogenous effects of airbag presence (or deploy-

ment) on crash severity as stated in the literature (105).

Since the ordered logistic model does not capture this

endogenous effect, the positive impacts from Shapley val-

ues might be more reliable. Overall, marginal effects

from conventional models tend to be higher (inflated)

compared with Shapley values from the machine learning

model. The authors believe this stems from the different

approaches in calculating the two measures. Marginal

effects only consider the exact set of variables for each

observation and then compute the difference in

probabilities for each outcome class, while Shapley val-

ues consider the average over a full permutation of all

existing features for each observation.

Conclusions

This paper is an effort to explore large truck-involved

work zone crash patterns that result in fatalities. The

motivations for this study are two-fold. First, the escalat-

ing need for regular maintenance and reconstruction of

roadways and the significant investments in roadway

infrastructure have resulted in consistent growth in the

number of work zones, which consequently calls for effi-

cient safety plans that ensure the safety of both workers

and drivers in these construction areas. Second, and

based on the current state of the literature, it seems that

machine learning algorithms such as data resampling

techniques and non-parametric models (e.g., decision

trees and RFs) demonstrate better performance in pre-

dicting more severe classes of crash, such as fatal crashes,

compared with conventional parametric methods.

Focusing on the large truck-involved work zone

crashes in Florida, both machine learning techniques

and conventional ordered logistic models were applied in

Figure 6. Comparison of conventional marginal effects and machine learning Shapley (SHAP) values.

12 Transportation Research Record 00(0)

this study. To assess the impact of the data imbalance

issue, the original crash data was resampled using ran-

dom over-sampling and SMOTE-NC techniques. RF

models were developed and tuned. Important variables

were extracted from the RF model based on the average

Gini index. Primary contributors included pedestrian

involvement, lighting conditions, safety equipment,

driver condition, driver age, and work zone locations.

Interestingly, roadway and environmental conditions

such as traffic-way, median type, road system identifier,

weather conditions, vehicle maneuver, and driver actions

did not show any significant contribution to the model.

Consequently, conventional logistic regression models

were built using the identified primary contributors.

Results confirmed that in both machine learning and

logistic regression models, data resampling significantly

improved the model’s performance on the minority class

(fatal crashes). In addition, results showed that the opti-

mized decision tree model outperformed the ordered

logistic regression in view of fatal crash prediction mea-

sures (F1_score) on both training and test data.

On fatality patterns, results showed that a combina-

tion of different factors could significantly increase the

probability of a fatal outcome. In pedestrian crashes, fac-

tors such as dark-not lighted conditions, distracted truck

driver, airbag deployment, and driver’s age (young driv-

ers outside city limits, senior drivers inside city limits)

were highly likely to be fatal. In non-pedestrian crashes,

the combination of front airbag deployment with any

restraint system other than shoulder and belt was quite

likely to be fatal. Also, abnormal driver conditions would

greatly increase the risk of a fatal outcome. Results also

showed that the presence of female driver (in multiple

vehicle crashes) decreased crash severity, probably

because females drive in a more careful driving manner

compared with males.

Finally, this study looked into the Shapley values and

compared them with the marginal effects from the con-

ventional ordered logistic model. Accordingly, pedestrian

involvement, lack of restraint system, front airbag

deployment, and not lighted conditions show a higher

than average impact on increasing fatality probability.

On the other hand, normal driver condition and airbags

not deployed contribute to lower severity crashes.

Results from the machine learning model and the con-

ventional ordered logistic model were similar in most

cases, with values from the machine learning model

being smaller, probably because of different computa-

tional approaches in the two models.

This study and similar pattern recognition studies

using machine learning techniques hold the potential to

provide a better approach to understanding crash contri-

buting factors and developing effective and specific coun-

termeasures, especially when less frequent but more

severe crashes are to be explored. Further research could

focus on the role of temporal variations using time-series

models or expand the analysis by analyzing worker and

driver severity measures separately.

Author Contributions

The authors confirm contribution to the paper as follows: study

conception and design: X. Jin, H. Asgari; data collection: X.

Jin, H. Asgari, G. Azimi, A. Rahimi; analysis and interpreta-

tion of results: R. Gupta and H. Asgari; draft manuscript pre-

paration: R. Gupta, H. Asgari and X. Jin. All authors reviewed

the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with

respect to the research, authorship, and/or publication of this

article.

Funding

The author(s) received no financial support for the research,

authorship, and/or publication of this article.

ORCID iDs

Rajesh Gupta https://orcid.org/0000-0001-7833-0235

Ghazaleh Azimi https://orcid.org/0000-0001-5646-6908

Xia Jin https://orcid.org/0000-0002-8660-3528

References

1. Work/Construction Zones. https://www.nhtsa.gov/sites/

nhtsa.dot.gov/files/workzones.pdf. Accessed July 15, 2019.

2. FHWA. Safe Driving, Safer Work Zones: National Work

Zone Awareness Week 2011. FOCUS, March 4–5, 2011,

ISSN 1060-6637. https://www.fhwa.dot.gov/publications/

focus/11mar/11mar.pdf.

3. National Work Zone Safety. Information Clearinghouse.

https://www.workzonesafety.org/crash-information/work-

zone-fatal-crashes-fatalities/#national. Accessed July 15, 2019.

4. The Economics Daily, Fatal injuries at road work zones.

Bureau of Labor Statistics, U.S. Department of Labor.

https://www.bls.gov/opub/ted/2017/fatal-injuries-at-road-

work-zones.htm Accessed July 20, 2019.

5. Akepati, S. R., and S. Dissanayake. Characteristics and

Contributory Factors of Work Zone Crashes. Presented at

93rd Annual Meeting of the Transportation Research

Board, Washington, D.C., 2011.

6. Srinivasan, R., G. Ullman, M. Finley, and F. Council. Use

of Empirical Bayesian Methods to Estimate Crash Modifi-

cation Factors for Daytime Versus Nighttime Work Zones.

Transportation Research Record: Journal of the Transporta-

tion Research Board, 2011. 2241: 29–38.

7. Yang, H., K. Ozbay, O. Ozturk, and M. Yildirimoglu.

Modeling Work Zone Crash Frequency by Quantifying

Measurement Errors in Work Zone Length. Accident Anal-

ysis & Prevention, Vol. 55, 2013, pp. 192–201.

Gupta et al 13

8. Ozturk, O., K. Ozbay, and H. Yang. Estimating the Impact

of Work Zones on Highway Safety. Transportation

Research Board 93rd Annual Meeting, Washington, D.C.,

2014.

9. Weng, J., and Q. Meng. Analysis of Driver Casualty Risk

for Different Work Zone Types. Accident Analysis & Pre-

vention, Vol. 43, No. 5, 2011, pp. 1811–1817.

10. Meng, Q., and J. Weng. Evaluation of Rear-End Crash

Risk at Work Zone Using Work Zone Traffic Data. Acci-

dent Analysis & Prevention, Vol. 43, No. 4, 2011,

pp. 1291–1300.

11. Xie, Y., K. Zhao, and N. Huynh. Analysis of Driver Injury

Severity in Rural Single-Vehicle Crashes. Accident Analysis

& Prevention, Vol. 47, 2012, pp. 36–44.

12. Celik, A. K., and E. Oktay. A Multinomial Logit Analysis

of Risk Factors Influencing Road Traffic Injury Severities

in the Erzurum and Kars Provinces of Turkey. Accident

Analysis & Prevention, Vol. 72, 2014, pp. 66–77.

13. Chen, Z., and W. D. Fan. A Multinomial Logit Model of

Pedestrian-Vehicle Crash Severity in North Carolina. Inter-

national Journal of Transportation Science and Technology,

Vol. 8, No. 1, 2019, pp. 43–52.

14. Wahab, L., and H. Jiang. A Multinomial Logit Analysis of

Factors Associated with Severity of Motorcycle Crashes in

Ghana. Traffic Injury Prevention, Vol. 20, No. 5, 2019,

pp. 521–527.

15. Hubbert, K., and M. Doustmohammadi. Multinomial

Logit Analysis of Injury Severity in Crashes Involving

Emotional Drivers. International Journal of Psychology and

Behavioral Sciences, Vol. 9, No. 4, 2019, pp. 63–70.

16. Zhu, X., and S. Srinivasan. Modeling Occupant-Level

Injury Severity: An Application to Large-Truck Crashes.

Accident Analysis & Prevention, Vol. 43, No. 4, 2011,

pp. 1427–1437.

17. Lemp, J. D., K. M. Kockelman, and A. Unnikrishnan.

Analysis of Large Truck Crash Severity Using Heteroske-

dastic Ordered Probit Models. Accident Analysis & Preven-

tion, Vol. 43, No. 1, 2011, pp. 370–380.

18. Linchao, L., and T. Fratrovic

´. Analysis of Factors Influen-

cing the Vehicle Damage Level in Fatal Truck-Related

Accidents and Differences in Rural and Urban Areas. Pro-

met-Traffic & Transportation, Vol. 28, No. 4, 2016,

pp. 331–340.

19. Rezapour, M., M. Moomen, and K. Ksaibati. Ordered

Logistic Models of Influencing Factors on Crash Injury

Severity of Single and Multiple-Vehicle Downgrade

Crashes: A Case Study in Wyoming. Journal of Safety

Research, Vol. 68, 2019, pp. 107–118.

20. Asare, I. O., and A. C. Mensah. Crash Severity Modelling

Using Ordinal Logistic Regression Approach. International

Journal of Injury Control and Safety Promotion, Vol. 27,

No. 4, 2020, pp. 412–419.

21. Yuan, Q., X. Xu, J. Zhao, and Q. Zeng. Investigation of

Injury Severity in Urban Expressway Crashes: A Case

Study from Beijing. PLoS One, Vol. 15, No. 1, 2020, p.

e0227869.

22. Chang, L. Y., and F. Mannering. Analysis of Injury Sever-

ity and Vehicle Occupancy in Truck- and Non-Truck-

Involved Accidents. Accident Analysis & Prevention, Vol.

31, No. 5, 1999, pp. 579–592.

23. Abdel-Aty, M., and H. Abdelwahab. Modeling Rear-End

Collisions Including the Role of Driver’s Visibility and

Light Truck Vehicles Using a Nested Logit Structure. Acci-

dent Analysis & Prevention, Vol. 36, No. 3, 2004,

pp. 447–456.

24. Patil, S., S. R. Geedipally, and D. Lord. Analysis of Crash

Severities Using Nested Logit Model—Accounting for the

Underreporting of Crashes. Accident Analysis & Prevention,

Vol. 45, 2012, pp. 646–653.

25. Razi-Ardakani, H., A. Mahmoudzadeh, and M. Kerman-

shah. A Nested Logit Analysis of the Influence of Distrac-

tion on Types of Vehicle Crashes. European Transport

Research Review, Vol. 10, No. 2, 2018, pp. 1–4.

26. Milton, J. C., V. N. Shankar, and F. L. Mannering. High-

way Accident Severities and the Mixed Logit Model: An

Exploratory Empirical Analysis. Accident Analysis & Pre-

vention, Vol. 40, No. 1, 2008, pp. 260–266.

27. Ye, F., and D. Lord. Investigation of Effects of Underre-

porting Crash Data on Three Commonly Used Traffic

Crash Severity Models: Multinomial Logit, Ordered Pro-

bit, and Mixed Logit. Transportation Research Record:

Journal of the Transportation Research Board, 2011. 2241:

51–58.

28. Chen, F., and S. Chen. Injury Severities of Truck Drivers

in Single- and Multi-Vehicle Accidents on Rural High-

ways. Accident Analysis & Prevention, Vol. 43, No. 5, 2011,

pp. 1677–1688.

29. Wu, Q., F. Chen, G. Zhang, X. C. Liu, H. Wang, and S.

M. Bogus. Mixed Logit Model-Based Driver Injury Sever-

ity Investigations in Single- and Multi-Vehicle Crashes on

Rural Two-Lane Highways. Accident Analysis & Preven-

tion, Vol. 72, 2014, pp. 105–115.

30. Islam, M., N. Alnawmasi, and F. Mannering. Unobserved

Heterogeneity and Temporal Instability in the Analysis of

Work-Zone Crash-Injury Severities. Analytic Methods in

Accident Research, Vol. 28, 2020, p. 100130.

31. Azimi, G., A. Rahimi, H. Asgari, and X. Jin. Severity Anal-

ysis for Large Truck Rollover Crashes Using a Random

Parameter Ordered Logit Model. Accident Analysis & Pre-

vention, Vol. 135, 2020, p. 105355.

32. Islam, S., S. L. Jones, and D. Dye. Comprehensive Analy-

sis of Single- and Multi-Vehicle Large Truck At-Fault

Crashes on Rural and Urban Roadways in Alabama. Acci-

dent Analysis & Prevention, Vol. 67, 2014, pp. 148–158.

33. Wang, J., H. Huang, P. Xu, S. Xie, and S. C. Wong. Ran-

dom Parameter Probit Models to Analyze Pedestrian Red-

Light Violations and Injury Severity in Pedestrian–Motor

Vehicle Crashes at Signalized Crossings. Journal of Trans-

portation Safety & Security, Vol. 12, No. 6, 2020,

pp. 818–837.

34. Chang, L. Y., and H. W. Wang. Analysis of Traffic Injury

Severity: An Application of Non-Parametric Classification

Tree Techniques. Accident Analysis & Prevention, Vol. 38,

No. 5, 2006, pp. 1019–1027.

35. Mujalli, R. O., and J. De On

˜a. Injury Severity Models for

Motor Vehicle Accidents: A Review. Proceedings of the

14 Transportation Research Record 00(0)

Institution of Civil Engineers: Transport, Vol. 166, No. 5,

2013, pp. 255–270.

36. Kashani, A. T., and A. S. Mohaymany. Analysis of the

Traffic Injury Severity on Two-Lane, Two-Way Rural

Roads Based on Classification Tree Models. Safety Sci-

ence, Vol. 49, No. 10, 2011, pp. 1314–1320.

37. Weng, J., Q. Meng, and D. Z. Wang. Tree-Based Logistic

Regression Approach for Work Zone Casualty Risk

Assessment. Risk Analysis, Vol. 33, No. 3, 2013,

pp. 493–504.

38. Abella

´n, J., G. Lo

´pez, and J. De On

˜a. Analysis of Traffic

Accident Severity Using Decision Rules via Decision Trees.

Expert Systems with Applications, Vol. 40, No. 15, 2013,

pp. 6047–6054.

39. de On

˜a, J., G. Lo

´pez, and J. Abella

´n. Extracting Decision

Rules from Police Accident Reports Through Decision

Trees. Accident Analysis & Prevention, Vol. 50, 2013,

pp. 1151–1160.

40. Chong, M. M., A. Abraham, and M. Paprzycki. Traffic

Accident Analysis Using Decision Trees and Neural Net-

works. arXiv Preprint CS/0405050, May 16, 2004.

41. Ghasemzadeh, A., and M. M. Ahmed. A Probit-Decision

Tree Approach to Analyze Effects of Adverse Weather

Conditions on Work Zone Crash Severity Using Second

Strategic Highway Research Program Roadway Informa-

tion Dataset. Transportation Research Board 96th Annual

Meeting, Washington, D.C., 2017.

42. Moral-Garcı

´a, S., J. G. Castellano, C. J. Mantas, A. Mon-

tella, and J. Abella

´n. Decision Tree Ensemble Method for

Analyzing Traffic Accidents of Novice Drivers in Urban

Areas. Entropy, Vol. 21, No. 4, 2019, p. 360.

43. Khattak, A. J., A. J. Khattak, and F. M. Council. Effects

of Work Zone Presence on Injury and Non-Injury Crashes.

Accident Analysis & Prevention, Vol. 34, No. 1, 2002,

pp. 19–29.

44. Li, Y., and Y. Bai. Comparison of Characteristics Between

Fatal and Injury Accidents in the Highway Construction

Zones. Safety Science, Vol. 46, No. 4, 2008, pp. 646–660.

45. Li, Y., and Y. Bai. Highway Work Zone Risk Factors and

Their Impact on Crash Severity. Journal of Transportation

Engineering, Vol. 135, No. 10, 2009, pp. 694–701.

46. Khattak, A. J., and F. Targa. Injury Severity and Total

Harm in Truck-Involved Work Zone Crashes. Transporta-

tion Research Record: Journal of the Transportation

Research Board, 2004. 1877: 106–116.

47. Osman, M., R. Paleti, S. Mishra, and M. M. Golias. Anal-

ysis of Injury Severity of Large Truck Crashes in Work

Zones. Accident Analysis & Prevention, Vol. 97, 2016,

pp. 261–273.

48. Zhang, K., and M. Hassan. Crash Severity Analysis of

Nighttime and Daytime Highway Work Zone Crashes.

PLoS One, Vol. 14, No. 8, 2019, p. e0221128.

49. Harb, R., E. Radwan, X. Yan, A. Pande, and M.

Abdel-Aty. Freeway Work-Zone Crash Analysis and Risk

Identification Using Multiple and Conditional Logistic

Regression. Journal of Transportation Engineering, Vol.

134, No. 5, 2008, pp. 203–214.

50. Salem, O. M., A. M. Genaidy, H. Wei, and N. Deshpande.

Spatial Distribution and Characteristics of Accident

Crashes at Work Zones of Interstate Freeways in Ohio.

Proc., 2006 IEEE Intelligent Transportation Systems Con-

ference, IEEE, New York, 2006, pp. 1642–1647.

51. Raub, R. A., O. B. Sawaya, J. L. Schofer, and A. Ziliasko-

poulos. Enhanced Crash Reporting to Explore Workzone

Crash Patterns. Paper No. 01-0166, Northwestern Univer-

sity Center for Public Safety, Evanston, IL, 2001.

52. Sze, N. N., and Z. Song. Factors Contributing to Injury

Severity in Work Zone Related Crashes in New Zealand.

International Journal of Sustainable Transportation, Vol.

13, No. 2, 2019, pp. 148–154.

53. Schrock, S. D., G. L. Ullman, A. S. Cothron, E. Kraus,

and A. P. Voigt. An Analysis of Fatal Work Zone Crashes

in Texas. Report FHW A/TX-05/0-4028-1, Texas Depart-

ment of Transportation, Research and Technology Imple-

mentation Office, October 2004.

54. Qi, Y., R. Srinivasan, H. Teng, and R. Baker. Analysis of

the Frequency and Severity of Rear-End Crashes in Work

Zones. Traffic Injury Prevention, Vol. 14, No. 1, 2013,

pp. 61–72.

55. Wang, Z., J. J. Lu, Q. Wang, L. Lu, and Z. Zhang. Modeling

Injury Severity in Work Zones Using Ordered Probit Regres-

sion. Proc., ICCTP 2010: Integrated Transportation Systems:

Green, Intelligent, Reliable, 2010, pp. 1058–1067, Beijing,

China.

56. Zhang, S., C. Tjortjis, X. Zeng, H. Qiao, I. Buchan, and J.

Keane. Comparing Data Mining Methods with Logistic

Regression in Childhood Obesity Prediction. Information

Systems Frontiers, Vol. 11, No. 4, 2009, pp. 449–460.

57. Kuhle, S., B. Maguire, H. Zhang, D. Hamilton, A. C.

Allen, K. S. Joseph, and V. M. Allen. Comparison of

Logistic Regression with Machine Learning Methods for

the Prediction of Fetal Growth Abnormalities: A Retro-

spective Cohort Study. BMC Pregnancy and Childbirth,

Vol. 18, No. 1, 2018, pp. 1–9.

58. Pua, Y. H., H. Kang, J. Thumboo, R. A. Clark, E. S.

Chew, C. L. Poon, H. C. Chong, and S. J. Yeo. Machine

Learning Methods Are Comparable to Logistic Regression

Techniques in Predicting Severe Walking Limitation Fol-

lowing Total Knee Arthroplasty. Knee Surgery, Sports

Traumatology, Arthroscopy, 2020 28(10): 3207–3216.

59. Hensher, D. A., and W. H. Greene. The Mixed Logit

Model: The State of Practice. Transportation, Vol. 30, No.

2, 2003, pp. 133–176.

60. Hensher, D. A., J. M. Rose, J. M. Rose, and W. H. Greene.

. Applied Choice Analysis: A Primer. Cambridge University

Press, Cambridge, 2005.

61. Jahangiri, A., and H. A. Rakha. Applying Machine Learn-

ing Techniques to Transportation Mode Recognition

Using Mobile Phone Sensor Data. IEEE Transactions on

Intelligent Transportation Systems, Vol. 16, No. 5, 2015,

pp. 2406–2417.

62. Jahangiri, A., H. Rakha, and T. A. Dingus. Red-Light

Running Violation Prediction Using Observational and

Simulator Data. Accident Analysis & Prevention, Vol. 96,

2016, pp. 316–328.

63. Balali, V., and M. Golparvar-Fard. Evaluation of Multi-

class Traffic Sign Detection and Classification Methods for

US Roadway Asset Inventory Management. Journal of

Gupta et al 15

Computing in Civil Engineering, Vol. 30, No. 2, 2016, p.

04015022.

64. Chen, S., W. Wang, and H. Van Zuylen. Construct Sup-

port Vector Machine Ensemble to Detect Traffic Incident.

Expert Systems with Applications, Vol. 36, No. 8, 2009,

pp. 10976–10986.

65. Yu, R., and M. Abdel-Aty. Utilizing Support Vector

Machine in Real-Time Crash Risk Evaluation. Accident

Analysis & Prevention, Vol. 51, 2013, pp. 252–259.

66. Li, Z., P. Liu, W. Wang, and C. Xu. Using Support Vector

Machine Models for Crash Injury Severity Analysis. Acci-

dent Analysis & Prevention, Vol. 45, 2012, pp. 478–486.

67. Chen, C., G. Zhang, Z. Qian, R. A. Tarefder, and Z. Tian.

Investigating Driver Injury Severity Patterns in Rollover

Crashes Using Support Vector Machine Models. Accident

Analysis & Prevention, Vol. 90, 2016, pp. 128–139.

68. Zheng, Z., P. Lu, and B. Lantz. Commercial Truck Crash

Injury Severity Analysis Using Gradient Boosting Data

Mining Model. Journal of Safety Research, Vol. 65, 2018,

pp. 115–124.

69. Japkowicz, N. Learning from Imbalanced Data Sets: A

Comparison of Various Strategies. In AAAI Workshop on

Learning from Imbalanced Data Sets, AAAI Press, Menlo

Park, CA, Vol. 68, 2000, pp. 10–15.

70. Barandela, R., R. M. Valdovinos, J. S. Sa

´nchez, and F. J.

Ferri. The Imbalanced Training Sample Problem: Under

or Over Sampling? In Joint IAPR International Workshops

on Statistical Techniques in Pattern Recognition (SPR) and

Structural and Syntactic Pattern Recognition (SSPR),

Springer, Berlin, Heidelberg, 2004, pp. 806–814.

71. Van Hulse, J., T. M. Khoshgoftaar, and A. Napolitano.

Experimental Perspectives on Learning from Imbalanced

Data. Proc., 24th International Conference on Machine Learn-

ing, Corvalis Oregon USA, June 20-24, 2007, pp. 935–942.

72. Gu, J., Y. Zhou, and X. Zuo. Making Class Bias Useful: A

Strategy of Learning from Imbalanced Data. Proc., Inter-

national Conference on Intelligent Data Engineering and

Automated Learning, Springer, Berlin, Heidelberg, 2007,

pp. 287–295.

73. Ahmadi, A., A. Jahangiri, V. Berardi, and S. G. Machiani.

Crash Severity Analysis of Rear-End Crashes in California

Using Statistical and Machine Learning Classification

Methods. Journal of Transportation Safety & Security, Vol.

12, No. 4, 2020, pp. 522–546.

74. Lamba, D., M. Alsadhan, W. Hsu, E. Fitzsimmons, and

G. Newm ark. Coping with Class Imbalance in Classification

of Traffic Crash Severity Based on Sensor and Road Data: A

Feature Selection and Data Augmentation Approach. The 6th

International Conference on Artificial Intelligence and Applica-

tions (AIAP-2019), May 25-26, 2019, Vancouver, Canada.

75. Yahaya, M., X. Jiang, C. Fu, K. Bashir, and W. Fan.

Enhancing Crash Injury Severity Prediction on Imbalanced

Crash Data by Sampling Technique with Variable Selec-

tion. Proc., 2019 IEEE Intelligent Transportation Systems

Conference (ITSC), IEEE, New York, 2019, pp. 363–368.

76. Fiorentini, N., and M. Losa. Handling Imbalanced Data

in Road Crash Severity Prediction by Machine Learning

Algorithms. Infrastructures, Vol. 5, No. 7, 2020, p. 61.

77. Elhassan, T., M. Aljurf, F. Al-Mohanna, and M. Shoukri.

Classification of Imbalance Data Using Tomek Link (t-

Link) Combined with Random Under-Sampling (RUS) as

a Data Reduction Method. Journal of Informatics and Data

Mining, Vol 1 (2), 2016, https://doi.org/10.21767/2472-

1956.100011

78. Han, H., W. Y. Wang, and B. H. Mao. Borderline-

SMOTE: A New Over-Sampling Method in Imbalanced

Data Sets Learning. Proc., International Conference on

Intelligent Computing, Springer, Berlin, Heidelberg, 2005,

pp. 878–887.

79. Anand, A., G. Pugalenthi, G. B. Fogel, and P. N.

Suganthan. An Approach for Classification of Highly

Imbalanced Data Using Weighting and Undersampling.

Amino Acids, Vol. 39, No. 5, 2010, pp. 1385–1391.

80. Ng, W. W., J. Hu, D. S. Yeung, S. Yin, and F. Roli. Diver-

sified Sensitivity-Based Undersampling for Imbalance Clas-

sification Problems. IEEE Transactions on Cybernetics,

Vol. 45, No. 11, 2014, pp. 2402–2412.

81. Naik, B., L. R. Rilett, J. Appiah, and L. F. Walubita.

Resampling Methods for Estimating Travel Time Uncer-

tainty: Application of the Gap Bootstrap. Transportation

Research Record: Journal of the Transportation Research

Board, 2018. 2672: 137–147.

82. Parsa, A. B., H. Taghipour, S. Derrible, and A. K.

Mohammadian. Real-Time Accident Detection: Coping

with Imbalanced Data. Accident Analysis & Prevention,

Vol. 129, 2019, pp. 202–210.

83. Ambu

¨hl, L., A. Loder, M. C. Bliemer, M. Menendez, and

K. W. Axhausen. Introducing a Re-Sampling Methodology

for the Estimation of Empirical Macroscopic Fundamental

Diagrams. Transportation Research Record: Journal of the

Transportation Research Board, 2018. 2672: 239–248.

84. Kitali, A. E., P. Alluri, T. Sando, and W. Wu. Identifica-

tion of Secondary Crash Risk Factors Using Penalized

Logistic Regression Model. Transportation Research

Record: Journal of the Transportation Research Board,

2019. 2673: 901–914.

85. Li, D. C., H. Y. Chen, and Q. S. Shi. Learning from Small

Datasets Containing Nominal Attributes. Neurocomputing,

Vol. 291, 2018, pp. 226–236.

86. Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P.

Kegelmeyer. SMOTE: Synthetic Minority Over-Sampling

Technique. Journal of Artificial Intelligence Research, Vol.

16, 2002, pp. 321–357.

87. Mitchell, T. M. Machine Learning, McGraw Hill, 1997,

ISBN 0070428077.

88. Kingsford, C., and S. L. Salzberg. What Are Decision

Trees? Nature Biotechnology, Vol. 26, No. 9, 2008,

pp. 1011–1013.

89. Chen, Y. L., and L. T. Hung. Using Decision Trees to Sum-

marize Associative Classification Rules. Expert Systems

with Applications, Vol. 36, No. 2, 2009, pp. 2338–2351.

90. Ghasemzadeh, A., B. E. Hammit, M. M. Ahmed, and R.

K. Young. Parametric Ordinal Logistic Regression and

Non-Parametric Decision Tree Approaches for Assessing

the Impact of Weather Conditions on Driver Speed Selec-

tion Using Naturalistic Driving Data. Transportation

16 Transportation Research Record 00(0)

Research Record: Journal of the Transportation Research

Board, 2018. 2672: 137–147.

91. Quinlan, J. R. Induction of Decision Trees. Machine Learn-

ing, Vol. 1, No. 1, 1986, pp. 81–106.

92. Rokach, L., and O. Maimon. Top-Down Induction of

Decision Trees Classifiers: A Survey. IEEE Transactions on

Systems, Man, and Cybernetics, Part C (Applications and

Reviews), Vol. 35, No. 4, 2005, pp. 476–487.

93. Mingers, J. An Empirical Comparison of Pruning Methods

for Decision Tree Induction. Machine Learning, Vol. 4, No.

2, 1989, pp. 227–243.

94. Bradford, J. P., C. Kunz, R. Kohavi, C. Brunk, and C. E.

Brodley. Pruning Decision Trees with Misclassification

Costs. Proc., European Conference on Machine Learning,

April 21, Springer, Berlin, Heidelberg, 1998, pp. 131–136.

95. Breiman, L. Random Forests. Machine Learning, Vol. 45,

No. 1, 2001, pp. 5–32.

96. Liaw, A., and M. Wiener. Classification and Regression

by Random Forest. R News, Vol. 2, No. 3, 2002, pp.

18–22.

97. Mafi, S., Y. Abdelrazig, and R. Doczy. Machine Learning

Methods to Analyze Injury Severity of Drivers from Dif-

ferent Age and Gender Groups. Transportation Research

Record: Journal of the Transportation Research Board,

2018. 2672: 171–183.

98. Arbabzadeh, N., M. Jalayer, and M. Jafari. Predicting

Traffic Safety Risk Factors Using an Ensemble Classifier.

In Data Analytics for Smart Cities (Alavi, A. H. and W. G.

Buttlar, eds.), Auerbach Publications, Boca Raton, FL,

2018, pp. 201–216.

99. Fayed, H. A., and A. F. Atiya. Speed Up Grid-Search for

Parameter Selection of Support Vector Machines. Applied

Soft Computing, Vol. 80, 2019, pp. 202–210.

100. Soleimani, S., S. R. Mousa, J. Codjoe, and M. Leitner.

A Comprehensive Railroad-Highway Grade Crossing

Consolidation Model: A Machine Learning Approach.

Accident Analysis & Prevention, Vol. 128, 2019, pp. 65–77.

101. Probst, P., A. L. Boulesteix, and B. Bischl. Tunability:

Importance of Hyperparameters of Machine Learning

Algorithms. The Journal of Machine Learning Research,

Vol. 20, No. 1, 2019, pp. 1934–1965.

102. Shapley, L. S. A Value for n-Person Games. In Contribu-

tions to the Theory of Games II (Kuhn, A. W., and H. W.

Tucker, eds.), Princeton University Press, Princeton, New

Jersey, USA, 1953.

103. Lundberg, S., and S. I. Lee. A Unified Approach to Inter-

preting Model Predictions. arXiv Preprint

arXiv:1705.07874, May 22, 2017.

104. Lundberg, S. M., G. Erion, H. Chen, A. DeGrave, J. M.

Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and

S. I. Lee. From Local Explanations to Global Under-

standing with Explainable AI for Trees. Nature Machine

Intelligence, Vol. 2, No. 1, 2020, pp. 56–67.

105. Angel, A., and M. Hickman. Analysis of the Factors

Affecting the Severity of Two-Vehicle Crashes. Ingenierı´a

y Desarrollo, Vol. 24, 2008, pp. 176–194.

Gupta et al 17

Appendix A

Table A1. Ordered Logistic Regression Models

Ordered logistic regression on random over-sampled data Ordered logistic regression on SMOTE-NC over-sampled data

Variable name Value

Standard

error t Value

Marginal effect

on property

damage only

Marginal

effect

on injury

Marginal

effect

on fatality Value

Standard

error t Value

Marginal effect

on property

damage only

Marginal

effect

on injury

Marginal

effect

on fatality

Crash

type_Sideswipe

20.882 0.064 213.711 0.166 20.019 20.148 21.212 0.084 214.417 0.189 20.025 20.164

Crash

type_Pedestrian

2.104 0.225 9.349 20.189 20.292 0.482 2.392 0.755 3.167 20.135 20.398 0.533

Crash type_rear

end

1.336 0.125 10.665 20.148 20.161 0.309 0.696 0.060 11.689 20.078 20.048 0.125

No airbag

equipment

20.488 0.105 24.662 0.084 0.007 20.09 22.091 0.098 221.411 0.32 20.027 20.293

Airbag not

deployed

21.143 0.102 211.151 0.191 0.024 20.215 22.686 0.094 228.684 0.367 0.057 20.424

Front airbag

deployed for at-

fault vehicle

0.226 0.113 2.009 20.035 20.01 0.046 21.735 0.111 215.586 0.307 20.105 20.202

Front airbag

deployed for not-

at-fault vehicle

1.736 0.107 16.157 20.174 20.229 0.403 3.489 0.148 23.548 20.158 20.538 0.696

Restraint not used

for at-fault driver

1.815 0.140 12.929 20.193 20.222 0.415 1.827 0.164 11.112 20.12 20.292 0.412

Restraint_Shoulder

and lap belt for

at-fault driver

0.086 0.104 0.833 20.014 20.002 0.017 22.286 0.476 24.805 0.467 20.269 20.199

Lighting = dark

not_lighted

0.631 0.086 7.354 20.09 20.045 0.134 0.409 0.099 4.150 20.042 20.035 0.078

Lighting = daylight 0.087 0.065 1.338 20.014 20.003 0.017 0.477 0.071 6.729 20.059 20.02 0.079

Driver

action = wrong

side of wrong

way

20.017 0.544 20.031 0.003 0.001 20.003 21.600 0.911 21.756 0.299 20.128 20.171

Driver at fault’s

condition normal

21.835 0.074 224.657 0.222 0.184 20.406 22.707 0.100 227.191 0.193 0.379 20.572

Not distracted (at-

fault driver)

20.427 0.055 27.708 0.066 0.019 20.086 0.025 0.073 0.347 20.003 20.001 0.004

Driver at fault age 0.008 0.002 5.410 20.001 0 0.002 20.006 0.002 23.894 0.001 0 20.001

Driver not at fault

age

20.008 0.001 25.923 0.001 0 20.001 20.003 0.001 22.590 0 0 20.001

Driver not at

fault = female

0.355 0.058 6.072 20.059 20.008 0.068 20.308 0.079 23.912 0.039 0.011 20.05

(continued)

Appendix A

Table A1. (continued)

Ordered logistic regression on random over-sampled data Ordered logistic regression on SMOTE-NC over-sampled data

Variable name Value

Standard

error t Value

Marginal effect

on property

damage only

Marginal

effect

on injury

Marginal

effect

on fatality Value

Standard

error t Value

Marginal effect

on property

damage only

Marginal

effect

on injury

Marginal

effect

on fatality

HARMFUL_

EVT1_1_

Pedestrian

2.244 0.245 9.171 20.194 20.314 0.509 4.043 0.763 5.302 20.164 20.581 0.744

Crash

location = within

city limits

20.789 0.051 215.336 0.135 0.011 20.146 21.021 0.057 217.822 0.136 0.024 20.16

Shoulder

type = curb

20.217 0.063 23.425 0.037 0.004 20.041 21.024 0.085 212.085 0.156 20.014 20.142

Machine Learning based Public Sentiment Analytics on Roadway Work-zone Tweets

Preprint

Full-text available

Jan 2023

The construction and maintenance activities of roadway infrastructure positively contribute to social and economic development and improve traffic safety. However, the roadway work zones (WZs) cause safety issues for construction workers and travelers and adversely affect vehicular movement. This study aims to explore public perceptions about WZs and identify factors that influence crashes and public experience at WZs by collecting and analyzing Twitter data. In this paper, we have employed several machine learning methods to classify WZs tweets and then performed exploratory, sentiment, and emotion analysis on the classified tweets. We have then verified our Twitter-related research outcome with police crash reports. The sentiment and emotion analysis using classified tweets (with 92% classification accuracy and 0.68 F-score) showed somewhat negative emotion on roadway WZs and onsite physical elements. However, the overall sentiment and emotion scores support positive outcomes from WZs' activities. We also found a temporal relationship and a strong correlation between WZ-related tweets and fatalities. A cross-analysis of tweets and crash reports revealed that some physical elements (e.g., signs, barriers, barrels, closures, and workers, etc.) are significantly associated with severe crashes at WZs. The results of this research may help policymakers to make appropriate policy decisions in improving driving experiences and reducing WZ-related traffic accidents.

Comparative Analysis of Parametric and Non-Parametric Data-Driven Models to Predict Road Crash Severity among Elderly Drivers Using Synthetic Resampling Techniques

Article

Full-text available

Jun 2023

As the global elderly population continues to rise, the risk of severe crashes among elderly drivers has become a pressing concern. This study presents a comprehensive examination of crash severity among this demographic, employing machine learning models and data gathered from Virginia, United States of America, between 2014 and 2021. The analysis integrates parametric models, namely logistic regression and linear discriminant analysis (LDA), as well as non-parametric models like random forest (RF) and extreme gradient boosting (XGBoost). Central to this study is the application of resampling techniques, specifically, random over-sampling examples (ROSE) and the synthetic minority over-sampling technique (SMOTE), to address the dataset’s inherent imbalance and enhance the models’ predictive performance. Our findings reveal that the inclusion of these resampling techniques significantly improves the predictive power of parametric models, notably increasing the true positive rate for severe crash prediction from 6% to 60% and boosting the geometric mean from 25% to 69% in logistic regression. Likewise, employing SMOTE resulted in a notable improvement in the non-parametric models’ performance, leading to a true positive rate increase from 8% to 36% in XGBoost. Moreover, the study established the superiority of parametric models over non-parametric counterparts when balanced resampling techniques are utilized. Beyond predictive modeling, the study delves into the effects of various contributing factors on crash severity, enhancing the understanding of how these factors influence elderly road safety. Ultimately, these findings underscore the immense potential of machine learning models in analyzing complex crash data, pinpointing factors that heighten crash severity, and informing targeted interventions to mitigate the risks of elderly driving.

Analysis of Injury Severity of Work Zone Truck-Involved Crashes in South Carolina for Interstates and Non-Interstates

Article

Full-text available

Apr 2023

This study investigates factors contributing to the injury severity of truck-involved work zones crashes in South Carolina (SC). The outcome of interest is injury or property damage only crashes, and the explanatory factors examined include the occupant, vehicle, collision, roadway, temporal, and environmental characteristics. Two mixed (random parameter) logit models are developed, one for non-interstates with speed limits less than 60 miles per hour (mph) and one for interstates with speed limits greater than or equal to 60 mph, using South Carolina statewide truck-involved work zone crash data from 2014 to 2020. Results of log-likelihood ratio tests indicate that separate speed models are warranted. The factors that were found to contribute to injury at the 90% confidence level in both models (interstate and non-interstate) are (1) dark lighting conditions, (2) female (at-fault) drivers, and (3) driving too fast for roadway conditions. Significant factors that apply only to non-interstates are SC or US primary roadways, activity area of the work zone, at-fault drivers under 35, sideswipe collision, presence of workers in the work zone, and collision with fixed objects. Significant factors that apply only to interstates are three or more vehicles, rear-end collision, location before the first work zone sign, and weekdays.

Assessing the effects of geometry and non-geometry related factors in work-zone crashes

Article

Mar 2024
Traffic Inj Prev

Mouyid Bin Islam

Objective: Work zones are unique in geometry and traffic management, utilizing special traffic signs, standard channelizing devices, appropriate barriers, and pavement markings. These configurations can introduce unexpected driving conditions, potentially posing risks to drivers. This analysis aims to explore potential differences in contributing factors between work-zone crashes where geometry was identified as a factor and those where it was non-geometry factor. To gain insights into driver injury severities in single-vehicle work-zone crashes, this study analyzed work zone crash data from Florida. Method: This study employed random parameters logit models, accommodating potential variations in parameter estimates’ means and variances. The dataset encompassed a wide array of factors known to influence driver injury severity, encompassing crash characteristics, vehicle attributes, roadway features, prevailing traffic volume, driver profiles, and spatial and temporal considerations. Results: This analysis yielded significantly distinct parameters for work-zone crashes, distinguishing between geometry-related and non-geometry-related factors (primarily the human factors). This distinction suggests a complex interplay between these factors. Notably, the marginal effects of individual parameter estimates exhibited marked differences between these two categories – geometry and non-geometry factors. Conclusion: These findings contribute to the growing body of research indicating that geometric restrictions within work zones introduce a distinct set of risk factors compared to non-geometry related factors. Recognizing the significance of geometric restrictions, beyond typical driving conditions, holds the implications for enhancing safety within various work zone configurations and offers valuable insights for crash scene investigators to pinpoint contributing factors accurately.

Conflict-Based Real-Time Road Safety Analysis: Sensitivity to Data Collection Duration and its Implications for Model Resilience

Article

Full-text available

May 2023

Conflict-based approaches to real-time road safety analysis can provide several benefits over traditional crash-based models. In particular, as traffic conflicts are much more frequent than crashes, models can be trained with significantly shorter collection periods. Since existing literature has not investigated the sensitivity of real-time conflict prediction models (RTConfPM) to data collection duration, here we aim to fill this gap and discuss the implications for model resilience. A real-world highway case study was analyzed. Methodologically, various traffic variables aggregated into 5 min intervals were selected as predictors, synthetic minority oversampling technique (SMOTE) was applied to deal with unbalanced classification issues, and support vector machine (SVM) was chosen as classifier. The dichotomous response variable separated safe and unsafe intervals into two classes; the latter were defined considering a minimum number of rear-end conflicts within the interval, which were identified using a surrogate measure of safety (SMS), that is, time-to-collision. Several RTConfPMs were trained and tested, considering different data collection durations and different criteria to define the unsafe situation class. The results, which were shown to be robust with respect to the machine learning classifier used, indicate that the models were able to provide reliable predictions with just three to five days of data, and that the increase in performance with collection periods longer than 10 to 15 days was negligible. These findings can be generalized by considering the number of unsafe situations corresponding to the data collection period of each tested model; they highlight the relevance of RTConfPM as a more flexible and resilient alternative to the crash-based approach.

Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data

Article

Full-text available

Mar 2023
PLOS ONE

While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.

Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms

Article

Full-text available

Jul 2020

Crash severity is undoubtedly a fundamental aspect of a crash event. Although machine learning algorithms for predicting crash severity have recently gained interest by the academic community, there is a significant trend towards neglecting the fact that crash datasets are acutely imbalanced. Overlooking this fact generally leads to weak classifiers for predicting the minority class (crashes with higher severity). In this paper, in order to handle imbalanced accident datasets and provide a better prediction for the minority class, the random undersampling the majority class (RUMC) technique is used. By employing an imbalanced and a RUMC-based balanced training set, we propose the calibration, validation, and evaluation of four different crash severity predictive models, including random tree, k-nearest neighbor, logistic regression, and random forest. Accuracy, true positive rate (recall), false positive rate, true negative rate, precision, F 1-score, and the confusion matrix have been calculated to assess the performance. Outcomes show that RUMC-based models provide an enhancement in the reliability of the classifiers for detecting fatal crashes and those causing injury. Indeed, in imbalanced models, the true positive rate for predicting fatal crashes and those causing injury spans from 0% (logistic regression) to 18.3% (k-nearest neighbor), while for the RUMC-based models, it spans from 52.5% (RUMC-based logistic regression) to 57.2% (RUMC-based k-nearest neighbor). Organizations and decision-makers could make use of RUMC and machine learning algorithms in predicting the severity of a crash occurrence, managing the present, and planning the future of their works.

From Local Explanations to Global Understanding with Explainable AI for Trees

Article

Full-text available

Jan 2020

Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.

Investigation of injury severity in urban expressway crashes: A case study from Beijing

Article

Full-text available

Jan 2020
PLOS ONE

Urban expressway is the main artery of traffic network, and an in-depth analysis of the crashes is crucial for improving the traffic safety level of expressways. This study intended to address the injury severity of expressways in Beijing by proposing Bayesian ordered logistic regression model. Crash data were collected from urban express rings and expressways in 2015 and 2016. The results showed that crash location, time and crash season are significant variables influencing injury severity. The findings revealed that the proposed model can address the ordinal feature of injury severity, while accommodating the data with small sample sizes that may not adequately represent population characteristics. The conclusions can provide the management departments with valuable suggestions for the injury prevention and safety improvement on the urban expressways.

17. A Value for n-Person Games

Chapter

Dec 1953

L. S. Shapley

Predicting Traffic Safety Risk Factors Using an Ensemble Classifier

Chapter

Oct 2018

Crash severity modelling using ordinal logistic regression approach

Article

Jul 2020

Road traffic accident is one of the major problems facing the world. The carnage on Ghana’s roads has raised road accidents to the status of a ‘public health’ threat. The objective of the study is to identify factors that contribute to accident severity using an ordinal regression model to fit a suitable model using the dataset extracted from the database of Motor Traffic and Transport Department, from 1989 to 2019. The results of the ordinal logistic regression analyses show that the nature of cars, National roads, over speeding, and location (urban or rural) are significant indicators of crash severity. Strategies to reduce crash injuries should physical enforcement through greater Police presence on our roads as well as technology. There is also the need to train drivers to be more vigilant in their travels especially on the national roads and in the urban areas. The Recommendation is, a well thought out and contextualised written laws and sanctioned schemes to monitor and enforce strict compliance with road traffic rules should be put in place.

Unobserved Heterogeneity and Temporal Instability in the Analysis of Work-Zone Crash-Injury Severities

Article

Jun 2020

In the state of Florida, work-zone related crashes and their resulting injury severities have been increasing recently, particularly over the 2015 to 2017 time period. In the current study, we seek to provide insights into the factors that have been influencing this trend. Using work zone data from the 2012 to 2017 time period, resulting driver-injury severities in single-vehicle work zone crashes were studied using random parameters logit models that allow for possible heterogeneity in the means and variances of parameter estimates. The available data included a wide variety of factors known to influence driver injury severity including data related to the crash characteristics, vehicle characteristics, roadway attributes, prevailing traffic volume, driver characteristics, and spatial and temporal characteristics. The model estimates produced significantly different parameters for each of the year from 2012 to 2017, and a fundamental shift in unobserved heterogeneity, suggesting statistically significant temporal instability. In addition, in several key instances, the marginal effects of individual parameter estimates show marked differences between one year and the next. However, these findings may not be the sole result of variations in driver behavior over time as has been argued in past research that has found temporal instability. This is because each work zone has a unique set of characteristics and, with the sample of work zones changing from one year to the next as highway maintenance and construction is undertaken in different locations, this work-zone sample variation could be a substantial source of the observed temporal instability.

Machine learning methods are comparable to logistic regression techniques in predicting severe walking limitation following total knee arthroplasty

Article

Dec 2019
KNEE SURG SPORT TR A

Purpose: Machine-learning methods are flexible prediction algorithms with potential advantages over conventional regression. This study aimed to use machine learning methods to predict post-total knee arthroplasty (TKA) walking limitation, and to compare their performance with that of logistic regression. Methods: From the department's clinical registry, a cohort of 4026 patients who underwent elective, primary TKA between July 2013 and July 2017 was identified. Candidate predictors included demographics and preoperative clinical, psychosocial, and outcome measures. The primary outcome was severe walking limitation at 6 months post-TKA, defined as a maximum walk time ≤ 15 min. Eight common regression (logistic, penalized logistic, and ordinal logistic with natural splines) and ensemble machine learning (random forest, extreme gradient boosting, and SuperLearner) methods were implemented to predict the probability of severe walking limitation. Models were compared on discrimination and calibration metrics. Results: At 6 months post-TKA, 13% of patients had severe walking limitation. Machine learning and logistic regression models performed moderately [mean area under the ROC curves (AUC) 0.73-0.75]. Overall, the ordinal logistic regression model performed best while the SuperLearner performed best among machine learning methods, with negligible differences between them (Brier score difference, < 0.001; 95% CI [- 0.0025, 0.002]). Conclusions: When predicting post-TKA physical function, several machine learning methods did not outperform logistic regression-in particular, ordinal logistic regression that does not assume linearity in its predictors. Level of evidence: Prognostic level II.

Severity analysis for large truck rollover crashes using a random parameter ordered logit model

Article

Dec 2019
ACCIDENT ANAL PREV

Large truck rollover crashes present significant financial, industrial, and social impacts. This paper presents an effort to investigate the contributing factors to large truck rollover crashes. Specific focus was placed on exploring the role of heterogeneity and the potential sources of heterogeneity regarding their impacts on injury severity outcomes. The data used in this study contained large truck rollover crashes that occurred between 2007 and 2016 in the state of Florida. A random parameter ordered logit (RPOL) model was applied. Various driver, vehicle, roadway, and crash attributes were explored as potential predictors in the model. Their impacts were examined for the presence of heterogeneity. Interaction effects were then added to the random variables in order to detect potential sources of heterogeneity. Model results showed that the impacts of lighting conditions and driving speed had significant variation across observations, and this variation could be attributed to driver actions and driver conditions at the time of the crash, as well as driver vision obstruction. Findings from this study shed light on the direction, magnitude, and randomness of the factors that contribute to large truck rollover crashes. Findings associated with heterogeneity could help develop more effective and targeted countermeasures to improve freight safety. Driver education programs could be planned more efficiently, and advisory and warning signs could be designed in a more insightful manner by taking into account specific roadway attributes, such as sandy surfaces, downhill, curved alignment, unpaved shoulders, and lighting conditions.

Enhancing Crash Injury Severity Prediction on Imbalanced Crash Data by Sampling Technique with Variable Selection

Conference Paper

Oct 2019

Analysis of Fatal Truck-Involved Work Zone Crashes in Florida: Application of Tree-Based Models

Abstract and Figures

Recommended publications

Injury Severity of Pedestrian and Bicyclist Crashes Involving Large Trucks

Severity analysis for large truck rollover crashes using a random parameter ordered logit model

Investigation of Heterogeneity in Severity Analysis for Large Truck Crashes

Injury Severity Analysis for Large Truck-Involved Crashes: Accounting for Heterogeneity