ArticlePDF Available

Advancing proactive crash prediction: A discretized duration approach for predicting crashes and severity

December 2023
Accident Analysis & Prevention 195(1):107407

December 2023
195(1):107407

DOI:10.1016/j.aap.2023.107407

Authors:

Diwas Thapa

The University of Memphis

Sabyasachee Mishra

The University of Memphis

Nagendra R. Velaga

Indian Institute of Technology Bombay

Content uploaded by Diwas Thapa

Content may be subject to copyright.

Advancing Proactive Crash Prediction: A Discretized Duration Approach for

Predicting Crashes and Severity

Diwas Thapa

, Sabyasachee Mishra 1, *, Nagendra R. Velaga

, Gopal R. Patil 2

1 Department of Civil Engineering, University of Memphis, Memphis, TN 38152, United States

2 Department of Civil Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India

Abstract

Driven by advancements in data-driven methods, recent developments in proactive crash prediction models

have primarily focused on implementing machine learning and artificial intelligence. However, from a

causal perspective, statistical models are preferred for their ability to estimate effect sizes using variable

coefficients and elasticity effects. Most statistical framework-based crash prediction models adopt a case-

control approach, matching crashes to non-crash events. However, accurately defining the crash-to-non-

crash ratio and incorporating crash severities pose challenges. Few studies have ventured beyond the case-

control approach to develop proactive crash prediction models, such as the duration-based framework. This

study extends the duration-based modeling framework to create a novel framework for predicting crashes

and their severity. Addressing the increased computational complexity resulting from incorporating crash

severities, we explore a tradeoff between model performance and estimation time. Results indicate that a

15% sample drawn at the epoch level achieves a balanced approach, reducing data size while maintaining

reasonable predictive accuracy. Furthermore, stability analysis of predictor variables across different

samples reveals that variables such as Time of day (Early afternoon), Weather condition (Clear), Lighting

condition (Daytime), Illumination (Illuminated), and Volume require larger samples for more accurate

coefficient estimation. Conversely, Daytime (Early morning, Late morning, Late afternoon), Lighting

* Corresponding author. Tel: +1-(901)-678-5043

Email addresses: dthapa@memphis.edu (D. Thapa), smishra3@memphis.edu (S. Mishra), n.r.velaga@iitb.ac.in (N.

R. Velaga), gpatil@iitb.ac.in (G. R. Patil)

condition (Dark lighted), Terrain (Flat), Land use (Commercial, Rural), Number of lanes, and Speed

converge towards true estimates with small incremental increases in sample size. The validation reveals

that the model performs better in highway segments experiencing more frequent crashes (segments where

the duration between crashes is less than 100 hours, or approximately 4 days).

Keywords: crash likelihood; crash severity likelihood; survival model; choice model; proactive safety

performance function; predictor stability; real time prediction validation

1. Introduction

Crash prediction models can be categorized into two main types: diagnostic crash prediction models, also

known as reactive crash prediction models, and proactive or real-time crash prediction models. These two

types of prediction models differ in their application and the variables they incorporate. Reactive crash

prediction models rely on historical crash data, as well as static covariates (variables that do not change

over time) and dynamic covariates (variables that do change over time), aggregated over a specific period.

Examples of such dynamic covariates include Average Annual Daily Traffic and average speed. These

models are valuable for developing safety performance functions, which help identify the precursors of

crashes and evaluate the impact of safety interventions and policies on highway safety (Yasmin et al., 2018).

On the other hand, proactive crash prediction models refer to real-time crash prediction models that utilize

historical crash data and static covariates, such as roadway condition and roadway geometry, along with

disaggregated dynamic covariates that vary with time. These dynamic covariates can include traffic volume,

speed, and weather conditions collected in near real-time. By incorporating dynamic predictors, these

models can account for changing traffic and weather conditions, allowing for the forecasting of the

likelihood of future crashes in real time. This, in turn, enables the implementation of crash mitigation

strategies.

Proactive crash prediction models have garnered significant attention from researchers in recent

years due to their potential to forecast and prevent future crashes. The availability of granular traffic flow

data, such as near real-time traffic flow data collected at small time intervals, from Intelligent

Transportation System infrastructure, coupled with the computational performance of modern computers,

has played a crucial role in increasing the popularity of these models. Modern data-driven methods, such

as Machine Learning (ML), have gained popularity as they replace traditional statistical models which are

often relatively more difficult to fit (Mannering et al., 2020). Data-driven methods have demonstrated

superior data fit and predictive capabilities as they are not constrained by assumptions inherent to traditional

econometric frameworks, such as statistical distribution and variable correlation. However, data-driven

methods have their own limitations too. They struggle with problems related to model transferability,

generalization, and the inability to quantify variable effects. In this context, statistical econometric

frameworks, through variable coefficients and elasticities, can quantify variable effects and provide model

transferability and generalization. In these respects, statistical models can be considered superior to data-

driven methods.

Due to the benefits offered by statistical econometric frameworks, there are ongoing efforts to

enhance and refine traditional statistical approaches to address their limitations and apply them to proactive

modeling. For instance, researchers have extended standard econometric frameworks by incorporating

flexible structures to develop mixed and generalized models. These models can account for unobserved

heterogeneity and hierarchical structures for variable correlations and dependencies. More recently,

researchers developed and implemented a new crash prediction framework (Thapa et al., 2022). In their

study, researchers developed a duration-based crash prediction model that combines elements of the

survival model and Multinomial Logit model (MNL). In this modeling approach, the time duration between

crashes is divided into 1-hour epochs, which are further subdivided into 4 15-minute time intervals. Each

epoch between two consecutive crashes is treated as a separate observation, with the time intervals serving

as choice alternatives. By adopting this approach, the framework can forecast the likelihood of future

crashes by considering two types of covariates. Firstly, static covariates associated with crashes, such as

highway geometry and environmental conditions, are repeated over each epoch. Secondly, dynamic

covariates, such as traffic flow and speed, change across epochs and within the 15-minute time intervals.

The authors of the study discovered that the duration-based model could generate reasonably accurate

estimates even when dealing with small sample sizes.

The current study builds upon the duration-based model by incorporating crash severities. While

prediction of crash occurrence has already been addressed in previous research, forecasting likelihood of

different crash severities is crucial from multiple perspectives, including safety, economic, and planning

considerations. The costs associated with crashes vary significantly depending on their severity. For

instance, the comprehensive unit cost of a Property Damage Only (PDO) crash in the US was estimated to

be around $12,000 in 2016, whereas a fatal crash was estimated to exceed $11 million (Harmon et al.,

2018). Additionally, crash severities are linked to road user costs. Studies have indicated that more severe

crashes require more time to clear, resulting in higher road user costs (Golob et al., 1987; J.-T. Lee & Fazio,

2005). Therefore, prioritizing the identification and addressing of factors contributing to more severe

crashes is crucial from both safety and economic perspectives. Furthermore, from a planning standpoint,

the ability to forecast crash severities provides transportation agencies with valuable insights. Agencies are

often constrained with limited resources and personnel, making it necessary to identify critical segments in

advance and proactively address adverse traffic flow conditions. By forecasting crash severities, agencies

can prioritize the allocation and deployment of resources and personnel to prevent severe crashes and

mitigate their impacts, contributing to more efficient and effective traffic operations and planning.

2. Literature review

Research in crash prediction has focused on forecasting both crash occurrences and severities. In the

following sections, we provide a literature review of prediction models based on the specific outcomes they

forecast. While we will discuss both proactive and reactive crash prediction models, this review will place

greater emphasis on proactive crash prediction models, as they align with the scope of our study.

2.1. Crash prediction models

The first group of studies focuses on real-time forecasting of future crashes, employing both data-

driven and statistical methods. Researchers have utilized various approaches to develop these models. Data-

driven methods have gained popularity in the literature, with several notable examples including Support

Vector Machines (Sun & Sun, 2016; Yu & Abdel-Aty, 2013), decision trees and random forests (Beshah et

al., 2011; Pham et al., 2010), neural networks (Li et al., 2020), and Bayesian statistics (Hossain &

Muromachi, 2012; Zheng & Sayed, 2020). These data-driven methods have proven effective in capturing

complex relationships and patterns in crash data, allowing for real-time forecasting of future crash

occurrences.

On the statistical side, the case-control design approach has been the most popular method for

developing proactive crash prediction models (Hossain et al., 2019). In this approach, crashes are matched

with non-crash events based on specific variables such as location and time of the crash (Abdel-Aty et al.,

2004). The resulting dataset, with binary outcomes indicating crash or non-crash events, is well-suited for

binary logistic regression. However, researchers have also explored the use of data-driven methods and

Bayesian statistics to enhance the modeling capabilities of this approach (Hossain et al., 2019). In addition

to the traditional case-control approach, alternative methodologies have been proposed. For example,

(Yasmin et al., 2018) developed a MNL that considered 5-minute intervals for the next 30 days as choice

alternatives, representing the occurrence of crashes in future time intervals. Given the substantial number

of choice alternatives, the authors employed sampling techniques (selecting 29 randomly sampled time

intervals and 1 interval with a crash) from the 30-day period.

More recently, researchers implemented a real-time crash prediction model by combining survival

model with the MNL model. Survival models or duration models have been employed to model traffic

crashes using static data (e.g., (Jovanis & Chang, 1989; Thapa & Mishra, 2021), however, they are

incapable of incorporating time-varying covariates. The researchers developed a new method to restructure

the crash data by creating forecasting epochs and time-intervals that can be associated with the dynamic

covariates (Thapa et al., 2022).

2.2. Crash severity prediction models

The second group of studies focuses on predicting crash severity. Data-driven methods have been used

more often to forecast crash severities, with various approaches utilized in different studies. Deep learning

methods have been applied in crash severity prediction (Rahim & Hassan, 2021), while Support Vector

Machines have been utilized in studies by (Chen et al., 2016; Iranitalab & Khattak, 2017). Random forests

have also been used as a predictive technique for crash severity forecasting (Iranitalab & Khattak, 2017).

Other methods such as neural networks and decision trees have been explored in some studies (J. Lee et al.,

2019; Ospina-Mateus et al., 2021; C. Zhang et al., 2020). In recent years, a significant focus has been placed

on comparing the performance of these algorithms in crash severity prediction (Santos et al., 2022). It is

important to note that most prediction models within this group are reactive in nature, aiming to predict

crash severity based on historical data and established patterns.

The most common statistical approach for developing crash severity prediction models is applying

discrete choice models, specifically multinomial and ordered response logit/probit models. However, more

advanced statistical models such as random parameter mixed models have gained popularity among

researchers in recent years, as they offer solutions to the fixed parameter restriction imposed by choice

models. Uncorrelated random parameter models (Fountas & Anastasopoulos, 2017) correlated random

parameter models (Ahmed et al., 2021; Fountas & Anastasopoulos, 2017), and generalized ordered response

models (Osman et al., 2019; Osman, Mishra, et al., 2018; Osman, Paleti, et al., 2018; Yasmin et al., 2014)

are some of the examples of these advanced statistical models. These models enable researchers to account

for parameter variations across different observations, providing more flexibility in capturing the

complexity of crash severity prediction. Another approach for crash severity prediction involves the use of

sequential models that can account for the dependency between various levels of crash severities. Studies

have explored the application of sequential models in crash severity prediction, allowing for the

consideration of dependencies between crash severities (Dissanayake & Lu, 2002; Jung et al., 2010).

With the advent of advanced models, researchers have conducted studies to examine and compare

their predictive performance. For instance, Yasmin & Eluru, 2013) compared different generalized and

mixed models within the frameworks of ordered and unordered choice modeling. Their findings indicated

that mixed generalized ordered logit and mixed MNL models showed promise in predicting crash injury

severity. In a study by (J. Zhang et al., 2018), various statistical and machine learning methods were

compared, and it was found that machine learning algorithms exhibited better performance. This

improvement could be attributed to factors such as the linear utility function and parametric assumptions

regarding the error term. (Cerwick et al., 2014) conducted a comparison between mixed MNL and latent

class MNL models. Their analysis revealed that the former model provided better average predictions across

different severity levels.

2.3. Models predicting crash frequency and severity

The final group of studies focuses on forecasting both crashes and their severity. However, it is important

to note that most of these models are primarily designed to forecast crash frequencies rather than the

presence or absence of crashes.

Multivariate count data models are commonly employed in these studies, as seen in the works of

(Jonathan et al., 2016; Ma & Kockelman, 2006; Park & Lord, 2007). Additionally, random parameter count

data models have been used to account for spatial and temporal heterogeneity, as demonstrated by (Barua

et al., 2016; Cheng et al., 2017; Dong et al., 2014). Other studies have implemented joint models with two

components: (i) a crash prediction component utilizing count data models, and (ii) a crash severity

component employing discrete choice models to predict crash counts by severity. This approach has been

employed by (Afghari et al., 2020; Pei et al., 2011; Yasmin & Eluru, 2018).

The sequential logit model has also been used to predict the likelihood and severity of crashes. (Xu

et al., 2013) developed a model using sequential binary logit models, where crashes were modeled in three

stages: Stage 1 (crash vs. non-crash), Stage 2 (property damage only vs. higher severities), and Stage 3

(non-capacitating vs. higher severities). However, a significant drawback of the sequential logit model in

the context of proactive crash prediction is that the estimation of multiple models can be computationally

demanding and time-consuming, making it impractical for large datasets.

3. Study contributions

Only a limited number of statistical approaches have been developed to date for proactive crash prediction,

apart from the commonly used case-control approach. This study introduces a duration-based prediction

model for both crash occurrence and crash severity. The model framework involves dividing the time

duration between historical crashes into distinct time periods to create forecasting epochs and time intervals.

This allows the model to incorporate dynamic covariates and ascertain the probability of crashes occurring

in future epochs and time intervals (Thapa et al., 2022). While this modeling approach has previously been

demonstrated for crash prediction, the current study extends the framework to incorporate crash severities.

The major contributions of this paper can be summarized as follows.

1. We expand upon the duration-based proactive crash prediction model by introducing a novel modeling

approach that can forecast both crash occurrence and severity. Our model framework is one of handful

statistical approaches for proactive crash prediction that does not rely on the case-control approach

(Thapa et al., 2022). Unlike the original model, which solely predicts the likelihood of crashes for

discrete future time intervals, our proposed model can also predict the corresponding crash severities.

Furthermore, the proposed model is implemented using a larger dataset. Specifically, the model

is applied to crash data collected from interstates in two cities in Tennessee, thereby achieving a broader

geographical coverage in comparison to the previous study that focused on a single city. This expanded

geographical scope enhances the generalizability of the crash predictors, as it ensures adequate

representation of diverse roadway conditions and traffic patterns across the study areas.

2. The proposed modeling framework demands discretizing the time duration between crashes to create

forecasting epochs (more on this in this in the next section). Consequently, the size of the initial crash

data expands significantly. Prior studies have indicated that appropriate sampling techniques can

address estimation complexities arising from large data size, thereby allowing for parameter estimation

with a reasonable degree of accuracy (Thapa et al., 2022). However, the incorporation of crash

severities adds an additional layer of complexity to the model estimation process.

Therefore, this study aims to investigate the influence of sample size on variable coefficients

and identify variables that are sensitive to changes in sample size. Understanding the variables that are

particularly impacted by sample size variations is crucial for the implementation of the model.

Additionally, this information will play a pivotal role in assessing the reliability of the model and

guiding future data collection efforts.

4. Methodology

In this section, we present the methodology under three distinct subsections: the duration-based prediction

framework, the nested logit model, and the estimation of the nested logit model. First, we describe the

duration-based prediction framework and the process of creating forecasting epochs. This section is

followed by the introduction of the two-level nested logit model and its relationship with the duration-based

crash prediction framework. Finally, we discuss the estimation processes used in this study to estimate the

parameters of the models.

4.1. Duration based prediction framework

In the duration-based crash prediction model, the occurrence of a crash at any time interval dt can be

modeled using the MNL framework with alternatives, n and the hazard rate, h given by 󰇛󰇜

(Thapa et al., 2022). By utilizing this relationship, the latent propensity function for each time interval can

be expressed as a function of static and dynamic covariates (time-varying factors). The application of this

concept is illustrated in the following example.

Example:

Consider the duration between crashes in a highway segment, denoted as s, which is discretized into epochs,

denoted as e, each with time intervals, denoted as i, and each interval has a duration of dt. Using these

indices, we can examine historical crash data for a roadway segment, s=1, where three consecutive crashes,

denoted as A1, A2, and A3, were observed with durations of 2.5 hours and 0.5 hours apart (see Table 1(a)).

Additionally, available are dynamic covariates, speed and volume for the segment and the crash year at a

temporal resolution of dt, as shown in Table 1(b). These covariates, as depicted, exhibit time-varying

characteristics.

For discretization, let us choose e=1 hour and dt=0.25 hours. Therefore, the number of time

intervals in an epoch, denoted by C=4, each identified by the index i=(1, 2, 3, 4). After discretization, the

forecasting epochs are created as shown in Table 1(c). Each epoch consists of four 15-minute intervals, and

an additional C+1th column called "Next epoch" is added, indicating whether the next crash occurred in the

current or future epoch (0 if in the current epoch, 1 if in future epochs). Based on the table, we can express

the time elapsed since the previous crash using the equation  󰇛󰇜󰇛󰇜. For example,

the time between crashes A1 and A2 can be determined as  󰇛󰇜󰇛󰇜 hours.

As shown in the table, the dynamic covariate Speed varies across different time periods. The static covariate

Terrain, in this example, does not repeat across the time intervals of a crash. However, to account for the

effect of time, the variable is multiplied by . For instance, the Terrain variable for the first time-interval

is 0.25 multiplied by 1, and for the second time interval, it is 0.5 multiplied by 1, and so on. Therefore, all

variables vary across epochs and time-intervals. The final data obtained after the creation of forecasting

epochs takes the form of panel data with repeated observations for each crash corresponding to the

forecasting epochs.

A few observations can be made from Table 1(c), particularly regarding the increase in data size

after the creation of forecasting epochs. The final data size is influenced by three factors. The first factor is

the size of the original crash data. The more crashes are observed, the larger the data size will be after

creating forecasting epochs. The second factor is the choice of discretization. When a smaller time

discretization is chosen, more detailed information regarding traffic flow can be obtained. However, this

also leads to a considerable increase in data size. The third factor is the distribution of inter-crash duration.

If the inter-crash durations are longer, more forecasting epochs will be created, resulting in a larger data

size. Considering these factors, implementing a model for a wide geographical area with small

discretization can become computationally demanding. Even a slight reduction in time discretization

significantly increases computational complexity. To reduce computational complexity, it is suggested to

use a smaller sample of the expanded data drawn at the epoch level for model training (Thapa et al., 2022).

Now, based on the example provided, the latent propensity function for crash severities, k observed

at a particular time interval, i can be represented as a function of time since crash, static, and dynamic

covariates using the utility function,

 in equation (1).

  





(1)

In equation (1), the coefficient  represents the impact of duration on crash severity. The vector of

covariates, , captures the effect of covariates, with its values varying across epochs and time intervals.

The corresponding vector of coefficients is denoted by



󰆒. Similarly, if we assume that the latent propensity

function for crash occurrences at any time interval, i consist of only an intercept term, the utility equations

for each alternative can be formulated using equation (2).

 

(2)

It is worth noting here that as shown in Table 1(c), occurrence of a crash at a specific time interval is

dependent on crashes not occurring on previous time intervals. This conditional probability of observing a

crash in a particular time interval within an epoch can be expressed using a random variable Ts as follows.

󰇛󰇜󰇛󰇜

󰇛󰇜󰇛󰇜





(3)

The resulting unconditional probability of a crash at any time interval can be obtained by multiplying the

conditional probability in equation (3) with the cumulative product of all probabilities for the C+1th intervals

preceding the epoch e as represented by equation 4.

󰇛󰇜 󰇛󰇜

󰇛󰇜󰇛󰇜



 󰇛󰇜

󰇛󰇜󰇛󰇜









(4)

Table 1(a). Historical crash data with static covariates.

Table 1(b). Dynamic covariates averaged for 15-min intervals: Vehicle speed (in mph).

Crash

Date of crash

Time of crash

Severity

Terrain

(Flat=1, Rolling=0)

1/1/2023

00:00

Fatal

1/1/2023

02:30

PDO

1/1/2023

03:00

Injury

Date and time

1/1/2023 00:00

1/1/2023 00:15

1/1/2023 00:30

1/1/2023 01:00

Speed

Date and time

1/1/2023 01:15

1/1/2023 01:30

1/1/2023 01:45

1/1/2023 02:00

Speed

Date and time

1/1/2023 02:15

1/1/2023 02:30

1/1/2023 02:45

1/1/2023 03:00

Speed

Date and time

1/1/2023 03:15

1/1/2023 03:30

1/1/2023 03:45

1/1/2023 04:00

Speed



Table 1(c). Final crash data after creating forecasting epochs.

4.2. Nested logit model

As discussed prior, the crash outcomes in the example are characterized by: (i) occurrence of crashes or the

time interval when a crash happens, and (ii) the severity of the crash that happened at a certain interval.

These outcomes can be effectively modeled using a two-level nested logit model, as depicted in Fig 1. In

this model, the time intervals, i and an additional alternative (C+1) serve as nodes representing the upper-

level choice alternatives, while the crash severities correspond to the lower-level alternatives. It is important

to note that the crash severities at each time interval are conditional upon the occurrence of a crash within

that interval. For simplicity, assume the severity levels are comprised of two categories, denoted by k =

(F/I, PDO), where F/I represents Fatal or Injury crashes, and PDO represents Property Damage Only

crashes. The conditional choice probability of the lower-level alternatives, k given the upper-level

alternatives, i can be expressed as follows.

󰇛󰇜󰇛󰇜

where,

(5)

󰇛󰇜 󰇡 

󰇢

󰇡 

󰇢



(6)

󰇛󰇜 

 󰇛󰇜



(7)

󰇡 

󰇢



(8)

Time

crash

(hr)

Epoch

15-min

intervals

epoch

Speed (mph)

Severity

Terrain

(Flat=1, Rolling=0)

2.5

Fatal

0.25

0.5

0.75

2.5

Fatal

1.25

1.50

1.75

2.5

Fatal

2.25

2.50

2.75

0.5

PDO

0.25

0.5

0.75

…

The parameter  in equations (6), (7), and (8) represents the logsum parameter or nesting coefficient, which

captures the underlying correlations for alternatives within a nest.  in equation (8) is the inclusive value

for nodes in the upper level. However, the C+1th alternative, Next epoch, lacks the logsum parameter due

to its degenerate branch. Consequently, the probability of this alternative can be determined using the

following equation.

󰇛󰇜

󰇛󰇜󰇛󰇜



(9)

The probability of F/I crashes in equation (6) can be obtained by substituting the value of  from equation

(1) assuming PDO crashes as the reference case. Similarly, equation (7) gives the probability of upper-level

alternatives, which is equivalent to equation (3) and can be rewritten using equation (10).

󰇛󰇜 󰇛 󰇜

 󰇛󰇜

󰇛󰇜

 󰇛󰇜







(10)

Assuming each row in the crash data after creation of forecasting epochs is represented using the superscript

n, the log-likelihood function for the two-level nested logit model can be expressed as the sum of two

components using equation (11). The first and second components of the equation are associated with the

lower and upper-level alternatives, respectively (Brownstone & Small, 1989). The parameters in the two-

level nested logit model is estimated by maximizing this equation.

󰇛󰇜

󰇛󰇜



(11)

Fig 1. Two-level nested structure of crash occurrence and severity.

4.3. Estimation of the nested logit model

There are several methods available for estimating parameters in nested logit models, with sequential

estimation and simultaneous estimation being the most cited approaches. In sequential estimation, the first

component of the log-likelihood function (equation 11) is maximized to estimate the parameters in the

lower-level. This step provides estimates of the coefficients scaled by their respective nesting parameter .

To simplify the process, the nesting parameters can be assumed to be constant for all nodes, represented as

. In the next step, inclusive values are calculated for each node using the scaled estimates obtained

from the lower level. These inclusive values are then used in the second component of the log-likelihood

function to maximize and obtain the values of  and intercepts  for the upper level. It is important to note

that while sequential estimation allows for the maximization and estimation of parameters in a stepwise

manner, the estimates obtained are not consistent because the scaled parameters from the lower level are

substituted to find parameters in the upper level. An alternative approach is simultaneous estimation, where

parameters in both levels are estimated simultaneously using a non-linear maximization algorithm. This

method is more rigorous compared to sequential estimation, and the estimates obtained are consistent.

5. Data

5.1. Data source and preparation

The estimation and validation of the two-level nested logit model were carried out using data gathered from

two main sources. First, historical crash data for the year 2019 was obtained from the Enhanced Tennessee

Roadway Information Management System (ETRIMS). This dataset provided information on various crash

characteristics such as the date, time, severity, and coordinates of the crash location, as well as details on

static covariates such as highway geometry, weather conditions, lighting conditions, land use, and terrain

characteristics. The dynamic covariates for the study, namely traffic flow and speed, were obtained from

the Radar Data System (RDS) stations located along the highway segments from which the historical crash

data was collected. Since our study aimed to implement a practical time discretization with 15-minute

intervals, the RDS data was collected specifically for these 15-minute intervals. To match the RDS data

with the corresponding crashes, a geospatial mapping approach was employed, aligning the RDS stations

with their respective highway segments.

It is important to note that RDS coverage in Tennessee is limited to its major cities, including

Memphis, Nashville, Chattanooga, and Knoxville. Therefore, for the purposes of this study, the segments

of interstates within the city limits of Memphis and Chattanooga were considered. Specifically, the selected

segments included I-40 and I-55 in Memphis, and I-24 and I-75 in Chattanooga.

For this study, the interstates were divided into segments based on four criteria including the

direction of traffic, number of lanes, posted speed limit, and terrain type. The segmentation details of the

interstates are provided in Table 2. The table includes information on the total number of segments, their

lengths in both directions, and the frequency of crashes observed within each segment. In total, the dataset

consisted of 2,375 crashes. Table 3 presents a breakdown of the crash frequencies based on various

categorical variables. Additionally, the table includes descriptive statistics for the continuous variables in

the dataset. The table provides a comprehensive overview of the data, highlighting the distribution of

crashes across different segments and variable categories.

In this study, the 15-minute traffic volumes were scaled to a range between 0 (minimum value) and

1 (maximum value). This scaling process was applied to avoid the potential influence of larger volumes on

the model training process. The duration between crashes exhibited a right-skewed distribution, as indicated

by the mean of 516.67 hours (about 3 weeks) being greater than the median of 230.46 hours (about 1 and a

half weeks). This suggests that there is a longer average time period between crashes, with occasional

instances of shorter durations. A visual representation of the distribution of inter-crash duration for the four

interstates is presented by a density plot in Fig 2. The density plot provides a graphical representation of

the distribution, highlighting the shape and spread of the duration between crashes for each interstate.

Table 2. Summary of interstate segmentation.

Interstate

City

Number of segments

Length (mi)

Number of crashes

I-40

Memphis

146

21.51

905

I-55

12.28

268

I-24

Chattanooga

14.71

675

I-75

13.29

527

Total

358

61.79

2,375

Table 3. Descriptive statistics of crash characteristics.

Categorical variables

Frequency of crashes

Relative abundance

Time of day

Early morning (6 a.m. to 9 a.m.)

447

18.82%

Late morning (9 a.m. to 12 p.m.)

262

11.03%

Early afternoon (12 p.m. to 3 p.m.)

351

14.78%

Late afternoon (3 p.m. to 6 p.m.)

586

24.67%

Evening (6 p.m. to 12 a.m.)

392

16.51%

Night (12 a.m. to 6 a.m.)

337

14.19%

Weather condition

Clear

1,733

72.97%

Others (Cloudy, rain, fog, or snow)

642

27.03%

Lighting condition

Daylight

1,612

67.87%

Dark lighted

463

19.49%

Dark, not lighted

300

12.63%

Illumination type

Illuminated

1,780

74.95%

Not illuminated

595

25.05%

Terrain

Flat

715

30.11%

Rolling

1,660

69.89%

Land use

Commercial

1,187

49.98%

Rural

765

32.21%

Mixed

423

17.81%

Crash severities

Fatal or injury

451

18.99%

Property Damage Only

1,924

81.01%

Continuous variables

Min

Median

Max

Mean

Traffic flow characteristics

Speed (mph)

1.00

59.06

63.47

66.96

91.00

61.41

10.46

Volume (scaled between 0-

minimum, and 1-maximum)

0.0002

0.12

0.28

0.46

1.00

0.31

0.22

Highway geometry

Number of lanes (both directions)

7.18

1.78

Inter-crash duration (hours)

0.00

68.05

230.46

627.17

7683.03

516.67

783.18

From the plot, it can be observed that I-40 has the highest peak, indicating a higher concentration

of crashes compared to the other interstates. Furthermore, the density plot reveals that the distribution of

crashes on I-40 is less spread out compared to the other interstates. This means that the duration between

crashes on I-40 is shorter, indicating a higher frequency of crashes occurring within a shorter period. In

terms of increasing spread, the interstates can be ranked as follows: I-40, I-55, I-24, and I-75. This implies

that the duration between crashes is longer and more spread out on I-75 compared to the other interstates.

Fig 2. Distribution of inter-crash duration for the interstates.

5.2. Data sampling

In this study, the models were calibrated using training data and evaluated on testing data. The process of

creating training and testing data involved splitting the historical crash data in a 9:1 ratio, where 90% of the

data was allocated for training and the remaining 10% for testing. To create forecasting epochs, both the

training and testing crashes were expanded. The training data was further sampled at 5% increments up to

25% to investigate whether any sample size below 25% would provide accurate parameter estimates. Thus,

the samples used for parameter estimation were 5%, 10%, 15%, and 25% of the training data. This sampling

approach is called epoch level sampling (Thapa et al., 2022). The sampled training data, along with the

complete training data, were used to estimate the parameters for the models. For comparison purposes, the

parameter estimates obtained from the complete training data (100% training data) were considered as the

"true" estimates.

To evaluate the performance of the trained models, the predicted log-likelihood values were

calculated on the training data. In this context, predicted log-likelihood provided a basis for comparing how

well the models captured the characteristics of the training data.

6. Results

All model computations, including estimation and validation, in this study were conducted using R version

4.2.3 on a computer equipped with Intel Core i7-11700K processor and 16 GB of memory. We initially

estimated the model parameters using the complete training data, employing both simultaneous and

sequential estimation techniques. The objective of estimating with the complete training data was to obtain

"true" parameter estimates and compare the results obtained from different estimation techniques. The

estimation results are presented in Table 4. In Table 4, the first column displays the variable groups in the

model, along with the corresponding variable categories considered as the base in the models. The second

column lists the variables included in the model. The estimation results are then presented, showing the

parameter estimates and their respective t-statistics for both simultaneous and sequential estimation. The

parameter estimates obtained from both estimation methods are comparable, indicating consistency in the

results. Additionally, the average values of predicted log-likelihood are also similar between the two

methods. When considering estimation complexity, which refers to the time taken for the model to converge

from a null model, it was found that sequential estimation offers a considerable advantage. Specifically,

using simultaneous estimation, the model took 51.09 hours (about 2 days) to converge, which was

approximately six times the time taken by sequential estimation, which was 8.63 hours. Therefore,

sequential estimation may provide consistent estimates with a significant reduction in computational

complexity.

The parameters obtained from simultaneous estimation, as shown in the table, can be utilized to

express the propensity function for F/I crashes in any time interval using the following utility equation.

 

  

For example, the utility equation for the first time-interval can be expressed as follows.

 

 

Table 4. Results from estimation of model using complete training data.

Variable groups

Variables

Simultaneous estimation

Sequential estimation

Estimate

t-stat

Estimate

t-stat

Upper level

Duration dynamics

Time since previous crash

-6.46

-22.60

-7.22

-91.82

Time of day

(Evening 6 p.m. to 12

a.m.,

Night 12 a.m. to 6 a.m.)

Early morning (6 a.m. to 9

a.m.)

1.96

20.75

2.19

44.87

Late morning (9 a.m. to 12

p.m.)

3.50

22.27

3.91

70.82

Early afternoon (12 p.m. to 3

p.m.)

3.48

22.39

3.89

75.27

Late afternoon (3 p.m. to 6

p.m.)

2.47

21.74

2.76

58.19

Weather conditions

(Others)

Clear

1.92

22.22

2.15

72.37

Lighting condition

Dark, not lighted)

Daytime

0.52

9.79

0.58

10.80

Dark lighted

3.47

22.11

3.88

67.84

Illumination type

(Not illuminated)

Illuminated

-0.88

-19.04

-0.99

-32.94

Terrain type

(Rolling)

Flat

0.28

9.58

0.32

10.50

Land use

(Mixed)

Commercial

-1.06

-18.82

-1.18

-31.78

Rural

-2.01

-21.60

-2.24

-56.49

Highway geometry

Number of lanes

0.88

22.23

0.98

72.09

Traffic flow

characteristics

Speed

-0.08

-23.20

-0.09

-506.52

Volume

-1.27

-21.70

-1.42

-55.39

Lower level

Intercepts

(Next epoch)

First 15-min interval

-8.71

-131.18

-8.85

-128.09

Second 15-min interval

-8.74

-130.92

-8.88

-127.94

Third 15-min interval

-8.77

-130.53

-8.90

-127.68

Fourth 15-min interval

-8.71

-131.35

-8.85

-128.26

Nesting coefficient



4.36

23.22

4.87

24.80

Goodness of fit

Number of observations (Training)

1,103,104

Average initial LL

-213.98

Average LL at convergence

-2.052

Number of observations (Testing)

140,591

Predicted LL

-1.879

Estimation complexity

Time (hours)

51.09

8.63

The analysis reveals interesting findings regarding the factors influencing F/I crashes. The duration

dynamics coefficient indicates that as the duration between crashes increases, the likelihood of F/I crashes

decreases. Moreover, F/I crashes are more likely to occur between 9 am and 3 pm. Clear weather conditions

are associated with a higher likelihood of F/I crashes compared to adverse weather conditions such as

clouds, rain, fog, or snow. Dark lighted conditions result in more severe crashes, followed by daytime and

dark unlighted conditions. Non-illuminated locations are more prone to F/I crashes compared to illuminated

locations. Additionally, locations with flat terrain have a higher likelihood of F/I crashes compared to those

with rolling terrain. Higher traffic volume leads to a decrease in F/I crashes, due to stop-and-go conditions

during congested conditions. Similarly, higher speeds are associated with a lower likelihood of F/I crashes,

although the effect size is small. The coefficients for the upper-level nodes, , have similar magnitudes.

The nesting parameter has a value of 4.36, indicating cross nesting of alternatives. It is worth noting that

the training data increased significantly after the creation of forecasting epochs, with the original 2,137

crashes expanding to 1,103,104 observations.

Next, we proceeded to estimate parameters using sampled data to explore the tradeoff between

model performance and estimation complexity. The results of this estimation can be found in Table 5, which

presents the obtained parameter values along with their respective t-statistics. Upon visual inspection, it is

apparent that the parameter values obtained using the 25% sample are much closer to the true values

compared to the 5% sample. This finding aligns with a previous study conducted by (Thapa et al., 2022).

However, it is also crucial to investigate the impact of sample size within the range of 5% to 25% to

determine the sample that offers the optimal balance between model performance and estimation

complexity. To address this, we estimated parameters at 5% increments, ranging from 5% to 25%. Fig 3

presents a graphical representation of estimation complexity and predicted log-likelihood for the various

samples. Notably, the figure indicates a significant improvement in prediction performance beyond the 10%

sample. Furthermore, the models demonstrate similar performance for the 15%, 20%, and 25% samples.

As expected, estimation complexity increases linearly with the sample size. For instance, the model

required 2.48 hours to train on the 5% sample, while it took approximately 20 times or 51.09 hours (about

2 days) for the full 100% dataset. Based on the findings depicted in the figure, it is evident that using a 15%

sample can yield comparable estimates and predictive performance to the 25% sample, while reducing the

estimation complexity to 60% of that offered by the 25% sample. This suggests that the 15% sample size

strikes a favorable balance between model performance and estimation complexity.

Table 5. Results from simultaneous model estimation using samples drawn at the epoch level

Variable groups

Variables

5% sample

10% sample

15% sample

20% sample

25% sample

Upper level

Duration dynamics

Time since previous crash

-5.05 (-4.21)

-6.55 (-7.15)

-6.60 (-8.65)

-6.99 (-10.10)

-6.94 (-11.36)

Time of day

(Evening 6 p.m. to 12 a.m.,

Night 12 a.m. to 6 a.m.)

Early morning (6 a.m. to 9 a.m.)

1.58 (3.93)

1.81 (6.39)

1.77 (7.73)

1.78 (8.97)

1.84 (10.13)

Late morning (9 a.m. to 12p.m.)

2.56 (4.13)

3.20 (6.97)

3.43 (8.51)

3.40 (9.87)

3.45 (11.03)

Early afternoon (12p.m. to 3p.m.)

2.46 (4.14)

3.18 (7.00)

3.60 (8.58)

3.76 (10.03)

3.70 (11.20)

Late afternoon (3p.m. to 6p.m.)

1.71 (4.02)

2.40 (6.83)

2.48 (8.32)

2.50 (9.66)

2.52 (10.81)

Weather conditions

(Others)

Clear

1.43 (4.12)

1.87 (7.00)

1.97 (8.51)

1.98 (9.91)

1.99 (11.08)

Lighting condition

(Dark, not lighted)

Daytime

0.49 (2.27)

0.42 (2.50)

0.31 (2.37)

0.39 (3.33)

0.36 (3.53)

Dark lighted

2.97 (4.15)

3.31 (6.92)

3.36 (8.41)

3.46 (9.81)

3.52 (11.06)

Illumination type

(Not illuminated)

Illuminated

-0.66 (-3.60)

-0.90 (-5.98)

-0.80 (-6.99)

-0.72 (-7.74)

-0.75 (-8.81)

Terrain type

(Rolling)

Flat

0.49 (3.22)

0.85 (5.89)

0.80 (7.05)

0.70 (7.65)

0.63 (8.15)

Land use

(Mixed)

Commercial

-1.22 (-3.94)

-1.28 (-6.27)

-1.16 (-7.40)

-0.98 (-8.05)

-0.99 (-9.01)

Rural

-2.03 (-4.14)

-2.15 (-6.87)

-2.08 (-8.28)

-1.98 (-9.54)

-1.96 (-10.63)

Highway geometry

Number of lanes

0.77 (4.19)

0.96 (7.09)

0.86 (8.48)

0.89 (9.89)

0.89 (11.11)

Traffic flow

characteristics

Speed

-0.07 (-4.29)

-0.08 (-7.34)

-0.08 (-8.86)

-0.08 (-10.32)

-0.08 (-11.55)

Volume

-14.94 (-4.26)

-18.77 (-7.26)

-17.74 (-8.77)

-17.98 (-10.23)

-18.02 (-11.47)

Lower level

Intercepts

(Next epoch)

First 15-min interval

-8.68 (-28.18)

-8.94 (-39.92)

-8.89 (-49.03)

-8.90 (-56.71)

-8.85 (-63.84)

Second 15-min interval

-9.01 (-27.42)

-8.61 (-42.27)

-8.56 (-51.90)

-8.59 (-59.88)

-8.62 (-66.90)

Third 15-min interval

-8.56 (-28.57)

-8.83 (-40.44)

-8.77 (-49.69)

-8.75 (-57.60)

-8.83 (-64.04)

Fourth 15-min interval

-8.60 (-28.37)

-8.73 (-40.63)

-8.77 (-49.53)

-8.80 (-57.19)

-8.70 (-64.55)

Nesting coefficient



3.68 (4.30)

4.49 (7.35)

4.42 (8.87)

4.46 (10.34)

4.45 (11.58)

Goodness of fit

Number of observations (Training)

55,062

110,187

165,378

220,662

275,787

Average initial LL

-212.70

-212.87

-212.94

-213.01

-213.08

Average LL at convergence

-2.052

-2.053

-2.054

-2.053

-2.051

Number of observations (Testing)

140,591

Predicted log-likelihood

-1.967

-1.968

-1.962

-1.961

Estimation complexity

Time (hours)

2.48

4.22

7.49

9.86

13.25

Fig 3. Improvement in model performance with increase in data size/estimation complexity.

6.1. Effect of sampling on coefficients

Based on the parameter estimates, it is evident that certain predictor variables are particularly sensitive to

sampling. A notable example is the Volume variable, where the coefficients exhibit significant differences

between the sampled data and the complete data (refer to Fig 4). This discrepancy can be attributed to the

sampling approach and the scaling of traffic volumes. Since the volumes are scaled between 0 and 1,

random sampling can lead to the exclusion of several observations, resulting in considerable variations in

the parameter estimates for this variable. On the other hand, coefficients for the Speed variable demonstrate

consistency. This consistency may be attributed to the fact that the values of the variable do not fluctuate

significantly, as indicated by its descriptive statistics, and are less affected by sampling.

Considering these observations, we aim to identify and report variables that are sensitive to

sampling. To visualize this, a bar plot in Fig 4 presents the variable coefficients obtained from the sampled

and complete data. From the plot, it can be observed that smaller samples are more likely to overestimate

the effect of some variables, for example, Time since crash, Time of day-Early afternoon, Terrain-Flat,

Land Use-Commercial, and Volume. Conversely, variables such as Time of day-Early Morning and Late

Morning, and Lighting-Daytime are more likely to be underestimated when smaller samples are used.

Overall, these findings emphasize the importance of considering the impact of sampling on parameter

estimates, particularly for variables that exhibit sensitivity to sampling.

Considering the impact of sampling, we identified variables that are unlikely to converge toward

the true value when small samples are used and those that are more likely to do so. Identification of these

variables is crucial from a practical standpoint, especially when analysts and planners seek greater accuracy

for specific variables. In the following figures, we present two groups of predictors. The first group consists

of variables which are less likely to converge to actual values with small increments in sample size. These

variables would require larger samples to achieve more accurate estimation. It is important to recognize the

limitations in estimating the coefficients for these variables with smaller sample sizes. The second group

comprises variables whose coefficients converge closer to the actual values as the sample size increases.

This group includes variables whose coefficients can be obtained with reasonable accuracy, even with small

increments in sample size. The findings will be useful in identifying variables that becomes more stable

and reliable as the sample size grows. These findings serve as valuable insights for researchers and

practitioners, allowing them to prioritize their data collection efforts and allocate resources effectively

based on the sensitivity of different predictors to sample size.

The variables which are less likely to converge to true values despite an increase in sample size,

ranging from 5% to 25%, compared to the full data are Time since crash, Time of day-Early afternoon,

Weather Condition-Clear, Lighting Condition-Daytime, Illumination-Illuminated, and Volume. These

variables are presented in Fig 5, indicating the percentage difference of the coefficients from the complete

training data. On the other hand, coefficients for Time of day-Early morning, Time of day-Late morning,

Time of day-Late afternoon, Lighting-Dark lighted, Terrain-Flat, Land Use-Commercial, Land Use-Rural,

Number of lanes, and Speed converge quicker to the actual values as the sample size increases. These

variables are displayed in Fig 6, illustrating the percentage difference compared to the complete training

data. These findings highlight the sensitivity of different variables to sample size and provide valuable

insights into the accuracy and stability of their coefficient estimates.

Fig. 4. Coefficient of variables for different training samples.

Fig 5. Variables unlikely to converge to their actual values despite of incremental increase in sample size.

Fig 6. Variables likely to converge to their actual values with incremental increase in sample size.

7. Validation

The validation of the proposed nested logit model was carried out to assess its predicted capabilities. All

validations were conducted using the simultaneous model trained on 15% data drawn at the epoch level

since our analysis suggested that it provided the best tradeoff between accuracy and estimation complexity.

As discussed previously, 10% of the sample was held out for testing. The test sample consisted of 236

crashes, including 39 F/I crashes and 197 PDO crashes. This test sample was used for validation. Similar

to the two-step model, validation was conducted to assess predictive abilities for the outcomes considered

at the lower and upper levels. These results are discussed in the following subsections.

7.1. Upper level: Crashes at epoch level

One of the primary objectives of the proposed framework is to predict the occurrence of future crashes.

Therefore, it is crucial to evaluate the temporal accuracy of the predicted crashes. To evaluate this, we

measured the proximity between the predicted crash epoch and the actual epoch at which crashes were

observed, by introducing a metric called Predicted Temporal Proximity (PTP), represented by equation

(12). This metric quantifies how closely the predicted crash epochs align with the observed epochs.

Furthermore, we also investigated whether the number of epochs impacted the model's performance

in terms of PTP. To accomplish this, we calculated the PTP for different subsets of the testing data by

excluding crashes with a substantial number of epochs. This was accomplished by creating subsets of the

test data to include crashes with fewer than 100 to 1000 epochs, with intervals of 100 epochs. The average

values of PTP for these subsets of testing data are depicted in Fig 7.



 

(12)

It is important to note that, according to the definition of PTP, a smaller value is desired as it

indicates that the predicted crash epoch is closer to the observed epoch. The results depicted in Fig 7 indicate

that when there is a substantial number of epochs (i.e., a large inter-crash duration), the value of PTP

increases. This suggests that epoch-level prediction is more accurate when the duration between crashes is

smaller. In other words, the prediction of crash epochs is more reliable for highway segments that

experience crashes more frequently. For example, based on the figure, for crashes with inter-crash durations

less than 100 hours (approximately 4 days), the predicted crash epoch is within 60% of the actual epoch,

compared to 74% for durations exceeding 1,000 hours.

Fig 7. Average PTP for different subsets of test samples.

7.2. Upper level: Crashes in predicted time-intervals

The accuracy of predicting crash occurrences at specific time intervals can be assessed from two

perspectives: i) the accuracy of predicting crashes (true positives), and ii) the accuracy of predicting 'no

crashes' (true negatives). Therefore, we relied on the metrics of Specificity and Sensitivity to evaluate the

model's predictions. Specificity measures the model's ability to correctly predict 'no crashes' (true negatives)

and is defined by equation (13). On the other hand, Sensitivity measures the model's ability to correctly

predict crashes (true positives) and is defined by equation (14). It quantifies the proportion of correctly

identified positive cases in relation to the actual positive cases. It quantifies the proportion of correctly

identified negative cases in relation to the actual negative cases.

The model’s prediction accuracy for crash and severity were evaluated using these metrics. The

results are summarized in Table 6 and described as follows.

74% 73% 72% 72%

70%

68% 67%

65% 64%

60%

57%

59%

61%

63%

65%

67%

69%

71%

73%

75%

< 1000 < 900 < 800 < 700 < 600 < 500 < 400 < 300 < 200 < 100

Average PTP

Crashes included with number of epochs

 󰇛󰇜

󰇛󰇜󰇛󰇜

(13)

 󰇛󰇜

󰇛󰇜󰇛󰇜

(14)

Table 6. Values of Specificity and Sensitivity from the model predictions

Predictions

Specificity

Sensitivity

Crash occurrence

539

173

169

0.76

0.27

Crash severity

F/I crashes

887

0.97

0.24

PDO crashes

557

192

154

0.74

0.21

The model predictions for the time intervals resulted in the following counts: True Negatives (TN)

= 539, True Positives (TP) = 63, False Positives (FP) = 173, and False Negatives (FN) = 169. The Specificity

is calculated to be 0.76, indicating a high value. This high value suggests a low rate of false positive

predictions. Therefore, the model demonstrates reliability in predicting crashes. In other words, the

likelihood of classifying a time interval without a crash as a time interval experiencing a crash is low. On

the other hand, the Sensitivity is calculated to be 0.27, indicating a low value. This low value suggests a

high rate of false negatives, or in other words, the chances of classifying true crash intervals as having no

crash is high.

7.3. Lower level: Crash severity for crashes in predicted time-intervals

The Specificity and Sensitivity measures were also utilized to evaluate the model's ability to predict crash

severities at each time interval. For F/I crashes, the following results were obtained: TN = 887, TP = 9, FP

= 20, FN = 28, resulting in a Specificity of 0.97 and a Sensitivity of 0.24. Similarly, for PDO crashes, the

values obtained were TN = 557, TP = 41, FP = 192, and FN = 154, with a Specificity of 0.74 and a

Sensitivity of 0.21.

The results indicate that for both severity types, the Specificity values are high. This suggests that

the model is capable of reliably predicting both F/I and PDO crashes with a lower chance of false positive

predictions. However, it should be noted that the model also exhibits low Sensitivity values, indicating that

the model may not always accurately classify the severity types with a high degree of certainty, leading to

a higher occurrence of false negative predictions. This outcome is the result of exceptionally higher

prevalence of time-intervals without crashes (0s) in comparison to those with crashes (1s). Future research

can improve upon the model by addressing this imbalance in the frequency of outcomes (e.g., see Morris

& Yang, 2021).

8. Conclusion

This study developed a duration-based model to predict crash occurrence and severity using historical crash

and traffic flow data from four interstates in Tennessee. The framework involved the reformulation of crash

data to create forecasting epochs and time-intervals, which were used to calculate crash and severity

likelihoods. The creation of forecasting epochs significantly increased the data size and estimation

complexity. Additionally, the adoption of a nested structure further contributed to the complexity of model

estimation. To address the computational challenges, we suggested sampling the data at the epoch level to

reduce estimation complexity. We aimed to find the optimal sampling strategy by considering the tradeoff

between model performance and estimation complexity. After evaluating various samples, we determined

that a 15% sample drawn at the epoch level provided the best balance in reducing data size. Furthermore,

we investigated the impact of sampling on the coefficients of predictor variables to identify those most

sensitive to changes in sample sizes. Variables such as Time since crash, Time of day-Early afternoon, Late

afternoon, Terrain-Flat, Land Use-Commercial, Number of lanes, and Volume were found to be more likely

to be overestimated by smaller samples. Conversely, variables including Time of day-Early Morning, Late

Morning, Lighting-Daytime and Dark lighted were more likely to be underestimated.

When investigating the stability of coefficients for the predictors, it was found that Time since

crash, Time of day-Early afternoon, Weather Condition-Clear, Lighting Condition-Daytime, Illumination-

Illuminated, and Volume exhibited a higher degree of instability. Consistent estimation of these coefficients

required larger sample sizes. On the other hand, coefficients for Time of day-Early morning, Late morning,

Late afternoon, Lighting-Dark lighted, Terrain-Flat, Land Use-Commercial and Rural, Number of lanes,

and Speed demonstrated a tendency to converge towards true estimates with incremental increases in

sample size. These findings are crucial for obtaining consistent and reliable estimates when utilizing

samples for model estimation and clarify the challenges and considerations associated with implementing

the duration-based model, including the impact of data sampling on estimation outcomes and the sensitivity

of certain variables to changes in sample sizes.

The proposed framework's validation provided satisfactory results. The measure, Predicted

Temporal Proximity (PTP), suggests that the model performs better when implemented on segments where

crashes are more frequent. For context, the model, trained on a 15% epoch-level sample, was able to predict

crashes within 60% (i.e., average PTP=60%) of the actual epoch for crashes occurring within 100 epochs,

or approximately 4 days of each other. On the contrary, the average value of PTP was 74% for crashes

occurring within 1,000 epochs of each other. This finding also sheds light on the practical implications of

the model, as it is often impractical to predict crashes too far into the future due to potential changes in

traffic, weather, and driving conditions. Similarly, the estimated model displayed a satisfactory value of

Specificity, indicating a low rate of false positives. In other words, the model is less likely to falsely predict

time intervals without crashes as having experienced crashes. This is particularly important as a reasonable

degree of certainty is desired to ensure effective allocation of limited safety resources to critical segments.

The value of Sensitivity was comparatively smaller, implying a higher rate of false negatives or missed

detections. However, it should also be noted that the frequency of time intervals without crashes is several

multiples larger than the frequency of time intervals with crashes (preponderance of 0s compared to 1s).

Therefore, the low value of Sensitivity is expected in this case.

9. Study Limitations and Future Research

Future research offers opportunities for notable improvements to the proposed model. Firstly, it would be

valuable to investigate alternative nesting structures to determine if they provide a better fit, especially

considering that the nesting parameter suggests the presence of alternative nests. More complex nesting

structures based on distinct categories such as time of day, weather conditions, and other relevant factors

could be explored. Additionally, in this study, the upper-level is assumed to be a MNL model without

considering the effect of time. Future research should consider addressing this when investigating

alternative structures. Secondly, the model estimates could be enhanced by incorporating random effects.

Since the reformulated data, after the creation of forecasting epochs, takes the form of panel data with

repeated observations for crashes and road segments, accounting for segment and crash-specific

heterogeneity could lead to more accurate model estimates. Furthermore, data balancing techniques such

as Synthetic Minority Over-sampling Technique can be used to balance the frequency of outcomes and

study its impact on model estimates. Finally, alternative estimation techniques leveraging parallel and

distributed computing can be implemented to reduce estimation time while still retaining information from

complete training dataset. Addressing these limitations would contribute to a more comprehensive

understanding of crash prediction and severity estimation and improve the accuracy and applicability of the

model in real-world scenarios.

Declaration of competing interests

None.

Acknowledgements

This research was partially supported by Fulbright Fellowship to second author at Indian Institute of

Technology (IIT) Bombay, National Science Foundation award #2222699 and the Center for Transportation

Innovations in Education and Research (C-TIER) at the University of Memphis. Any findings and opinions

expressed in this paper are those of the authors and do not necessarily reflect the view of the aforementioned

agencies.

References

Abdel-Aty, M., Uddin, N., Pande, A., Abdalla, M. F., & Hsia, L. (2004). Predicting Freeway Crashes

from Loop Detector Data by Matched Case-Control Logistic Regression. Transportation

Research Record: Journal of the Transportation Research Board, 1897(1), 88–95.

https://doi.org/10.3141/1897-12

Afghari, A. P., Haque, M. M., & Washington, S. (2020). Applying a joint model of crash count and crash

severity to identify road segments with high risk of fatal and serious injury crashes. Accident

Analysis & Prevention, 144, 105615. https://doi.org/10.1016/j.aap.2020.105615

Ahmed, S. S., Cohen, J., & Anastasopoulos, P. Ch. (2021). A correlated random parameters with

heterogeneity in means approach of deer-vehicle collisions and resulting injury-severities.

Analytic Methods in Accident Research, 30, 100160. https://doi.org/10.1016/j.amar.2021.100160

Barua, S., El-Basyouny, K., & Islam, Md. T. (2016). Multivariate random parameters collision count data

models with spatial heterogeneity. Analytic Methods in Accident Research, 9, 1–15.

https://doi.org/10.1016/j.amar.2015.11.002

Beshah, T., Ejigu, D., Abraham, A., Snasel, V., & Kromer, P. (2011). Pattern recognition and knowledge

discovery from road traffic accident data in Ethiopia: Implications for improving road safety.

2011 World Congress on Information and Communication Technologies, 1241–1246.

https://doi.org/10.1109/WICT.2011.6141426

Brownstone, D., & Small, K. A. (1989). Efficient Estimation of Nested Logit models. Journal of Business

& Economic Statistics, 7(1), 67–74. https://doi.org/10.1080/07350015.1989.10509714

Cerwick, D. M., Gkritza, K., Shaheed, M. S., & Hans, Z. (2014). A comparison of the mixed logit and

latent class methods for crash severity analysis. Analytic Methods in Accident Research, 3–4, 11–

27. https://doi.org/10.1016/j.amar.2014.09.002

Chen, C., Zhang, G., Qian, Z., Tarefder, R. A., & Tian, Z. (2016). Investigating driver injury severity

patterns in rollover crashes using support vector machine models. Accident Analysis &

Prevention, 90, 128–139. https://doi.org/10.1016/j.aap.2016.02.011

Cheng, W., Gill, G. S., Dasu, R., Xie, M., Jia, X., & Zhou, J. (2017). Comparison of Multivariate Poisson

lognormal spatial and temporal crash models to identify hot spots of intersections based on crash

types. Accident Analysis & Prevention, 99, 330–341. https://doi.org/10.1016/j.aap.2016.11.022

Dissanayake, S., & Lu, J. (2002). Analysis of Severity of Young Driver Crashes: Sequential Binary

Logistic Regression Modeling. Transportation Research Record: Journal of the Transportation

Research Board, 1784(1), 108–114. https://doi.org/10.3141/1784-14

Dong, C., Clarke, D. B., Yan, X., Khattak, A., & Huang, B. (2014). Multivariate random-parameters zero-

inflated negative binomial regression model: An application to estimate crash frequencies at

intersections. Accident Analysis & Prevention, 70, 320–329.

https://doi.org/10.1016/j.aap.2014.04.018

Fountas, G., & Anastasopoulos, P. Ch. (2017). A random thresholds random parameters hierarchical

ordered probit analysis of highway accident injury-severities. Analytic Methods in Accident

Research, 15, 1–16. https://doi.org/10.1016/j.amar.2017.03.002

Golob, T. F., Recker, W. W., & Leonard, J. D. (1987). An analysis of the severity and incident duration of

truck-involved freeway accidents. Accident Analysis & Prevention, 19(5), 375–395.

https://doi.org/10.1016/0001-4575(87)90023-6

Harmon, T., Bahar, G., & Gross, F. (2018). Crash Costs for Highway Safety Analysis.

Hossain, M., Abdel-Aty, M., Quddus, M. A., Muromachi, Y., & Sadeek, S. N. (2019). Real-time crash

prediction models: State-of-the-art, design pathways and ubiquitous requirements. Accident

Analysis & Prevention, 124, 66–84. https://doi.org/10.1016/j.aap.2018.12.022

Hossain, M., & Muromachi, Y. (2012). A Bayesian network based framework for real-time crash

prediction on the basic freeway segments of urban expressways. Accident Analysis & Prevention,

45, 373–381. https://doi.org/10.1016/j.aap.2011.08.004

Iranitalab, A., & Khattak, A. (2017). Comparison of four statistical and machine learning methods for

crash severity prediction. Accident Analysis & Prevention, 108, 27–36.

https://doi.org/10.1016/j.aap.2017.08.008

Jonathan, A.-V., Wu, K.-F. (Ken), & Donnell, E. T. (2016). A multivariate spatial crash frequency model

for identifying sites with promise based on crash types. Accident Analysis & Prevention, 87, 8–

16. https://doi.org/10.1016/j.aap.2015.11.006

Jovanis, P. P., & Chang, H. L. (1989). Disaggregate model of highway accident occurrence using survival

theory. Accident Analysis and Prevention, 21(5), 445–458. https://doi.org/10.1016/0001-

4575(89)90005-5

Jung, S., Qin, X., & Noyce, D. A. (2010). Rainfall effect on single-vehicle crash severities using

polychotomous response models. Accident Analysis & Prevention, 42(1), 213–224.

https://doi.org/10.1016/j.aap.2009.07.020

Lee, J., Yoon, T., Kwon, S., & Lee, J. (2019). Model Evaluation for Forecasting Traffic Accident Severity

in Rainy Seasons Using Machine Learning Algorithms: Seoul City Study. Applied Sciences,

10(1), 129. https://doi.org/10.3390/app10010129

Lee, J.-T., & Fazio, J. (2005). Influential Factors in Freeway Crash Response and Clearance Times by

Emergency Management Services in Peak Periods. Traffic Injury Prevention, 6(4), 331–339.

https://doi.org/10.1080/15389580500255773

Li, P., Abdel-Aty, M., & Yuan, J. (2020). Real-time crash risk prediction on arterials based on LSTM-

CNN. Accident Analysis & Prevention, 135, 105371. https://doi.org/10.1016/j.aap.2019.105371

Ma, J., & Kockelman, K. M. (2006). Poisson Regression for Models of Injury Count, by Severity.

Transportation Research Record: Journal of the Transportation Research Board, 1950, 24–34.

Mannering, F., Bhat, C. R., Shankar, V., & Abdel-Aty, M. (2020). Big data, traditional data and the

tradeoffs between prediction and causality in highway-safety analysis. Analytic Methods in

Accident Research, 25, 100113. https://doi.org/10.1016/j.amar.2020.100113

Morris, C., & Yang, J. J. (2021). Effectiveness of resampling methods in coping with imbalanced crash

data: Crash type analysis and predictive modeling. Accident Analysis & Prevention, 159, 106240.

https://doi.org/10.1016/j.aap.2021.106240

Osman, M., Mishra, S., & Paleti, R. (2018). Injury severity analysis of commercially-licensed drivers in

single-vehicle crashes: Accounting for unobserved heterogeneity and age group differences.

Accident Analysis & Prevention, 118, 289–300. https://doi.org/10.1016/j.aap.2018.05.004

Osman, M., Mishra, S., Paleti, R., & Golias, M. (2019). Impacts of Work Zone Component Areas on

Driver Injury Severity. Journal of Transportation Engineering, Part A: Systems, 145(8),

04019032. https://doi.org/10.1061/jtepbs.0000253

Osman, M., Paleti, R., & Mishra, S. (2018). Analysis of passenger-car crash injury severity in different

work zone configurations. Accident Analysis and Prevention, 111(May 2017), 161–172.

https://doi.org/10.1016/j.aap.2017.11.026

Ospina-Mateus, H., Quintana Jiménez, L. A., Lopez-Valdes, F. J., Berrio Garcia, S., Barrero, L. H., &

Sana, S. S. (2021). Extraction of decision rules using genetic algorithms and simulated annealing

for prediction of severity of traffic accidents by motorcyclists. Journal of Ambient Intelligence

and Humanized Computing, 12(11), 10051–10072. https://doi.org/10.1007/s12652-020-02759-5

Park, E. S., & Lord, D. (2007). Multivariate poisson-lognormal models for jointly modeling crash

frequency by severity. Transportation Research Record, 2019, 1–6. https://doi.org/10.3141/2019-

Pei, X., Wong, S. C., & Sze, N. N. (2011). A joint-probability approach to crash prediction models.

Accident Analysis & Prevention, 43(3), 1160–1166. https://doi.org/10.1016/j.aap.2010.12.026

Pham, M.-H., Bhaskar, A., Chung, E., & Dumont, A.-G. (2010). Random forest models for identifying

motorway Rear-End Crash Risks using disaggregate data. 13th International IEEE Conference on

Intelligent Transportation Systems, 468–473. https://doi.org/10.1109/ITSC.2010.5625003

Rahim, M. A., & Hassan, H. M. (2021). A deep learning based traffic crash severity prediction

framework. Accident Analysis & Prevention, 154, 106090.

https://doi.org/10.1016/j.aap.2021.106090

Santos, K., Dias, J. P., & Amado, C. (2022). A literature review of machine learning algorithms for crash

injury severity prediction. Journal of Safety Research, 80, 254–269.

https://doi.org/10.1016/j.jsr.2021.12.007

Sun, J., & Sun, J. (2016). Real‐time crash prediction on urban expressways: Identification of key

variables and a hybrid support vector machine model. IET Intelligent Transport Systems, 10(5),

331–337. https://doi.org/10.1049/iet-its.2014.0288

Thapa, D., & Mishra, S. (2021). Using worker’s naturalistic response to determine and analyze work zone

crashes in the presence of work zone intrusion alert systems. Accident Analysis and Prevention,

156. https://doi.org/10.1016/j.aap.2021.106125

Thapa, D., Paleti, R., & Mishra, S. (2022). Overcoming challenges in crash prediction modeling using

discretized duration approach: An investigation of sampling approaches. Accident Analysis &

Prevention, 169, 106639. https://doi.org/10.1016/j.aap.2022.106639

Xu, C., Tarko, A., Wang, W., & Liu, P. (2013). Predicting crash likelihood and severity on freeways with

real-time loop detector data. Accident Analysis and Prevention, 57, 30–39.

http://dx.doi.org/10.1016/j.aap.2013.03.035

Yasmin, S., & Eluru, N. (2013). Evaluating alternate discrete outcome frameworks for modeling crash

injury severity. Accident Analysis & Prevention, 59, 506–521.

https://doi.org/10.1016/j.aap.2013.06.040

Yasmin, S., & Eluru, N. (2018). A joint econometric framework for modeling crash counts by severity.

Transportmetrica A: Transport Science, 14(3), 230–255.

https://doi.org/10.1080/23249935.2017.1369469

Yasmin, S., Eluru, N., Bhat, C. R., & Tay, R. (2014). A latent segmentation based generalized ordered

logit model to examine factors influencing driver injury severity. Analytic Methods in Accident

Research, 1, 23–38. https://doi.org/10.1016/j.amar.2013.10.002

Yasmin, S., Eluru, N., Wang, L., & Abdel-Aty, M. A. (2018). A joint framework for static and real-time

crash risk analysis. Analytic Methods in Accident Research, 18, 45–56.

https://doi.org/10.1016/j.amar.2018.04.001

Yu, R., & Abdel-Aty, M. (2013). Utilizing support vector machine in real-time crash risk evaluation.

Accident Analysis & Prevention, 51, 252–259. https://doi.org/10.1016/j.aap.2012.11.027

Zhang, C., He, J., Wang, Y., Yan, X., Zhang, C., Chen, Y., Liu, Z., & Zhou, B. (2020). A Crash Severity

Prediction Method Based on Improved Neural Network and Factor Analysis. Discrete Dynamics

in Nature and Society. https://doi.org/10.1155/2020/4013185

Zhang, J., Li, Z., Pu, Z., & Xu, C. (2018). Comparing Prediction Performance for Crash Injury Severity

Among Various Machine Learning and Statistical Methods. IEEE Access, 6, 60079–60087.

https://doi.org/10.1109/ACCESS.2018.2874979

Zheng, L., & Sayed, T. (2020). A novel approach for real time crash prediction at signalized intersections.

Transportation Research Part C: Emerging Technologies, 117, 102683.

https://doi.org/10.1016/j.trc.2020.102683

Recent Advances in Traffic Accident Analysis and Prediction: A Comprehensive Review of Machine Learning Techniques

Preprint

Full-text available

Jun 2024

Traffic accidents pose a severe global public health issue, leading to 1.19 million fatalities annually, with the greatest impact on individuals aged 5 to 29 years old. This paper addresses the critical need for advanced predictive methods in road safety by conducting a comprehensive review of recent advancements in applying machine learning (ML) techniques to traffic accident analysis and prediction. It examines 191 studies from the last five years, focusing on predicting accident risk, frequency, severity, duration, as well as general statistical analysis of accident data. To our knowledge, this study is the first to provide such a comprehensive review, covering the state-of-the-art across a wide range of domains related to accident analysis and prediction. The review highlights the effectiveness of integrating diverse data sources and advanced ML techniques to improve prediction accuracy and handle the complexities of traffic data. By mapping the current landscape and identifying gaps in the literature, this study aims to guide future research towards significantly reducing traffic-related deaths and injuries by 2030, aligning with the World Health Organization (WHO) targets.

Presentation of artificial neural network models based on optimum theories for predicting accident severity on rural roads in Iran

Article

Full-text available

May 2024

A spatial autocorrelation analysis of road traffic crash by severity using Moran’s I spatial statistics: A comparative study of Addis Ababa and Berlin cities

Article

Full-text available

Mar 2024
ACCIDENT ANAL PREV

Methodological advancements in road safety research reveal an increasing inclination toward integrating spatial approaches in hot spot identification, spatial pattern analysis, and developing spatially lagged models. Previous studies on hot spot identification and spatial pattern analysis have overlooked crash severities and the spatial autocorrelation of crashes by severity, missing valuable insights into crash patterns and underlying factors. This study investigates the spatial autocorrelation of crash severity by taking two capital cities, Addis Ababa and Berlin, as a case study and compares patterns in low and high-income countries. The study used three-year crash data from each city. It employed the average nearest neighbor distance (ANND) method to determine the significance of spatial clustering of crash data by severity, Global Moran's I to examine the statistical significance of spatial autocorrelation, and Local Moran's I to identify significant cluster locations with High-High (HH) and Low-Low (LL) crash severity values. The ANND analysis reveals a significant clustering of crashes by severity in both cities, except in Berlin's fatal crashes. However, different Global Moran's I results were obtained for the two cities, with a strong and statistically significant value for Addis Ababa compared to Berlin. The Local Moran's I result indicates that the central business district and residential areas have LL values, while the city's outskirts exhibit HH values in Addis Ababa. With some persistent HH value locations, Berlin's HH and LL grid clusters are intermingled on the city's periphery. Socioeconomic factors, road user behavior and roadway factors contribute to the difference in the result. Nevertheless, it is interesting to note the similarity of significant HH value locations on the outskirts of both cities. Finally, the results are consistent with previous studies and indicate the need for further investigation in other locations.

Assessing Driver Behavior in Work Zones: A Discretized Duration Approach to Predict Speeding

Article

Full-text available

Dec 2023
ACCIDENT ANAL PREV

Higher speeds in work zones have been linked to an increased likelihood of crashes and more severe crash outcomes. To enhance safety, speed limits are often reduced in work zones, aiming to create a steady flow of traffic and safer traffic operations such as merging and flagging. However, this speed reduction can also lead to abrupt speed changes, resulting from sudden braking or acceleration, increasing the risk of crashes. This disruption in speed and flow results increases the likelihood of rear-end crashes. Ensuring driver compliance with the reduced speed limits and traffic flow operations is challenging as work zones may cause frustration and lead to more instances of speeding. Therefore, proactively predicting speeding events in work zones can be crucial for the safety of both workers and road users, as it enables the implementation of speed enforcement measures to maintain and improve driver compliance in advance. In this study, we employ the duration-based prediction framework to forecast speeding occurrences in work zones. The model is used to identify significant predictors of speeding including visibility, number of lanes, posted speed limit, segment length, coefficient of variation in speed, and travel time index. Among these variables, the number of lanes, posted speed limit, and coefficient of variation of speed are positively associated with speeding. On the other hand, visibility, segment length, and travel time index are negatively associated with (M. Adeel) 2 speeding. Results show the model's predictive accuracy is higher for speeding events with shorter durations between consecutive occurrences. The model predicted speeding within 61% of the actual epoch when speeding events within 5 hours of one another were considered for validation. This indicates that the model is more effective for road segments and work zones where speeding occurs more frequently. The prediction framework can be a great asset for agencies to improve work zone safety in real-time by enabling them to proactively implement effective work zone enforcement measures to control speeding and to stay prepared, preventing potential hazards.

Overcoming challenges in crash prediction modeling using discretized duration approach: An investigation of sampling approaches

Preprint

Full-text available

Mar 2022

Until recently, statistical approaches used for real-time crash prediction modeling were limited to case-control design and “sampling of alternatives” approaches. A recent study has developed a duration-based real-time crash prediction model capable of incorporating dynamic (time-varying) covariates within its framework. The modeling approach discretizes the duration between crashes into equal time intervals which can be modeled as alternatives in a multinomial logit framework. The approach, however, requires a reformulation of the original crash dataset to fit its modeling framework which results in considerably large data making model estimation computationally demanding. Additionally, validation of the model in the original study is based on crash data from just one interstate, I-405, assuming homogenous highway segments each 5 miles in length. This study improves upon the original study by investigating sampling techniques that can be applied to the reformulated data to reduce computational load using 2019 crash data from two interstates, I-40 and I-55, in Memphis, Tennessee. Furthermore, discretization of inter-crash duration is undertaken following non-homogenous segmentation of the interstates that is based on highway geometry, terrain, and posted speed limit. To accomplish the study objectives, a relatively small future window of 1 h with 15-minute time intervals is used to discretize the inter-crash duration and obtain the reformulated data. Sampling of crashes for model estimation is then done at the crash, epoch, and segment levels to answer the question of which sampling technique (by crash, epoch, or segment) would result in reasonable accuracy when compared with the complete (100%) data. Results show that 25% of samples drawn at the epoch level can result in a considerable reduction of computational load while providing reasonably consistent estimates.

Using worker's naturalistic response to determine and analyze work zone crashes in the presence of work zone intrusion alert systems

Article

Full-text available

Apr 2021
ACCIDENT ANAL PREV

Work zone Intrusion Alert Systems (WZIAS) are alert mechanisms that detect and alert workers of vehicles intruding into a work zone. These systems pre-dominantly employ two components-sensors placed near the work zone perimeter that detect intrusions, and alarms placed closed to or carried by the workers that alerts them. This study investigates the association between layout of these components for three WZIAS on work zone crashes based on worker reaction. Also, the key determinants of work zone crashes in presence of the WZIAS is identified using survival analysis. The ideal deployment strategy and use case scenarios for the three WZIAS is presented based on the findings of the study. The systems were subjected to rigorous testing that emulated intrusions to record worker reaction and determine occurrence of crashes. Analysis of results indicate that the key determinants of work zone crashes are speed of the intruding vehicle, distance between the sensor and worker, and accuracy of a system in detecting intrusions and alerting workers. Results from field experiments suggest that identification of appropriate use cases for WZIAS is necessary to ensure they work effectively. Based on the findings from this study it is suggested that current guidelines on work zones be modified to standardize WZIAS setup.

Extraction of decision rules using genetic algorithms and simulated annealing for prediction of severity of traffic accidents by motorcyclists

Article

Full-text available

Nov 2021

The objective of this study is to analysis of accident of motorcyclists on Bogotá roads in Colombia. For detection of conditions related to crashes and their severity, the proposed model develops the strategies to enhance road safety. In this context, data mining and machine learning techniques are used to investigate 34,232 accidents by motorcyclists during January 2013 to February 2018. Both the Genetic algorithm and simulated annealing are applied in conjunction with mining rules (support, confidence, lift, and comprehensibility) as per objectives of the problem. The application of a hybrid algorithm allows for the creation and definition of optimal hierarchical decision rules for the prediction of the severity of motorcycle traffic accidents. The proposed method yields good results in the metrics of recall (90.07%), precision (89.87%), and accuracy (90.06%) on the data set. The results increase the prediction by 20–21% in comparisons with the following methods: Decision Trees (CART, ID3, and C4.5), Support Vector Machines (SVMs), K-Nearest Neighbor (KNN), Naive Bayes, Neural Networks, Random Forest, and Random Tree. The proposed method defines 11 rules for the prediction of accidents with material damage, 24 rules with injuries, and 12 rules with fatalities. The variables with the most recurrence in the definition of rules are time, weather and road conditions, and the number of victims involved in the accidents. Finally, the interactions of the conditions and characteristics presented in motorcycle accidents are analyzed which contribute to the definition of countermeasures for road safety.

A Crash Severity Prediction Method Based on Improved Neural Network and Factor Analysis

Article

Full-text available

Jun 2020
DISCRETE DYN NAT SOC

Crash severity prediction has been raised as a key problem in traffic accident studies. Thus, to progress in this area, in this study, a thorough artificial neural network combined with an improved metaheuristic algorithm was developed and tested in terms of its structure, training function, factor analysis, and comparative results. Data from I5, an interstate highway in the Washington State during the period of 2011–2015, were used for fitting and prediction, and after setting the theoretical three-layer neural network (NN), an improved Particle Swarm Optimization (PSO) method with adaptive inertial weight was offered to optimize the NN, and finally, a comparison among different adaptive strategies was conducted. The results showed that although the algorithms produced almost the same accuracy in their predictions, a backpropagation method combined with a nonlinear inertial weight setting in PSO produced fast global and accurate local optimal searching, thereby demonstrating a better understanding of the entire model explanation, which could best fit the model, and at last, the factor analysis showed that non-road-related factors, particularly vehicle-related factors, are more important than road-related variables. The method developed in this study can be applied to a big data analysis of traffic accidents and be used as a fast-useful tool for policy makers and traffic safety researchers.

A novel approach for real time crash prediction at signalized intersections

Article

Full-text available

Jun 2020
TRANSPORT RES C-EMER

This study proposes a novel approach to predict real time crash risk at signalized intersections at the signal cycle level. The approach uses traffic conflicts extracted from informative vehicle trajectories as an intermediate for crash prediction and develops generalized extreme value (GEV) models based on conflict extremes. Moreover, a Bayesian hierarchical structure is developed for the GEV model to combine conflict extremes of different intersections, and the aim is to further improve safety estimates through borrowing strength from different intersections and accounting for non-stationarity and unobserved heterogeneity in conflict extremes. The proposed approach was applied to four signalized intersections in City of Surrey, British Columbia. Traffic conflicts measured by modified time to collision and three cycle-level traffic parameters (traffic volume, shock wave area, and platoon ratio) were extracted from collected video data using computer vision techniques, and a best fitted model was then developed. Two safety indices, risk of crash (RC) and return level of a cycle (RLC), were derived from the GEV model to quantitatively measure the safety cycle-by-cycle. The results show that the non-negative RC can directly point out cycles with crash prone traffic conditions with RC > 0, and RLC is a more flexible safety index which can differentiate between safety levels even for "safe" cycles with RC = 0. The real time crash prediction results are validated at an aggregate level by comparing to observed crashes.

A literature review of machine learning algorithms for crash injury severity prediction

Article

Dec 2021
J SAFETY RES

Introduction: Road traffic crashes represent a major public health concern, so it is of significant importance to understand the factors associated with the increase of injury severity of its interveners when involved in a road crash. Determining such factors is essential to help decision making in road safety management, improving road safety, and reducing the severity of future crashes. Method: This paper presents a recent literature review of the methods that have been applied to road crash injury severity modeling. It includes 56 studies from 2001 to 2021 that consider more than 20 different statistical or machine learning techniques. Results: Random Forest was the algorithm with the best results, achieving the best performance in 70% of the times that it was applied and in 29% of all studies. Support Vector Machine and Decision Tree achieved the best performance in 53% and 31% of the times and in 16% and 14% of all studies, respectively. Bayesian Networks and K-Nearest Neighbors achieved the best performance in 67% and 40% of the times that were used but only achieved the best performance in 4% and 7% of all the studies analyzed, respectively. Conclusions: At this point, Random Forest revealed to be a good approach for road traffic crash injury severity prediction followed by Support Vector Machine, Decision Tree, and K-Nearest Neighbor. However, there is still a lot of room in this area to explore other techniques that can best suit this purpose as not only the model's performance should be considered but also causality issues, unobserved heterogeneity, and temporal instability. Practical Applications: This review enables researchers to understand the recent techniques applied in the analysis of injury severity modeling, and the ones that achieved the best performance results. Based on the reviewed studies, challenges and future research directions are presented.

Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling

Article

Sep 2021
ACCIDENT ANAL PREV

Crash data analysis is commonly subjected to imbalanced data. Varied by facility and control types, some crash types are more frequent than others. However, uncommon crash types are routinely more severe and associated with higher economic and societal costs, and thus crucial to prevent. It is paramount to develop inferential models that can reliably predict crash types and identify attributing factors, especially for the severe types. The current process of modeling towards infrequent events generally disregards disparity in data representation, which can lead to biased models. Therefore, mitigating and managing imbalanced data is essential to the development of meaningful and robust models that help reveal effective countermeasures. This study focuses on comparing the effects of resampling techniques on the performance of both machine learning and classical statistical models for classifying and predicting different crash types on freeways. Specifically, a mixed sampling approach featuring a cluster-based under-sampling coupled with three popular over-sampling methods (i.e., random over-sampling, synthetic minority over-sampling, and adaptive synthetic sampling) were investigated with respect to four crash classification models, including three ensemble machine learning models (CatBoost, XGBoost, and Random Forests) and one classic statistical model (Nested Logit). This study concluded that all three resampling methods consistently enhanced the performance of all models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and tremendously improved the prediction of minority crash types without impeding the prediction of the majority crash type. This is likely due to the density-based approach of adaptive synthetic sampling in creating synthetic instances that are more congruent with the underlying manifold structure embodied in the high-dimensional feature space.

A deep learning based traffic crash severity prediction framework

Article

May 2021
ACCIDENT ANAL PREV

Highway work zones are most vulnerable roadway segments for congestion and traffic collisions. Hence, providing accurate and timely prediction of the severity of traffic collisions at work zones is vital to reduce the response time for emergency units (e.g., medical aid), accordingly improve traffic safety and reduce congestion. In predicting the severity of traffic collisions, previous studies used different statistical and machine learning models with accuracy as the main evaluating factor. However, the performance of these models was generally not good, especially on fatal and injury crashes. Also, looking into the prediction accuracy only is misleading. This paper aims to propose a novel deep learning-based approach with a customized f1-loss function to predict the severity of traffic crashes. Underlying this objective is to compare the results of deep learning models with machine learning model considering two performance indicators, namely precision, and recall. The data used in the analysis include a sample of traffic crashes that occurred at work zones in Louisiana from 2014 to 2018. This dataset includes valuable information (features) related to road, vehicle, and human factors affecting the occurrence and severity of those crashes. The proposed methodology is based on transforming these features/variables into images. Image transformation is conducted using a nonlinear dimensionality reduction technique t-SNE and convex hull algorithm. A CNN based deep learning algorithm with a customized loss function was used to directly optimize the model for precision and recall. The results showed improved performance in predicting the crash severity of fatal and injury crashes using the deep learning approach, which can help to improve traffic safety as well as traffic congestion at work zones and possibly other roadways segments.

Article

Feb 2021

This paper investigates deer-vehicle collisions and their resulting injury-severities in a threefold approach. Two random parameters binary logit models with heterogeneity in means are estimated to analyze: (a) the likelihood of witnessing deer on the roadway; and (b) the likelihood of a vehicle hitting deer on the roadway. Additionally, a novel variant of ordered probability models, namely the correlated random parameters ordered logit model with heterogeneity in means, is estimated to explore the factors contributing to different driver injury-severity outcomes resulting from deer-vehicle collisions. A database of crashes maintained by the Pennsylvania Department of Transportation for the year 2018 is used for the analysis. The findings reveal that it is more likely to witness deer in rural locations, in dark lighting conditions, and during deer breeding season – when deer are more active in moving from one location to another. In addition, it is more likely for vehicles to hit deer in roads with speed limits above 55 mph, and during the deer breeding season. Furthermore, female drivers are found to be more likely to hit deer as compared to male drivers. Finally, airbag deployment and post-crash overturning of the vehicle are both associated with major injuries and fatalities; whereas, use of restraints is found to prevent injuries or fatalities. Employing the correlated random parameters and the heterogeneity in means approaches offers additional insights about the effects of unobserved heterogeneity on factors contributing to the likelihood of witnessing and hitting deer on the roadway, as well as the driver injury-severity levels resulting from deer-vehicle collisions.

Applying a joint model of crash count and crash severity to identify road segments with high risk of fatal and serious injury crashes

Article

Aug 2020
ACCIDENT ANAL PREV

Both crash count and severity are thought to quantify crash risk at defined transport network locations (e.g. intersections, a particulate section of highway, etc.). Crash count is a measure of the likelihood of occurring a potential harmful event, whereas crash severity is a measure of the societal impact and harm to the society. As the majority of safety improvement programs are focused on preventing fatal and serious injury crashes, identification of high-risk sites—or blackspots—should ideally account for both severity and frequency of crashes. Past research efforts to incorporate crash severity into the identification of high-risk sites include multivariate crash count models, equivalent property damage only models and two-stage mixed models. These models, however, often require suitable distributional assumptions for computational efficiency, neglect the ordinal nature of crash severity, and are inadequate for capturing unobserved heterogeneity arising from possible correlations between crash counts of different severity levels. These limitations can ultimately lead to inefficient allocation of resources and misidentification of sites with high risk of fatal and serious injury crashes. Moreover, the implication of these models in blackspot identification is an important, unanswered question. While a joint econometric model of crash count and crash severity has the flexibility to account for the limitations mentioned previously, its ability to identify high-risk sites also needs to be examined. This study aims to fill this research gap by employing the joint model for blackspot identification. Using data from state-controlled roads in Queensland, Australia, a new risk score is developed based on predicted crash counts by severity, weighted by the cost ratio of severity levels. This weighted risk score is then used for identifying road segments with high risk of fatal and injury crashes. Results show that the joint model of crash count and crash severity has substantially improved prediction accuracy compared to the traditional count models. The correlation between crash counts of different severity levels captures the unobserved heterogeneity caused by the extra-variation in total crash counts and moderates the parameters in the joint model. In comparison with the traditional approaches, the proposed weighted risk score approach with the joint model of crash count and crash severity leads to the identification of a higher number of fatal and serious injury crashes in the top ranked sites flagged for safety improvements.

Advancing proactive crash prediction: A discretized duration approach for predicting crashes and severity

Recommended publications

Application of ship dynamic parameter measuring units in predicting ship dynamic stability

UNCOVERING TEMPORAL DYNAMICS IN TRAFFIC SAFETY: APPLICATION OF DURATION-BASED MODELS FOR DIAGNOSTIC...

Overcoming challenges in crash prediction modeling using discretized duration approach: An investiga...

Assessing Driver Behavior in Work Zones: A Discretized Duration Approach to Predict Speeding

Towards smart work zones: Creating safe and efficient work zones in the technology era