ArticlePDF Available

Abstract and Figures

Heterogeneity of crash data masks the underlying crash patterns and perplexes crash analysis. This paper aims to explore an advanced high-dimensional clustering approach to investigate heterogeneity in large datasets. Detailed records of crashes involving large trucks occurring in the state of Florida between 2007 and 2016 were examined to identify truck crash patterns and significant conditions contributing to the patterns. The block clustering method was applied to more than 220,000 crash records with nearly 200 attributes. The analysis showed promising results in segmenting a large heterogeneous dataset into meaningful subgroups (with 95.72% average degree of homogeneity for selected blocks). The goodness of fit for clustering methods is evaluated and both integrated completed likelihood (ICL) and pseudo-likelihood values improved significantly (20.8% and 21.1% respectively). Attribute clustering showed distinct characteristics for each cluster. Crash clustering revealed significant differences among the clusters and suggested that this crash dataset could be portioned as same-direction, opposing-direction, and single-vehicle crashes. Individual blocks defined by both row and column clustering were further investigated to better understand the contribution set of conditions that lead to large truck crashes. Major features for each of the three major types of crashes were analyzed, which may provide additional insights to develop potential countermeasures and strategies that target specific segments. The clustering approach could be used as a preanalysis method to identify homogeneous subgroups for further analysis, which will help enhance the effectiveness of safety programs.
Content may be subject to copyright.
Research Article
Transportation Research Record
1–13
ÓNational Academy of Sciences:
Transportation Research Board 2019
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/0361198119839347
journals.sagepub.com/home/trr
Clustering Approach toward Large Truck
Crash Analysis
Alireza Rahimi
1
, Ghazaleh Azimi
1
, Hamidreza Asgari
1
, and Xia Jin
1
Abstract
Heterogeneity of crash data masks the underlying crash patterns and perplexes crash analysis. This paper aims to explore an
advanced high-dimensional clustering approach to investigate heterogeneity in large datasets. Detailed records of crashes
involving large trucks occurring in the state of Florida between 2007 and 2016 were examined to identify truck crash patterns
and significant conditions contributing to the patterns. The block clustering method was applied to more than 220,000 crash
records with nearly 200 attributes. The analysis showed promising results in segmenting a large heterogeneous dataset into
meaningful subgroups (with 95.72% average degree of homogeneity for selected blocks). The goodness of fit for clustering
methods is evaluated and both integrated completed likelihood (ICL) and pseudo-likelihood values improved significantly
(20.8% and 21.1% respectively). Attribute clustering showed distinct characteristics for each cluster. Crash clustering revealed
significant differences among the clusters and suggested that this crash dataset could be portioned as same-direction, oppos-
ing-direction, and single-vehicle crashes. Individual blocks defined by both row and column clustering were further investi-
gated to better understand the contribution set of conditions that lead to large truck crashes. Major features for each of the
three major types of crashes were analyzed, which may provide additional insights to develop potential countermeasures and
strategies that target specific segments. The clustering approach could be used as a preanalysis method to identify homoge-
neous subgroups for further analysis, which will help enhance the effectiveness of safety programs.
The number of crashes involving large trucks has been
increasing in the United States. In 2010, large trucks
were involved in about 58,000 injuries and 3,494 fatal
crashes, respectively. In 2016, the number of injury
crashes involving large trucks almost doubled, and fatal
crashes increased by more than 20% comparing with
2010 (1). Large truck crashes impose an enormous
amount of loss on society. In addition to increased traffic
congestion and property damage, they put roadway users
at high risk of injury and fatality. There are also adverse
consequences for the prosperity of industry, including
delay-related cost, additional operations costs, and pro-
ductivity loss. The cost of commercial vehicle crashes has
been estimated to be over $99 billion annually (2).
In this regard, many studies have focused on large
truck crash analysis and identification of countermea-
sures to improve truck safety (3,4). One particular chal-
lenge in investigating contributing factors of large truck
crashes is the presence of heterogeneity, which refers to
the correlation between unobserved factors and observed
variables (5). In other words, the impacts of specific fac-
tors might vary among the observations, leading to ran-
dom distribution of parameter coefficients rather than
fixed impacts. Presence of heterogeneity in different
aspects of travel behavior and traffic data has been
widely discussed (6,7). Traffic accident data are often
heterogeneous considering that the occurrence and sever-
ity of a crash is the result of multiple contributing factors
at the same time (8). There may exist several heterogene-
ity issues that need to be addressed (9). First, some of the
contributing factors may remain hidden. For instance,
highly influential factors for a group of crashes might
not be significant for the whole dataset (811). The
degree of effect for specific contributing factors might be
different for the whole dataset and for subgroups (12,
13). Moreover, certain contributing factors might have a
completely different effect on different groups, such as
increasing the risk of fatal crashes for men and decreas-
ing it for women (12).
Heterogeneity of crash data masks the underlying
crash patterns and perplexes crash analysis (8). Although
various approaches have been undertaken to investigate
1
Department of Civil and Environmental Engineering, Florida International
University, Miami, FL
Corresponding Author:
Address correspondence to Xia Jin: xjin1@fiu.edu
heterogeneity associated with crash data, such as mixed
logit models or generalized structures (1419), various
phenomena may arise when analyzing and organizing
data in high-dimensional spaces (often with hundreds or
thousands of dimensions). The major issue with high-
dimensional datasets is that data points become
increasingly sparse as the dimensionality increases (20),
traditional techniques begin to fail, and the quality of
results deteriorates (21). High-dimensional crash data
require more robust methods to fully discover hidden
patterns (22). In addition, recent efforts on heterogeneity
have mostly been focused on pedestrian or passenger
vehicle crashes, very few studies investigating truck crash
heterogeneity. This paper, then, aims to explore an
advanced high-dimensional approach, block clustering,
to investigate heterogeneity in large datasets. Detailed
records of crashes involving large trucks that occurred in
the state of Florida between 2007 and 2016 have been
examined to identify truck crash patterns and significant
conditions contributing to the patterns.
The next section provides a brief overview of existing
studies that have investigated heterogeneity in crash data-
sets. The methodology used in this research is described
in the third section. The fourth section describes the data.
The fifth section analyzes the results and the final section
concludes the study.
Literature Review
There are two general approaches that have been taken
to investigate heterogeneity in crash data (8). Data con-
straining is a common approach that focuses on a very
specific segment of the crash dataset. For examples,
Kitali et al. focused on multiple-vehicle crashes (23);
Hadi et al. and Ghasemzadeh et al. analyzed incidents in
work zones (24,25); other specific subjects like crashes in
rural or urban arterial (26,27) and crashes involving
volatile or older adult drivers (28,29) have also been
investigated.
Although data constraining is a useful approach in
analyzing a specific type of crash, it cannot be generalized
to other crash types, and therefore has limited applicabil-
ity. The second approach to discovering heterogeneity is
clustering. Different from classification problems in
which each observation is associated with a group and
the objective is to place a new observation in one of the
groups, cluster analysis seeks to discover the number and
composition of groups that best describes the characteris-
tics of the observations. Cluster analysis uses distance
measures over various dimensions in the dataset to dis-
cover clusters of similar objects (22,30). Crash data clus-
tering has been investigated mainly using two major
approaches, partitional and model-based clustering.
Partitional clustering algorithms optimize a particular
objective function to identify clustering patterns that are
present in the data and iteratively improve the quality of
the partitions. Partitional clustering is also called
prototype-based clustering because it requires parameters
to be used as prototype points that represent each cluster
(31). K-means clustering (32,33) is the most widely used
partitional clustering algorithm in crash data analysis. K
representative points are selected as the initial centroids
in the first step. Using a proximity measure, each point
in the data set is then assigned to the closest centroid.
The centroids for each cluster are then updated based on
newly founded clusters. These two steps will be iteratively
repeated until no changes in centroids are observed or
until any other alternative convergence criterion is met.
Anderson used K-means clustering to profile road acci-
dent hotspots (34). Iranitalab et al. used K-means cluster-
ing for crash severity prediction (5). Mauro et al. applied
K-means clustering to examine patterns of vehicle crashes
before and after infrastructural interventions to improve
road safety (35). Zhang et al. utilized K-means clustering in
crash causality and severity analysis (36). K-medoids is
another partitional clustering algorithm which is more resi-
lient to outliers compared with K-means (34). The K-
medoids algorithm seeks to minimize specific objective
functions by finding clustering solutions. This method is
more robust in addressing noise and outliers in the data
because actual data points are chosen as the prototypes
(34). Nitsche et al. used K-medoids to investigate pre-crash
scenarios at road junctions (37). It is worth mentioning
that partitional methods are nondeterministic in nature
and need a user-predefined number of clusters to obtain a
solution (31).
In the model-based clustering approach, the objectives
are assumed to match a specific model. The model, which
is often a statistical distribution, may be user-specified
and might change during the process (38,39). Latent
class clustering (LCC) is a probabilistic model-based clus-
tering and assumes a mixture of several probability densi-
ties within the data (40). Mohamed et al. used LCC in
injury severity analysis of pedestrian–vehicle crashes (41).
Iranitalab et al. used LCC for crash severity prediction
(5). Depaire et al. segmented traffic accident by means of
LCC (8).
Despite the efficiency and simplicity of these clustering
methods, there are several limitations. First, all existing
studies have focused on segmenting the observations (i.e.,
crash records) and disregarded potential clusters among
the explanatory variables. This issue leads to the focus on
individual variable impacts rather than considering the
intertwining effects among a subset of factors that con-
tribute to different types of crashes. Although traditional
clustering methods could be applied to clustering for
both observations and variables separately, having very
2Transportation Research Record 00(0)
high computational complexities make these clustering
methods not suitable for high-dimensional datasets (42).
In addition, the possible level of noise, and a large
amount of meaningless information in a large crash data-
set, requires a robust method for clustering (43).
In summary, two approaches can be used to analyze
heterogeneous crash data: data constraining and cluster-
ing. Although data constraining is simple, it has limited
applicability and might be subjective. The clustering
approach provides unbiased results for segmenting a
dataset and can enhance the homogeneity significantly.
This paper aims to apply a robust clustering method
to overcome the limitations of traditional clustering
methods. Given the above thoughts, block clustering,
also referred to as coclustering or biclustering, holds the
promise of addressing heterogeneity in high-dimensional
datasets. The next section presents a detailed description
of the block clustering algorithm employed for this
study.
Methodology
Coclustering utilizes the duality of clustering and dis-
covers hidden latent patterns by generating a compact
representation of the dataset. The goal is to cluster the
sets of rows and columns simultaneously to obtain
homogeneous blocks (44). This method has attracted
much attention in recent years for text mining (clustering
of documents and words simultaneously) (45), bioinfor-
matics (clustering of genes and tissues simultaneously)
(43) and social network analysis (46).
Block clustering considers the two sets, observations,
and variables, simultaneously and organizes the data
into homogeneous blocks. If Xdenotes an n3ddata
matrix defined by X={(xij); i2I&j2J}, where Iis a set
of nobjects (rows, observations, crashes) and Jis a set of
dvariables (columns, variables, attributes), the main
objective of this method is to make permutations or rear-
rangements of observations and attributes to construct a
correspondence structure on I3J. An important advan-
tage of block clustering is the transformation of the ini-
tial data matrix Xinto a simpler and smaller data matrix
with the same structure (42,44). Moreover, block clus-
tering is fast and requires far less computation than that
needed to process the two sets separately and consecu-
tively as in the well-known K-means algorithm (42).
Figure 1 illustrates an example of block clustering.
Array (a) in Figure 1 presents a binary dataset consisting
of n=10 objects, I={A, B, C, D, E, F, G, H, I, J}, and
asetofd=7 binary variables, J={1, 2, 3, 4, 5, 6, 7}. To
obtain a homogeneous dataset, the array (a) can be reor-
ganized either by a partition on Ior by partitions on Iand
Jsimultaneously. Array (b) consists of data reorganized
by a partition of Iinto g=3clusters,a={A, C, H},
b={B, F, J}andc={D, G, I, E}. Array (c) consists of
data reorganized by the same partition of Iand partition
of Jinto m=3clusters,I={1, 4}, II ={3, 5, 7}and
III ={2, 6}. Compared with array (b), array (c)clearly
reveals an interesting pattern (42). The block clustering
approach takes advantage of partitioning on Iand J
simultaneously and results in a more homogeneous dataset
compared with traditional clustering models like K-means.
Another advantage of the block clustering method is that
it reduces the initial data matrix Xinto a simpler
data matrix having the same structure. In the example, the
initial (10 37) binary data matrix is reduced to a
(g3m)=(333) summary binary data matrix
(Figure 1d).
Different approaches can be applied for coclustering
and these approaches can differ in the pattern they seek
and the types of data they apply to. Govaert and Nadif
proposed a general framework to formalize the hypoth-
eses of coclustering algorithms (42). They introduced a
latent block model to solve the coclustering problem and
overcome the defects of classical coclustering methods.
They suggested a block clustering framework which uti-
lizes parsimonious models and allows a rigorous simula-
tion. This section presents a block clustering approach
based on the work of Govaert and Nadif (42) and Bhatia
et al. (47).
Mixture Models
A fundamental assumption of model-based clustering is
that the data has originated from a mixture of underlying
probability distributions, where each component kof the
mixture indicates a cluster. Therefore, the matrix dataset
X={x
i
;i2(1,.,n)} is supposed to be independent and
identically distributed and arises from a probability dis-
tribution with density (42,44):
Figure 1. Block clustering, showing (a) binary data set, (b) data
reorganized by a partition on I,(c) data reorganized by partitions
on Iand Jsimultaneously, and (d) summary binary data.
Rahimi et al 3
f(x;u)= Y
iX
k
pkfk(xi;a)ð1Þ
where
f
k
denotes the density function for the kth component,
ais the corresponding class parameter,
p
k
finds the probabilities that an observation belongs
to the kth component with k= (1,.,g) and for which g
is assumed to be known, and
uis the vector of (p
1,
.,p
g,
a). Govaert and Nadif (42)
showed that the density function can be rewritten as:
f(x;u)= X
z2Z
p(z)f(xz
j;a)ð2Þ
f(xz
j;a)= Y
i
fzi(xi;a)ð3Þ
p(z)= Y
i
pzið4Þ
where Zstands for the set of all possible partitions of Iin
gclusters. Therefore, according to this function, the data
matrix is supposed to be a sample of size 1 from a ran-
dom (n, d) matrix.
Latent Block Model
The Iset can be partitioned into gclusters by z=(z
11
,..
., z
ng
) with z
ik
=1ifibe a part of cluster kand z
ik
=0
otherwise, z
i
=kif z
ik
=1 and z
.k
=S
i
z
ik
is the cardin-
ality, the number of elements in a set, of row-cluster k.
Likewise, Jcan be divided into mclusters with w=(w
11
,
...,w
dm
)with w
jl
=1ifjfits into cluster land w
jl
=0
otherwise, w
j
=lif
wjl
= 1 and w
l
=S
j
w
jl
is the cardinal-
ity of column cluster l.
To investigate block clustering, Govaert and Nadif
extended the mixture model density function and
assumed that the labeling of Iand Jare independent of
each other (42). The obtained latent block mixture model
can be defined by the following probability density func-
tion (PDF):
f(x;u)= X
(z,w)2Z3WY
i,j
pzirwjfziwj(xij;a)ð5Þ
where
Zand Wshow the sets of all possible labeling for zof
Iand wof J, respectively,
f
zi,wj
(x,a) is the PDF defined on the real set R, and
u=(p,r,a) with p=(p
1
,...,p
g
) and r=(r
1
,...,
r
m
) are the vectors of probabilities p
k
and r
l
that a row
and a column associated to the kth row element and to
the lth column element respectively.
According to the above formulation, the
randomized data generation method can be described as
follows (42,44):
Row labeling: Generate the labeling z=(z
1
,...,z
n
)
according to the distribution p=(p
1
,...,p
g
).
Column labeling: Generate the labeling w=(w
1
,..
.,w
d
) according to the distribution r=(r
1
,...,r
m
).
Data generation: Generate for i= (1, ..., n) and j=
(1, ..., d) a value x
ij
according to the density distribu-
tion f
zi,wj
(.,a).
Model Parameter Estimation
EM-based algorithms (42,44,47) can be used to approx-
imate model parameters by maximizing observed data
log-likelihood. The complete data log-likelihood can be
defined by the following function:
Lcðz;w;uÞ¼X
k
z:klogpkþX
l
w:llogrlþ
X
i;j;k;l
zik wjl log fklðxij ;aÞ
ð6Þ
In this method, the conditional expectation Q(u,u
(c)
)of
the complete data log-likelihood is maximized given a
previous current estimate u
(c)
and xto iteratively maxi-
mize the log-likelihood:
Qðu;uðcÞÞ¼X
i;k
tðcÞ
ik logpkþX
j;l
rðcÞ
jl logrl
þX
i;j;k;l
eðcÞ
ikjl logfklðxij ;aÞð7Þ
where
t(c)ik =p(zik =1jx,u(c)),
r(c)jl =p(wjl =1jx,u(c)), and
e(c)ikjl =p(zikwjl =1jx,u(c)):
Because of the dependence structure in the model,
Govaert and Nadif (42) proposed an approximate solution
using the interpretation of the EM algorithm by Hathaway
(48)andNealandHinton(49). Therefore the fuzzy cluster-
ing criterion for the latent block model can be defined as
follows, in which L
c
is the fuzzy complete data log-
likelihood associated with the block latent model:
~
Fc(t,r;u)=Lc(t,r,u)+H(t)+H(r)ð8Þ
where
4Transportation Research Record 00(0)
H(r)= P
jl
rjl log rjl
H(t)= P
ik
tik log tik
Lc(t,r;u)=
P
k
t:klog pk+
P
:l
rllog rl+P
i,j,k,l
tikrjl log fkl (xij ;a)
8
>
>
>
<
>
>
>
:ð9Þ
Algorithms
Govaert and Nadif (42) proposed a block expectation
maximization (BEM) algorithm to maximize the fuzzy
clustering criterion using the following steps.
E-Step: The conditional row and column class prob-
abilities are computed respectively as
tik = log pk+X
jl
rjl log fkl(xij ;a)ð10Þ
rjl =logrl+X
ik
tik log fkl(xij ;a)ð11Þ
M-Step: The row proportions p, column
proportions r, and the model parameter aare calcu-
lated by maximizing Pkt:klog pk,P:lrllog rland
Pi,j,k,ltik rjl log fkl(xij ;a) which are the first, second,
and last term in L
c
respectively. The estimation of a
depends on the f
kl
PDF which will be discussed later
for binary data.
Therefore, the BEM algorithm suggested by Govaert
and Nadif (42) to maximize the fuzzy clustering criterion
can be described as:
1. Initialize t(0),r(0)and u(0)=(p(0),r(0),a(0)).
2. Compute t(c+1),p(c+1),a(c+(1=2)) by using EM
algorithm for the data matrix uil =Pjrjlxij and
starting from p(c),r(c),a(c).
3. Compute r(c+1),r(c+1),a(c+1)by using EM algo-
rithm for the data matrix vjk =Pitikxij and start-
ing from r(c),r(c),a(c+(1=2)).
4. Iterate step (2) and (3) until convergence.
Block Mixture Models for Binary Datasets
This section summarizes the methodology and describes
the final clustering model used based on the blockcluster
R Package (47). The crash dataset used in this study
included categorical and binary variables. Categorical
variables were converted to dummy variables and the fol-
lowing block mixture model was used to solve the binary
block clustering problem. Govaert and Nadif (42) dis-
cussed how the Bernoulli probability distribution
function, which was needed to find model parameter a,
can be described as:
fkl(xij ;a)=(ekj )xijakl
jj
(1ekj)1xij akl
jj ð12Þ
akl =0,ekl =rkl if rkl\0:5
akl =1,ekl =1rkl if rkl.0:5
ð13Þ
where
p=(p
kl
) is a binary data set with p
kl
2[0, 1], and
a
kl
and e
kl
characterize the center and dispersion of the
block k, l respectively. a
kl
represents the most frequent
binary value and e
kl
gives the probability of having a dif-
ferent value than the center for each block.
Based on this Bernoulli probability distribution func-
tion, both E and M steps can be redefined.
E-Step: The conditional row and column class prob-
abilities can be found by:
tik = log pk+X
l
uil r:lakl
jj
log ekl
1ekl
+X
l
r:llog(1ekl)
ð14Þ
rjl = log rl+X
k
vjk t:kakl
log ekl
1ekl
+X
k
t:klog(1ekl)
ð15Þ
uil =X
j
rjlxij ð16Þ
vjk =X
i
tik xij ð17Þ
M-Step: the model parameter ais calculated as:
akl =0,ekl =ykl
t:kr:lif ykl
t:kr:l
\0:5
akl =1,ekl =1ykl
t:kr:lotherwise:
ð18Þ
Data Description
The data used in this study were extracted from the
Florida statewide crash database through the Signal
Four Analytics portal (50). The data were coded from
police crash reports including driver, vehicle, crash, and
citation information. Each crash involved at least one
large truck. Roadway network information was also
integrated in the database. Irrelevant information was
removed. The final dataset contains more than 200 attri-
butes. Categorical variables were recoded into dummy
variables as the applied methodology required binary
inputs.
The database recorded around 200 variables, describ-
ing the characteristics of the drivers, vehicles, crash
Rahimi et al 5
events, roadway geometry, lighting, and environment
conditions. The total sample involved 220,932 crashes
that occurred between 2007 and 2016, involving 228,180
large trucks, 180,702 non-truck motor vehicles, 1,902
fatalities, and 58,976 injuries.
Roadway problems were present in 1.9% of the two-
vehicle cases, and adverse weather and light conditions
were present in approximately 8.2% and 20.3% of the
crashes, respectively. Interruption in the traffic flow (pre-
vious crash, work zone, peak hour congestion, etc.) was
coded in almost 2.3% of the two-vehicle crashes.
About 74% of the crashes occurred on local roads,
state highways, interstate, or county roads. In 80% of the
accidents, crash severity was reported as property dam-
age only, but injury and fatality were coded in 19% and
1% of accidents, respectively. Hit-and-run and school
bus–related crashes were reported 6,538 and 1,056 times
respectively.
Table 1 below presents crash type by severity level.
Results Analysis
The very first step for block clustering is finding the opti-
mum number of clusters for rows and columns.
Biernacki et al. (51) suggested integrated completed like-
lihood (ICL), a criterion which can effectively maximize
the complete data likelihood, and has proven to be more
robust than the Bayesian information criterion (BIC) for
mixture models. For a detailed discussion of the ICL cri-
terion, readers are referred to Biernacki et al. (51) and
Bertoletti et al. (52).
A variety of combinations of row number (1 through
10) and column number (1 through 10) were tried, to find
the optimum number of blocks. ICL and pseudo-
likelihood values for each model were evaluated. The
optimum number of blocks was found to be 30 with 3
rows (K) and 10 columns (L), as neither ICL nor pseudo-
likelihood improved much when the numbers further
increased. The ICL and pseudo-likelihood values for all
models can be found in Figures 2 and 3. The results show
that ICL and pseudo-likelihood values for the optimum
number of blocks improved 20.8% and 21.1%, respec-
tively, compared with the initial dataset.
For the model with K= (1, 2, 3) and L= (1, ..., 10)
clusters, the row proportions pand column proportions
rare shown in Table 2. The first row-cluster, K= 1, cov-
ers 31.8% of all observations (crashes), and K= 2 and
K= 3 contain 39.9% and 28.3% of accidents, respec-
tively. The first column cluster, L= 1, consists of 10.3%
of all variables included in the dataset.
Column Clusters (Attribute Clustering)
Detailed results for variable clustering can be found in
Table 3. The results show distinct characteristics for each
cluster. It can be seen that driver age, crash location,
vehicle condition, weather condition, roadway type,
Table 1. Crash Type by Severity for Large Truck Involved Crashes
Property damage only Injury Fatality
Crash type Crashes Percentage Crashes Percentage Crashes Percentage Total
1. Bicycle 124 17.8% 510 73.4% 61 8.8% 695
2. Head-on 1,875 57.3% 1,172 35.8% 223 6.8% 3,270
3. Left entering 3,352 62.0% 1,972 36.5% 86 1.6% 5,410
4. Left leaving 1,368 56.4% 978 40.3% 79 3.3% 2,425
5. Left rear 1,954 67.3% 930 32.0% 20 0.7% 2,904
6. Off-roadway 18,864 87.3% 2,639 12.2% 95 0.4% 21,598
7. Opposing sideswipe 2,502 82.8% 500 16.5% 21 0.7% 3,023
8. Other 17,954 81.6% 3,918 17.8% 128 0.6% 22,000
9. Pedestrian 148 12.7% 816 70.2% 199 17.1% 1,163
10. Rear-end 34,074 68.6% 15,240 30.7% 380 .8% 49,694
11. Right angle 4,887 60.7% 3,004 37.3% 160 2.0% 8,051
12. Right/left 418 89.3% 50 10.7% 0 0.0% 468
13. Right/through 3,667 79.5% 928 20.1% 19 0.4% 4,614
14. Right/U-turn 29 85.3% 5 14.7% 0 0.0% 34
15. Rollover 1,628 49.2% 1,643 49.7% 38 1.1% 3,309
16. Same-direction sideswipe 33,799 88.4% 4,393 11.5% 28 0.1% 38,220
17. Unknown 3,970 81.8% 856 17.6% 27 0.6% 4,853
18. Single-vehicle 7,609 87.6% 993 11.4% 83 1.0% 8,685
19. Parked-vehicle 26,076 95.6% 1,127 4.1% 73 0.3% 27,276
20. Backed into 11,560 92.5% 926 7.4% 14 0.1% 12,500
21. Animal 679 91.8% 59 8.0% 2 0.3% 740
Total 176,537 79.9% 42,659 19.3% 1,736 0.8% 220,932
6Transportation Research Record 00(0)
vehicle maneuver, and driver action were involved in
defining the clusters. It may not always be obvious which
are the most significant features for each cluster, but it
presents a helpful way to identify potential associations
between the variables. A cluster of columns is a subset of
columns that exhibit similar behavior across the rows
(crashes) (43). Column clustering identifies coexistence
between variables and implies that all attributes within a
cluster will either occur together or not occur for a spe-
cific group of crashes (row-cluster).
For example, column cluster 2 represents collisions
with non-fixed objects on the roadway; column cluster 4
mostly contains crashes in parking lots; column cluster 5
mostly involves weekend crashes while driving above the
speed limit; cluster 7 may involve distracted drivers, and
cluster 10 involves female drivers and those changing
lanes. The degree of occurrence depends on e
kl
value
which will be discussed in the block clustering section.
Row Clusters (Crash Clustering)
A cluster of rows is a subset of rows that exhibit similar
behavior across the columns (attributes) (43). The model
identified three distinct row clusters. To further investi-
gate the clusters, several variables, such as crash type,
crash severity, crash time, manner of collision, most
Figure 2. Integrated complete likelihood values by number of blocks.
Figure 3. Pseudo-likelihood values by number of blocks.
Table 2. Row and Column Proportions for Block Cluster Model
LL=1 L=2 L=3 L=4 L=5 L=6 L=7 L=8 L=9 L=10
r(%) 10.3 3.4 5.2 6.9 8.6 6.9 19.0 19.0 12.1 8.6
K123nanananananana
p(%) 31.8 39.9 28.3 na na na na na na na
Note: na = not applicable.
Rahimi et al 7
harmful events, and so forth, were evaluated to identify
the latent patterns. Among all tested variables, the result
revealed significant patterns only in conjunction with
crash type. Table 4 below shows row clusters by crash
type. Z-tests were conducted to examine the significance
of the differences among the clusters.
Results show that the first row-cluster (K= 1) mostly
contains rear-end and same-direction sideswipe crashes.
These two types of crashes are very similar to each other
in the sense that the involved vehicles are traveling in the
same direction. For the second row-cluster, K= 2, the
most dominant crashes are angle, head-on, and opposing
sideswipe crashes, which are again very similar to each
other as the vehicles involved are traveling in opposing
directions. Lastly, for K= 3, the most prevalent acci-
dents are park/off-roadway and single-vehicle crashes.
This cluster mostly includes crashes like rollover, collid-
ing with animal, pedestrian, bicyclist, fixed objects, or
parked vehicles. The results suggest that this crash data-
set can be generally categorized as same-direction
crashes, multi-direction crashes, and single-vehicle
crashes.
The above analysis of row clusters and column clus-
ters indicates that this clustering approach is able to iden-
tify relatively homogeneous groups within the dataset
that are meaningful and reliable (with robust statistical
foundations).
Block Clusters (Both Attribute and Crash Clustering)
To further investigate which groups of attributes are
more likely to be associated with which groups of
Table 3. Block Clustering Model Result: Column Clusters
LVariables LVariables
1 Driver age between 36 and 50 years old 7 Driver distracted
Crash at intersection Road system identifier, U.S. Highway
Driver action, aggressive/careless maneuver Vehicle maneuver action stopped or slowing in traffic
Vehicle year before 2000 Vehicle with defect
Weather condition not clear Vehicle at-fault, body type other
Vehicle maneuver action, straight ahead Vehicle at-fault, pickup truck
2 First harmful event, collision non-fixed object 8 Driver age between 16 and 20 years old
First harmful event location, on roadway Driver condition at time of crash, not normal
3 Vehicle at-fault, passenger car Driving under the influence
Total lane 4 or more First harmful event, non-collision
Traffic way, two-way divided Road system identifier, local road
4 First harmful event, collision with fixed object Road system Identifier, forest or private road
First harmful event location, off-roadway Roadway alignment, curve
Road system identifier, parking lot Roadway grade, not level
Vehicle maneuver action, backing Vehicle at-fault, bus
5 Above posted speed Vehicle at-fault, light trucks
Crash time, weekend Vehicle at-fault, utility vehicle
Driver action, other contributing action 9 Driver age between 21 and 35years old
Road surface condition, not dry Driver age between 51 and 65 years old
Vehicle maneuver action, other Driver action, improper maneuver
6 Crash within city limits Driver action, no contributing action
Estimated speed [0,25] mph Light condition, not daylight
Unpaved or curb shoulder Traffic way, two-way not divided
Vehicle at-fault, medium/heavy trucks Vehicle maneuver action, turn
7 Estimated speed [76,100] mph 10 Estimated speed [26,50] mph
Vision obstructed Estimated speed [51,75] mph
Traffic way, one-way At-fault driver gender, female
Driver age more than 66 years old Road system identifier, interstate
Driver action, illegal maneuver Vehicle maneuver action change lane
Table 4. Crash Type by Row-Cluster
KRear-end Same-direction sideswipe Park/off-roadway Single-vehicle Angle Head-on Opposing sideswipe
1 44.2% 56.6% 13.1% 22.3% 23.3% 30.7% 24.1%
2 52.7% 40.3% 28.3% 28.8% 71.1% 55.9% 62.0%
3 3.1% 3.2% 58.7% 48.9% 5.6% 13.4% 13.9%
Total 100% 100% 100% 100% 100% 100% 100%
8Transportation Research Record 00(0)
crashes, this subsection focuses on the individual blocks,
defined by both row and column clustering. A block
cluster defines a subset of rows (crashes) that exhibit sim-
ilar behavior across a subset of columns (attributes), and
vice versa (43).
Figure 4 depicts the original as well as the clustered
data with K= (1, 2, 3) and L= (1, ..., 10). The dataset
was segmented very well by block clustering. The figure
shows 30 blocks (3 rows by 10 columns), with the green
lines representing the boundaries. As aforementioned,
each block has two features, a
kl
, which shows the center
of blocks or most frequent binary value, and e
kl
which
represents the dispersion or probability of having a dif-
ferent value than the center. Therefore, e
kl
can be used to
realize how homogeneously the blocks are clustered.
Table 5 shows the a
kl
and e
kl
values. For instance,
Figure 4 shows that cluster K= 1 and L= 2 mostly
includes 1 (white squares) rather than 0 (black squares),
whereas in Table 5 for this cluster (K=1,L=2)the
center is found to be True (which means 1) and the dis-
persion for this block is 4.8% (which means 95.2% of
this block has value 1).
To better understand the contributing set of condi-
tions that affect each type of crash the investigation
focuses on blocks that have acceptable e
kl
and, therefore,
dominant a
kl
, as highlighted in Table 5. The idea is to
investigate the significant subset of attributes relevant to
each of the three subgroups of crashes. For instance, for
same-direction crashes (K= 1), four blocks showed sig-
nificant degrees of homogeneity (12e
kl
). Blocks with K/L
= {(1, 2), (1, 4), (1, 7), (1, 8)} showed more than 90%
degree of homogeneity. The associated subset of attri-
butes in these blocks (columns 2, 4, 7, and 8) can be
obtained from column clustering result (Table 4) to
describe this type of crashes. It shows that same-direction
crashes are usually associated with attributes in column 2
(true), but not with attributes in columns 4, 7 and 8
(false). It should be noted that the average degree of
homogeneity for the selected blocks is 95.72% which
implies the robustness of the model.
The results indicate that for same-direction crashes,
which include rear-end and same-direction sideswipe
accidents, the first harmful events were most likely
reported as a collision with a non-fixed object and hap-
pened on the roadway. They were not likely to take place
on one-way streets, parking lots, US highway, or local
roads. These crashes were not likely to be caused by
vision obstruction, backing, a stopped vehicle, or slowing
in traffic. Trucks carrying hazardous materials were
more likely to be involved in same-direction crashes.
Work zones seemed to witness more same-direction
crashes. It is revealed that same-direction vehicle crashes
were the most dangerous crashes which usually resulted
Figure 4. Original and clustered dataset using block clustering
approach.
Table 5. Block Clustering Model Result
a
kl
value for each block
K/L 12345678910
1 False True True False False False False False False False
2 False True False False False True False False False False
3 False False False False False True False False False False
e
kl
value for each block
K/L 12345678910
1 33.2% 4.8% 47.3% 1.0% 10.9% 32.8% 6.5% 4.3% 19.1% 31.3%
2 36.4% 0.0% 16.4% 1.9% 10.8% 35.1% 7.3% 1.3% 22.6% 8.1%
3 27.9% 44.2% 6.7% 43.0% 15.4% 32.6% 5.3% 2.9% 25.6% 5.6%
Rahimi et al 9
in more than one fatality or more than two injuries, and
that females were more likely to get involved in same-
direction crashes than other crashes (but the number of
crashes occurring were still less than male driver crashes).
For opposing-direction crashes, which include angle,
head-on, and opposing sideswipe crashes, female and
senior drivers were rarely cited as at-fault and the esti-
mated speed was not likely to be above 25 mph. Similar
to same-direction crashes, the first harmful events were
most likely reported as collision with a non-fixed object
and happened on the roadway. They were not likely to
take place on one-way streets, parking lots, US highway,
or local roads. These crashes were not likely to be caused
by vision obstruction, backing, a stopped vehicle, or
slowing in traffic. It was revealed that pedestrians, bikes,
and mopeds were most commonly involved in opposing-
direction vehicle crashes and least frequently involved in
same-direction crashes. Applying a raised median that
prevents opposing-direction vehicle crashes could drasti-
cally diminish non-motorist crashes. School bus–related
accidents were more likely to occur in opposing-direction
crashes. Therefore, it seems beneficial to inform school
bus drivers about the high risk of this type of crash and
specially instruct them to prevent angle, head-on, and
opposing sideswipe crashes.
Last but not least, single-vehicle crashes or those
involving parked vehicles or off-road crashes (including
rollover, colliding with animal, pedestrian, bicyclist, or
fixed objects) were not likely to take place on two-way
divided roadway with more than four lanes, and the esti-
mated speed was not likely to be above 25 mph. By defi-
nition, these single-vehicle crashes involved trucks only.
It was found that in these crashes, the restraint systems
(shoulder or lap belt) were more likely to be not used by
motorists. Therefore, educating truck drivers on the ben-
efits of restraint systems could help improve safety. A
majority of the drivers held their driver’s license outside
of Florida. This indicates the need to notify or educate
non-resident truck drivers who are unfamiliar with the
roads in Florida about the high risk of rollover, colliding
with animal, pedestrian, bicyclist, fixed-object, or
parked-vehicle crashes.
Interestingly, some variables were found to be com-
mon among all three types of crashes which implies that
these attributes were general among large truck crashes.
Drivers were rarely found to be distracted or driving
above 76 mph or in DUI (driving under the influence)
condition. Illegal maneuvers, vehicle defects, and driver
vision obstruction were not a significant cause for large
truck crashes. U.S. highway was found to be the safest
roadway for large trucks. Moreover, there are several
types of variables which were not found to be significant
in any of the clusters implying that they are not
contributing factors in large truck crashes; these include
driver age, vehicle age, weather condition, and type of
shoulder.
The findings from this study using clustering methods
showed very similar results to another study of heavy
truck crashes in Florida (53). In that study, the dataset
was initially segmented into seven categories, including
pedestrian, run-off-road/single-vehicle, same-direction,
opposite-direction, change-traffic-way/turning, intersect-
ing paths, and other, without using a clustering method.
Their results showed that same-direction and opposite-
direction crashes had distinct patterns, whereas the other
five categories revealed a similar pattern and were not
significantly different. This confirms the study findings
and implies that the proposed block clustering method is
able to produce reliable and meaningful results.
Conclusion
This study presents an effort to employ an advanced
high-dimensional clustering approach to large truck
crash analysis. A block clustering method was applied to
more than 220,000 crash records with more than 200
attributes. The analysis showed promising results in seg-
menting the large heterogeneous dataset into meaningful
subgroups that provide additional insights for crash
analysis.
Attribute clustering showed distinct characteristics for
each cluster; driver age, crash location, vehicle condition,
weather condition, roadway type, vehicle maneuver, and
driver action were involved in defining the clusters.
Utilizing column clustering provides comprehensive
insights for crash study as the approach considers a
group of attributes that are likely to occur at the same
time rather than analyzing attributes individually.
Crash clustering revealed significant differences
among the clusters and suggested that this crash dataset
could be portioned as same-direction (including rear-end
and same-direction sideswipe), opposing-direction
(include angle, head-on, and opposing sideswipe), and
single-vehicle (contains rollover, colliding with animal,
pedestrian, bicyclist, fixed objects, or parked vehicles)
crashes.
Individual blocks, defined by both row and column
clustering were further investigated to better understand
the contributing set of conditions that lead to large truck
crashes. The average degree of homogeneity for selected
blocks is 95.72% which implies the robustness of the
model. Major features for each of the three major types
of crashes were analyzed, which may provide insights to
develop potential countermeasures for specific segments.
In particular, raised medians to target non-motorists’
crashes, notifying school bus drivers about the high risk
10 Transportation Research Record 00(0)
of opposing-direction crashes, and programs targeting
non-Florida truckers may help improve safety.
The suggested clustering approach can be used as a
preanalysis method for heterogeneous crash data. The
block clustering approach can lead to more robust mod-
els to segment the crash data for further analysis. In this
paper, the homogeneity improved significantly as ICL
and pseudo-likelihood values increased by 20.8% and
21.1% respectively in the optimized dataset. Findings of
the clustering method were confirmed by another study
(which employed a conventional segmenting approach)
conducted in the same area. This shows the potential of
clustering methods to produce meaningful results.
Although the dataset used was high-dimensional and
contained many crashes, it was limited to the state of
Florida and had limited attributes. Researchers are
encouraged to apply the methodology to more compre-
hensive datasets to obtain more general results. The
method can also be incorporated to improve the accu-
racy of truck crash prediction models as it provides
robust and statistically significant criteria to segment the
dataset.
Acknowledgments
This work is funded by the research office of the Florida
Department of Transportation (BDV29 977-31). Data were
extracted from the Signal Four Analytics database provided by
Ilir Bejleri and Liang Zhai at the University of Florida.
Author Contributions
The authors confirm contribution to the paper as follows: study
conception and design: AR and XJ; data processing: HA, AR,
and GA; analysis and interpretation of results: AR and XJ;
draft manuscript preparation: AR and XJ. All authors reviewed
the results and approved the final version of the manuscript.
References
1. Large Truck and Bus Crash Facts 2016. Analysis Division
Federal Motor Carrier Safety Administration. FMCSA-
RRA-17-016. U.S. Department of Transportation.
Washington, D.C., 2018.
2. Large Truck and Bus Crash Facts 2014. Analysis Division
Federal Motor Carrier Safety Administration. FMCSA-
RRA-16-001. U.S. Department of Transportation.
Washington, D.C., 2016.
3. Haleem, K., and A. Gan. Effect of Driver’s Age and Side
of Impact on Crash Severity along Urban Freeways: A
Mixed Logit Approach. Journal of Safety Research, Vol.
46, 2012, pp. 67–76.
4. Anastasopoulos, P., and F. Mannering. An Empirical
Assessment of Fixed and Random Parameter Logit
Models using Crash and Non-Crash-Specific Injury Data.
Accident Analysis and Prevention, Vol. 43, No. 3, 2011,
pp. 1140–1147.
5. Iranitalab, A., and A. Khattak. Comparison of Four Sta-
tistical and Machine Learning Methods for Crash Severity
Prediction. Accident Analysis and Prevention, Vol. 108,
Supplement C, 2017, pp. 27–36.
6. Shams, K., X. Jin, R. Fitzgerald, H. Asgari, and M. S.
Hossan. Value of Reliability for Road Freight Transporta-
tion: Evidence from a Stated Preference Survey in Florida.
Transportation Research Record: Journal of the Transporta-
tion Research Board, 2017. 2610: 35–43.
7. Jin, X., M. S. Hossan, H. Asgari, and K. Shams. Incorpor-
ating Attitudinal Aspects in Roadway Pricing Analysis.
Transport Policy, Vol. 62, 2018, pp. 38–47.
8. Depaire, B., G. Wets, and K. Vanhoof. Traffic Accident
Segmentation by Means of Latent Class Clustering.
Accident Analysis and Prevention, Vol. 40, 2008,
pp. 1257–1266.
9. Sasidharana, L., K. Wub, and M. Menendezaa. Exploring
the Application of Latent Class Cluster Analysis for Inves-
tigating Pedestrian Crash Injury Severities in Switzerland.
Accident Analysis and Prevention, Vol. 85, 2015,
pp. 219–228.
10. Valent, F., F. Schiava, C. Savonitto, T. Gallo, S. Brusa-
ferro, and F. Barbone. Risk Factors for Fatal Road Traffic
Accidents in Udine, Italy. Accident Analysis and Preven-
tion, Vol. 34, No. 1, 2002, pp. 71–84.
11. Yau, K. K. W. Risk Factors Affecting the Severity of Sin-
gle Vehicle Traffic Accidents in Hong Kong. Accident Anal-
ysis and Prevention, Vol. 36, No. 3, 2004, pp. 333–340.
12. Ulfarsson, G. F., and F. L. Mannering. Difference in Male
and Female Injury Severities in Sport-Utility Vehicle, Mini-
van, Pickup and Passenger Car. Accident Analysis and Pre-
vention, Vol. 36, No. 2, 2004, pp. 135–147.
13. Islam, S., and F. L. Mannering. Driver Aging and its Effect
on Male and Female Single-Vehicle Accident Injuries:
Some Additional Evidence. Accident Analysis and Preven-
tion, Vol. 37, No. 2, 2006, pp. 267–276.
14. Moore, D., W. Schneider, P. Savolainen, and M. Farzaneh.
Mixed Logit Analysis of Bicyclist Injury Severity Resulting
from Motor Vehicle Crashes at Inter-Section and Non-
Intersection Location. Accident Analysis and Prevention,
Vol. 43, 2011, pp. 621–630.
15. Shaheed, M. S., K. Gkritza, W. Zhangc, and Z. Hans. A
Mixed Logit Analysis of Two-Vehicle Crash Severities
Involving a Motorcycle. Accident Analysis and Prevention,
Vol. 61, 2013, pp. 119–128.
16. Zeng, Z., W. Zhu, R. Ke, J. Ash, Y. Wang, J. Xu, and X.
Xu. A Generalized Nonlinear Model-Based Mixed Multi-
nomial Logitapproach for Crash Data Analysis. Accident
Analysis and Prevention, Vol. 99, 2017, pp. 51–65.
17. Milton, J., V. Shankar, and F. L. Mannering. Highway
Accident Severities and the Mixed Logit Model: An
Exploratory Empirical Analysis. Accident Analysis and
Prevention, Vol. 40, No. 1, 2008, pp. 260–266.
Rahimi et al 11
18. Wu, Q., F. Chen, G. Zhang, X. C. Liu, H. Wang, and S.
M. Bogus. Mixed Logit Model-Based Driver Injury Sever-
ity Investigations in Single- and Multi-Vehicle Crashes on
Rural Two-Lane Highways. Accident Analysis and Preven-
tion, Vol. 72, 2014, pp. 105–115.
19. Cerwick, D., K. Gkritza, and M. Shaheed. A Comparison
of the Mixed Logit and Latent Class Methods for Crash
Severity Analysis. Analytic Methods in Accident Research,
Vol. 3–4, 2014, pp. 11–27.
20. Steinbach, M., L. Erto
¨z, and V. Kumar. The Challenges of
Clustering High Dimensional Data. In New Directions in
Statistical Physics (L. T., Wille, ed.), Springer Verlag, Ber-
lin, Heidelberg, Germany, 2004, pp. 273–309.
21. Parsons, L. Subspace Clustering for High Dimensional
Data: A Review. ACM SIGKDD Explorations Newsletter:
Special Issue on Learning from Imbalanced Datasets, Vol.
6, No. 1, 2004, pp. 90–105.
22. Jain, A. K., M. N. Murty, and P. J. Flynn. Data Clustering:
A Review. ACM Computing Surveys (CSUR), Vol. 31, No.
3, 1999, pp. 264–323.
23. Kitali, A. E., E. Kidando, P. Martz, P. Alluri, T. Sando,
R. Moses, and R. Lentz. Evaluating Factors Influencing
the Severity of Three-Plus Multiple-Vehicle Crashes using
Real-Time Traffic Data. Transportation Research Record:
Journal of the Transportation Research Board, 2018.
2672(38): 128–137.
24. Hadi, M., Y. Xiao, T. Wang, S. F. Qom, L. Azizi, J. Jia, A.
Massahi, and M. S. Iqbal. Framework for Multi-Resolution
Analyses of Advanced Traffic Management Strategies. Tech-
nical Report. Lehman Center of Transportation Research
Florida International University, Miami, FL, 2016.
25. Ghasemzadeh, A., and M. M. Ahmed. A Tree-Based
Ordered Probit Approach to Identify Factors Affecting
Work Zone Weather-Related Crashes Severity in North
Carolina using the Highway Safety Information System
Dataset. Presented at 96th Annual Meeting of the Trans-
portation Research Board, Washington, D.C., 2017.
26. Haghighi, N., X. C. Liu, G. Zhang, and R. J. Porter.
Impact of Roadway Geometric Features on Crash Severity
on Rural Two-Lane Highways. Accident Analysis and Pre-
vention, Vol. 111, 2018, pp. 34–42.
27. Najaf, P., V. R. Duddu, and S. S. Pulugurtha. Predictabil-
ity and Interpretability of Hybrid Link-Level Crash
Frequency Models for Urban Arterials Compared to Clus-
ter-Based and General Negative Binomial Regression
Models. International Journal of Injury Control and Safety
Promotion, Vol. 25, No. 1, 2017, pp. 3–13.
28. Kamrani, M., A. J. Khattak, and T. Li. A Framework to
Process and Analyze Driver, Vehicle and Road Infrastruc-
ture Volatilities in Real-Time. Presented at 97th Annual
Meeting of the Transportation Research Board, Washing-
ton, D.C., 2018.
29. Motamedi, S., and J. H. Wang. Older Adult Drivers’ Chal-
lenges and In-Vehicle Technology Acceptance. Interna-
tional Journal for Traffic and Transport Engineering, Vol. 7,
No. 4, 2017, pp. 498–515.
30. Han, J., M. Kamber, and J. Pei. Data Mining:
Concepts and Techniques. Morgan Kaufmann, Waltham,
MA, 2001, pp. 335–393.
31. Aggarwal, C. C., and C. K. Reddy. Data Clustering Algorithms
and Applications. Chapman and Hall/CRC, Boca Raton, FL,
2013.
32. MacQueen, J. Some Methods for Classification and Analy-
sis of Multivariate Observations. In Proc., 5th Berkeley
Symposium on Mathematical Statistics and Probability.
University of California Press, Berkeley, CA, 1967,
pp. 281–297.
33. Lloyd, S. Least Squares Quantization in PCM. IEEE
Transactions on Information Theory, Vol. 28, No. 2, 1982,
pp. 129–137.
34. Anderson, T. K. Kernel Density Estimation and K-Means
Clustering to Profile Road Accident Hotspots. Accident
Analysis and Prevention, Vol. 41, 2009, pp. 359–364.
35. Mauro, R., M. D. Luca, and G. Dell’Acqua. Using a K-
Means Clustering Algorithm to Examine Patterns of
Vehicle Crashes in Before-After Analysis. Modern Applied
Science, Vol. 7, 2013, pp. 11–19.
36. Zhang, C., J. N. Ivan, and T. Jonsson. Collision Type Cate-
gorization Based on Crash Causality and Severity Analysis.
Presented at 86th Annual Meeting of the Transportation
Research Board, Washington, D.C., 2007.
37. Nitsche, P. Pre-Crash Scenarios at Road Junctions: A Clus-
tering Method for Car Crash Data. Accident Analysis and
Prevention, Vol. 107, 2017, pp. 137–151.
38. Brown, D. Efficient Functional Clustering of Protein
Sequences using the Dirichlet Process. Bioinformatics, Vol.
24, No. 16, 2008, pp. 1765–1771.
39. Berkhin, P. A Survey of Clustering Data Mining Tech-
niques. In Grouping Multidimensional Data (J., Kogan,
C. Nicholas, and M. Teboulle, eds.), Springer, Berlin, Hei-
delberg, Germany, 2006, pp. 25–71.
40. Vermunt, J. K., and J. Magidson. Latent Class Cluster
Analysis. Applied Latent Class Analysis. Cambridge Uni-
versity Press, Cambridge, UK, 2002, pp. 89–106.
41. Mohamed, M. G., N. Saunier, L. F. Miranda-Moreno,
and S. V. Ukkusuri. A Clustering Regression Approach: A
Comprehensive Injury Severity Analysis of Pedestrian-
Vehicle Crashes in New York, US and Montreal, Canada.
Safety Science, Vol. 54, 2013, pp. 27–37.
42. Govaert, G., and M. Nadif. Block Clustering with Ber-
noulli Mixture Models: Comparison of Different
Approaches. Computational Statistics and Data Analysis,
Vol. 52, No. 6, 2008, pp. 3233–3245.
43. Madeira, S. C., and A. L. Oliveira. Biclustering Algorithms
for Biological Data Analysis: A Survey. IEEE/ACM Trans-
actions on Computational Biology and Bioinformatics,Vol.1,
No. 1, 2004, pp. 24–45.
44. Govaert, G., and M. Nadif. Co-Clustering: Models, Algorithms
and Applications, 1st ed. Wiley-IEEE Press, Hoboken, NJ,
2013.
45. Dhillon, I. S. Co-Clustering Documents and Words using
Bipartite Spectral Graph Partitioning. Proceedings 7th
ACM SIGKDD International Conference on Knowledge
12 Transportation Research Record 00(0)
Discovery and Data Mining, KDD ’01, San Francisco, CA,
2001, pp. 269–274.
46. Wang, F., S. Lin, and P. S. Yu. Collaborative Co-Cluster-
ing across Multiple Social Media. Proc., 17th IEEE Inter-
national Conference on Mobile Data Management, IEEE,
Porto, Portugal, 2016.
47. Bhatia, P., S. Iovleff, and G. Govaert. Blockcluster: An R
Package for Model Based Co-Clustering. Journal of Statis-
tical Software, Vol. VV, No. II, 2014.
48. Hathaway, R. Another Interpretation of the EM Algo-
rithm for Mixture Distributions. Statistics and Probability
Letters, Vol. 4, No. 2, 1986, pp. 53–56.
49. Neal, R., and G. Hinton. A View of the EM Algorithm
That Justifies Incremental, Sparse, and Other Variants.
Learning in Graphical Models, 1998, pp. 355–368.
50. The GeoPlan Center. Signal Four Analytics. Department
of Urban & Regional Planning, University of Florida,
Gainesville, FL. https://s4.geoplan.ufl.edu/.
51. Biernacki, C., G. Celeux, and G. Govaert. Assessing a Mix-
ture Model for Clustering with the Integrated Classification
Likelihood. RR-3521. INRIA, 1998. https://hal.inria.fr/
inria-00073163/document.
52. Bertoletti, M., N. Friel, and R. Rastelli. Choosing the
Number of Clusters in a Finite Mixture Model using an
Exact Integrated Completed Likelihood Criterion.
METRON, Vol. 73, No. 2, 2015, pp. 177–199.
53. Spainhour, L. K., D. Brill, J. O. Sobanjo, J. Wekezer, and
P. V. Mtenga. Evaluation of Traffic Crash Fatality Causes
and Effects: A Study of Fatal Traffic Crashes in Florida
from 1998–2000 Focusing on Heavy Truck Crashes. Final
Report. Project No. BD-050. Florida Department of
Transportation, Tallahassee, FL, 2005.
The Standing Committee on Artificial Intelligence and
Advanced Computing Applications (ABJ70) peer-reviewed this
paper (19-02466).
The opinions, findings and conclusions expressed in this paper are
those of the authors and not necessarily those of the Florida
Department of Transportation or the U.S. Department of
Transportation.
Rahimi et al 13
... Kaiyang freeway imi et al. [36] 2019 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ USA Florida ng et al. [7] 2019 ✓ ✓ ✓ ✓ China Shanxi pour et al. [6] 2018 ...
... Kaiyang freeway Rahimi et al. [36] 2019 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ USA Florida Zhang et al. [7] 2019 ✓ ✓ ✓ ✓ China Shanxi Rezapour et al. [6] 2018 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ USA Wyoming Choi et al. [22] 2014 ...
Article
Full-text available
With the growing demand for transportation and cargo between cities, the proportion of heavy vehicles in freeway traffic has been increasing in Iran and worldwide during the past decade. The impact of heavy vehicles on crash severity has long been a concern in the crash analysis literature for the prevalence of crashes in freeway traffic. The purpose of this study is to investigate the contribution of heavy vehicles to freeway crashes and uncover other causal factors. Using the comprehensive crash and traffic data from the Qazvin–Tehran freeway in Iran, from 2013 to 2018, 1350 crashes involving heavy vehicles were extracted regarding the weather conditions, weekday, main cause of the crash, driver gender, and culprit side. Considering crash severity calculation, the applied coefficient weights in this study for a person were considered as 3 for an accident resulting in injury and 5 for a fatal crash. A binary logit model was estimated using the data to determine if there was a significant correlation between recognized factors and the likelihood of the crash. The logit modeling results clearly illustrate important relationships between various risk factors and occupant injury, in which heavy vehicles were recognized as one of the most important factors in this study. Other variables associated with crash severity were weather conditions and driver attention. Results indicate that the number of crashes is simultaneously dependent on the total vehicle volume and average speed of heavy vehicles.
... One approach to analyzing the factors affecting the severity and frequency of large truck crashes is data clustering or dividing into sub-groups. The data clustering approach to large truck crashes in Florida for the 2007-2016 period suggested that crash datasets could be sub-grouped into same-direction, opposite-direction, and single-vehicle crash datasets to obtain a better understanding of the effects of contributing factors [37]. The dividing approach of the analysis period on urban freeways in Texas between 2006 and 2010 using different logit models suggests that the contributing factors related to injury severity varied over five periods per day [38]. ...
Article
Full-text available
Freight transportation, dominated by trucks, is an integral part of trade and production in the USA. Given the prevalence of large truck crashes, a comprehensive investigation is imperative to ascertain the underlying causes. This study analyzed 2017–2021 Texas crash data to identify factors impacting large truck crash rates and injury severity and to locate high-risk zones for severe incidents. Logistic regression models and bivariate analysis were utilized to assess the impacts of various crash-related variables individually and collectively. Heat maps and hotspot analysis were employed to pinpoint areas with a high frequency of both minor and severe large truck crashes. The findings of the investigation highlighted night-time no-passing zones and marked lanes as primary road traffic control, highway or FM roads, a higher posted road speed limit, dark lighting conditions, male and older drivers, and curved road alignment as prominent contributing factors to large truck crashes. Furthermore, in cases where the large truck driver was determined not to be at fault, the likelihood of severe collisions significantly increased. The study’s findings urge policymakers to prioritize infrastructure improvements like dual left-turn lanes and extended exit ramps while advocating for wider adoption of safety technologies like lane departure warnings and autonomous emergency braking. Additionally, public awareness campaigns aimed at reducing distracted driving and drunk driving, particularly among truck drivers, could significantly reduce crashes. By implementing these targeted solutions, we can create safer roads for everyone in Texas.
... Association rules were also used to study the characteristics of expressway traffic accidents and to analyze the influencing factors and causes of injuries and fatalities leading to such traffic crashes (Chen et al. 2020). Finally, the block clustering method was applied to investigate the heterogeneity in large truck crash datasets and to provide additional insights to develop potential countermeasures and strategies (Rahimi et al. 2019). ...
Chapter
With the prevalence of large amount of information, there is an increasing need to digitalize and automate the associated data handling techniques. This is particularly important to enable reliable, effective, efficient, quick, and optimal decision-making processes. Coupled with the increase in computational power and the wide availability of cloud‐driven capabilities, artificial intelligence has offered unprecedented opportunities to retrieve and reveal remarkable patterns, trends, relationships, and knowledge from big data sets. For many decades, artificial intelligence has been applied to address challenges and provide solutions in different research areas and application domains. In relation to that, the goal of this chapter is to offer an introduction to the fundamental and essential methods, principles, and applications of artificial intelligence. To this end, this chapter focuses on the state-of-the-art research in the area of artificial intelligence with specific consideration to the following three domains: (1) Civil Infrastructure Applications (which include (i) bridges, pavements, and transportation systems; (ii) underground facilities; and, (iii) water systems, resilience, and electrical/power systems; (2) Construction Engineering and Management Applications (which include (i)construction-related activities; (ii) planning; and, (iii) facility management), and (3) Safety Applications (which include (i) construction safety management; (ii) accident analysis, and (iii) fire safety).
... For the KM method, the nearest objects to the mean of data are defined as the cluster centers in one cluster. If each data point could precisely define the position of the system, the KM method could render a decent classification (Rahimi et al., 2019). ...
Article
In this work, a combination of isotopic and hydrogeochemical data of a karstic region was clustered with four distinct clustering analysis (CA) methods to study water evolution in a vulnerable karstic region to improve protection, sustainability, and enhanced water resource management. Four CA methods, including hierarchical cluster analysis (HCA), K-means (KM), and fuzzy logic CA methods, fuzzy C-mean (FCM), and genetic K‐means (GKM), have been utilized to analyze hydrochemical, chemical, and isotopic datasets, including dissolved inorganic carbon (DIC), δ¹³C-DIC, δ¹⁸O, and δ²H datasets of water resources of Paveh-Javanrud (PV-JR) karstic region, located at the western border of Iran and Iraq countries. The utilized dataset contains 34 water samples with varied origination to evaluate the performance of each model and find the best method based on a meaningful categorization of geological, hydrogeochemical, and isotopic characteristics. Finally, the best model results were matched graphically with developed geospatial graphs to visualize the correlation between the region's water resources. Accordingly, the FCM and GKM methods represent the same, yet meaningful results and have the best performance among the four methods. It was also identified that the PV-JR water resources could be generally categorized into five distinct clusters, including FC1 to FC5 and GK1 to GK5, of which two clusters that have mixing, two clusters with solo-origination and no sign of mixing, and finally, a seasonal spring which is categorized as a separate cluster. Potentially, studying water resources via theoretical methods combined with considering isotope hydrology is of particular interest since solving the environmental issues related to karstic regions and their water resource management are shared concerns in most arid and semi-arid countries, especially in the Middle East as this study, thus could lay a basis for the following scientific attempts involving hydrogeochemical studies and advanced statistical analysis.
Article
Full-text available
Background Highway safety remains a significant issue, with road crashes being a leading cause of fatalities and injuries. While several studies have been conducted on crash severity, few have analyzed and predicted specific types of crashes, such as fatal crashes. Identifying the key factors associated with fatal crashes and predicting their occurrence can help develop effective preventative measures. Objective This study intended to develop cluster analysis and ML-based models using crash data to extract the prominent factors behind fatal crash occurrences and analyze the inherent pattern of variables contributing to fatal crashes. Methods Several branches and categories of supervised ML models have been implemented for fatality prediction and their results have been compared. SHAP analysis was conducted using the ML model to explore the contributing factors of fatal crashes. Additionally, the underlying hidden patterns of fatal crashes have been evaluated using K-means clustering, and specific fatal crash scenarios have been extracted. Results The deep neural networks model achieved 85% accuracy in predicting fatal crashes in Kansas. Factors, such as speed limits, nighttime, darker road conditions, two-lane highways, highway interchange areas, motorcycle and tractor-trailer involvement, and head-on collisions were found to be influential. Moreover, the clusters were able to discern certain scenarios of fatal crashes. Conclusion The study can provide a clear image of the important factors related to fatal crashes, which can be utilized to create new safety protocols and countermeasures to reduce fatal crashes. The results from cluster analysis can facilitate transportation professionals with representative scenarios, which will benefit in identifying potential fatal crash conditions.
Article
This research examines the injury severity of single-vehicle large-truck crashes in Florida while exploring the role of heterogeneity. a random parameter ordered logit (RPOl) model was applied to 27,505 single-vehicle large-truck crashes from 2007 to 2016 in Florida, and the contributing factors were identified. Random parameters and interaction effects were introduced to the model to determine the heterogeneity and its potential sources. the results suggested that driving speed of 76-120 mph and defective tires were the most influential factors in crash injury severity, increasing the probability of severe crashes. Regarding truckers' attributes, asleep or fatigued conditions and driving under the influence were correlated with a higher possibility of severe crashes. interestingly, the results showed that truckers from outside the state of Florida were less likely to cause severe single-vehicle large-truck crashes compared to their Floridian counterparts. Y-intersections were also found as a high-risk location for single-vehicle large-truck crashes, leading to more severe outcomes. Regarding heterogeneity, the results indicated that the impacts of driving speed (26-50 mph) and light condition (dark-not lighted) significantly varied among the observations, and these variations could be attributed to driver action, vision obstruction, driver distraction, roadway type and roadway alignment.
Conference Paper
Full-text available
Despite extensive research on traffic injury severities, relatively little is known about the factors contributing to truck-involved crashes in developing countries, especially in the context of Bangladesh. Due to the unavailability of authentic crash data sources, this study collected data from alternative sources such as online English news media reports. The current study prepared a database of 144 truck-involved fatal crash reports during the period of twelve months (January 2021 to December 2021). The crash reports contain a bag of 15,300 words. Several state-of-the-art text mining tools were utilized to identify crash patterns, including word cloud analysis, word frequency analysis, word co-occurrence network analysis, rapid automatic keyword extraction, and topic modeling. The analysis revealed several important crash contributing factors such as the type of vehicle involved (auto-rickshaw, bus, van, motorcycle), manner of collision (head-on), time of the day (morning, night), driver behavior (speeding, overtaking, wrong-way driving), and environmental factors (dense fog). In addition, ‘coming from opposite direction’ and ‘head-on collision’ are two important sequences of events in truck-involved crashes. Truck drivers are also involved in crashes with trains at the rail crossing. The findings of this research can assist policymakers in identifying crash avoidance strategies to lower truck-related crashes in Bangladesh.
Article
To effectively fight against traffic accidents, it is of great importance to analyse and understand the conditions that are linked with accidents. Such an analysis can serve as the basis to (i) develop reactive measures by finding the links between the pre-accident conditions (ii) devise proactive strategies that will prevent the occurrence of accidents by making the vehicles safer. This paper contributes to advancement of both approaches. For (i), one needs to identify the patterns in accidents. For (ii), introduction of Connected and Automated Vehicles (CAVs) is a promising solution. However CAVs need to be tested under numerous traffic scenarios to prove their safety before their deployment on public roads. This necessitates a great demand for high quality test scenarios for CAVs. This paper achieves two goals. First, it analyses the past traffic accidents (UK’s STATS19 database) to identify trends in the heterogeneous accident data and unravel the relationships between pre-accident conditions. This is done using a clustering algorithm (ROCK). Seven distinct large clusters emerge as a result. Each of these clusters are then further analysed for their meaning using the frequency analysis and geometric analysis. Secondly the paper underpins the proactive route (ii) by systematically developing, using the information in each cluster, test-case scenarios for CAVs which reflect the risk-prone conditions of the respective clusters. This is done using a data mining method (Market Basket algorithm) and further geometric interpretation of clusters. This way explicit scenarios are developed carrying the characteristics of the clusters that they come from.
Article
This study explores the crash injury severity of large truck-involved crashes, where the truck driver was identified as the at-fault driver. The paper focuses on vehicle-in-motion crashes that occurred on Florida’s state highways between 2007 and 2016. A random parameter ordered logit (RPOL) model was developed to identify random parameters and interaction effects. Results indicated that not using restraint systems, running a red light, wrong-way driving, failing to yield the right of way, tire or brake defects, and dark conditions had positive associations with higher levels of crash injury severity. The random variables—straight alignment, paved shoulders, and unpaved shoulders—showed significant random effects among the observations. For straight alignment, running red lights, following too closely, vision obstruction caused by fixed objects, and vision obstruction caused by fog were the sources of heterogeneity. Unpaved shoulders, running red lights, wrong-way driving, and the presence of parked or stopped vehicles were found as interaction effects. Results showed that accounting for heterogeneity and interaction effects significantly improved the goodness of fit of the model. This study provides more comprehensive knowledge of the influencing factors of large truck crashes by considering the role of heterogeneity and its potential sources in crash injury severity.
Article
Full-text available
Driving is an essential activity in living a fulfilling lifestyle. Older adults, like the rest of the population, require a means of transportation to participate in important lifestyle choices; however, declines in their sensory, motor, perceptual, and cognitive abilities limit their driving capabilities. These limitations motivated this study to investigate older adult drivers' driving challenges by conducting a questionnaire. The in-vehicle technologies which mitigate these challenges were identified. In this study, the acceptance of the identified technologies is explored by conducting a second questionnaire. A four dimensional model which included perceived usefulness, perceived ease of use, perceived safety, and perceived anxiety is considered in the second questionnaire. In total, 250 older adult drivers participated in these questionnaires. The responses obtained from both questionnaires identified potential challenges that they were facing and whether they intend to use the identified in-vehicle technologies. Having more information about the acceptance of these technologies can help engineers better understand the factors that make technologies useful to older adult drivers, and thus improve their driving safety.
Conference Paper
Full-text available
Work zone crashes are still on the rise due to the aging of US roads and the increase in traffic demand. Investigation of crash characteristics and determining contributing factors in work zones is one of the most important issues in many traffic safety studies. The effect of work zones on traffic safety can be exacerbated by weather conditions. A sudden reduction in visibility may intensify the severity of work zone crashes. Although many studies have investigated work zone crashes, research that investigates the impact of adverse weather conditions on work zone crashes is lacking. In this study, The Highway Safety Information System database for North Carolina was used to identify the characteristics of work zone weather-related crashes. A Tree-based Ordered Probit, a relatively recent and promising combination of nonparametric machine learning (decision tree) and classical statistics (ordered probit) techniques, was utilized to gain a better understanding about the effects of various factors on different work zone crash related injury and crash severity in adverse weather conditions. The results showed that Tree- based Ordered Probit model has a better performance compared to conventional Ordered Probit Model. Lighting conditions, number of vehicles involved in a crash, road characteristics, number of occupants, land use, presence of traffic control devices, and two types of crashes (sideswipe and rear-end crashes) were identified as the most important factors in work zone weather-related crash severity.
Article
Full-text available
This paper presents the findings of a study recently conducted in Florida to quantify freight users’ willingness to pay (WTP) for the improvement of transportation-related attributes, particularly reliability. A stated preference survey was developed and administered between January and May 2016. The survey collected responses from 150 shippers, carriers, and forwarders. Econometric models, including mixed and multinomial logit models, were developed to estimate the users’ WTP and to investigate the presence of user heterogeneity. The value of time and the value of reliability were estimated separately for the various user groups. The results indicated that carriers showed the lowest WTP when their WTP was compared with that of other freight users. Shippers without transportation—that is, shippers who contracted out their shipping— exhibited more interest in reducing travel time savings, whereas shippers with transportation showed more sensitivity to reliability. Preference heterogeneity was also explored by commodity group and product type. The results confirmed the findings from past studies and showed significant differences in WTP values when the sources of heterogeneity were considered. This paper contributes to the literature by providing empirical evidence of the quantification of the value of reliability in road freight transportation and the impacts of user heterogeneity. The study results will help advance understanding of the impacts of the performance of transportation systems on the freight industry.
Article
Multiple-vehicle crashes involving at least two vehicles constitute over 70% of fatal and injury crashes in the U.S. Moreover, multiple-vehicle crashes involving three or more vehicles (3+) are usually more severe compared with the crashes involving only two vehicles. This study focuses on developing 3+ multiple-vehicle crash severity models for a freeway section using real-time traffic data and crash data for the years 2014–2016. The study corridor is a 111-mile section on I-4 in Orlando, Florida. Crash injury severity was classified as a binary outcome (fatal/severe injury and minor/no injury crashes). For the purpose of identifying the reliable relationship between the 3+ severe multiple-vehicle crashes and the identified explanatory variables, a binary probit model with Dirichlet random effect parameter was used. More specifically, Dirichlet random effect model was introduced to account for unobserved heterogeneity in the crash data. The probit model was implemented using a Bayesian framework and the ratios of the Monte Carlo errors were monitored to achieve parameter estimation convergence. The following variables were found significant at the 95% Bayesian credible interval: logarithm of average vehicle speed, logarithm of average equivalent 10-minute hourly volume, alcohol involvement, lighting condition, and number of vehicles involved (3, or >3) in multiple-vehicle crashes. Further analysis involved analyzing the posterior probability distributions of these significant variables. The study findings can be used to associate certain traffic conditions with severe injury crashes involving 3+ multiple vehicles, and can help develop effective crash injury reduction strategies based on real-time traffic data.
Article
Given the recent advancements in autonomous driving functions, one of the main challenges is safe and efficient operation in complex traffic situations such as road junctions. There is a need for comprehensive testing, either in virtual simulation environments or on real-world test tracks. This paper presents a novel data analysis method including the preparation, analysis and visualization of car crash data, to identify the critical pre-crash scenarios at T- and four-legged junctions as a basis for testing the safety of automated driving systems. The presented method employs k-medoids to cluster historical junction crash data into distinct partitions and then applies the association rules algorithm to each cluster to specify the driving scenarios in more detail. The dataset used consists of 1056 junction crashes in the UK, which were exported from the in-depth "On-the-Spot" database. The study resulted in thirteen crash clusters for T-junctions, and six crash clusters for crossroads. Association rules revealed common crash characteristics, which were the basis for the scenario descriptions. The results support existing findings on road junction accidents and provide benchmark situations for safety performance tests in order to reduce the possible number parameter combinations.
Article
Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012-2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012-2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method.
Article
The impacts of behavioral attitudes are rarely explored when it comes to roadway pricing strategies. The existing literature mainly focuses on observed traveler or trip characteristics and is less likely to capture latent preferences or heterogeneity of roadway users. Motivated to address this knowledge gap, the study herein puts an effort to examine how underlying behavioral attitudes will affect drivers' choices in utilizing managed lane facilities. Based on the data from the South Florida Expressway Stated Preference Survey, factor analysis was conducted based on ten attitudinal statements, and four latent attitudinal factors were identified: willingness to pay, willingness to shift travel schedule, utility (cost/time) sensitivity, and congestion tolerance. In order to assess managed lane's utility for drivers, two sets of multinomial logit (MNL) models were developed using combined revealed preference (RP) and stated preference (SP) data, with and without these attitudinal factors. Results indicated significant contribution of attitudinal parameters in the model, both in terms of coefficients and model performance. The factors were further used in a cluster analysis which identified major segments of roadway users. Such market segmentation is expected to provide valuable insights in capturing travelers' behavior while accounting for attitudinal aspects, which could enhance transportation planning efforts and policy making procedures.
Article
Machine learning (ML) techniques have higher prediction accuracy compared to conventional statistical methods for crash frequency modelling. However, their black-box nature limits the interpretability. The objective of this research is to combine both ML and statistical methods to develop hybrid link-level crash frequency models with high predictability and interpretability. For this purpose, M5′ model trees method (M5′) is introduced and applied to classify the crash data and then calibrate a model for each homogenous class. The data for 1134 and 345 randomly selected links on urban arterials in the city of Charlotte, North Carolina was used to develop and validate models, respectively. The outputs from the hybrid approach are compared with the outputs from cluster-based negative binomial regression (NBR) and general NBR models. Findings indicate that M5' has high predictability and is very reliable to interpret the role of different attributes on crash frequency compared to other developed models.