Content uploaded by Alireza Rahimi
Author content
All content in this area was uploaded by Alireza Rahimi on Apr 18, 2019
Content may be subject to copyright.
Research Article
Transportation Research Record
1–13
ÓNational Academy of Sciences:
Transportation Research Board 2019
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/0361198119839347
journals.sagepub.com/home/trr
Clustering Approach toward Large Truck
Crash Analysis
Alireza Rahimi
1
, Ghazaleh Azimi
1
, Hamidreza Asgari
1
, and Xia Jin
1
Abstract
Heterogeneity of crash data masks the underlying crash patterns and perplexes crash analysis. This paper aims to explore an
advanced high-dimensional clustering approach to investigate heterogeneity in large datasets. Detailed records of crashes
involving large trucks occurring in the state of Florida between 2007 and 2016 were examined to identify truck crash patterns
and significant conditions contributing to the patterns. The block clustering method was applied to more than 220,000 crash
records with nearly 200 attributes. The analysis showed promising results in segmenting a large heterogeneous dataset into
meaningful subgroups (with 95.72% average degree of homogeneity for selected blocks). The goodness of fit for clustering
methods is evaluated and both integrated completed likelihood (ICL) and pseudo-likelihood values improved significantly
(20.8% and 21.1% respectively). Attribute clustering showed distinct characteristics for each cluster. Crash clustering revealed
significant differences among the clusters and suggested that this crash dataset could be portioned as same-direction, oppos-
ing-direction, and single-vehicle crashes. Individual blocks defined by both row and column clustering were further investi-
gated to better understand the contribution set of conditions that lead to large truck crashes. Major features for each of the
three major types of crashes were analyzed, which may provide additional insights to develop potential countermeasures and
strategies that target specific segments. The clustering approach could be used as a preanalysis method to identify homoge-
neous subgroups for further analysis, which will help enhance the effectiveness of safety programs.
The number of crashes involving large trucks has been
increasing in the United States. In 2010, large trucks
were involved in about 58,000 injuries and 3,494 fatal
crashes, respectively. In 2016, the number of injury
crashes involving large trucks almost doubled, and fatal
crashes increased by more than 20% comparing with
2010 (1). Large truck crashes impose an enormous
amount of loss on society. In addition to increased traffic
congestion and property damage, they put roadway users
at high risk of injury and fatality. There are also adverse
consequences for the prosperity of industry, including
delay-related cost, additional operations costs, and pro-
ductivity loss. The cost of commercial vehicle crashes has
been estimated to be over $99 billion annually (2).
In this regard, many studies have focused on large
truck crash analysis and identification of countermea-
sures to improve truck safety (3,4). One particular chal-
lenge in investigating contributing factors of large truck
crashes is the presence of heterogeneity, which refers to
the correlation between unobserved factors and observed
variables (5). In other words, the impacts of specific fac-
tors might vary among the observations, leading to ran-
dom distribution of parameter coefficients rather than
fixed impacts. Presence of heterogeneity in different
aspects of travel behavior and traffic data has been
widely discussed (6,7). Traffic accident data are often
heterogeneous considering that the occurrence and sever-
ity of a crash is the result of multiple contributing factors
at the same time (8). There may exist several heterogene-
ity issues that need to be addressed (9). First, some of the
contributing factors may remain hidden. For instance,
highly influential factors for a group of crashes might
not be significant for the whole dataset (8–11). The
degree of effect for specific contributing factors might be
different for the whole dataset and for subgroups (12,
13). Moreover, certain contributing factors might have a
completely different effect on different groups, such as
increasing the risk of fatal crashes for men and decreas-
ing it for women (12).
Heterogeneity of crash data masks the underlying
crash patterns and perplexes crash analysis (8). Although
various approaches have been undertaken to investigate
1
Department of Civil and Environmental Engineering, Florida International
University, Miami, FL
Corresponding Author:
Address correspondence to Xia Jin: xjin1@fiu.edu
heterogeneity associated with crash data, such as mixed
logit models or generalized structures (14–19), various
phenomena may arise when analyzing and organizing
data in high-dimensional spaces (often with hundreds or
thousands of dimensions). The major issue with high-
dimensional datasets is that data points become
increasingly sparse as the dimensionality increases (20),
traditional techniques begin to fail, and the quality of
results deteriorates (21). High-dimensional crash data
require more robust methods to fully discover hidden
patterns (22). In addition, recent efforts on heterogeneity
have mostly been focused on pedestrian or passenger
vehicle crashes, very few studies investigating truck crash
heterogeneity. This paper, then, aims to explore an
advanced high-dimensional approach, block clustering,
to investigate heterogeneity in large datasets. Detailed
records of crashes involving large trucks that occurred in
the state of Florida between 2007 and 2016 have been
examined to identify truck crash patterns and significant
conditions contributing to the patterns.
The next section provides a brief overview of existing
studies that have investigated heterogeneity in crash data-
sets. The methodology used in this research is described
in the third section. The fourth section describes the data.
The fifth section analyzes the results and the final section
concludes the study.
Literature Review
There are two general approaches that have been taken
to investigate heterogeneity in crash data (8). Data con-
straining is a common approach that focuses on a very
specific segment of the crash dataset. For examples,
Kitali et al. focused on multiple-vehicle crashes (23);
Hadi et al. and Ghasemzadeh et al. analyzed incidents in
work zones (24,25); other specific subjects like crashes in
rural or urban arterial (26,27) and crashes involving
volatile or older adult drivers (28,29) have also been
investigated.
Although data constraining is a useful approach in
analyzing a specific type of crash, it cannot be generalized
to other crash types, and therefore has limited applicabil-
ity. The second approach to discovering heterogeneity is
clustering. Different from classification problems in
which each observation is associated with a group and
the objective is to place a new observation in one of the
groups, cluster analysis seeks to discover the number and
composition of groups that best describes the characteris-
tics of the observations. Cluster analysis uses distance
measures over various dimensions in the dataset to dis-
cover clusters of similar objects (22,30). Crash data clus-
tering has been investigated mainly using two major
approaches, partitional and model-based clustering.
Partitional clustering algorithms optimize a particular
objective function to identify clustering patterns that are
present in the data and iteratively improve the quality of
the partitions. Partitional clustering is also called
prototype-based clustering because it requires parameters
to be used as prototype points that represent each cluster
(31). K-means clustering (32,33) is the most widely used
partitional clustering algorithm in crash data analysis. K
representative points are selected as the initial centroids
in the first step. Using a proximity measure, each point
in the data set is then assigned to the closest centroid.
The centroids for each cluster are then updated based on
newly founded clusters. These two steps will be iteratively
repeated until no changes in centroids are observed or
until any other alternative convergence criterion is met.
Anderson used K-means clustering to profile road acci-
dent hotspots (34). Iranitalab et al. used K-means cluster-
ing for crash severity prediction (5). Mauro et al. applied
K-means clustering to examine patterns of vehicle crashes
before and after infrastructural interventions to improve
road safety (35). Zhang et al. utilized K-means clustering in
crash causality and severity analysis (36). K-medoids is
another partitional clustering algorithm which is more resi-
lient to outliers compared with K-means (34). The K-
medoids algorithm seeks to minimize specific objective
functions by finding clustering solutions. This method is
more robust in addressing noise and outliers in the data
because actual data points are chosen as the prototypes
(34). Nitsche et al. used K-medoids to investigate pre-crash
scenarios at road junctions (37). It is worth mentioning
that partitional methods are nondeterministic in nature
and need a user-predefined number of clusters to obtain a
solution (31).
In the model-based clustering approach, the objectives
are assumed to match a specific model. The model, which
is often a statistical distribution, may be user-specified
and might change during the process (38,39). Latent
class clustering (LCC) is a probabilistic model-based clus-
tering and assumes a mixture of several probability densi-
ties within the data (40). Mohamed et al. used LCC in
injury severity analysis of pedestrian–vehicle crashes (41).
Iranitalab et al. used LCC for crash severity prediction
(5). Depaire et al. segmented traffic accident by means of
LCC (8).
Despite the efficiency and simplicity of these clustering
methods, there are several limitations. First, all existing
studies have focused on segmenting the observations (i.e.,
crash records) and disregarded potential clusters among
the explanatory variables. This issue leads to the focus on
individual variable impacts rather than considering the
intertwining effects among a subset of factors that con-
tribute to different types of crashes. Although traditional
clustering methods could be applied to clustering for
both observations and variables separately, having very
2Transportation Research Record 00(0)
high computational complexities make these clustering
methods not suitable for high-dimensional datasets (42).
In addition, the possible level of noise, and a large
amount of meaningless information in a large crash data-
set, requires a robust method for clustering (43).
In summary, two approaches can be used to analyze
heterogeneous crash data: data constraining and cluster-
ing. Although data constraining is simple, it has limited
applicability and might be subjective. The clustering
approach provides unbiased results for segmenting a
dataset and can enhance the homogeneity significantly.
This paper aims to apply a robust clustering method
to overcome the limitations of traditional clustering
methods. Given the above thoughts, block clustering,
also referred to as coclustering or biclustering, holds the
promise of addressing heterogeneity in high-dimensional
datasets. The next section presents a detailed description
of the block clustering algorithm employed for this
study.
Methodology
Coclustering utilizes the duality of clustering and dis-
covers hidden latent patterns by generating a compact
representation of the dataset. The goal is to cluster the
sets of rows and columns simultaneously to obtain
homogeneous blocks (44). This method has attracted
much attention in recent years for text mining (clustering
of documents and words simultaneously) (45), bioinfor-
matics (clustering of genes and tissues simultaneously)
(43) and social network analysis (46).
Block clustering considers the two sets, observations,
and variables, simultaneously and organizes the data
into homogeneous blocks. If Xdenotes an n3ddata
matrix defined by X={(xij); i2I&j2J}, where Iis a set
of nobjects (rows, observations, crashes) and Jis a set of
dvariables (columns, variables, attributes), the main
objective of this method is to make permutations or rear-
rangements of observations and attributes to construct a
correspondence structure on I3J. An important advan-
tage of block clustering is the transformation of the ini-
tial data matrix Xinto a simpler and smaller data matrix
with the same structure (42,44). Moreover, block clus-
tering is fast and requires far less computation than that
needed to process the two sets separately and consecu-
tively as in the well-known K-means algorithm (42).
Figure 1 illustrates an example of block clustering.
Array (a) in Figure 1 presents a binary dataset consisting
of n=10 objects, I={A, B, C, D, E, F, G, H, I, J}, and
asetofd=7 binary variables, J={1, 2, 3, 4, 5, 6, 7}. To
obtain a homogeneous dataset, the array (a) can be reor-
ganized either by a partition on Ior by partitions on Iand
Jsimultaneously. Array (b) consists of data reorganized
by a partition of Iinto g=3clusters,a={A, C, H},
b={B, F, J}andc={D, G, I, E}. Array (c) consists of
data reorganized by the same partition of Iand partition
of Jinto m=3clusters,I={1, 4}, II ={3, 5, 7}and
III ={2, 6}. Compared with array (b), array (c)clearly
reveals an interesting pattern (42). The block clustering
approach takes advantage of partitioning on Iand J
simultaneously and results in a more homogeneous dataset
compared with traditional clustering models like K-means.
Another advantage of the block clustering method is that
it reduces the initial data matrix Xinto a simpler
data matrix having the same structure. In the example, the
initial (10 37) binary data matrix is reduced to a
(g3m)=(333) summary binary data matrix
(Figure 1d).
Different approaches can be applied for coclustering
and these approaches can differ in the pattern they seek
and the types of data they apply to. Govaert and Nadif
proposed a general framework to formalize the hypoth-
eses of coclustering algorithms (42). They introduced a
latent block model to solve the coclustering problem and
overcome the defects of classical coclustering methods.
They suggested a block clustering framework which uti-
lizes parsimonious models and allows a rigorous simula-
tion. This section presents a block clustering approach
based on the work of Govaert and Nadif (42) and Bhatia
et al. (47).
Mixture Models
A fundamental assumption of model-based clustering is
that the data has originated from a mixture of underlying
probability distributions, where each component kof the
mixture indicates a cluster. Therefore, the matrix dataset
X={x
i
;i2(1,.,n)} is supposed to be independent and
identically distributed and arises from a probability dis-
tribution with density (42,44):
Figure 1. Block clustering, showing (a) binary data set, (b) data
reorganized by a partition on I,(c) data reorganized by partitions
on Iand Jsimultaneously, and (d) summary binary data.
Rahimi et al 3
f(x;u)= Y
iX
k
pkfk(xi;a)ð1Þ
where
f
k
denotes the density function for the kth component,
ais the corresponding class parameter,
p
k
finds the probabilities that an observation belongs
to the kth component with k= (1,.,g) and for which g
is assumed to be known, and
uis the vector of (p
1,
.,p
g,
a). Govaert and Nadif (42)
showed that the density function can be rewritten as:
f(x;u)= X
z2Z
p(z)f(xz
j;a)ð2Þ
f(xz
j;a)= Y
i
fzi(xi;a)ð3Þ
p(z)= Y
i
pzið4Þ
where Zstands for the set of all possible partitions of Iin
gclusters. Therefore, according to this function, the data
matrix is supposed to be a sample of size 1 from a ran-
dom (n, d) matrix.
Latent Block Model
The Iset can be partitioned into gclusters by z=(z
11
,..
., z
ng
) with z
ik
=1ifibe a part of cluster kand z
ik
=0
otherwise, z
i
=kif z
ik
=1 and z
.k
=S
i
z
ik
is the cardin-
ality, the number of elements in a set, of row-cluster k.
Likewise, Jcan be divided into mclusters with w=(w
11
,
...,w
dm
)with w
jl
=1ifjfits into cluster land w
jl
=0
otherwise, w
j
=lif
wjl
= 1 and w
l
=S
j
w
jl
is the cardinal-
ity of column cluster l.
To investigate block clustering, Govaert and Nadif
extended the mixture model density function and
assumed that the labeling of Iand Jare independent of
each other (42). The obtained latent block mixture model
can be defined by the following probability density func-
tion (PDF):
f(x;u)= X
(z,w)2Z3WY
i,j
pzirwjfziwj(xij;a)ð5Þ
where
Zand Wshow the sets of all possible labeling for zof
Iand wof J, respectively,
f
zi,wj
(x,a) is the PDF defined on the real set R, and
u=(p,r,a) with p=(p
1
,...,p
g
) and r=(r
1
,...,
r
m
) are the vectors of probabilities p
k
and r
l
that a row
and a column associated to the kth row element and to
the lth column element respectively.
According to the above formulation, the
randomized data generation method can be described as
follows (42,44):
Row labeling: Generate the labeling z=(z
1
,...,z
n
)
according to the distribution p=(p
1
,...,p
g
).
Column labeling: Generate the labeling w=(w
1
,..
.,w
d
) according to the distribution r=(r
1
,...,r
m
).
Data generation: Generate for i= (1, ..., n) and j=
(1, ..., d) a value x
ij
according to the density distribu-
tion f
zi,wj
(.,a).
Model Parameter Estimation
EM-based algorithms (42,44,47) can be used to approx-
imate model parameters by maximizing observed data
log-likelihood. The complete data log-likelihood can be
defined by the following function:
Lcðz;w;uÞ¼X
k
z:klogpkþX
l
w:llogrlþ
X
i;j;k;l
zik wjl log fklðxij ;aÞ
ð6Þ
In this method, the conditional expectation Q(u,u
(c)
)of
the complete data log-likelihood is maximized given a
previous current estimate u
(c)
and xto iteratively maxi-
mize the log-likelihood:
Qðu;uðcÞÞ¼X
i;k
tðcÞ
ik logpkþX
j;l
rðcÞ
jl logrl
þX
i;j;k;l
eðcÞ
ikjl logfklðxij ;aÞð7Þ
where
t(c)ik =p(zik =1jx,u(c)),
r(c)jl =p(wjl =1jx,u(c)), and
e(c)ikjl =p(zikwjl =1jx,u(c)):
Because of the dependence structure in the model,
Govaert and Nadif (42) proposed an approximate solution
using the interpretation of the EM algorithm by Hathaway
(48)andNealandHinton(49). Therefore the fuzzy cluster-
ing criterion for the latent block model can be defined as
follows, in which L
c
is the fuzzy complete data log-
likelihood associated with the block latent model:
~
Fc(t,r;u)=Lc(t,r,u)+H(t)+H(r)ð8Þ
where
4Transportation Research Record 00(0)
H(r)= P
jl
rjl log rjl
H(t)= P
ik
tik log tik
Lc(t,r;u)=
P
k
t:klog pk+
P
:l
rllog rl+P
i,j,k,l
tikrjl log fkl (xij ;a)
8
>
>
>
<
>
>
>
:ð9Þ
Algorithms
Govaert and Nadif (42) proposed a block expectation
maximization (BEM) algorithm to maximize the fuzzy
clustering criterion using the following steps.
E-Step: The conditional row and column class prob-
abilities are computed respectively as
tik = log pk+X
jl
rjl log fkl(xij ;a)ð10Þ
rjl =logrl+X
ik
tik log fkl(xij ;a)ð11Þ
M-Step: The row proportions p, column
proportions r, and the model parameter aare calcu-
lated by maximizing Pkt:klog pk,P:lrllog rland
Pi,j,k,ltik rjl log fkl(xij ;a) which are the first, second,
and last term in L
c
respectively. The estimation of a
depends on the f
kl
PDF which will be discussed later
for binary data.
Therefore, the BEM algorithm suggested by Govaert
and Nadif (42) to maximize the fuzzy clustering criterion
can be described as:
1. Initialize t(0),r(0)and u(0)=(p(0),r(0),a(0)).
2. Compute t(c+1),p(c+1),a(c+(1=2)) by using EM
algorithm for the data matrix uil =Pjrjlxij and
starting from p(c),r(c),a(c).
3. Compute r(c+1),r(c+1),a(c+1)by using EM algo-
rithm for the data matrix vjk =Pitikxij and start-
ing from r(c),r(c),a(c+(1=2)).
4. Iterate step (2) and (3) until convergence.
Block Mixture Models for Binary Datasets
This section summarizes the methodology and describes
the final clustering model used based on the blockcluster
R Package (47). The crash dataset used in this study
included categorical and binary variables. Categorical
variables were converted to dummy variables and the fol-
lowing block mixture model was used to solve the binary
block clustering problem. Govaert and Nadif (42) dis-
cussed how the Bernoulli probability distribution
function, which was needed to find model parameter a,
can be described as:
fkl(xij ;a)=(ekj )xijakl
jj
(1ekj)1xij akl
jj ð12Þ
akl =0,ekl =rkl if rkl\0:5
akl =1,ekl =1rkl if rkl.0:5
ð13Þ
where
p=(p
kl
) is a binary data set with p
kl
2[0, 1], and
a
kl
and e
kl
characterize the center and dispersion of the
block k, l respectively. a
kl
represents the most frequent
binary value and e
kl
gives the probability of having a dif-
ferent value than the center for each block.
Based on this Bernoulli probability distribution func-
tion, both E and M steps can be redefined.
E-Step: The conditional row and column class prob-
abilities can be found by:
tik = log pk+X
l
uil r:lakl
jj
log ekl
1ekl
+X
l
r:llog(1ekl)
ð14Þ
rjl = log rl+X
k
vjk t:kakl
log ekl
1ekl
+X
k
t:klog(1ekl)
ð15Þ
uil =X
j
rjlxij ð16Þ
vjk =X
i
tik xij ð17Þ
M-Step: the model parameter ais calculated as:
akl =0,ekl =ykl
t:kr:lif ykl
t:kr:l
\0:5
akl =1,ekl =1ykl
t:kr:lotherwise:
ð18Þ
Data Description
The data used in this study were extracted from the
Florida statewide crash database through the Signal
Four Analytics portal (50). The data were coded from
police crash reports including driver, vehicle, crash, and
citation information. Each crash involved at least one
large truck. Roadway network information was also
integrated in the database. Irrelevant information was
removed. The final dataset contains more than 200 attri-
butes. Categorical variables were recoded into dummy
variables as the applied methodology required binary
inputs.
The database recorded around 200 variables, describ-
ing the characteristics of the drivers, vehicles, crash
Rahimi et al 5
events, roadway geometry, lighting, and environment
conditions. The total sample involved 220,932 crashes
that occurred between 2007 and 2016, involving 228,180
large trucks, 180,702 non-truck motor vehicles, 1,902
fatalities, and 58,976 injuries.
Roadway problems were present in 1.9% of the two-
vehicle cases, and adverse weather and light conditions
were present in approximately 8.2% and 20.3% of the
crashes, respectively. Interruption in the traffic flow (pre-
vious crash, work zone, peak hour congestion, etc.) was
coded in almost 2.3% of the two-vehicle crashes.
About 74% of the crashes occurred on local roads,
state highways, interstate, or county roads. In 80% of the
accidents, crash severity was reported as property dam-
age only, but injury and fatality were coded in 19% and
1% of accidents, respectively. Hit-and-run and school
bus–related crashes were reported 6,538 and 1,056 times
respectively.
Table 1 below presents crash type by severity level.
Results Analysis
The very first step for block clustering is finding the opti-
mum number of clusters for rows and columns.
Biernacki et al. (51) suggested integrated completed like-
lihood (ICL), a criterion which can effectively maximize
the complete data likelihood, and has proven to be more
robust than the Bayesian information criterion (BIC) for
mixture models. For a detailed discussion of the ICL cri-
terion, readers are referred to Biernacki et al. (51) and
Bertoletti et al. (52).
A variety of combinations of row number (1 through
10) and column number (1 through 10) were tried, to find
the optimum number of blocks. ICL and pseudo-
likelihood values for each model were evaluated. The
optimum number of blocks was found to be 30 with 3
rows (K) and 10 columns (L), as neither ICL nor pseudo-
likelihood improved much when the numbers further
increased. The ICL and pseudo-likelihood values for all
models can be found in Figures 2 and 3. The results show
that ICL and pseudo-likelihood values for the optimum
number of blocks improved 20.8% and 21.1%, respec-
tively, compared with the initial dataset.
For the model with K= (1, 2, 3) and L= (1, ..., 10)
clusters, the row proportions pand column proportions
rare shown in Table 2. The first row-cluster, K= 1, cov-
ers 31.8% of all observations (crashes), and K= 2 and
K= 3 contain 39.9% and 28.3% of accidents, respec-
tively. The first column cluster, L= 1, consists of 10.3%
of all variables included in the dataset.
Column Clusters (Attribute Clustering)
Detailed results for variable clustering can be found in
Table 3. The results show distinct characteristics for each
cluster. It can be seen that driver age, crash location,
vehicle condition, weather condition, roadway type,
Table 1. Crash Type by Severity for Large Truck Involved Crashes
Property damage only Injury Fatality
Crash type Crashes Percentage Crashes Percentage Crashes Percentage Total
1. Bicycle 124 17.8% 510 73.4% 61 8.8% 695
2. Head-on 1,875 57.3% 1,172 35.8% 223 6.8% 3,270
3. Left entering 3,352 62.0% 1,972 36.5% 86 1.6% 5,410
4. Left leaving 1,368 56.4% 978 40.3% 79 3.3% 2,425
5. Left rear 1,954 67.3% 930 32.0% 20 0.7% 2,904
6. Off-roadway 18,864 87.3% 2,639 12.2% 95 0.4% 21,598
7. Opposing sideswipe 2,502 82.8% 500 16.5% 21 0.7% 3,023
8. Other 17,954 81.6% 3,918 17.8% 128 0.6% 22,000
9. Pedestrian 148 12.7% 816 70.2% 199 17.1% 1,163
10. Rear-end 34,074 68.6% 15,240 30.7% 380 .8% 49,694
11. Right angle 4,887 60.7% 3,004 37.3% 160 2.0% 8,051
12. Right/left 418 89.3% 50 10.7% 0 0.0% 468
13. Right/through 3,667 79.5% 928 20.1% 19 0.4% 4,614
14. Right/U-turn 29 85.3% 5 14.7% 0 0.0% 34
15. Rollover 1,628 49.2% 1,643 49.7% 38 1.1% 3,309
16. Same-direction sideswipe 33,799 88.4% 4,393 11.5% 28 0.1% 38,220
17. Unknown 3,970 81.8% 856 17.6% 27 0.6% 4,853
18. Single-vehicle 7,609 87.6% 993 11.4% 83 1.0% 8,685
19. Parked-vehicle 26,076 95.6% 1,127 4.1% 73 0.3% 27,276
20. Backed into 11,560 92.5% 926 7.4% 14 0.1% 12,500
21. Animal 679 91.8% 59 8.0% 2 0.3% 740
Total 176,537 79.9% 42,659 19.3% 1,736 0.8% 220,932
6Transportation Research Record 00(0)
vehicle maneuver, and driver action were involved in
defining the clusters. It may not always be obvious which
are the most significant features for each cluster, but it
presents a helpful way to identify potential associations
between the variables. A cluster of columns is a subset of
columns that exhibit similar behavior across the rows
(crashes) (43). Column clustering identifies coexistence
between variables and implies that all attributes within a
cluster will either occur together or not occur for a spe-
cific group of crashes (row-cluster).
For example, column cluster 2 represents collisions
with non-fixed objects on the roadway; column cluster 4
mostly contains crashes in parking lots; column cluster 5
mostly involves weekend crashes while driving above the
speed limit; cluster 7 may involve distracted drivers, and
cluster 10 involves female drivers and those changing
lanes. The degree of occurrence depends on e
kl
value
which will be discussed in the block clustering section.
Row Clusters (Crash Clustering)
A cluster of rows is a subset of rows that exhibit similar
behavior across the columns (attributes) (43). The model
identified three distinct row clusters. To further investi-
gate the clusters, several variables, such as crash type,
crash severity, crash time, manner of collision, most
Figure 2. Integrated complete likelihood values by number of blocks.
Figure 3. Pseudo-likelihood values by number of blocks.
Table 2. Row and Column Proportions for Block Cluster Model
LL=1 L=2 L=3 L=4 L=5 L=6 L=7 L=8 L=9 L=10
r(%) 10.3 3.4 5.2 6.9 8.6 6.9 19.0 19.0 12.1 8.6
K123nanananananana
p(%) 31.8 39.9 28.3 na na na na na na na
Note: na = not applicable.
Rahimi et al 7
harmful events, and so forth, were evaluated to identify
the latent patterns. Among all tested variables, the result
revealed significant patterns only in conjunction with
crash type. Table 4 below shows row clusters by crash
type. Z-tests were conducted to examine the significance
of the differences among the clusters.
Results show that the first row-cluster (K= 1) mostly
contains rear-end and same-direction sideswipe crashes.
These two types of crashes are very similar to each other
in the sense that the involved vehicles are traveling in the
same direction. For the second row-cluster, K= 2, the
most dominant crashes are angle, head-on, and opposing
sideswipe crashes, which are again very similar to each
other as the vehicles involved are traveling in opposing
directions. Lastly, for K= 3, the most prevalent acci-
dents are park/off-roadway and single-vehicle crashes.
This cluster mostly includes crashes like rollover, collid-
ing with animal, pedestrian, bicyclist, fixed objects, or
parked vehicles. The results suggest that this crash data-
set can be generally categorized as same-direction
crashes, multi-direction crashes, and single-vehicle
crashes.
The above analysis of row clusters and column clus-
ters indicates that this clustering approach is able to iden-
tify relatively homogeneous groups within the dataset
that are meaningful and reliable (with robust statistical
foundations).
Block Clusters (Both Attribute and Crash Clustering)
To further investigate which groups of attributes are
more likely to be associated with which groups of
Table 3. Block Clustering Model Result: Column Clusters
LVariables LVariables
1 Driver age between 36 and 50 years old 7 Driver distracted
Crash at intersection Road system identifier, U.S. Highway
Driver action, aggressive/careless maneuver Vehicle maneuver action stopped or slowing in traffic
Vehicle year before 2000 Vehicle with defect
Weather condition not clear Vehicle at-fault, body type other
Vehicle maneuver action, straight ahead Vehicle at-fault, pickup truck
2 First harmful event, collision non-fixed object 8 Driver age between 16 and 20 years old
First harmful event location, on roadway Driver condition at time of crash, not normal
3 Vehicle at-fault, passenger car Driving under the influence
Total lane 4 or more First harmful event, non-collision
Traffic way, two-way divided Road system identifier, local road
4 First harmful event, collision with fixed object Road system Identifier, forest or private road
First harmful event location, off-roadway Roadway alignment, curve
Road system identifier, parking lot Roadway grade, not level
Vehicle maneuver action, backing Vehicle at-fault, bus
5 Above posted speed Vehicle at-fault, light trucks
Crash time, weekend Vehicle at-fault, utility vehicle
Driver action, other contributing action 9 Driver age between 21 and 35years old
Road surface condition, not dry Driver age between 51 and 65 years old
Vehicle maneuver action, other Driver action, improper maneuver
6 Crash within city limits Driver action, no contributing action
Estimated speed [0,25] mph Light condition, not daylight
Unpaved or curb shoulder Traffic way, two-way not divided
Vehicle at-fault, medium/heavy trucks Vehicle maneuver action, turn
7 Estimated speed [76,100] mph 10 Estimated speed [26,50] mph
Vision obstructed Estimated speed [51,75] mph
Traffic way, one-way At-fault driver gender, female
Driver age more than 66 years old Road system identifier, interstate
Driver action, illegal maneuver Vehicle maneuver action change lane
Table 4. Crash Type by Row-Cluster
KRear-end Same-direction sideswipe Park/off-roadway Single-vehicle Angle Head-on Opposing sideswipe
1 44.2% 56.6% 13.1% 22.3% 23.3% 30.7% 24.1%
2 52.7% 40.3% 28.3% 28.8% 71.1% 55.9% 62.0%
3 3.1% 3.2% 58.7% 48.9% 5.6% 13.4% 13.9%
Total 100% 100% 100% 100% 100% 100% 100%
8Transportation Research Record 00(0)
crashes, this subsection focuses on the individual blocks,
defined by both row and column clustering. A block
cluster defines a subset of rows (crashes) that exhibit sim-
ilar behavior across a subset of columns (attributes), and
vice versa (43).
Figure 4 depicts the original as well as the clustered
data with K= (1, 2, 3) and L= (1, ..., 10). The dataset
was segmented very well by block clustering. The figure
shows 30 blocks (3 rows by 10 columns), with the green
lines representing the boundaries. As aforementioned,
each block has two features, a
kl
, which shows the center
of blocks or most frequent binary value, and e
kl
which
represents the dispersion or probability of having a dif-
ferent value than the center. Therefore, e
kl
can be used to
realize how homogeneously the blocks are clustered.
Table 5 shows the a
kl
and e
kl
values. For instance,
Figure 4 shows that cluster K= 1 and L= 2 mostly
includes 1 (white squares) rather than 0 (black squares),
whereas in Table 5 for this cluster (K=1,L=2)the
center is found to be True (which means 1) and the dis-
persion for this block is 4.8% (which means 95.2% of
this block has value 1).
To better understand the contributing set of condi-
tions that affect each type of crash the investigation
focuses on blocks that have acceptable e
kl
and, therefore,
dominant a
kl
, as highlighted in Table 5. The idea is to
investigate the significant subset of attributes relevant to
each of the three subgroups of crashes. For instance, for
same-direction crashes (K= 1), four blocks showed sig-
nificant degrees of homogeneity (12e
kl
). Blocks with K/L
= {(1, 2), (1, 4), (1, 7), (1, 8)} showed more than 90%
degree of homogeneity. The associated subset of attri-
butes in these blocks (columns 2, 4, 7, and 8) can be
obtained from column clustering result (Table 4) to
describe this type of crashes. It shows that same-direction
crashes are usually associated with attributes in column 2
(true), but not with attributes in columns 4, 7 and 8
(false). It should be noted that the average degree of
homogeneity for the selected blocks is 95.72% which
implies the robustness of the model.
The results indicate that for same-direction crashes,
which include rear-end and same-direction sideswipe
accidents, the first harmful events were most likely
reported as a collision with a non-fixed object and hap-
pened on the roadway. They were not likely to take place
on one-way streets, parking lots, US highway, or local
roads. These crashes were not likely to be caused by
vision obstruction, backing, a stopped vehicle, or slowing
in traffic. Trucks carrying hazardous materials were
more likely to be involved in same-direction crashes.
Work zones seemed to witness more same-direction
crashes. It is revealed that same-direction vehicle crashes
were the most dangerous crashes which usually resulted
Figure 4. Original and clustered dataset using block clustering
approach.
Table 5. Block Clustering Model Result
a
kl
value for each block
K/L 12345678910
1 False True True False False False False False False False
2 False True False False False True False False False False
3 False False False False False True False False False False
e
kl
value for each block
K/L 12345678910
1 33.2% 4.8% 47.3% 1.0% 10.9% 32.8% 6.5% 4.3% 19.1% 31.3%
2 36.4% 0.0% 16.4% 1.9% 10.8% 35.1% 7.3% 1.3% 22.6% 8.1%
3 27.9% 44.2% 6.7% 43.0% 15.4% 32.6% 5.3% 2.9% 25.6% 5.6%
Rahimi et al 9
in more than one fatality or more than two injuries, and
that females were more likely to get involved in same-
direction crashes than other crashes (but the number of
crashes occurring were still less than male driver crashes).
For opposing-direction crashes, which include angle,
head-on, and opposing sideswipe crashes, female and
senior drivers were rarely cited as at-fault and the esti-
mated speed was not likely to be above 25 mph. Similar
to same-direction crashes, the first harmful events were
most likely reported as collision with a non-fixed object
and happened on the roadway. They were not likely to
take place on one-way streets, parking lots, US highway,
or local roads. These crashes were not likely to be caused
by vision obstruction, backing, a stopped vehicle, or
slowing in traffic. It was revealed that pedestrians, bikes,
and mopeds were most commonly involved in opposing-
direction vehicle crashes and least frequently involved in
same-direction crashes. Applying a raised median that
prevents opposing-direction vehicle crashes could drasti-
cally diminish non-motorist crashes. School bus–related
accidents were more likely to occur in opposing-direction
crashes. Therefore, it seems beneficial to inform school
bus drivers about the high risk of this type of crash and
specially instruct them to prevent angle, head-on, and
opposing sideswipe crashes.
Last but not least, single-vehicle crashes or those
involving parked vehicles or off-road crashes (including
rollover, colliding with animal, pedestrian, bicyclist, or
fixed objects) were not likely to take place on two-way
divided roadway with more than four lanes, and the esti-
mated speed was not likely to be above 25 mph. By defi-
nition, these single-vehicle crashes involved trucks only.
It was found that in these crashes, the restraint systems
(shoulder or lap belt) were more likely to be not used by
motorists. Therefore, educating truck drivers on the ben-
efits of restraint systems could help improve safety. A
majority of the drivers held their driver’s license outside
of Florida. This indicates the need to notify or educate
non-resident truck drivers who are unfamiliar with the
roads in Florida about the high risk of rollover, colliding
with animal, pedestrian, bicyclist, fixed-object, or
parked-vehicle crashes.
Interestingly, some variables were found to be com-
mon among all three types of crashes which implies that
these attributes were general among large truck crashes.
Drivers were rarely found to be distracted or driving
above 76 mph or in DUI (driving under the influence)
condition. Illegal maneuvers, vehicle defects, and driver
vision obstruction were not a significant cause for large
truck crashes. U.S. highway was found to be the safest
roadway for large trucks. Moreover, there are several
types of variables which were not found to be significant
in any of the clusters implying that they are not
contributing factors in large truck crashes; these include
driver age, vehicle age, weather condition, and type of
shoulder.
The findings from this study using clustering methods
showed very similar results to another study of heavy
truck crashes in Florida (53). In that study, the dataset
was initially segmented into seven categories, including
pedestrian, run-off-road/single-vehicle, same-direction,
opposite-direction, change-traffic-way/turning, intersect-
ing paths, and other, without using a clustering method.
Their results showed that same-direction and opposite-
direction crashes had distinct patterns, whereas the other
five categories revealed a similar pattern and were not
significantly different. This confirms the study findings
and implies that the proposed block clustering method is
able to produce reliable and meaningful results.
Conclusion
This study presents an effort to employ an advanced
high-dimensional clustering approach to large truck
crash analysis. A block clustering method was applied to
more than 220,000 crash records with more than 200
attributes. The analysis showed promising results in seg-
menting the large heterogeneous dataset into meaningful
subgroups that provide additional insights for crash
analysis.
Attribute clustering showed distinct characteristics for
each cluster; driver age, crash location, vehicle condition,
weather condition, roadway type, vehicle maneuver, and
driver action were involved in defining the clusters.
Utilizing column clustering provides comprehensive
insights for crash study as the approach considers a
group of attributes that are likely to occur at the same
time rather than analyzing attributes individually.
Crash clustering revealed significant differences
among the clusters and suggested that this crash dataset
could be portioned as same-direction (including rear-end
and same-direction sideswipe), opposing-direction
(include angle, head-on, and opposing sideswipe), and
single-vehicle (contains rollover, colliding with animal,
pedestrian, bicyclist, fixed objects, or parked vehicles)
crashes.
Individual blocks, defined by both row and column
clustering were further investigated to better understand
the contributing set of conditions that lead to large truck
crashes. The average degree of homogeneity for selected
blocks is 95.72% which implies the robustness of the
model. Major features for each of the three major types
of crashes were analyzed, which may provide insights to
develop potential countermeasures for specific segments.
In particular, raised medians to target non-motorists’
crashes, notifying school bus drivers about the high risk
10 Transportation Research Record 00(0)
of opposing-direction crashes, and programs targeting
non-Florida truckers may help improve safety.
The suggested clustering approach can be used as a
preanalysis method for heterogeneous crash data. The
block clustering approach can lead to more robust mod-
els to segment the crash data for further analysis. In this
paper, the homogeneity improved significantly as ICL
and pseudo-likelihood values increased by 20.8% and
21.1% respectively in the optimized dataset. Findings of
the clustering method were confirmed by another study
(which employed a conventional segmenting approach)
conducted in the same area. This shows the potential of
clustering methods to produce meaningful results.
Although the dataset used was high-dimensional and
contained many crashes, it was limited to the state of
Florida and had limited attributes. Researchers are
encouraged to apply the methodology to more compre-
hensive datasets to obtain more general results. The
method can also be incorporated to improve the accu-
racy of truck crash prediction models as it provides
robust and statistically significant criteria to segment the
dataset.
Acknowledgments
This work is funded by the research office of the Florida
Department of Transportation (BDV29 977-31). Data were
extracted from the Signal Four Analytics database provided by
Ilir Bejleri and Liang Zhai at the University of Florida.
Author Contributions
The authors confirm contribution to the paper as follows: study
conception and design: AR and XJ; data processing: HA, AR,
and GA; analysis and interpretation of results: AR and XJ;
draft manuscript preparation: AR and XJ. All authors reviewed
the results and approved the final version of the manuscript.
References
1. Large Truck and Bus Crash Facts 2016. Analysis Division
Federal Motor Carrier Safety Administration. FMCSA-
RRA-17-016. U.S. Department of Transportation.
Washington, D.C., 2018.
2. Large Truck and Bus Crash Facts 2014. Analysis Division
Federal Motor Carrier Safety Administration. FMCSA-
RRA-16-001. U.S. Department of Transportation.
Washington, D.C., 2016.
3. Haleem, K., and A. Gan. Effect of Driver’s Age and Side
of Impact on Crash Severity along Urban Freeways: A
Mixed Logit Approach. Journal of Safety Research, Vol.
46, 2012, pp. 67–76.
4. Anastasopoulos, P., and F. Mannering. An Empirical
Assessment of Fixed and Random Parameter Logit
Models using Crash and Non-Crash-Specific Injury Data.
Accident Analysis and Prevention, Vol. 43, No. 3, 2011,
pp. 1140–1147.
5. Iranitalab, A., and A. Khattak. Comparison of Four Sta-
tistical and Machine Learning Methods for Crash Severity
Prediction. Accident Analysis and Prevention, Vol. 108,
Supplement C, 2017, pp. 27–36.
6. Shams, K., X. Jin, R. Fitzgerald, H. Asgari, and M. S.
Hossan. Value of Reliability for Road Freight Transporta-
tion: Evidence from a Stated Preference Survey in Florida.
Transportation Research Record: Journal of the Transporta-
tion Research Board, 2017. 2610: 35–43.
7. Jin, X., M. S. Hossan, H. Asgari, and K. Shams. Incorpor-
ating Attitudinal Aspects in Roadway Pricing Analysis.
Transport Policy, Vol. 62, 2018, pp. 38–47.
8. Depaire, B., G. Wets, and K. Vanhoof. Traffic Accident
Segmentation by Means of Latent Class Clustering.
Accident Analysis and Prevention, Vol. 40, 2008,
pp. 1257–1266.
9. Sasidharana, L., K. Wub, and M. Menendezaa. Exploring
the Application of Latent Class Cluster Analysis for Inves-
tigating Pedestrian Crash Injury Severities in Switzerland.
Accident Analysis and Prevention, Vol. 85, 2015,
pp. 219–228.
10. Valent, F., F. Schiava, C. Savonitto, T. Gallo, S. Brusa-
ferro, and F. Barbone. Risk Factors for Fatal Road Traffic
Accidents in Udine, Italy. Accident Analysis and Preven-
tion, Vol. 34, No. 1, 2002, pp. 71–84.
11. Yau, K. K. W. Risk Factors Affecting the Severity of Sin-
gle Vehicle Traffic Accidents in Hong Kong. Accident Anal-
ysis and Prevention, Vol. 36, No. 3, 2004, pp. 333–340.
12. Ulfarsson, G. F., and F. L. Mannering. Difference in Male
and Female Injury Severities in Sport-Utility Vehicle, Mini-
van, Pickup and Passenger Car. Accident Analysis and Pre-
vention, Vol. 36, No. 2, 2004, pp. 135–147.
13. Islam, S., and F. L. Mannering. Driver Aging and its Effect
on Male and Female Single-Vehicle Accident Injuries:
Some Additional Evidence. Accident Analysis and Preven-
tion, Vol. 37, No. 2, 2006, pp. 267–276.
14. Moore, D., W. Schneider, P. Savolainen, and M. Farzaneh.
Mixed Logit Analysis of Bicyclist Injury Severity Resulting
from Motor Vehicle Crashes at Inter-Section and Non-
Intersection Location. Accident Analysis and Prevention,
Vol. 43, 2011, pp. 621–630.
15. Shaheed, M. S., K. Gkritza, W. Zhangc, and Z. Hans. A
Mixed Logit Analysis of Two-Vehicle Crash Severities
Involving a Motorcycle. Accident Analysis and Prevention,
Vol. 61, 2013, pp. 119–128.
16. Zeng, Z., W. Zhu, R. Ke, J. Ash, Y. Wang, J. Xu, and X.
Xu. A Generalized Nonlinear Model-Based Mixed Multi-
nomial Logitapproach for Crash Data Analysis. Accident
Analysis and Prevention, Vol. 99, 2017, pp. 51–65.
17. Milton, J., V. Shankar, and F. L. Mannering. Highway
Accident Severities and the Mixed Logit Model: An
Exploratory Empirical Analysis. Accident Analysis and
Prevention, Vol. 40, No. 1, 2008, pp. 260–266.
Rahimi et al 11
18. Wu, Q., F. Chen, G. Zhang, X. C. Liu, H. Wang, and S.
M. Bogus. Mixed Logit Model-Based Driver Injury Sever-
ity Investigations in Single- and Multi-Vehicle Crashes on
Rural Two-Lane Highways. Accident Analysis and Preven-
tion, Vol. 72, 2014, pp. 105–115.
19. Cerwick, D., K. Gkritza, and M. Shaheed. A Comparison
of the Mixed Logit and Latent Class Methods for Crash
Severity Analysis. Analytic Methods in Accident Research,
Vol. 3–4, 2014, pp. 11–27.
20. Steinbach, M., L. Erto
¨z, and V. Kumar. The Challenges of
Clustering High Dimensional Data. In New Directions in
Statistical Physics (L. T., Wille, ed.), Springer Verlag, Ber-
lin, Heidelberg, Germany, 2004, pp. 273–309.
21. Parsons, L. Subspace Clustering for High Dimensional
Data: A Review. ACM SIGKDD Explorations Newsletter:
Special Issue on Learning from Imbalanced Datasets, Vol.
6, No. 1, 2004, pp. 90–105.
22. Jain, A. K., M. N. Murty, and P. J. Flynn. Data Clustering:
A Review. ACM Computing Surveys (CSUR), Vol. 31, No.
3, 1999, pp. 264–323.
23. Kitali, A. E., E. Kidando, P. Martz, P. Alluri, T. Sando,
R. Moses, and R. Lentz. Evaluating Factors Influencing
the Severity of Three-Plus Multiple-Vehicle Crashes using
Real-Time Traffic Data. Transportation Research Record:
Journal of the Transportation Research Board, 2018.
2672(38): 128–137.
24. Hadi, M., Y. Xiao, T. Wang, S. F. Qom, L. Azizi, J. Jia, A.
Massahi, and M. S. Iqbal. Framework for Multi-Resolution
Analyses of Advanced Traffic Management Strategies. Tech-
nical Report. Lehman Center of Transportation Research
Florida International University, Miami, FL, 2016.
25. Ghasemzadeh, A., and M. M. Ahmed. A Tree-Based
Ordered Probit Approach to Identify Factors Affecting
Work Zone Weather-Related Crashes Severity in North
Carolina using the Highway Safety Information System
Dataset. Presented at 96th Annual Meeting of the Trans-
portation Research Board, Washington, D.C., 2017.
26. Haghighi, N., X. C. Liu, G. Zhang, and R. J. Porter.
Impact of Roadway Geometric Features on Crash Severity
on Rural Two-Lane Highways. Accident Analysis and Pre-
vention, Vol. 111, 2018, pp. 34–42.
27. Najaf, P., V. R. Duddu, and S. S. Pulugurtha. Predictabil-
ity and Interpretability of Hybrid Link-Level Crash
Frequency Models for Urban Arterials Compared to Clus-
ter-Based and General Negative Binomial Regression
Models. International Journal of Injury Control and Safety
Promotion, Vol. 25, No. 1, 2017, pp. 3–13.
28. Kamrani, M., A. J. Khattak, and T. Li. A Framework to
Process and Analyze Driver, Vehicle and Road Infrastruc-
ture Volatilities in Real-Time. Presented at 97th Annual
Meeting of the Transportation Research Board, Washing-
ton, D.C., 2018.
29. Motamedi, S., and J. H. Wang. Older Adult Drivers’ Chal-
lenges and In-Vehicle Technology Acceptance. Interna-
tional Journal for Traffic and Transport Engineering, Vol. 7,
No. 4, 2017, pp. 498–515.
30. Han, J., M. Kamber, and J. Pei. Data Mining:
Concepts and Techniques. Morgan Kaufmann, Waltham,
MA, 2001, pp. 335–393.
31. Aggarwal, C. C., and C. K. Reddy. Data Clustering Algorithms
and Applications. Chapman and Hall/CRC, Boca Raton, FL,
2013.
32. MacQueen, J. Some Methods for Classification and Analy-
sis of Multivariate Observations. In Proc., 5th Berkeley
Symposium on Mathematical Statistics and Probability.
University of California Press, Berkeley, CA, 1967,
pp. 281–297.
33. Lloyd, S. Least Squares Quantization in PCM. IEEE
Transactions on Information Theory, Vol. 28, No. 2, 1982,
pp. 129–137.
34. Anderson, T. K. Kernel Density Estimation and K-Means
Clustering to Profile Road Accident Hotspots. Accident
Analysis and Prevention, Vol. 41, 2009, pp. 359–364.
35. Mauro, R., M. D. Luca, and G. Dell’Acqua. Using a K-
Means Clustering Algorithm to Examine Patterns of
Vehicle Crashes in Before-After Analysis. Modern Applied
Science, Vol. 7, 2013, pp. 11–19.
36. Zhang, C., J. N. Ivan, and T. Jonsson. Collision Type Cate-
gorization Based on Crash Causality and Severity Analysis.
Presented at 86th Annual Meeting of the Transportation
Research Board, Washington, D.C., 2007.
37. Nitsche, P. Pre-Crash Scenarios at Road Junctions: A Clus-
tering Method for Car Crash Data. Accident Analysis and
Prevention, Vol. 107, 2017, pp. 137–151.
38. Brown, D. Efficient Functional Clustering of Protein
Sequences using the Dirichlet Process. Bioinformatics, Vol.
24, No. 16, 2008, pp. 1765–1771.
39. Berkhin, P. A Survey of Clustering Data Mining Tech-
niques. In Grouping Multidimensional Data (J., Kogan,
C. Nicholas, and M. Teboulle, eds.), Springer, Berlin, Hei-
delberg, Germany, 2006, pp. 25–71.
40. Vermunt, J. K., and J. Magidson. Latent Class Cluster
Analysis. Applied Latent Class Analysis. Cambridge Uni-
versity Press, Cambridge, UK, 2002, pp. 89–106.
41. Mohamed, M. G., N. Saunier, L. F. Miranda-Moreno,
and S. V. Ukkusuri. A Clustering Regression Approach: A
Comprehensive Injury Severity Analysis of Pedestrian-
Vehicle Crashes in New York, US and Montreal, Canada.
Safety Science, Vol. 54, 2013, pp. 27–37.
42. Govaert, G., and M. Nadif. Block Clustering with Ber-
noulli Mixture Models: Comparison of Different
Approaches. Computational Statistics and Data Analysis,
Vol. 52, No. 6, 2008, pp. 3233–3245.
43. Madeira, S. C., and A. L. Oliveira. Biclustering Algorithms
for Biological Data Analysis: A Survey. IEEE/ACM Trans-
actions on Computational Biology and Bioinformatics,Vol.1,
No. 1, 2004, pp. 24–45.
44. Govaert, G., and M. Nadif. Co-Clustering: Models, Algorithms
and Applications, 1st ed. Wiley-IEEE Press, Hoboken, NJ,
2013.
45. Dhillon, I. S. Co-Clustering Documents and Words using
Bipartite Spectral Graph Partitioning. Proceedings 7th
ACM SIGKDD International Conference on Knowledge
12 Transportation Research Record 00(0)
Discovery and Data Mining, KDD ’01, San Francisco, CA,
2001, pp. 269–274.
46. Wang, F., S. Lin, and P. S. Yu. Collaborative Co-Cluster-
ing across Multiple Social Media. Proc., 17th IEEE Inter-
national Conference on Mobile Data Management, IEEE,
Porto, Portugal, 2016.
47. Bhatia, P., S. Iovleff, and G. Govaert. Blockcluster: An R
Package for Model Based Co-Clustering. Journal of Statis-
tical Software, Vol. VV, No. II, 2014.
48. Hathaway, R. Another Interpretation of the EM Algo-
rithm for Mixture Distributions. Statistics and Probability
Letters, Vol. 4, No. 2, 1986, pp. 53–56.
49. Neal, R., and G. Hinton. A View of the EM Algorithm
That Justifies Incremental, Sparse, and Other Variants.
Learning in Graphical Models, 1998, pp. 355–368.
50. The GeoPlan Center. Signal Four Analytics. Department
of Urban & Regional Planning, University of Florida,
Gainesville, FL. https://s4.geoplan.ufl.edu/.
51. Biernacki, C., G. Celeux, and G. Govaert. Assessing a Mix-
ture Model for Clustering with the Integrated Classification
Likelihood. RR-3521. INRIA, 1998. https://hal.inria.fr/
inria-00073163/document.
52. Bertoletti, M., N. Friel, and R. Rastelli. Choosing the
Number of Clusters in a Finite Mixture Model using an
Exact Integrated Completed Likelihood Criterion.
METRON, Vol. 73, No. 2, 2015, pp. 177–199.
53. Spainhour, L. K., D. Brill, J. O. Sobanjo, J. Wekezer, and
P. V. Mtenga. Evaluation of Traffic Crash Fatality Causes
and Effects: A Study of Fatal Traffic Crashes in Florida
from 1998–2000 Focusing on Heavy Truck Crashes. Final
Report. Project No. BD-050. Florida Department of
Transportation, Tallahassee, FL, 2005.
The Standing Committee on Artificial Intelligence and
Advanced Computing Applications (ABJ70) peer-reviewed this
paper (19-02466).
The opinions, findings and conclusions expressed in this paper are
those of the authors and not necessarily those of the Florida
Department of Transportation or the U.S. Department of
Transportation.
Rahimi et al 13