ArticlePDF Available

Clustering Approach toward Large Truck Crash Analysis

April 2019
Transportation Research Record Journal of the Transportation Research Board

April 2019

DOI:10.1177/0361198119839347

Authors:

Alireza Rahimi

Florida International University

Ghazaleh Azimi

Florida International University

Hamidreza Asgari

Florida International University

Xia Jin

Florida International University

Heterogeneity of crash data masks the underlying crash patterns and perplexes crash analysis. This paper aims to explore an advanced high-dimensional clustering approach to investigate heterogeneity in large datasets. Detailed records of crashes involving large trucks occurring in the state of Florida between 2007 and 2016 were examined to identify truck crash patterns and significant conditions contributing to the patterns. The block clustering method was applied to more than 220,000 crash records with nearly 200 attributes. The analysis showed promising results in segmenting a large heterogeneous dataset into meaningful subgroups (with 95.72% average degree of homogeneity for selected blocks). The goodness of fit for clustering methods is evaluated and both integrated completed likelihood (ICL) and pseudo-likelihood values improved significantly (20.8% and 21.1% respectively). Attribute clustering showed distinct characteristics for each cluster. Crash clustering revealed significant differences among the clusters and suggested that this crash dataset could be portioned as same-direction, opposing-direction, and single-vehicle crashes. Individual blocks defined by both row and column clustering were further investigated to better understand the contribution set of conditions that lead to large truck crashes. Major features for each of the three major types of crashes were analyzed, which may provide additional insights to develop potential countermeasures and strategies that target specific segments. The clustering approach could be used as a preanalysis method to identify homogeneous subgroups for further analysis, which will help enhance the effectiveness of safety programs.

Block clustering, showing (a) binary data set, (b) data reorganized by a partition on I, (c) data reorganized by partitions on I and J simultaneously, and (d) summary binary data.

…

Integrated complete likelihood values by number of blocks.

…

Pseudo-likelihood values by number of blocks.

…

Original and clustered dataset using block clustering approach.

…

Block Clustering Model Result -Column Clusters 4

…

Figures - uploaded by Alireza Rahimi

Content may be subject to copyright.

Content uploaded by Alireza Rahimi

Content may be subject to copyright.

Research Article

Transportation Research Record

1–13

ÓNational Academy of Sciences:

Transportation Research Board 2019

Article reuse guidelines:

sagepub.com/journals-permissions

DOI: 10.1177/0361198119839347

journals.sagepub.com/home/trr

Clustering Approach toward Large Truck

Crash Analysis

Alireza Rahimi

, Ghazaleh Azimi

, Hamidreza Asgari

, and Xia Jin

Abstract

Heterogeneity of crash data masks the underlying crash patterns and perplexes crash analysis. This paper aims to explore an

advanced high-dimensional clustering approach to investigate heterogeneity in large datasets. Detailed records of crashes

involving large trucks occurring in the state of Florida between 2007 and 2016 were examined to identify truck crash patterns

and significant conditions contributing to the patterns. The block clustering method was applied to more than 220,000 crash

records with nearly 200 attributes. The analysis showed promising results in segmenting a large heterogeneous dataset into

meaningful subgroups (with 95.72% average degree of homogeneity for selected blocks). The goodness of fit for clustering

methods is evaluated and both integrated completed likelihood (ICL) and pseudo-likelihood values improved significantly

(20.8% and 21.1% respectively). Attribute clustering showed distinct characteristics for each cluster. Crash clustering revealed

significant differences among the clusters and suggested that this crash dataset could be portioned as same-direction, oppos-

ing-direction, and single-vehicle crashes. Individual blocks defined by both row and column clustering were further investi-

gated to better understand the contribution set of conditions that lead to large truck crashes. Major features for each of the

three major types of crashes were analyzed, which may provide additional insights to develop potential countermeasures and

strategies that target specific segments. The clustering approach could be used as a preanalysis method to identify homoge-

neous subgroups for further analysis, which will help enhance the effectiveness of safety programs.

The number of crashes involving large trucks has been

increasing in the United States. In 2010, large trucks

were involved in about 58,000 injuries and 3,494 fatal

crashes, respectively. In 2016, the number of injury

crashes involving large trucks almost doubled, and fatal

crashes increased by more than 20% comparing with

2010 (1). Large truck crashes impose an enormous

amount of loss on society. In addition to increased traffic

congestion and property damage, they put roadway users

at high risk of injury and fatality. There are also adverse

consequences for the prosperity of industry, including

delay-related cost, additional operations costs, and pro-

ductivity loss. The cost of commercial vehicle crashes has

been estimated to be over $99 billion annually (2).

In this regard, many studies have focused on large

truck crash analysis and identification of countermea-

sures to improve truck safety (3,4). One particular chal-

lenge in investigating contributing factors of large truck

crashes is the presence of heterogeneity, which refers to

the correlation between unobserved factors and observed

variables (5). In other words, the impacts of specific fac-

tors might vary among the observations, leading to ran-

dom distribution of parameter coefficients rather than

fixed impacts. Presence of heterogeneity in different

aspects of travel behavior and traffic data has been

widely discussed (6,7). Traffic accident data are often

heterogeneous considering that the occurrence and sever-

ity of a crash is the result of multiple contributing factors

at the same time (8). There may exist several heterogene-

ity issues that need to be addressed (9). First, some of the

contributing factors may remain hidden. For instance,

highly influential factors for a group of crashes might

not be significant for the whole dataset (8–11). The

degree of effect for specific contributing factors might be

different for the whole dataset and for subgroups (12,

13). Moreover, certain contributing factors might have a

completely different effect on different groups, such as

increasing the risk of fatal crashes for men and decreas-

ing it for women (12).

Heterogeneity of crash data masks the underlying

crash patterns and perplexes crash analysis (8). Although

various approaches have been undertaken to investigate

Department of Civil and Environmental Engineering, Florida International

University, Miami, FL

Corresponding Author:

Address correspondence to Xia Jin: xjin1@fiu.edu

heterogeneity associated with crash data, such as mixed

logit models or generalized structures (14–19), various

phenomena may arise when analyzing and organizing

data in high-dimensional spaces (often with hundreds or

thousands of dimensions). The major issue with high-

dimensional datasets is that data points become

increasingly sparse as the dimensionality increases (20),

traditional techniques begin to fail, and the quality of

results deteriorates (21). High-dimensional crash data

require more robust methods to fully discover hidden

patterns (22). In addition, recent efforts on heterogeneity

have mostly been focused on pedestrian or passenger

vehicle crashes, very few studies investigating truck crash

heterogeneity. This paper, then, aims to explore an

advanced high-dimensional approach, block clustering,

to investigate heterogeneity in large datasets. Detailed

records of crashes involving large trucks that occurred in

the state of Florida between 2007 and 2016 have been

examined to identify truck crash patterns and significant

conditions contributing to the patterns.

The next section provides a brief overview of existing

studies that have investigated heterogeneity in crash data-

sets. The methodology used in this research is described

in the third section. The fourth section describes the data.

The fifth section analyzes the results and the final section

concludes the study.

Literature Review

There are two general approaches that have been taken

to investigate heterogeneity in crash data (8). Data con-

straining is a common approach that focuses on a very

specific segment of the crash dataset. For examples,

Kitali et al. focused on multiple-vehicle crashes (23);

Hadi et al. and Ghasemzadeh et al. analyzed incidents in

work zones (24,25); other specific subjects like crashes in

rural or urban arterial (26,27) and crashes involving

volatile or older adult drivers (28,29) have also been

investigated.

Although data constraining is a useful approach in

analyzing a specific type of crash, it cannot be generalized

to other crash types, and therefore has limited applicabil-

ity. The second approach to discovering heterogeneity is

clustering. Different from classification problems in

which each observation is associated with a group and

the objective is to place a new observation in one of the

groups, cluster analysis seeks to discover the number and

composition of groups that best describes the characteris-

tics of the observations. Cluster analysis uses distance

measures over various dimensions in the dataset to dis-

cover clusters of similar objects (22,30). Crash data clus-

tering has been investigated mainly using two major

approaches, partitional and model-based clustering.

Partitional clustering algorithms optimize a particular

objective function to identify clustering patterns that are

present in the data and iteratively improve the quality of

the partitions. Partitional clustering is also called

prototype-based clustering because it requires parameters

to be used as prototype points that represent each cluster

(31). K-means clustering (32,33) is the most widely used

partitional clustering algorithm in crash data analysis. K

representative points are selected as the initial centroids

in the first step. Using a proximity measure, each point

in the data set is then assigned to the closest centroid.

The centroids for each cluster are then updated based on

newly founded clusters. These two steps will be iteratively

repeated until no changes in centroids are observed or

until any other alternative convergence criterion is met.

Anderson used K-means clustering to profile road acci-

dent hotspots (34). Iranitalab et al. used K-means cluster-

ing for crash severity prediction (5). Mauro et al. applied

K-means clustering to examine patterns of vehicle crashes

before and after infrastructural interventions to improve

road safety (35). Zhang et al. utilized K-means clustering in

crash causality and severity analysis (36). K-medoids is

another partitional clustering algorithm which is more resi-

lient to outliers compared with K-means (34). The K-

medoids algorithm seeks to minimize specific objective

functions by finding clustering solutions. This method is

more robust in addressing noise and outliers in the data

because actual data points are chosen as the prototypes

(34). Nitsche et al. used K-medoids to investigate pre-crash

scenarios at road junctions (37). It is worth mentioning

that partitional methods are nondeterministic in nature

and need a user-predefined number of clusters to obtain a

solution (31).

In the model-based clustering approach, the objectives

are assumed to match a specific model. The model, which

is often a statistical distribution, may be user-specified

and might change during the process (38,39). Latent

class clustering (LCC) is a probabilistic model-based clus-

tering and assumes a mixture of several probability densi-

ties within the data (40). Mohamed et al. used LCC in

injury severity analysis of pedestrian–vehicle crashes (41).

Iranitalab et al. used LCC for crash severity prediction

(5). Depaire et al. segmented traffic accident by means of

LCC (8).

Despite the efficiency and simplicity of these clustering

methods, there are several limitations. First, all existing

studies have focused on segmenting the observations (i.e.,

crash records) and disregarded potential clusters among

the explanatory variables. This issue leads to the focus on

individual variable impacts rather than considering the

intertwining effects among a subset of factors that con-

tribute to different types of crashes. Although traditional

clustering methods could be applied to clustering for

both observations and variables separately, having very

2Transportation Research Record 00(0)

high computational complexities make these clustering

methods not suitable for high-dimensional datasets (42).

In addition, the possible level of noise, and a large

amount of meaningless information in a large crash data-

set, requires a robust method for clustering (43).

In summary, two approaches can be used to analyze

heterogeneous crash data: data constraining and cluster-

ing. Although data constraining is simple, it has limited

applicability and might be subjective. The clustering

approach provides unbiased results for segmenting a

dataset and can enhance the homogeneity significantly.

This paper aims to apply a robust clustering method

to overcome the limitations of traditional clustering

methods. Given the above thoughts, block clustering,

also referred to as coclustering or biclustering, holds the

promise of addressing heterogeneity in high-dimensional

datasets. The next section presents a detailed description

of the block clustering algorithm employed for this

study.

Methodology

Coclustering utilizes the duality of clustering and dis-

covers hidden latent patterns by generating a compact

representation of the dataset. The goal is to cluster the

sets of rows and columns simultaneously to obtain

homogeneous blocks (44). This method has attracted

much attention in recent years for text mining (clustering

of documents and words simultaneously) (45), bioinfor-

matics (clustering of genes and tissues simultaneously)

(43) and social network analysis (46).

Block clustering considers the two sets, observations,

and variables, simultaneously and organizes the data

into homogeneous blocks. If Xdenotes an n3ddata

matrix defined by X={(xij); i2I&j2J}, where Iis a set

of nobjects (rows, observations, crashes) and Jis a set of

dvariables (columns, variables, attributes), the main

objective of this method is to make permutations or rear-

rangements of observations and attributes to construct a

correspondence structure on I3J. An important advan-

tage of block clustering is the transformation of the ini-

tial data matrix Xinto a simpler and smaller data matrix

with the same structure (42,44). Moreover, block clus-

tering is fast and requires far less computation than that

needed to process the two sets separately and consecu-

tively as in the well-known K-means algorithm (42).

Figure 1 illustrates an example of block clustering.

Array (a) in Figure 1 presents a binary dataset consisting

of n=10 objects, I={A, B, C, D, E, F, G, H, I, J}, and

asetofd=7 binary variables, J={1, 2, 3, 4, 5, 6, 7}. To

obtain a homogeneous dataset, the array (a) can be reor-

ganized either by a partition on Ior by partitions on Iand

Jsimultaneously. Array (b) consists of data reorganized

by a partition of Iinto g=3clusters,a={A, C, H},

b={B, F, J}andc={D, G, I, E}. Array (c) consists of

data reorganized by the same partition of Iand partition

of Jinto m=3clusters,I={1, 4}, II ={3, 5, 7}and

III ={2, 6}. Compared with array (b), array (c)clearly

reveals an interesting pattern (42). The block clustering

approach takes advantage of partitioning on Iand J

simultaneously and results in a more homogeneous dataset

compared with traditional clustering models like K-means.

Another advantage of the block clustering method is that

it reduces the initial data matrix Xinto a simpler

data matrix having the same structure. In the example, the

initial (10 37) binary data matrix is reduced to a

(g3m)=(333) summary binary data matrix

(Figure 1d).

Different approaches can be applied for coclustering

and these approaches can differ in the pattern they seek

and the types of data they apply to. Govaert and Nadif

proposed a general framework to formalize the hypoth-

eses of coclustering algorithms (42). They introduced a

latent block model to solve the coclustering problem and

overcome the defects of classical coclustering methods.

They suggested a block clustering framework which uti-

lizes parsimonious models and allows a rigorous simula-

tion. This section presents a block clustering approach

based on the work of Govaert and Nadif (42) and Bhatia

et al. (47).

Mixture Models

A fundamental assumption of model-based clustering is

that the data has originated from a mixture of underlying

probability distributions, where each component kof the

mixture indicates a cluster. Therefore, the matrix dataset

X={x

;i2(1,.,n)} is supposed to be independent and

identically distributed and arises from a probability dis-

tribution with density (42,44):

Figure 1. Block clustering, showing (a) binary data set, (b) data

reorganized by a partition on I,(c) data reorganized by partitions

on Iand Jsimultaneously, and (d) summary binary data.

Rahimi et al 3

f(x;u)= Y

pkfk(xi;a)ð1Þ

where

denotes the density function for the kth component,

ais the corresponding class parameter,

finds the probabilities that an observation belongs

to the kth component with k= (1,.,g) and for which g

is assumed to be known, and

uis the vector of (p

.,p

a). Govaert and Nadif (42)

showed that the density function can be rewritten as:

f(x;u)= X

z2Z

p(z)f(xz

j;a)ð2Þ

f(xz

j;a)= Y

fzi(xi;a)ð3Þ

p(z)= Y

pzið4Þ

where Zstands for the set of all possible partitions of Iin

gclusters. Therefore, according to this function, the data

matrix is supposed to be a sample of size 1 from a ran-

dom (n, d) matrix.

Latent Block Model

The Iset can be partitioned into gclusters by z=(z

,..

., z

) with z

=1ifibe a part of cluster kand z

otherwise, z

=kif z

=1 and z

is the cardin-

ality, the number of elements in a set, of row-cluster k.

Likewise, Jcan be divided into mclusters with w=(w

...,w

)with w

=1ifjfits into cluster land w

otherwise, w

=lif

wjl

= 1 and w

is the cardinal-

ity of column cluster l.

To investigate block clustering, Govaert and Nadif

extended the mixture model density function and

assumed that the labeling of Iand Jare independent of

each other (42). The obtained latent block mixture model

can be defined by the following probability density func-

tion (PDF):

f(x;u)= X

(z,w)2Z3WY

i,j

pzirwjfziwj(xij;a)ð5Þ

where

Zand Wshow the sets of all possible labeling for zof

Iand wof J, respectively,

zi,wj

(x,a) is the PDF defined on the real set R, and

u=(p,r,a) with p=(p

,...,p

) and r=(r

,...,

) are the vectors of probabilities p

and r

that a row

and a column associated to the kth row element and to

the lth column element respectively.

According to the above formulation, the

randomized data generation method can be described as

follows (42,44):

Row labeling: Generate the labeling z=(z

,...,z

)

according to the distribution p=(p

,...,p

Column labeling: Generate the labeling w=(w

,..

.,w

) according to the distribution r=(r

,...,r

Data generation: Generate for i= (1, ..., n) and j=

(1, ..., d) a value x

according to the density distribu-

tion f

zi,wj

(.,a).

Model Parameter Estimation

EM-based algorithms (42,44,47) can be used to approx-

imate model parameters by maximizing observed data

log-likelihood. The complete data log-likelihood can be

defined by the following function:

Lcðz;w;uÞ¼X

z:klogpkþX

w:llogrlþ

i;j;k;l

zik wjl log fklðxij ;aÞ

ð6Þ

In this method, the conditional expectation Q(u,u

(c)

)of

the complete data log-likelihood is maximized given a

previous current estimate u

(c)

and xto iteratively maxi-

mize the log-likelihood:

Qðu;uðcÞÞ¼X

i;k

tðcÞ

ik logpkþX

j;l

rðcÞ

jl logrl

þX

i;j;k;l

eðcÞ

ikjl logfklðxij ;aÞð7Þ

where

t(c)ik =p(zik =1jx,u(c)),

r(c)jl =p(wjl =1jx,u(c)), and

e(c)ikjl =p(zikwjl =1jx,u(c)):

Because of the dependence structure in the model,

Govaert and Nadif (42) proposed an approximate solution

using the interpretation of the EM algorithm by Hathaway

(48)andNealandHinton(49). Therefore the fuzzy cluster-

ing criterion for the latent block model can be defined as

follows, in which L

is the fuzzy complete data log-

likelihood associated with the block latent model:

Fc(t,r;u)=Lc(t,r,u)+H(t)+H(r)ð8Þ

where

4Transportation Research Record 00(0)

H(r)= P

rjl log rjl

H(t)= P

tik log tik

Lc(t,r;u)=

t:klog pk+

rllog rl+P

i,j,k,l

tikrjl log fkl (xij ;a)

:ð9Þ

Algorithms

Govaert and Nadif (42) proposed a block expectation

maximization (BEM) algorithm to maximize the fuzzy

clustering criterion using the following steps.

E-Step: The conditional row and column class prob-

abilities are computed respectively as

tik = log pk+X

rjl log fkl(xij ;a)ð10Þ

rjl =logrl+X

tik log fkl(xij ;a)ð11Þ

M-Step: The row proportions p, column

proportions r, and the model parameter aare calcu-

lated by maximizing Pkt:klog pk,P:lrllog rland

Pi,j,k,ltik rjl log fkl(xij ;a) which are the first, second,

and last term in L

respectively. The estimation of a

depends on the f

PDF which will be discussed later

for binary data.

Therefore, the BEM algorithm suggested by Govaert

and Nadif (42) to maximize the fuzzy clustering criterion

can be described as:

1. Initialize t(0),r(0)and u(0)=(p(0),r(0),a(0)).

2. Compute t(c+1),p(c+1),a(c+(1=2)) by using EM

algorithm for the data matrix uil =Pjrjlxij and

starting from p(c),r(c),a(c).

3. Compute r(c+1),r(c+1),a(c+1)by using EM algo-

rithm for the data matrix vjk =Pitikxij and start-

ing from r(c),r(c),a(c+(1=2)).

4. Iterate step (2) and (3) until convergence.

Block Mixture Models for Binary Datasets

This section summarizes the methodology and describes

the final clustering model used based on the blockcluster

R Package (47). The crash dataset used in this study

included categorical and binary variables. Categorical

variables were converted to dummy variables and the fol-

lowing block mixture model was used to solve the binary

block clustering problem. Govaert and Nadif (42) dis-

cussed how the Bernoulli probability distribution

function, which was needed to find model parameter a,

can be described as:

fkl(xij ;a)=(ekj )xijakl

(1ekj)1xij akl

jj ð12Þ

akl =0,ekl =rkl if rkl\0:5

akl =1,ekl =1rkl if rkl.0:5

ð13Þ

where

p=(p

) is a binary data set with p

2[0, 1], and

and e

characterize the center and dispersion of the

block k, l respectively. a

represents the most frequent

binary value and e

gives the probability of having a dif-

ferent value than the center for each block.

Based on this Bernoulli probability distribution func-

tion, both E and M steps can be redefined.

E-Step: The conditional row and column class prob-

abilities can be found by:

tik = log pk+X

uil r:lakl

log ekl

1ekl

r:llog(1ekl)

ð14Þ

rjl = log rl+X

vjk t:kakl

log ekl

1ekl

t:klog(1ekl)

ð15Þ

uil =X

rjlxij ð16Þ

vjk =X

tik xij ð17Þ

M-Step: the model parameter ais calculated as:

akl =0,ekl =ykl

t:kr:lif ykl

t:kr:l

\0:5

akl =1,ekl =1ykl

t:kr:lotherwise:

ð18Þ

Data Description

The data used in this study were extracted from the

Florida statewide crash database through the Signal

Four Analytics portal (50). The data were coded from

police crash reports including driver, vehicle, crash, and

citation information. Each crash involved at least one

large truck. Roadway network information was also

integrated in the database. Irrelevant information was

removed. The final dataset contains more than 200 attri-

butes. Categorical variables were recoded into dummy

variables as the applied methodology required binary

inputs.

The database recorded around 200 variables, describ-

ing the characteristics of the drivers, vehicles, crash

Rahimi et al 5

events, roadway geometry, lighting, and environment

conditions. The total sample involved 220,932 crashes

that occurred between 2007 and 2016, involving 228,180

large trucks, 180,702 non-truck motor vehicles, 1,902

fatalities, and 58,976 injuries.

Roadway problems were present in 1.9% of the two-

vehicle cases, and adverse weather and light conditions

were present in approximately 8.2% and 20.3% of the

crashes, respectively. Interruption in the traffic flow (pre-

vious crash, work zone, peak hour congestion, etc.) was

coded in almost 2.3% of the two-vehicle crashes.

About 74% of the crashes occurred on local roads,

state highways, interstate, or county roads. In 80% of the

accidents, crash severity was reported as property dam-

age only, but injury and fatality were coded in 19% and

1% of accidents, respectively. Hit-and-run and school

bus–related crashes were reported 6,538 and 1,056 times

respectively.

Table 1 below presents crash type by severity level.

Results Analysis

The very first step for block clustering is finding the opti-

mum number of clusters for rows and columns.

Biernacki et al. (51) suggested integrated completed like-

lihood (ICL), a criterion which can effectively maximize

the complete data likelihood, and has proven to be more

robust than the Bayesian information criterion (BIC) for

mixture models. For a detailed discussion of the ICL cri-

terion, readers are referred to Biernacki et al. (51) and

Bertoletti et al. (52).

A variety of combinations of row number (1 through

10) and column number (1 through 10) were tried, to find

the optimum number of blocks. ICL and pseudo-

likelihood values for each model were evaluated. The

optimum number of blocks was found to be 30 with 3

rows (K) and 10 columns (L), as neither ICL nor pseudo-

likelihood improved much when the numbers further

increased. The ICL and pseudo-likelihood values for all

models can be found in Figures 2 and 3. The results show

that ICL and pseudo-likelihood values for the optimum

number of blocks improved 20.8% and 21.1%, respec-

tively, compared with the initial dataset.

For the model with K= (1, 2, 3) and L= (1, ..., 10)

clusters, the row proportions pand column proportions

rare shown in Table 2. The first row-cluster, K= 1, cov-

ers 31.8% of all observations (crashes), and K= 2 and

K= 3 contain 39.9% and 28.3% of accidents, respec-

tively. The first column cluster, L= 1, consists of 10.3%

of all variables included in the dataset.

Column Clusters (Attribute Clustering)

Detailed results for variable clustering can be found in

Table 3. The results show distinct characteristics for each

cluster. It can be seen that driver age, crash location,

vehicle condition, weather condition, roadway type,

Table 1. Crash Type by Severity for Large Truck Involved Crashes

Property damage only Injury Fatality

Crash type Crashes Percentage Crashes Percentage Crashes Percentage Total

1. Bicycle 124 17.8% 510 73.4% 61 8.8% 695

2. Head-on 1,875 57.3% 1,172 35.8% 223 6.8% 3,270

3. Left entering 3,352 62.0% 1,972 36.5% 86 1.6% 5,410

4. Left leaving 1,368 56.4% 978 40.3% 79 3.3% 2,425

5. Left rear 1,954 67.3% 930 32.0% 20 0.7% 2,904

6. Off-roadway 18,864 87.3% 2,639 12.2% 95 0.4% 21,598

7. Opposing sideswipe 2,502 82.8% 500 16.5% 21 0.7% 3,023

8. Other 17,954 81.6% 3,918 17.8% 128 0.6% 22,000

9. Pedestrian 148 12.7% 816 70.2% 199 17.1% 1,163

10. Rear-end 34,074 68.6% 15,240 30.7% 380 .8% 49,694

11. Right angle 4,887 60.7% 3,004 37.3% 160 2.0% 8,051

12. Right/left 418 89.3% 50 10.7% 0 0.0% 468

13. Right/through 3,667 79.5% 928 20.1% 19 0.4% 4,614

14. Right/U-turn 29 85.3% 5 14.7% 0 0.0% 34

15. Rollover 1,628 49.2% 1,643 49.7% 38 1.1% 3,309

16. Same-direction sideswipe 33,799 88.4% 4,393 11.5% 28 0.1% 38,220

17. Unknown 3,970 81.8% 856 17.6% 27 0.6% 4,853

18. Single-vehicle 7,609 87.6% 993 11.4% 83 1.0% 8,685

19. Parked-vehicle 26,076 95.6% 1,127 4.1% 73 0.3% 27,276

20. Backed into 11,560 92.5% 926 7.4% 14 0.1% 12,500

21. Animal 679 91.8% 59 8.0% 2 0.3% 740

Total 176,537 79.9% 42,659 19.3% 1,736 0.8% 220,932

6Transportation Research Record 00(0)

vehicle maneuver, and driver action were involved in

defining the clusters. It may not always be obvious which

are the most significant features for each cluster, but it

presents a helpful way to identify potential associations

between the variables. A cluster of columns is a subset of

columns that exhibit similar behavior across the rows

(crashes) (43). Column clustering identifies coexistence

between variables and implies that all attributes within a

cluster will either occur together or not occur for a spe-

cific group of crashes (row-cluster).

For example, column cluster 2 represents collisions

with non-fixed objects on the roadway; column cluster 4

mostly contains crashes in parking lots; column cluster 5

mostly involves weekend crashes while driving above the

speed limit; cluster 7 may involve distracted drivers, and

cluster 10 involves female drivers and those changing

lanes. The degree of occurrence depends on e

value

which will be discussed in the block clustering section.

Row Clusters (Crash Clustering)

A cluster of rows is a subset of rows that exhibit similar

behavior across the columns (attributes) (43). The model

identified three distinct row clusters. To further investi-

gate the clusters, several variables, such as crash type,

crash severity, crash time, manner of collision, most

Figure 2. Integrated complete likelihood values by number of blocks.

Figure 3. Pseudo-likelihood values by number of blocks.

Table 2. Row and Column Proportions for Block Cluster Model

LL=1 L=2 L=3 L=4 L=5 L=6 L=7 L=8 L=9 L=10

r(%) 10.3 3.4 5.2 6.9 8.6 6.9 19.0 19.0 12.1 8.6

K123nanananananana

p(%) 31.8 39.9 28.3 na na na na na na na

Note: na = not applicable.

Rahimi et al 7

harmful events, and so forth, were evaluated to identify

the latent patterns. Among all tested variables, the result

revealed significant patterns only in conjunction with

crash type. Table 4 below shows row clusters by crash

type. Z-tests were conducted to examine the significance

of the differences among the clusters.

Results show that the first row-cluster (K= 1) mostly

contains rear-end and same-direction sideswipe crashes.

These two types of crashes are very similar to each other

in the sense that the involved vehicles are traveling in the

same direction. For the second row-cluster, K= 2, the

most dominant crashes are angle, head-on, and opposing

sideswipe crashes, which are again very similar to each

other as the vehicles involved are traveling in opposing

directions. Lastly, for K= 3, the most prevalent acci-

dents are park/off-roadway and single-vehicle crashes.

This cluster mostly includes crashes like rollover, collid-

ing with animal, pedestrian, bicyclist, fixed objects, or

parked vehicles. The results suggest that this crash data-

set can be generally categorized as same-direction

crashes, multi-direction crashes, and single-vehicle

crashes.

The above analysis of row clusters and column clus-

ters indicates that this clustering approach is able to iden-

tify relatively homogeneous groups within the dataset

that are meaningful and reliable (with robust statistical

foundations).

Block Clusters (Both Attribute and Crash Clustering)

To further investigate which groups of attributes are

more likely to be associated with which groups of

Table 3. Block Clustering Model Result: Column Clusters

LVariables LVariables

1 Driver age between 36 and 50 years old 7 Driver distracted

Crash at intersection Road system identifier, U.S. Highway

Driver action, aggressive/careless maneuver Vehicle maneuver action stopped or slowing in traffic

Vehicle year before 2000 Vehicle with defect

Weather condition not clear Vehicle at-fault, body type other

Vehicle maneuver action, straight ahead Vehicle at-fault, pickup truck

2 First harmful event, collision non-fixed object 8 Driver age between 16 and 20 years old

First harmful event location, on roadway Driver condition at time of crash, not normal

3 Vehicle at-fault, passenger car Driving under the influence

Total lane 4 or more First harmful event, non-collision

Traffic way, two-way divided Road system identifier, local road

4 First harmful event, collision with fixed object Road system Identifier, forest or private road

First harmful event location, off-roadway Roadway alignment, curve

Road system identifier, parking lot Roadway grade, not level

Vehicle maneuver action, backing Vehicle at-fault, bus

5 Above posted speed Vehicle at-fault, light trucks

Crash time, weekend Vehicle at-fault, utility vehicle

Driver action, other contributing action 9 Driver age between 21 and 35years old

Road surface condition, not dry Driver age between 51 and 65 years old

Vehicle maneuver action, other Driver action, improper maneuver

6 Crash within city limits Driver action, no contributing action

Estimated speed [0,25] mph Light condition, not daylight

Unpaved or curb shoulder Traffic way, two-way not divided

Vehicle at-fault, medium/heavy trucks Vehicle maneuver action, turn

7 Estimated speed [76,100] mph 10 Estimated speed [26,50] mph

Vision obstructed Estimated speed [51,75] mph

Traffic way, one-way At-fault driver gender, female

Driver age more than 66 years old Road system identifier, interstate

Driver action, illegal maneuver Vehicle maneuver action change lane

Table 4. Crash Type by Row-Cluster

KRear-end Same-direction sideswipe Park/off-roadway Single-vehicle Angle Head-on Opposing sideswipe

1 44.2% 56.6% 13.1% 22.3% 23.3% 30.7% 24.1%

2 52.7% 40.3% 28.3% 28.8% 71.1% 55.9% 62.0%

3 3.1% 3.2% 58.7% 48.9% 5.6% 13.4% 13.9%

Total 100% 100% 100% 100% 100% 100% 100%

8Transportation Research Record 00(0)

crashes, this subsection focuses on the individual blocks,

defined by both row and column clustering. A block

cluster defines a subset of rows (crashes) that exhibit sim-

ilar behavior across a subset of columns (attributes), and

vice versa (43).

Figure 4 depicts the original as well as the clustered

data with K= (1, 2, 3) and L= (1, ..., 10). The dataset

was segmented very well by block clustering. The figure

shows 30 blocks (3 rows by 10 columns), with the green

lines representing the boundaries. As aforementioned,

each block has two features, a

, which shows the center

of blocks or most frequent binary value, and e

which

represents the dispersion or probability of having a dif-

ferent value than the center. Therefore, e

can be used to

realize how homogeneously the blocks are clustered.

Table 5 shows the a

and e

values. For instance,

Figure 4 shows that cluster K= 1 and L= 2 mostly

includes 1 (white squares) rather than 0 (black squares),

whereas in Table 5 for this cluster (K=1,L=2)the

center is found to be True (which means 1) and the dis-

persion for this block is 4.8% (which means 95.2% of

this block has value 1).

To better understand the contributing set of condi-

tions that affect each type of crash the investigation

focuses on blocks that have acceptable e

and, therefore,

dominant a

, as highlighted in Table 5. The idea is to

investigate the significant subset of attributes relevant to

each of the three subgroups of crashes. For instance, for

same-direction crashes (K= 1), four blocks showed sig-

nificant degrees of homogeneity (12e

). Blocks with K/L

= {(1, 2), (1, 4), (1, 7), (1, 8)} showed more than 90%

degree of homogeneity. The associated subset of attri-

butes in these blocks (columns 2, 4, 7, and 8) can be

obtained from column clustering result (Table 4) to

describe this type of crashes. It shows that same-direction

crashes are usually associated with attributes in column 2

(true), but not with attributes in columns 4, 7 and 8

(false). It should be noted that the average degree of

homogeneity for the selected blocks is 95.72% which

implies the robustness of the model.

The results indicate that for same-direction crashes,

which include rear-end and same-direction sideswipe

accidents, the first harmful events were most likely

reported as a collision with a non-fixed object and hap-

pened on the roadway. They were not likely to take place

on one-way streets, parking lots, US highway, or local

roads. These crashes were not likely to be caused by

vision obstruction, backing, a stopped vehicle, or slowing

in traffic. Trucks carrying hazardous materials were

more likely to be involved in same-direction crashes.

Work zones seemed to witness more same-direction

crashes. It is revealed that same-direction vehicle crashes

were the most dangerous crashes which usually resulted

Figure 4. Original and clustered dataset using block clustering

approach.

Table 5. Block Clustering Model Result

value for each block

K/L 12345678910

1 False True True False False False False False False False

2 False True False False False True False False False False

3 False False False False False True False False False False

value for each block

K/L 12345678910

1 33.2% 4.8% 47.3% 1.0% 10.9% 32.8% 6.5% 4.3% 19.1% 31.3%

2 36.4% 0.0% 16.4% 1.9% 10.8% 35.1% 7.3% 1.3% 22.6% 8.1%

3 27.9% 44.2% 6.7% 43.0% 15.4% 32.6% 5.3% 2.9% 25.6% 5.6%

Rahimi et al 9

in more than one fatality or more than two injuries, and

that females were more likely to get involved in same-

direction crashes than other crashes (but the number of

crashes occurring were still less than male driver crashes).

For opposing-direction crashes, which include angle,

head-on, and opposing sideswipe crashes, female and

senior drivers were rarely cited as at-fault and the esti-

mated speed was not likely to be above 25 mph. Similar

to same-direction crashes, the first harmful events were

most likely reported as collision with a non-fixed object

and happened on the roadway. They were not likely to

take place on one-way streets, parking lots, US highway,

or local roads. These crashes were not likely to be caused

by vision obstruction, backing, a stopped vehicle, or

slowing in traffic. It was revealed that pedestrians, bikes,

and mopeds were most commonly involved in opposing-

direction vehicle crashes and least frequently involved in

same-direction crashes. Applying a raised median that

prevents opposing-direction vehicle crashes could drasti-

cally diminish non-motorist crashes. School bus–related

accidents were more likely to occur in opposing-direction

crashes. Therefore, it seems beneficial to inform school

bus drivers about the high risk of this type of crash and

specially instruct them to prevent angle, head-on, and

opposing sideswipe crashes.

Last but not least, single-vehicle crashes or those

involving parked vehicles or off-road crashes (including

rollover, colliding with animal, pedestrian, bicyclist, or

fixed objects) were not likely to take place on two-way

divided roadway with more than four lanes, and the esti-

mated speed was not likely to be above 25 mph. By defi-

nition, these single-vehicle crashes involved trucks only.

It was found that in these crashes, the restraint systems

(shoulder or lap belt) were more likely to be not used by

motorists. Therefore, educating truck drivers on the ben-

efits of restraint systems could help improve safety. A

majority of the drivers held their driver’s license outside

of Florida. This indicates the need to notify or educate

non-resident truck drivers who are unfamiliar with the

roads in Florida about the high risk of rollover, colliding

with animal, pedestrian, bicyclist, fixed-object, or

parked-vehicle crashes.

Interestingly, some variables were found to be com-

mon among all three types of crashes which implies that

these attributes were general among large truck crashes.

Drivers were rarely found to be distracted or driving

above 76 mph or in DUI (driving under the influence)

condition. Illegal maneuvers, vehicle defects, and driver

vision obstruction were not a significant cause for large

truck crashes. U.S. highway was found to be the safest

roadway for large trucks. Moreover, there are several

types of variables which were not found to be significant

in any of the clusters implying that they are not

contributing factors in large truck crashes; these include

driver age, vehicle age, weather condition, and type of

shoulder.

The findings from this study using clustering methods

showed very similar results to another study of heavy

truck crashes in Florida (53). In that study, the dataset

was initially segmented into seven categories, including

pedestrian, run-off-road/single-vehicle, same-direction,

opposite-direction, change-traffic-way/turning, intersect-

ing paths, and other, without using a clustering method.

Their results showed that same-direction and opposite-

direction crashes had distinct patterns, whereas the other

five categories revealed a similar pattern and were not

significantly different. This confirms the study findings

and implies that the proposed block clustering method is

able to produce reliable and meaningful results.

Conclusion

This study presents an effort to employ an advanced

high-dimensional clustering approach to large truck

crash analysis. A block clustering method was applied to

more than 220,000 crash records with more than 200

attributes. The analysis showed promising results in seg-

menting the large heterogeneous dataset into meaningful

subgroups that provide additional insights for crash

analysis.

Attribute clustering showed distinct characteristics for

each cluster; driver age, crash location, vehicle condition,

weather condition, roadway type, vehicle maneuver, and

driver action were involved in defining the clusters.

Utilizing column clustering provides comprehensive

insights for crash study as the approach considers a

group of attributes that are likely to occur at the same

time rather than analyzing attributes individually.

Crash clustering revealed significant differences

among the clusters and suggested that this crash dataset

could be portioned as same-direction (including rear-end

and same-direction sideswipe), opposing-direction

(include angle, head-on, and opposing sideswipe), and

single-vehicle (contains rollover, colliding with animal,

pedestrian, bicyclist, fixed objects, or parked vehicles)

crashes.

Individual blocks, defined by both row and column

clustering were further investigated to better understand

the contributing set of conditions that lead to large truck

crashes. The average degree of homogeneity for selected

blocks is 95.72% which implies the robustness of the

model. Major features for each of the three major types

of crashes were analyzed, which may provide insights to

develop potential countermeasures for specific segments.

In particular, raised medians to target non-motorists’

crashes, notifying school bus drivers about the high risk

10 Transportation Research Record 00(0)

of opposing-direction crashes, and programs targeting

non-Florida truckers may help improve safety.

The suggested clustering approach can be used as a

preanalysis method for heterogeneous crash data. The

block clustering approach can lead to more robust mod-

els to segment the crash data for further analysis. In this

paper, the homogeneity improved significantly as ICL

and pseudo-likelihood values increased by 20.8% and

21.1% respectively in the optimized dataset. Findings of

the clustering method were confirmed by another study

(which employed a conventional segmenting approach)

conducted in the same area. This shows the potential of

clustering methods to produce meaningful results.

Although the dataset used was high-dimensional and

contained many crashes, it was limited to the state of

Florida and had limited attributes. Researchers are

encouraged to apply the methodology to more compre-

hensive datasets to obtain more general results. The

method can also be incorporated to improve the accu-

racy of truck crash prediction models as it provides

robust and statistically significant criteria to segment the

dataset.

Acknowledgments

This work is funded by the research office of the Florida

Department of Transportation (BDV29 977-31). Data were

extracted from the Signal Four Analytics database provided by

Ilir Bejleri and Liang Zhai at the University of Florida.

Author Contributions

The authors confirm contribution to the paper as follows: study

conception and design: AR and XJ; data processing: HA, AR,

and GA; analysis and interpretation of results: AR and XJ;

draft manuscript preparation: AR and XJ. All authors reviewed

the results and approved the final version of the manuscript.

References

1. Large Truck and Bus Crash Facts 2016. Analysis Division

Federal Motor Carrier Safety Administration. FMCSA-

RRA-17-016. U.S. Department of Transportation.

Washington, D.C., 2018.

2. Large Truck and Bus Crash Facts 2014. Analysis Division

Federal Motor Carrier Safety Administration. FMCSA-

RRA-16-001. U.S. Department of Transportation.

Washington, D.C., 2016.

3. Haleem, K., and A. Gan. Effect of Driver’s Age and Side

of Impact on Crash Severity along Urban Freeways: A

Mixed Logit Approach. Journal of Safety Research, Vol.

46, 2012, pp. 67–76.

4. Anastasopoulos, P., and F. Mannering. An Empirical

Assessment of Fixed and Random Parameter Logit

Models using Crash and Non-Crash-Specific Injury Data.

Accident Analysis and Prevention, Vol. 43, No. 3, 2011,

pp. 1140–1147.

5. Iranitalab, A., and A. Khattak. Comparison of Four Sta-

tistical and Machine Learning Methods for Crash Severity

Prediction. Accident Analysis and Prevention, Vol. 108,

Supplement C, 2017, pp. 27–36.

6. Shams, K., X. Jin, R. Fitzgerald, H. Asgari, and M. S.

Hossan. Value of Reliability for Road Freight Transporta-

tion: Evidence from a Stated Preference Survey in Florida.

Transportation Research Record: Journal of the Transporta-

tion Research Board, 2017. 2610: 35–43.

7. Jin, X., M. S. Hossan, H. Asgari, and K. Shams. Incorpor-

ating Attitudinal Aspects in Roadway Pricing Analysis.

Transport Policy, Vol. 62, 2018, pp. 38–47.

8. Depaire, B., G. Wets, and K. Vanhoof. Traffic Accident

Segmentation by Means of Latent Class Clustering.

Accident Analysis and Prevention, Vol. 40, 2008,

pp. 1257–1266.

9. Sasidharana, L., K. Wub, and M. Menendezaa. Exploring

the Application of Latent Class Cluster Analysis for Inves-

tigating Pedestrian Crash Injury Severities in Switzerland.

Accident Analysis and Prevention, Vol. 85, 2015,

pp. 219–228.

10. Valent, F., F. Schiava, C. Savonitto, T. Gallo, S. Brusa-

ferro, and F. Barbone. Risk Factors for Fatal Road Traffic

Accidents in Udine, Italy. Accident Analysis and Preven-

tion, Vol. 34, No. 1, 2002, pp. 71–84.

11. Yau, K. K. W. Risk Factors Affecting the Severity of Sin-

gle Vehicle Traffic Accidents in Hong Kong. Accident Anal-

ysis and Prevention, Vol. 36, No. 3, 2004, pp. 333–340.

12. Ulfarsson, G. F., and F. L. Mannering. Difference in Male

and Female Injury Severities in Sport-Utility Vehicle, Mini-

van, Pickup and Passenger Car. Accident Analysis and Pre-

vention, Vol. 36, No. 2, 2004, pp. 135–147.

13. Islam, S., and F. L. Mannering. Driver Aging and its Effect

on Male and Female Single-Vehicle Accident Injuries:

Some Additional Evidence. Accident Analysis and Preven-

tion, Vol. 37, No. 2, 2006, pp. 267–276.

14. Moore, D., W. Schneider, P. Savolainen, and M. Farzaneh.

Mixed Logit Analysis of Bicyclist Injury Severity Resulting

from Motor Vehicle Crashes at Inter-Section and Non-

Intersection Location. Accident Analysis and Prevention,

Vol. 43, 2011, pp. 621–630.

15. Shaheed, M. S., K. Gkritza, W. Zhangc, and Z. Hans. A

Mixed Logit Analysis of Two-Vehicle Crash Severities

Involving a Motorcycle. Accident Analysis and Prevention,

Vol. 61, 2013, pp. 119–128.

16. Zeng, Z., W. Zhu, R. Ke, J. Ash, Y. Wang, J. Xu, and X.

Xu. A Generalized Nonlinear Model-Based Mixed Multi-

nomial Logitapproach for Crash Data Analysis. Accident

Analysis and Prevention, Vol. 99, 2017, pp. 51–65.

17. Milton, J., V. Shankar, and F. L. Mannering. Highway

Accident Severities and the Mixed Logit Model: An

Exploratory Empirical Analysis. Accident Analysis and

Prevention, Vol. 40, No. 1, 2008, pp. 260–266.

Rahimi et al 11

18. Wu, Q., F. Chen, G. Zhang, X. C. Liu, H. Wang, and S.

M. Bogus. Mixed Logit Model-Based Driver Injury Sever-

ity Investigations in Single- and Multi-Vehicle Crashes on

Rural Two-Lane Highways. Accident Analysis and Preven-

tion, Vol. 72, 2014, pp. 105–115.

19. Cerwick, D., K. Gkritza, and M. Shaheed. A Comparison

of the Mixed Logit and Latent Class Methods for Crash

Severity Analysis. Analytic Methods in Accident Research,

Vol. 3–4, 2014, pp. 11–27.

20. Steinbach, M., L. Erto

¨z, and V. Kumar. The Challenges of

Clustering High Dimensional Data. In New Directions in

Statistical Physics (L. T., Wille, ed.), Springer Verlag, Ber-

lin, Heidelberg, Germany, 2004, pp. 273–309.

21. Parsons, L. Subspace Clustering for High Dimensional

Data: A Review. ACM SIGKDD Explorations Newsletter:

Special Issue on Learning from Imbalanced Datasets, Vol.

6, No. 1, 2004, pp. 90–105.

22. Jain, A. K., M. N. Murty, and P. J. Flynn. Data Clustering:

A Review. ACM Computing Surveys (CSUR), Vol. 31, No.

3, 1999, pp. 264–323.

23. Kitali, A. E., E. Kidando, P. Martz, P. Alluri, T. Sando,

R. Moses, and R. Lentz. Evaluating Factors Influencing

the Severity of Three-Plus Multiple-Vehicle Crashes using

Real-Time Traffic Data. Transportation Research Record:

Journal of the Transportation Research Board, 2018.

2672(38): 128–137.

24. Hadi, M., Y. Xiao, T. Wang, S. F. Qom, L. Azizi, J. Jia, A.

Massahi, and M. S. Iqbal. Framework for Multi-Resolution

Analyses of Advanced Traffic Management Strategies. Tech-

nical Report. Lehman Center of Transportation Research

Florida International University, Miami, FL, 2016.

25. Ghasemzadeh, A., and M. M. Ahmed. A Tree-Based

Ordered Probit Approach to Identify Factors Affecting

Work Zone Weather-Related Crashes Severity in North

Carolina using the Highway Safety Information System

Dataset. Presented at 96th Annual Meeting of the Trans-

portation Research Board, Washington, D.C., 2017.

26. Haghighi, N., X. C. Liu, G. Zhang, and R. J. Porter.

Impact of Roadway Geometric Features on Crash Severity

on Rural Two-Lane Highways. Accident Analysis and Pre-

vention, Vol. 111, 2018, pp. 34–42.

27. Najaf, P., V. R. Duddu, and S. S. Pulugurtha. Predictabil-

ity and Interpretability of Hybrid Link-Level Crash

Frequency Models for Urban Arterials Compared to Clus-

ter-Based and General Negative Binomial Regression

Models. International Journal of Injury Control and Safety

Promotion, Vol. 25, No. 1, 2017, pp. 3–13.

28. Kamrani, M., A. J. Khattak, and T. Li. A Framework to

Process and Analyze Driver, Vehicle and Road Infrastruc-

ture Volatilities in Real-Time. Presented at 97th Annual

Meeting of the Transportation Research Board, Washing-

ton, D.C., 2018.

29. Motamedi, S., and J. H. Wang. Older Adult Drivers’ Chal-

lenges and In-Vehicle Technology Acceptance. Interna-

tional Journal for Traffic and Transport Engineering, Vol. 7,

No. 4, 2017, pp. 498–515.

30. Han, J., M. Kamber, and J. Pei. Data Mining:

Concepts and Techniques. Morgan Kaufmann, Waltham,

MA, 2001, pp. 335–393.

31. Aggarwal, C. C., and C. K. Reddy. Data Clustering Algorithms

and Applications. Chapman and Hall/CRC, Boca Raton, FL,

2013.

32. MacQueen, J. Some Methods for Classification and Analy-

sis of Multivariate Observations. In Proc., 5th Berkeley

Symposium on Mathematical Statistics and Probability.

University of California Press, Berkeley, CA, 1967,

pp. 281–297.

33. Lloyd, S. Least Squares Quantization in PCM. IEEE

Transactions on Information Theory, Vol. 28, No. 2, 1982,

pp. 129–137.

34. Anderson, T. K. Kernel Density Estimation and K-Means

Clustering to Profile Road Accident Hotspots. Accident

Analysis and Prevention, Vol. 41, 2009, pp. 359–364.

35. Mauro, R., M. D. Luca, and G. Dell’Acqua. Using a K-

Means Clustering Algorithm to Examine Patterns of

Vehicle Crashes in Before-After Analysis. Modern Applied

Science, Vol. 7, 2013, pp. 11–19.

36. Zhang, C., J. N. Ivan, and T. Jonsson. Collision Type Cate-

gorization Based on Crash Causality and Severity Analysis.

Presented at 86th Annual Meeting of the Transportation

Research Board, Washington, D.C., 2007.

37. Nitsche, P. Pre-Crash Scenarios at Road Junctions: A Clus-

tering Method for Car Crash Data. Accident Analysis and

Prevention, Vol. 107, 2017, pp. 137–151.

38. Brown, D. Efficient Functional Clustering of Protein

Sequences using the Dirichlet Process. Bioinformatics, Vol.

24, No. 16, 2008, pp. 1765–1771.

39. Berkhin, P. A Survey of Clustering Data Mining Tech-

niques. In Grouping Multidimensional Data (J., Kogan,

C. Nicholas, and M. Teboulle, eds.), Springer, Berlin, Hei-

delberg, Germany, 2006, pp. 25–71.

40. Vermunt, J. K., and J. Magidson. Latent Class Cluster

Analysis. Applied Latent Class Analysis. Cambridge Uni-

versity Press, Cambridge, UK, 2002, pp. 89–106.

41. Mohamed, M. G., N. Saunier, L. F. Miranda-Moreno,

and S. V. Ukkusuri. A Clustering Regression Approach: A

Comprehensive Injury Severity Analysis of Pedestrian-

Vehicle Crashes in New York, US and Montreal, Canada.

Safety Science, Vol. 54, 2013, pp. 27–37.

42. Govaert, G., and M. Nadif. Block Clustering with Ber-

noulli Mixture Models: Comparison of Different

Approaches. Computational Statistics and Data Analysis,

Vol. 52, No. 6, 2008, pp. 3233–3245.

43. Madeira, S. C., and A. L. Oliveira. Biclustering Algorithms

for Biological Data Analysis: A Survey. IEEE/ACM Trans-

actions on Computational Biology and Bioinformatics,Vol.1,

No. 1, 2004, pp. 24–45.

44. Govaert, G., and M. Nadif. Co-Clustering: Models, Algorithms

and Applications, 1st ed. Wiley-IEEE Press, Hoboken, NJ,

2013.

45. Dhillon, I. S. Co-Clustering Documents and Words using

Bipartite Spectral Graph Partitioning. Proceedings 7th

ACM SIGKDD International Conference on Knowledge

12 Transportation Research Record 00(0)

Discovery and Data Mining, KDD ’01, San Francisco, CA,

2001, pp. 269–274.

46. Wang, F., S. Lin, and P. S. Yu. Collaborative Co-Cluster-

ing across Multiple Social Media. Proc., 17th IEEE Inter-

national Conference on Mobile Data Management, IEEE,

Porto, Portugal, 2016.

47. Bhatia, P., S. Iovleff, and G. Govaert. Blockcluster: An R

Package for Model Based Co-Clustering. Journal of Statis-

tical Software, Vol. VV, No. II, 2014.

48. Hathaway, R. Another Interpretation of the EM Algo-

rithm for Mixture Distributions. Statistics and Probability

Letters, Vol. 4, No. 2, 1986, pp. 53–56.

49. Neal, R., and G. Hinton. A View of the EM Algorithm

That Justifies Incremental, Sparse, and Other Variants.

Learning in Graphical Models, 1998, pp. 355–368.

50. The GeoPlan Center. Signal Four Analytics. Department

of Urban & Regional Planning, University of Florida,

Gainesville, FL. https://s4.geoplan.ufl.edu/.

51. Biernacki, C., G. Celeux, and G. Govaert. Assessing a Mix-

ture Model for Clustering with the Integrated Classification

Likelihood. RR-3521. INRIA, 1998. https://hal.inria.fr/

inria-00073163/document.

52. Bertoletti, M., N. Friel, and R. Rastelli. Choosing the

Number of Clusters in a Finite Mixture Model using an

Exact Integrated Completed Likelihood Criterion.

METRON, Vol. 73, No. 2, 2015, pp. 177–199.

53. Spainhour, L. K., D. Brill, J. O. Sobanjo, J. Wekezer, and

P. V. Mtenga. Evaluation of Traffic Crash Fatality Causes

and Effects: A Study of Fatal Traffic Crashes in Florida

from 1998–2000 Focusing on Heavy Truck Crashes. Final

Report. Project No. BD-050. Florida Department of

Transportation, Tallahassee, FL, 2005.

The Standing Committee on Artificial Intelligence and

Advanced Computing Applications (ABJ70) peer-reviewed this

paper (19-02466).

The opinions, findings and conclusions expressed in this paper are

those of the authors and not necessarily those of the Florida

Department of Transportation or the U.S. Department of

Transportation.

Rahimi et al 13

Investigation of Factors Associated with Heavy Vehicle Crashes in Iran (Tehran–Qazvin Freeway)

Article

Full-text available

Jul 2023

With the growing demand for transportation and cargo between cities, the proportion of heavy vehicles in freeway traffic has been increasing in Iran and worldwide during the past decade. The impact of heavy vehicles on crash severity has long been a concern in the crash analysis literature for the prevalence of crashes in freeway traffic. The purpose of this study is to investigate the contribution of heavy vehicles to freeway crashes and uncover other causal factors. Using the comprehensive crash and traffic data from the Qazvin–Tehran freeway in Iran, from 2013 to 2018, 1350 crashes involving heavy vehicles were extracted regarding the weather conditions, weekday, main cause of the crash, driver gender, and culprit side. Considering crash severity calculation, the applied coefficient weights in this study for a person were considered as 3 for an accident resulting in injury and 5 for a fatal crash. A binary logit model was estimated using the data to determine if there was a significant correlation between recognized factors and the likelihood of the crash. The logit modeling results clearly illustrate important relationships between various risk factors and occupant injury, in which heavy vehicles were recognized as one of the most important factors in this study. Other variables associated with crash severity were weather conditions and driver attention. Results indicate that the number of crashes is simultaneously dependent on the total vehicle volume and average speed of heavy vehicles.

Statistical and Spatial Analysis of Large Truck Crashes in Texas (2017–2021)

Article

Full-text available

Mar 2024

Freight transportation, dominated by trucks, is an integral part of trade and production in the USA. Given the prevalence of large truck crashes, a comprehensive investigation is imperative to ascertain the underlying causes. This study analyzed 2017–2021 Texas crash data to identify factors impacting large truck crash rates and injury severity and to locate high-risk zones for severe incidents. Logistic regression models and bivariate analysis were utilized to assess the impacts of various crash-related variables individually and collectively. Heat maps and hotspot analysis were employed to pinpoint areas with a high frequency of both minor and severe large truck crashes. The findings of the investigation highlighted night-time no-passing zones and marked lanes as primary road traffic control, highway or FM roads, a higher posted road speed limit, dark lighting conditions, male and older drivers, and curved road alignment as prominent contributing factors to large truck crashes. Furthermore, in cases where the large truck driver was determined not to be at fault, the likelihood of severe collisions significantly increased. The study’s findings urge policymakers to prioritize infrastructure improvements like dual left-turn lanes and extended exit ramps while advocating for wider adoption of safety technologies like lane departure warnings and autonomous emergency braking. Additionally, public awareness campaigns aimed at reducing distracted driving and drunk driving, particularly among truck drivers, could significantly reduce crashes. By implementing these targeted solutions, we can create safer roads for everyone in Texas.

State-of-the-Art Research in the Area of Artificial Intelligence with Specific Consideration to Civil Infrastructure, Construction Engineering and Management, and Safety

Chapter

Sep 2022

With the prevalence of large amount of information, there is an increasing need to digitalize and automate the associated data handling techniques. This is particularly important to enable reliable, effective, efficient, quick, and optimal decision-making processes. Coupled with the increase in computational power and the wide availability of cloud‐driven capabilities, artificial intelligence has offered unprecedented opportunities to retrieve and reveal remarkable patterns, trends, relationships, and knowledge from big data sets. For many decades, artificial intelligence has been applied to address challenges and provide solutions in different research areas and application domains. In relation to that, the goal of this chapter is to offer an introduction to the fundamental and essential methods, principles, and applications of artificial intelligence. To this end, this chapter focuses on the state-of-the-art research in the area of artificial intelligence with specific consideration to the following three domains: (1) Civil Infrastructure Applications (which include (i) bridges, pavements, and transportation systems; (ii) underground facilities; and, (iii) water systems, resilience, and electrical/power systems; (2) Construction Engineering and Management Applications (which include (i)construction-related activities; (ii) planning; and, (iii) facility management), and (3) Safety Applications (which include (i) construction safety management; (ii) accident analysis, and (iii) fire safety).

Delineation of isotopic and hydrochemical evolution of karstic aquifers with different cluster-based (HCA, KM, FCM and GKM) methods

Article

Jun 2022
J HYDROL

In this work, a combination of isotopic and hydrogeochemical data of a karstic region was clustered with four distinct clustering analysis (CA) methods to study water evolution in a vulnerable karstic region to improve protection, sustainability, and enhanced water resource management. Four CA methods, including hierarchical cluster analysis (HCA), K-means (KM), and fuzzy logic CA methods, fuzzy C-mean (FCM), and genetic K‐means (GKM), have been utilized to analyze hydrochemical, chemical, and isotopic datasets, including dissolved inorganic carbon (DIC), δ¹³C-DIC, δ¹⁸O, and δ²H datasets of water resources of Paveh-Javanrud (PV-JR) karstic region, located at the western border of Iran and Iraq countries. The utilized dataset contains 34 water samples with varied origination to evaluate the performance of each model and find the best method based on a meaningful categorization of geological, hydrogeochemical, and isotopic characteristics. Finally, the best model results were matched graphically with developed geospatial graphs to visualize the correlation between the region's water resources. Accordingly, the FCM and GKM methods represent the same, yet meaningful results and have the best performance among the four methods. It was also identified that the PV-JR water resources could be generally categorized into five distinct clusters, including FC1 to FC5 and GK1 to GK5, of which two clusters that have mixing, two clusters with solo-origination and no sign of mixing, and finally, a seasonal spring which is categorized as a separate cluster. Potentially, studying water resources via theoretical methods combined with considering isotope hydrology is of particular interest since solving the environmental issues related to karstic regions and their water resource management are shared concerns in most arid and semi-arid countries, especially in the Middle East as this study, thus could lay a basis for the following scientific attempts involving hydrogeochemical studies and advanced statistical analysis.

Hazardous traffic scenarios for motorcyclists in Indonesia: a comprehensive insight from police accident data and self-reports

Article

Apr 2024
Int J Inj Contr Saf Promot

Fatal Crash Occurrence Prediction and Pattern Evaluation by Applying Machine Learning Techniques

Article

Full-text available

Feb 2024
Open Transport J

Background Highway safety remains a significant issue, with road crashes being a leading cause of fatalities and injuries. While several studies have been conducted on crash severity, few have analyzed and predicted specific types of crashes, such as fatal crashes. Identifying the key factors associated with fatal crashes and predicting their occurrence can help develop effective preventative measures. Objective This study intended to develop cluster analysis and ML-based models using crash data to extract the prominent factors behind fatal crash occurrences and analyze the inherent pattern of variables contributing to fatal crashes. Methods Several branches and categories of supervised ML models have been implemented for fatality prediction and their results have been compared. SHAP analysis was conducted using the ML model to explore the contributing factors of fatal crashes. Additionally, the underlying hidden patterns of fatal crashes have been evaluated using K-means clustering, and specific fatal crash scenarios have been extracted. Results The deep neural networks model achieved 85% accuracy in predicting fatal crashes in Kansas. Factors, such as speed limits, nighttime, darker road conditions, two-lane highways, highway interchange areas, motorcycle and tractor-trailer involvement, and head-on collisions were found to be influential. Moreover, the clusters were able to discern certain scenarios of fatal crashes. Conclusion The study can provide a clear image of the important factors related to fatal crashes, which can be utilized to create new safety protocols and countermeasures to reduce fatal crashes. The results from cluster analysis can facilitate transportation professionals with representative scenarios, which will benefit in identifying potential fatal crash conditions.

Injury severity of single-vehicle large-truck crashes: accounting for heterogeneity

Article

Jul 2023
Int J Inj Contr Saf Promot

This research examines the injury severity of single-vehicle large-truck crashes in Florida while exploring the role of heterogeneity. a random parameter ordered logit (RPOl) model was applied to 27,505 single-vehicle large-truck crashes from 2007 to 2016 in Florida, and the contributing factors were identified. Random parameters and interaction effects were introduced to the model to determine the heterogeneity and its potential sources. the results suggested that driving speed of 76-120 mph and defective tires were the most influential factors in crash injury severity, increasing the probability of severe crashes. Regarding truckers' attributes, asleep or fatigued conditions and driving under the influence were correlated with a higher possibility of severe crashes. interestingly, the results showed that truckers from outside the state of Florida were less likely to cause severe single-vehicle large-truck crashes compared to their Floridian counterparts. Y-intersections were also found as a high-risk location for single-vehicle large-truck crashes, leading to more severe outcomes. Regarding heterogeneity, the results indicated that the impacts of driving speed (26-50 mph) and light condition (dark-not lighted) significantly varied among the observations, and these variations could be attributed to driver action, vision obstruction, driver distraction, roadway type and roadway alignment.

Crash Contributing Factors and Patterns Associated with Fatal Truck-involved Crashes in Bangladesh: Findings from Text Mining Approach

Conference Paper

Full-text available

Jan 2023

Despite extensive research on traffic injury severities, relatively little is known about the factors contributing to truck-involved crashes in developing countries, especially in the context of Bangladesh. Due to the unavailability of authentic crash data sources, this study collected data from alternative sources such as online English news media reports. The current study prepared a database of 144 truck-involved fatal crash reports during the period of twelve months (January 2021 to December 2021). The crash reports contain a bag of 15,300 words. Several state-of-the-art text mining tools were utilized to identify crash patterns, including word cloud analysis, word frequency analysis, word co-occurrence network analysis, rapid automatic keyword extraction, and topic modeling. The analysis revealed several important crash contributing factors such as the type of vehicle involved (auto-rickshaw, bus, van, motorcycle), manner of collision (head-on), time of the day (morning, night), driver behavior (speeding, overtaking, wrong-way driving), and environmental factors (dense fog). In addition, ‘coming from opposite direction’ and ‘head-on collision’ are two important sequences of events in truck-involved crashes. Truck drivers are also involved in crashes with trains at the rail crossing. The findings of this research can assist policymakers in identifying crash avoidance strategies to lower truck-related crashes in Bangladesh.

A Data Mining Approach for Traffic Accidents, Pattern Extraction and Test Scenario Generation for Autonomous Vehicles

Article

Oct 2022

To effectively fight against traffic accidents, it is of great importance to analyse and understand the conditions that are linked with accidents. Such an analysis can serve as the basis to (i) develop reactive measures by finding the links between the pre-accident conditions (ii) devise proactive strategies that will prevent the occurrence of accidents by making the vehicles safer. This paper contributes to advancement of both approaches. For (i), one needs to identify the patterns in accidents. For (ii), introduction of Connected and Automated Vehicles (CAVs) is a promising solution. However CAVs need to be tested under numerous traffic scenarios to prove their safety before their deployment on public roads. This necessitates a great demand for high quality test scenarios for CAVs. This paper achieves two goals. First, it analyses the past traffic accidents (UK’s STATS19 database) to identify trends in the heterogeneous accident data and unravel the relationships between pre-accident conditions. This is done using a clustering algorithm (ROCK). Seven distinct large clusters emerge as a result. Each of these clusters are then further analysed for their meaning using the frequency analysis and geometric analysis. Secondly the paper underpins the proactive route (ii) by systematically developing, using the information in each cluster, test-case scenarios for CAVs which reflect the risk-prone conditions of the respective clusters. This is done using a data mining method (Market Basket algorithm) and further geometric interpretation of clusters. This way explicit scenarios are developed carrying the characteristics of the clusters that they come from.

Injury Severity Analysis for Large Truck-Involved Crashes: Accounting for Heterogeneity

Article

May 2022

This study explores the crash injury severity of large truck-involved crashes, where the truck driver was identified as the at-fault driver. The paper focuses on vehicle-in-motion crashes that occurred on Florida’s state highways between 2007 and 2016. A random parameter ordered logit (RPOL) model was developed to identify random parameters and interaction effects. Results indicated that not using restraint systems, running a red light, wrong-way driving, failing to yield the right of way, tire or brake defects, and dark conditions had positive associations with higher levels of crash injury severity. The random variables—straight alignment, paved shoulders, and unpaved shoulders—showed significant random effects among the observations. For straight alignment, running red lights, following too closely, vision obstruction caused by fixed objects, and vision obstruction caused by fog were the sources of heterogeneity. Unpaved shoulders, running red lights, wrong-way driving, and the presence of parked or stopped vehicles were found as interaction effects. Results showed that accounting for heterogeneity and interaction effects significantly improved the goodness of fit of the model. This study provides more comprehensive knowledge of the influencing factors of large truck crashes by considering the role of heterogeneity and its potential sources in crash injury severity.

DATA CLUSTERING Algorithms and Applications

Book

Full-text available

Aug 2013

OLDER ADULT DRIVERS' CHALLENGES AND IN-VEHICLE TECHNOLOGY ACCEPTANCE

Article

Full-text available

Oct 2017

Driving is an essential activity in living a fulfilling lifestyle. Older adults, like the rest of the population, require a means of transportation to participate in important lifestyle choices; however, declines in their sensory, motor, perceptual, and cognitive abilities limit their driving capabilities. These limitations motivated this study to investigate older adult drivers' driving challenges by conducting a questionnaire. The in-vehicle technologies which mitigate these challenges were identified. In this study, the acceptance of the identified technologies is explored by conducting a second questionnaire. A four dimensional model which included perceived usefulness, perceived ease of use, perceived safety, and perceived anxiety is considered in the second questionnaire. In total, 250 older adult drivers participated in these questionnaires. The responses obtained from both questionnaires identified potential challenges that they were facing and whether they intend to use the identified in-vehicle technologies. Having more information about the acceptance of these technologies can help engineers better understand the factors that make technologies useful to older adult drivers, and thus improve their driving safety.

A Tree-Based Ordered Probit Approach to Identify Factors Affecting Work Zone Weather-Related Crashes Severity in North Carolina Using the Highway Safety Information System Dataset

Conference Paper

Full-text available

Jan 2017

Work zone crashes are still on the rise due to the aging of US roads and the increase in traffic demand. Investigation of crash characteristics and determining contributing factors in work zones is one of the most important issues in many traffic safety studies. The effect of work zones on traffic safety can be exacerbated by weather conditions. A sudden reduction in visibility may intensify the severity of work zone crashes. Although many studies have investigated work zone crashes, research that investigates the impact of adverse weather conditions on work zone crashes is lacking. In this study, The Highway Safety Information System database for North Carolina was used to identify the characteristics of work zone weather-related crashes. A Tree-based Ordered Probit, a relatively recent and promising combination of nonparametric machine learning (decision tree) and classical statistics (ordered probit) techniques, was utilized to gain a better understanding about the effects of various factors on different work zone crash related injury and crash severity in adverse weather conditions. The results showed that Tree- based Ordered Probit model has a better performance compared to conventional Ordered Probit Model. Lighting conditions, number of vehicles involved in a crash, road characteristics, number of occupants, land use, presence of traffic control devices, and two types of crashes (sideswipe and rear-end crashes) were identified as the most important factors in work zone weather-related crash severity.

Value of Reliability for Road Freight Transportation: Evidence from a Stated Preference Survey in Florida

Article

Full-text available

Jan 2017

This paper presents the findings of a study recently conducted in Florida to quantify freight users’ willingness to pay (WTP) for the improvement of transportation-related attributes, particularly reliability. A stated preference survey was developed and administered between January and May 2016. The survey collected responses from 150 shippers, carriers, and forwarders. Econometric models, including mixed and multinomial logit models, were developed to estimate the users’ WTP and to investigate the presence of user heterogeneity. The value of time and the value of reliability were estimated separately for the various user groups. The results indicated that carriers showed the lowest WTP when their WTP was compared with that of other freight users. Shippers without transportation—that is, shippers who contracted out their shipping— exhibited more interest in reducing travel time savings, whereas shippers with transportation showed more sensitivity to reliability. Preference heterogeneity was also explored by commodity group and product type. The results confirmed the findings from past studies and showed significant differences in WTP values when the sources of heterogeneity were considered. This paper contributes to the literature by providing empirical evidence of the quantification of the value of reliability in road freight transportation and the impacts of user heterogeneity. The study results will help advance understanding of the impacts of the performance of transportation systems on the freight industry.

Evaluating Factors Influencing the Severity of Three-Plus Multiple-Vehicle Crashes using Real-Time Traffic Data

Article

Jul 2018

Multiple-vehicle crashes involving at least two vehicles constitute over 70% of fatal and injury crashes in the U.S. Moreover, multiple-vehicle crashes involving three or more vehicles (3+) are usually more severe compared with the crashes involving only two vehicles. This study focuses on developing 3+ multiple-vehicle crash severity models for a freeway section using real-time traffic data and crash data for the years 2014–2016. The study corridor is a 111-mile section on I-4 in Orlando, Florida. Crash injury severity was classified as a binary outcome (fatal/severe injury and minor/no injury crashes). For the purpose of identifying the reliable relationship between the 3+ severe multiple-vehicle crashes and the identified explanatory variables, a binary probit model with Dirichlet random effect parameter was used. More specifically, Dirichlet random effect model was introduced to account for unobserved heterogeneity in the crash data. The probit model was implemented using a Bayesian framework and the ratios of the Monte Carlo errors were monitored to achieve parameter estimation convergence. The following variables were found significant at the 95% Bayesian credible interval: logarithm of average vehicle speed, logarithm of average equivalent 10-minute hourly volume, alcohol involvement, lighting condition, and number of vehicles involved (3, or >3) in multiple-vehicle crashes. Further analysis involved analyzing the posterior probability distributions of these significant variables. The study findings can be used to associate certain traffic conditions with severe injury crashes involving 3+ multiple vehicles, and can help develop effective crash injury reduction strategies based on real-time traffic data.

Impact of roadway geometric features on crash severity on rural two-lane highways

Article

Nov 2017

Pre-crash scenarios at road junctions: A clustering method for car crash data

Article

Aug 2017

Given the recent advancements in autonomous driving functions, one of the main challenges is safe and efficient operation in complex traffic situations such as road junctions. There is a need for comprehensive testing, either in virtual simulation environments or on real-world test tracks. This paper presents a novel data analysis method including the preparation, analysis and visualization of car crash data, to identify the critical pre-crash scenarios at T- and four-legged junctions as a basis for testing the safety of automated driving systems. The presented method employs k-medoids to cluster historical junction crash data into distinct partitions and then applies the association rules algorithm to each cluster to specify the driving scenarios in more detail. The dataset used consists of 1056 junction crashes in the UK, which were exported from the in-depth "On-the-Spot" database. The study resulted in thirteen crash clusters for T-junctions, and six crash clusters for crossroads. Association rules revealed common crash characteristics, which were the basis for the scenario descriptions. The results support existing findings on road junction accidents and provide benchmark situations for safety performance tests in order to reduce the possible number parameter combinations.

Comparison of four statistical and machine learning methods for crash severity prediction

Article

Aug 2017

Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012-2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012-2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method.

Incorporating attitudinal aspects in roadway pricing analysis

Article

Apr 2017
TRANSPORT POLICY

The impacts of behavioral attitudes are rarely explored when it comes to roadway pricing strategies. The existing literature mainly focuses on observed traveler or trip characteristics and is less likely to capture latent preferences or heterogeneity of roadway users. Motivated to address this knowledge gap, the study herein puts an effort to examine how underlying behavioral attitudes will affect drivers' choices in utilizing managed lane facilities. Based on the data from the South Florida Expressway Stated Preference Survey, factor analysis was conducted based on ten attitudinal statements, and four latent attitudinal factors were identified: willingness to pay, willingness to shift travel schedule, utility (cost/time) sensitivity, and congestion tolerance. In order to assess managed lane's utility for drivers, two sets of multinomial logit (MNL) models were developed using combined revealed preference (RP) and stated preference (SP) data, with and without these attitudinal factors. Results indicated significant contribution of attitudinal parameters in the model, both in terms of coefficients and model performance. The factors were further used in a cluster analysis which identified major segments of roadway users. Such market segmentation is expected to provide valuable insights in capturing travelers' behavior while accounting for attitudinal aspects, which could enhance transportation planning efforts and policy making procedures.

Predictability and interpretability of hybrid link-level crash frequency models for urban arterials compared to cluster-based and general negative binomial regression models

Article

Feb 2017

Machine learning (ML) techniques have higher prediction accuracy compared to conventional statistical methods for crash frequency modelling. However, their black-box nature limits the interpretability. The objective of this research is to combine both ML and statistical methods to develop hybrid link-level crash frequency models with high predictability and interpretability. For this purpose, M5′ model trees method (M5′) is introduced and applied to classify the crash data and then calibrate a model for each homogenous class. The data for 1134 and 345 randomly selected links on urban arterials in the city of Charlotte, North Carolina was used to develop and validate models, respectively. The outputs from the hybrid approach are compared with the outputs from cluster-based negative binomial regression (NBR) and general NBR models. Findings indicate that M5' has high predictability and is very reliable to interpret the role of different attributes on crash frequency compared to other developed models.

Clustering Approach toward Large Truck Crash Analysis

Abstract and Figures

Recommended publications

Maximizing throughput in finite-source parallel queue systems

SMAC: Simultaneous Mapping and Clustering Using Spectral Decompositions

Optimizing large scale chemical transport models for multicore platforms

The Dynamic Microstructure of Speech Production