ArticlePDF Available

Investigating the effects of the fixed and varying dispersion parameters of Poisson-gamma models on empirical Bayes estimates

August 2008
Accident Analysis & Prevention 40(4):1441-57

August 2008
40(4):1441-57

DOI:10.1016/j.aap.2008.03.014

Source
PubMed

Authors:

Peter Park

York University

Traditionally, transportation safety analysts have used the empirical Bayes (EB) method to improve the estimate of the long-term mean of individual sites; to correct for the regression-to-the-mean (RTM) bias in before–after studies; and to identify hotspots or high risk locations. The EB method combines two different sources of information: (1) the expected number of crashes estimated via crash prediction models, and (2) the observed number of crashes at individual sites. Crash prediction models have traditionally been estimated using a negative binomial (NB) (or Poisson-gamma) modeling framework due to the over-dispersion commonly found in crash data. A weight factor is used to assign the relative influence of each source of information on the EB estimate. This factor is estimated using the mean and variance functions of the NB model. With recent trends that illustrated the dispersion parameter to be dependent upon the covariates of NB models, especially for traffic flow-only models, as well as varying as a function of different time-periods, there is a need to determine how these models may affect EB estimates.

Content uploaded by Peter Park

Content may be subject to copyright.

Investigating the effects of the fixed and varying dispersion parameters of

Poisson-gamma models on empirical Bayes estimates

Dominique Lord, Ph.D., P.Eng.*

Assistant Professor

Department of Civil Engineering

Texas A&M University

3136 TAMU

College Station, TX 77843-3136

Tel. (979) 458-3949

Fax. (979) 845-6481

Email : d-lord@tamu.edu

Peter Young-Jin Park, Ph.D., P.Eng.

Transportation Engineer

iTRANS Consulting Inc.

100 York Boulevard, Suite 300

Richmond Hill, ON, Canada L4B 1J8

Tel: (905) 882-4100 ext.5264

Fax: (905) 882-1557

Email: ppark@itransconsulting.com

Paper submitted for publication

May 5th, 2007

* Corresponding author

* Manuscript

ABSTRACT

Traditionally, transportation safety analysts have used the empirical Bayes (EB) method to

improve the estimate of the long-term mean of individual sites; to correct for the regression-

to-the-mean (RTM) bias in before-after studies; and to identify hotspot or high risk locations.

The EB method combines two different sources of information: 1) the expected number of

crashes estimated via crash prediction models, and 2) the observed number of crashes at

individual sites. Crash prediction models have traditionally been estimated using a negative

binomial (NB) (or Poisson-gamma) modeling framework due to the over-dispersion

commonly found in crash data. A weight factor is used to assign the relative influence of each

source of information on the EB estimate. This factor is estimated using the mean and

variance functions of the NB model. With recent trends that illustrated the dispersion

parameter to be dependent upon the covariates of NB models as well as varying as a function

different time periods, there is a need to determine how this may affect EB estimates.

The objectives of this study are to examine how commonly used functional forms as

well as fixed and time-varying dispersion parameters affect the EB estimates. To accomplish

the study objectives, several crash prediction models were estimated using a sample of rural

three-legged intersections located in California. Two types of aggregated and time-specific

models were produced: 1) the traditional NB model with a fixed dispersion parameter and 2)

the generalized NB model (GNB) with a time-varying dispersion parameter, which is also

dependent upon the covariates of the model.

The results of the study show that the selection of the functional form of NB models

has an important effect on EB estimates both in terms of estimated values and dispersion

parameter. Time-specific models with a varying dispersion parameter provide better statistical

performance in terms of goodness-of-fit (GOF) than aggregated multi-year models. The

performance is even better when GNB are used both for time-specific and aggregated models.

Similar to past study findings, there might be no apparent benefits of introducing varying

dispersion parameters for identifying hotspots using the EB method. The study concludes that

transportation safety analysts should not automatically assume that existing functional forms

are adequate for modeling motor vehicle crashes and rigorous analyses should be used to

estimate the most appropriate functional form for linking crashes to explanatory variables,

including traffic flow.

Keywords: crash prediction models, dispersion parameter, empirical Bayes estimates,

negative binomial, rural intersections

INTRODUCTION

Statistical models or crash prediction models have been a very popular method for estimating

the safety performance of various transportation elements. The most common statistical

models used by transportation safety analysts are the Poisson and Negative Binomial (NB)

(or Poisson-gamma) regression models (Miaou, 1994; Pock and Mannering, 1996; Lord et al.,

2005a). NB models are usually the model of choice and have been applied extensively in

various types of highway safety studies, from the identification of hotspots or hazardous sites,

the prediction of motor vehicle collisions, to the development of accident modification factors

via the coefficients of the model (Harwood et al., 2000; Miaou, 1996; Vogt, 1999; Lord and

Bonneson, 2006). There are two main reasons why NB models are favored over Poisson

models for modeling motor vehicle collisions. First, the variance of the response variable (i.e.

crashes per unit of time) commonly exceeds the mean value of the variable, which violates

the main assumption associated with the Poisson model (i.e., “over-dispersion” phenomenon).

As a result, if a Poisson distribution is assumed in estimating the expected number of crashes,

larger discrepancies between the observed and the predicted crashes may be observed (Hauer,

2001). In addition, a mis-specified Poisson model may lead to the inclusion of covariates that

have been erroneously identified as being significant when, in fact, they are not (Park and

Lord, 2006). It has been reported that the over-dispersion is caused by some unmeasured

uncertainties associated with the unobserved or unobservable variables, resulting in the

omitted variable problem. However, although the latter problem can contribute to the over-

dispersion, it is mainly attributed to the nature of the crash process, namely the fact that

crashes are the product of Bernoulli trials with unequal probability of events (this is also

known as Poisson trials). Lord et al. (2005a) have reported that as the number of trials

increases and becomes very large, the distribution may be approximated by a Poisson process

(hence the use of Poisson-based or mixed-Poisson models), where the magnitude of the over-

dispersion is dependent on the characteristics of the Poisson trials. (Note: the over-dispersion

can be minimized using appropriate mean structures of statistical models, as discussed in

Miaou and Song, 2005, Mitra and Washington, 2007, and in the conclusions of this paper). In

short, the NB model can efficiently reduce these unmeasured uncertainties by allowing an

error term to capture the unmeasured heterogeneity in a study dataset (Miaou and Lord, 2003).

Therefore, in order to take into account the over-dispersion problem in a given study dataset,

transportation safety analysts normally adapt the NB modeling framework for developing

crash prediction models.

In highway safety, the dispersion parameter of NB models (note: some researchers

use the term over-dispersion parameter instead of the dispersion parameter) takes a central

role for calculating empirical Bayes (EB) estimates. These estimates are used to smooth the

random fluctuation of crash counts and generate a more accurate estimate of the long-term

mean at a given site. Inasmuch as the EB estimates are one of the main inputs for a sound EB

before-after study, the accuracy of EB estimates will definitely affect the precision of the

analysis output. The EB estimates can also be used to identify hotspots (see Saccommano et

al., 2001) or “sites with promise” (see Hauer, 1996) by ranking crash-prone locations by order

of magnitude or by computing the difference between the output of predictive models and the

EB estimate. As a result, rigorous statistical models based on an appropriate NB modeling

framework must be developed to obtain reliable EB estimates and to maximize the safety

benefit per dollar spent.

As discussed by Hauer (1997), the long-term mean for a site i over a period t can be

estimated using the EB method:

()

ˆˆ

it it it it it

yµγγµ=−+ (1)

where,

γ= weight factor for given site i and year t;

y= observed number of crashes for given site i and year t;

ˆit

µ= the estimated number of crashes by crash prediction models for given site i and

year t (usually estimated using a NB model).

The weight factor it

is given as follows:

()

it it

γαµ=+

(2)

where,

α= the dispersion parameter for the given dataset [note: in the safety literature,

analysts have also used the inverse dispersion parameter 1φα=].

Up until very recently, researchers did not estimate time-varying EB estimates (as

currently defined in equations (1) and (2)); instead transportation safety analysts produced an

average EB estimates for the sites under study for the entire period by relying on a traditional

NB model (Harwood et al. 2000; Persaud et al., 2001; Vogt, 1999). The traditional NB model

uses a fixed dispersion parameter that is applied to the entire dataset in the study (Miaou,

1996). However, as pointed out by Hauer (2001), there is no tangible rationale that all sites in

the dataset should have a constant dispersion parameter over a given study period. Several

other researchers have also questioned the hypothesis that the dispersion parameter has a

fixed value over different sites and time-periods (Heydecker and Wu, 2001; Miaou and Lord,

2003; Lord et al., 2005b; Miranda-Moreno et al., 2005; El-Basyouny and Sayed, 2006).

Heydecker and Wu (2001) attempted to estimate varying dispersion parameters as a function

of sites’ covariates, such as AADT, lane and shoulder widths among others. They asserted that

the NB model with a varying dispersion parameter (henceforth defined as generalized NB

model or GNB) can better represent the nature of crash dataset than the traditional NB model

with a fixed dispersion parameter. The approach proposed by Heydecker and Wu (2001) was

also used by Lord et al. (2005) for modeling the safety performance of freeways as a function

of traffic flow characteristics. An exception is Lyon et al. (2005b) who introduced a time-

varying dispersion parameter using the traditional NB model. The dispersion parameter for

each year was estimated outside the model estimating process using the maximum likelihood

method.

There are many different functional forms that have been proposed to link crashes to

the explanatory variables of traditional regression models for segments (Martin, 2002; Abbas,

2004; Lord et al., 2005b) and intersections (Nicholson and Turner, 1996; Turner and

Nicholson, 1998; Mountain and Fawaz, 1998; Miaou and Lord, 2003). In the past, these

functional forms were adopted to determine the model that provided the best statistical fit

without considering the relationship between different functional forms and the dispersion

parameter (with the exception of Miaou and Lord, 2003). However, if the selection of a

functional form can influence the precision of the model estimates [i.e. the estimated number

of crashes; ˆit

µin equation (1)], then it may also influence the precision of estimated

dispersion parameters [i.e. α in equation (2)]. In evaluating the safety effects of treatments

(e.g., Persaud et al., 2001; Powers and Carson, 2004) and hotspot identification (e.g.,

Miranda-Moreno et al., 2005), the impact of the dispersion parameters on the EB estimates

calculated using the output of different functional forms merit further investigation. To

answer this question, this study has been motivated to address the following issues:

1) Develop a series of crash prediction models which can estimate a site and time-specific

number of crashes using both traditional NB and GNB models; estimate a fixed and

varying dispersion parameters across site i and time t; and, investigate the

characteristics of the dispersion in the data as a function of the selected covariates (i.e.,

traffic volumes).

2) Evaluate the statistical performance of crash prediction models using commonly used

goodness-of-fit (GOF) statistics as well as Cumulative Residual (CURE) plots.

3) Examine the relationship between the functional forms of crash prediction models and

the estimated dispersion parameters (i.e., both fixed and varying dispersion parameters).

4) Examine the differences on the EB estimates between the NB model with a fixed

dispersion parameter and GNB model with a varying dispersion parameter, and

compare both estimates in hotspot identification.

To accomplish the objectives of this study, several crash prediction models were

produced using a sample of three-legged rural intersections located in California. Crash,

traffic flow and geometric design data (to confirm the intersection geometry) were obtained

from the Highway Safety Information System (HSIS) managed by the University of North

Carolina in Chapel Hill, NC. For a given five-year (1997-2001) study period, a total of 5,752

three-legged rural intersections were included in the database. Intersections that contained

missing and questionable values were discarded from the dataset, resulting in a sample of

5,588 three-legged rural intersections for the same five-year period with a total of 5,996

reported crashes (all crash severities or the total number of crashes). Given the large sample

size, the inverse dispersion parameters in this study are assumed to be properly estimated (see

Lord, 2006 about this assumption). Table 1 contains a brief summary of study dataset.

MODEL DEVELOPMENT

This section describes the characteristics of the traditional NB and GNB models.

TRADITIONAL AND GENERALIZED NEGATIVE BINOMIAL MODELS

Properties of the traditional NB model have been illustrated by Cameron and Trivedi (1998).

The probability density function (pdf) of the NB distribution can be defined as

()

(|,) 11 1

!( )

it it

it i it

it it it

PY y y

αα

µα

µµ

αα α

⎛⎞⎛⎞

Γ+⎟⎟

⎜⎜

⎟⎟

⎜⎜

⎟⎟

== ⋅⋅

⎜⎜

⎟⎟

⎜⎜

⎟⎟

⋅Γ ++

⎜⎜

⎟⎟

⎝⎠⎝⎠

(3)

In contrast to the Poisson distribution, the NB distribution allows for over-dispersion,

and thus the mean (i.e.

{}

exp( )

it it it

EYµ==Xβ; i=Xa vector of covariates, and

=βregression coefficients corresponding to the covariates) can be smaller than the variance

(i.e.

{}

ititit

YVar

µαµ

⋅+= ) (Note: the data can also show signs of under-dispersion). When the

dispersion parameter 1αφ= is equal to zero, the NB distribution resorts back to the

Poisson distribution. Larger values of

signifies a greater amount of over-dispersion. An

important characteristic of the traditional NB model is that the dispersion parameter

(or its

inverse φ) would not vary from site to site. This type of model has only a single fixed value

over all observations without considering potential dependency on the covariates.

The GNB uses the same pdf shown in equation (3) and estimates the number of

crashes of each site, like the traditional NB model. However, instead of estimating a fixed

dispersion parameter, the model estimates varying dispersion parameters by using the

following expression (Hardin and Hilbe, 2001):

exp( )

it it t

Zδα=⋅ (4)

where,

Zit = a vector of secondary covariates (do not necessarily the same as the covariates

in estimating the mean function ˆit

µ),

δ = regression coefficients corresponding to covariates Zit.

With equation (4), the GNB model can be used for estimating a different over-

dispersion parameter according to the sites’ attributes (i.e., covariates). If there are no

significant secondary covariates for explaining the systematic dispersion structure, the

dispersion parameters will only contain a fixed value (i.e., constant term), resulting in a

traditional NB regression model.

In this study, STATA V.8.0 program (Stata, 2003) was used to estimate all the

coefficients of the traditional NB and GNB models, including the fixed and varying

dispersion parameters. In order to simplify the analysis, the serial correlation associated with

time-trend models was not included in this study. It should be pointed out that since the data

did not contain any missing values and because the model type is defined as a marginal model,

the coefficients of generalized linear models (GLM) are the same (or very similar) as the

values produced by the Generalized Estimating Equations (GEE), no matter which working

correlation matrix is used (or whether the correlation matrix is mis-specified). The only

difference is related to the standard errors of the coefficients. The standard errors are usually

underestimated when temporal effects are not included in the modeling process (see Lord and

Persaud, 2000 and Hardin and Hilbe, 2003 for additional information).

CRASH PREDICTION MODELS

Given the specific objectives of this study, instead of developing the models with the best

statistical fit considering every possible combination of covariates, only entering traffic

volumes (from major and minor intersecting roads) were used as covariates. Miaou and Lord

(2003) listed the most popular functional forms (referred in the study as Models 1 to 5) from

previously published studies:

Model 1:

(

)

iitit FF 21ln

ln 10

Model 2: iitit FF 2ln1ln

ln 210

Model 3:

(

)

iitit FF 21ln

ln 10

Model 4:

(

)

(

)

iiiitit FFFF 12ln21ln

ln 210

Model 5:

iiitit FFF 22ln1ln

ln 3210

where,

µit = the expected number of crashes at intersection i in year t;

F1i = average AADT over a given 5 years entering from major road at

intersection i; and,

F2i = average AADT over a given 5 years entering from minor road at

intersection i.

As reported by Miaou and Lord (2003), the functional forms described above are not

the most adequate for describing the relationship between crashes and exposure since the

forms do not appropriately fit the data near the boundary conditions. Nonetheless, they are

still relevant for this study, as they are considered established functional forms in the highway

safety literature. In addition, the most adequate functional form proposed by Miaou and Lord

(2003), a model with two distinct mean functions, cannot be estimated via a generalized

linear modeling (GLM) framework, as it was done in this study.

In this analysis, the approach proposed by Lyon et al.’s (2005) in which only the

intercept term ( 0t

) varies by year for both the traditional NB and GNB crash prediction

models was adopted. Their approach also assumes the same exposure for the entire study

period. Furthermore, the mean and dispersion functions have the same covariates (see Lord et

al., 2005b). However, in the final model selection, only the coefficients that passed the

significant test at a 95% confidence level were selected as covariates for the dispersion

parameters.

Tables 2 and 3 summarize the modeling results for the traditional NB and GNB

models, respectively. Five different functional forms are used for each model, and six time-

specific (1997, 1998, 1999, 2000, 2001, and All Year) crash prediction models are developed

for each functional form. To illustrate the application of the models, suppose the expected

number of crashes over a given five-year period is estimated using the GNB Model 1. If we

assumes for example that the average “Major AADT” and “Minor AADT” over the given

period is 3,000 vpd and 300 vpd, respectively, by employing the disaggregate time-specific

“GNB Model 1” in Table 3, one obtains the expected number of crashes per year as 0.095 [i.e.,

exp(-10.561+1.0136×ln(3000+300))], 0.075, 0.102, 0.104, and 0.109, respectively. By

summing up all these values, the expected number of crashes over the five-year period is

estimated to be 0.485. On the other hand, by employing the aggregate “All-Year Model” in

Table 3, the estimate of the crash frequency at the same intersection over the same 5-year

period equals 0.484 [i.e. exp(-8.9388+1.0136×ln(3000+300)) ≈ 0.484].

Using the same example intersection above, the time-specific inverse dispersion

parameters are estimated as 3.474 [i.e., exp(3.0961-0.2285×ln(3000+300))], 6.202, 2.722,

2.271, and 2.453, respectively, for each corresponding year. As opposed to the mean crashes,

these values cannot be aggregated; note that the average value equals 2.852. Using the “All-

Year Model”, the overall dispersion parameter over the five years equals 2.359.

Four different GOF statistical tests were employed for comparing the series of crash

prediction models. The tests, described in Hardin and Hilbe (2001) and Washington et al.

(2003), are as follows:

1) Akaike’s Information Criterion (AIC) = N

PML k2)(ln2

−

(5)

where,

ln L(Mk) = log likelihood of model k;

P = the number of parameters; and,

N = the number of observations (in our exercise = 5,588).

2) Bayesian Information Criteria (BIC) = D(Mk) – d.f.·lnN (6)

where,

D(Mk) = Deviance of model k; and,

d.f.= degrees of freedom.

3) Sum of Model Deviances (G2) =

()

∑

iii yy

ln2

(7)

where,

yi = the observed number of crashes at site i;

ˆ= the expected number of crashes at site i.

4) R2 like measure of fit (MOF) based on Standardized Residuals (R2) =

()

⎥

⎦

⎤

⎢

⎣

⎡

⎟

⎠

⎞

⎜

⎝

⎛−

⎟

⎠

⎞

⎜

⎝

⎛−− ∑∑ ==

1ˆˆ

iyyyy

µµ

(8)

where,

y= average number of observed number of crashes.

The model with the lowest value in AIC, BIC, and G2 is considered the model with

the best statistical fit. On the other hand, the model with the largest R2 like MOF value

indicates a superior fitted model. As shown in Table 4, in general Model 4 provides the best

fit for both the traditional NB and GNB models according to the three different test statistics

(i.e. AIC, BIC, and R2 like MOF). Model 5 is selected as the best fitted model based on the

G2-statistics, but is selected as the second worst model based on the R2 like MOF. On the

other hand, all the test statistics selected Model 1 as the worst statistical fit model regardless

of the model type (i.e., NB or GNB).

As pointed out by Miranda-Moreno et al. (2005), in general, GNB models fit the

data better than the traditional NB models on the basis of the three different test statistics (i.e.

AIC, BIC, and G2-statistics) with an exception the “Model 1” result in terms of the G2-

statistics. Karim and Sayed (2006) reported the same conclusion in their study. [Note: Using

the Deviance statistics, Miaou and Lord (2003) did not find a significant difference between

NB and GNB models in terms of GOF.] However, the R2 like MOF does not produce

consistent test results compared to the results based on the other test statistics. Examining the

analysis results of the four different test statistics, the following findings are worthwhile to be

noted (refer to the Table 4):

1) Model 4 can be considered as the best fitted model amongst the five alternate

Models regardless of the model type (i.e., NB and GNB model) based on the test

results of AIC, BIC, and R2 like MOF.

2) Overall, “GNB” models show a better fit than traditional “NB” models

regardless of the functional forms based on the three different test statistics (i.e.

AIC, BIC, and G2-statistics). However, since the R2 like MOF produced

inconsistent test results, this test statistics may not be suitable to determine the

best fitted model as well as the most suitable functional form at least for this

study dataset.

3) As a result, determining the best fitted model as well as the most suitable

functional form using a single (or a couple of) test statistics may potentially be

unreliable, and thus should be avoided.

GOODNESS OF FIT EVALUATION BASED ON RESIDUAL ANALYSIS

In the previous section, four different test statistics were utilized for measuring the GOF of

competitive functional forms or models. Another evaluation method initially proposed by

Hauer and Bamfo (1997) can also be used to evaluate the model fit adequacy. This method is

known as the CURE method and has since been used extensively by many transportation

safety analysts (e.g., Lord and Persaud, 2000; Washington et al., 2005; Wang and Abdel-Aty,

2007), to determine the most suitable model for their study data. The method requires

scrutinizing a graph (i.e., CURE plot), in which the cumulative residuals are plotted in

increasing order for each explanatory variable (i.e., major and minor road AADT in our case)

separately. The residuals (eit) represent the difference between the observed (yit) and the

expected number of crashes ( i

ˆ) at a given study site i in a year t. The closer the curve

oscillates around zero-residual line, the better the model fits the data. The curious reader is

referred to the references listed above for additional details about the CURE method.

Figures 1 through 5 show a total of ten different CURE plots using five different

functional forms with two different model types based upon the “all year” crash data. As

discussed by Hauer and Bamfo (1997), the CURE plot will reveal how well the functional

forms fit the data with respect to each individual explanatory variable (in our case, F1 and F2)

and show systematic deviations of the cumulative residuals from the zero-residual line. Two

different CURE plots (i.e. Figure (a) and (b)) are generated for each Model to compare the

model adequacy between the two model types (i.e., NB and GNB models). To shorten the

illustration, the major road AADT (F1) has been used as a representative explanatory variable

for this analysis. Looking at Figures 1 to 5, several characteristics involving CURE plots can

be noticed:

1) Model 1 underestimates the expected number of crashes (i.e., yit > i

ˆ) in the

range of between 0 and 65,000 major road AADT and slightly overestimates the

number (i.e., yit < i

ˆ) where the major road AADT (F1) is higher than 65,000

(refer to Figure 1) regardless of the model type (NB or GNB). Moreover, in the

range of major road AADT (F1) between 5,000 and 12,000, Model 1 produces

significantly lower values in the expected number of crashes compared to the

observed number of crashes, resulting substantially greater values in the

cumulative residuals than the +2.0σ confidence interval. No practical difference

in the CURE plots between the two model types (i.e. NB and GNB model) is

found and the final cumulative residual line is reasonably close to 0.

2) NB Model 2 (refer to the Figure 2 (a)) and NB Model 4 (refer to the Figure 4 (a))

are similar in that both models underestimate the expected number of crashes

over the entire range of variable (i.e., F1) with greater than +2.0σ cumulative

residuals where the major road AADT is higher than about 50,000. In addition,

these two NB models do not have a feature that a good fitted model reasonably

has (i.e., the zero final cumulative residuals). In case of the NB Model 3, while

the expected number of crashes shows the overestimated results with the major

road AADT lower than about 20,000, the other features are similar to the NB

Model 2 and 4 (i.e., cumulative residuals greater than the +2.0σ confidence limit,

no zero cumulative residual ending).

3) Among the different CURE plots from Figure 1 through 5, the Model 5 CURE

plots (i.e., Figure 5-(a), (b)) showed the most unexpected results, including a

catastrophic drop in the cumulative residuals at the major road AADT around

42,000, regardless of the model type. In the previous section, this model was

selected as the best fitted model according to the G2-statistics and chosen as the

second best fitted model based on the AIC as well as BIC. A closer look at the

CURE plots as well as the raw dataset revealed that this sudden drop is caused

by the unusually higher number of minor road AADT (i.e., F2 = 23,111) at a

specific intersection. The value is at least twice higher than the other minor

roads’ AADT, and contributed to produce dramatically greater amount of

overestimation values (i.e., eit = -1165.6 in NB Model 5, eit = -1170.7 in GNB

Model 5). In fact, this sudden drop reveals that Model 5 is very sensitive in the

estimates by the higher number of minor road AADT (F2) by employing the third

variable in the functional form (i.e., 32i

). It should be recognized that the third

variable in Model 5 made the actual difference between Models 2 and 5, and

Model 2 does not show the sudden change in estimates. In truth, this site should

be investigated further to determine whether this observation is in fact an outlier

(e.g., error in reported flows, etc.) or an influence point. Statistical tests (not used

here), such as R-Student, DFFITS and Cooks’ D, can be used to identify

potential outliers and influence points (Myers, 2000).

4) GNB Model 2, 3, and 4 yielded improved CURE plots compared to the CURE

plots of the corresponding NB Models in that the amount of bias in the estimates

has been much reduced. The final cumulative residuals of these three GNB

models end close enough to zero value. Although there are just slight differences

in the CURE plots among these three GNB models, GNB Model 4 shows better

fitting result (it was already selected as the best fitted model based on the test

statistics analysis in previous section) since the cumulative residual lines are

oscillating reasonably across the zero cumulative residual line over the entire

explanatory values.

Table 5 contains the summary of the cumulative residuals for a total of 60 different

models [6 years (including all year) × 5 functional forms × 2 model types] to show the

difference between the time-specific model (yearly based model) and all-year model as well

as the difference between the NB (fixed dispersion) and GNB (varying dispersion) model.

Notable characteristics are:

1) The sums of residuals of time-specific models show a great amount of up-and-

down fluctuation in each year regardless of the functional forms employed. For

Models 2 though 4, the absolute values of the total residuals from time-specific

models (i.e., Sum 97-01) show lower values than those of the aggregated all-year

models. It indicates that the time-specific models produce more accurate results

than the aggregated all year models regardless of the model type (i.e. NB or

GNB) especially for Model 2, 3 and 4. The result is even more interesting if

recognized that we only used the constant entering traffic volumes (i.e. average

F1 and F2 over study period) as model inputs [Note: It is important to point out

that using the varying entering volumes for every year (i.e., time-varying F1t and

F2t) might show even better results than using a constant exposure (Mountain et

al., 1998; Lord and Persaud, 2000; Lord et al., 2005b).] On the other hand,

Model 1 and 5 do not show this characteristic.

2) In general, GNB model shows better fitting results with smaller residuals than

the traditional NB model, with the exception of Model 5. As mentioned

previously, Model 5 produced very different results compared to the results from

the other models because of the abnormally higher number of minor road AADT

(F2) at the previously identified intersection. This could be a unique

characteristic of the Model 5 at least for this study dataset. Note that the model is

usually adapted to estimate the crash frequencies at intersections in large urban

areas (Lord and Persaud, 2004; Miaou and Lord, 2003; Lyon et al. 2005),

therefore Model 5 may not be suitable to estimate the crashes at intersections in

rural areas. It is of interest to note that Model 5 was originally selected as the

best functional form in terms of the G2-statistics and the second best functional

form in terms of the AIC and BIC (refer to Table 4). The fact that the statistical

tests in Equations (5) through (8) are frequently used to determine and justify the

best model performance by transportation safety analysts without further looking

into raw data set using a model diagnosis tool (e.g., CURE plot) is a cause for

concern.

DISPERSION PARAMETERS AMONG DIFFERENT CRASH PREDICTION MODELS

Table 6 and Figure 6 summarize the estimated dispersion parameters obtained from different

models. A few notable features include:

1) The estimated values from this study are quite different from the reported values

in the previous study by Miaou and Lord (2003) since the study used a

completely different dataset from the City of Toronto. However, the same

patterns have been noticed, such that the estimated fixed dispersion parameters

among the traditional NB models 2, 3, 4, and 5 are fairly similar.

2) 3) In general, traditional NB models underestimate the inverse dispersion

parameters compared to GNB models, with the exception of Model 3. In terms of

the varying dispersion parameters, comparing the minimum/maximum values as

well as the average values, Model 5 again shows the biggest discrepancy amongst

all the models tested (as seen in Table 6). It may be caused by the observation

with abnormally higher number of minor road AADT or it is just a peculiar

characteristic of Model 5 since the similar discrepancy has been experienced by

Miaou and Lord (2003). No matter the reason since

{

}

ititit

YVar

µαµ

⋅+= , Model

5 estimates will produce larger variance than that of other models, implying

higher uncertainties associated with this model. As a result, compared to the other

models, Model 5 put more emphasis on the observed number of crashes than the

model estimates in obtaining EB estimates.

3) Inasmuch as a traditional NB model produces a fixed (i.e., constant) dispersion

parameter for each model, the weight factor is inversely related to model

estimates (i.e., tt

1∝, refer to Figure 7). The higher the model estimates, the

smaller the weight factors, and vise versa.

4) GNB Models 2, 4, and 5 [refer to Figure 8-(b), 8-(d), and 8-(e), respectively]

allow different weight factors for intersections with the same model estimate. The

tendency is stronger for the intersections with higher model estimates. On the

other hand, GNB Models 1 and 3 [refer to Figure 8-(a) and 8-(c)] show the same

patterns with the corresponding NB models, and do not allow different dispersion

parameters unless the intersections have different model estimates. For Models 1

and 3, traffic volumes entering from the major and minor roads (i.e. F1 and F2,

respectively) are not treated as distinct traffic volumes, but rather as a single

traffic volume unit (i.e., F1+F2, F1·F2). Even with the exact same traffic volume

(e.g., F1 + F2 = 3,000/day), intersections could have different entering traffic

volumes from the major and minor roads (e.g., F1 = 1,500 and F2 = 1,500, F1 =

2,000 and F2 = 1,000, etc.). Since the same functional form is used to explain the

dispersion and the mean values in all GNB Models, GNB Models 1 and 3 should

have the same value for the weight factor if the model estimates are the same. On

the other hand, GNB Models 2, 4, and 5 could have different dispersion

parameters and model estimates even for intersections with the exactly same

aggregate traffic volumes. Even though the estimated dispersion parameters can

be different between the traditional NB and GNB models and if one only uses the

aggregated traffic volumes (i.e., F1+F2, F1·F2) in explaining the varying

dispersion parameters, there is no practical merit of using the GNB Model over

the traditional NB model for this case. Figures 9-(a) and 9-(c) clearly show that

the GNB Models 1 and 3 produce virtually the same weight factors as calculated

from the traditional NB model. It should be noted that GNB models produce a

slightly lower value for the weight factor (i.e., the coefficients in Figure 9 are less

than 1.0) than those of traditional NB models. Hence, GNB Models will give

slightly more weight to the observed number of crashes than the model estimates

when calculating the EB estimates compared to the traditional NB Models with

the same estimated value.

Figure 10 shows which covariates are more heavily associated with the degree of

dispersion amongst different GNB Models. In this illustration, GNB Models 1 and 3 were

disregarded since the models only contain a single aggregated traffic volume. GNB Models 2,

3, and 5 show that the major road AADT contributes more significantly to the variation in the

dispersion parameters than that of the minor road AADT. Obviously, for a given major road

AADT, a number of different dispersion parameters can be estimated. Since the weight

factors are a function of dispersion parameters, the variation in the weight factor in GNB

Models 2, 3, and 5 is mainly caused by the heterogeneity associated with the major road

covariate. Since the AADT has been used as a surrogate measure for explaining the possible

structure of un-modeled heterogeneities, as documented in Miaou and Lord (2003) and Mitra

and Washington (2007), the inclusion of other covariates describing characteristics associated

with the major approaches may help reduce the heterogeneity observed in the models.

The final objective of this study consisted of investigating the impact of varying

dispersions on identifying hotspots. Figure 11 illustrates the relationship between the hotspot

identification lists ranked by the traditional NB models and by the GNB models. Smaller

values in the ranking imply more hazardous intersections in terms of the EB estimates. The

association between the NB ranking and the GNB ranking is very strong and shows almost a

perfect correlation regardless of the functional forms used. However, a notable point to

observe is that the hotspot ranking for Models 1 and 3 is constantly lower for the traditional

NB model than that of the GNB model. The other functional forms do not show this pattern.

It is unclear at this time the cause of such pattern, but this characteristic should be

investigated further.

To evaluate the association between the two rankings, a Spearman rank-order

correlation coefficient (

s), which is a non-parametric statistical test, was applied to the data.

Similar to the findings documented in El-Basyouny and Sayed (2006), the association in

ranking between the traditional NB and GNB models was found to be very strong (i.e.,

s >

0.99, all cases). Therefore, it can be concluded that there is no substantial benefit of using

GNB model if the study purpose is solely in identifying the hotspots based on the EB

estimates (again if the same exposure level is used for model types).

CONCLUSIONS

In this paper, a number of important issues regarding the impact of the dispersion parameters

on EB estimates were presented. Several important conclusions can be reported:

1) Developing time-specific models is favored over developing aggregated models.

This supports the findings of Lord and Persaud (2000) who noted that models

with trend performed better in terms of GOF than models without trend (or

aggregated models). In the same line, using GNB models provide an even better

performance than NB models with a fixed dispersion parameter both for time-

specific and aggregated modeling frameworks in most cases.

2) If separate years and specific sites are analyzed individually, time-specific models

with a varying dispersion parameter will have a significant impact on the EB

estimate, as documented in Table 6 and Figure 6. This result concurs with those

of Miaou and Lord (2003). The varying dispersion parameters have shown a

significant level of variability, which will affect the weight factor associated with

the EB estimate as well as the computation of confidence intervals associated this

estimate.

3) Model 5 produced very sensitive result by the minor road AADT (F2) due to the

last component ( 32i

) in the functional form. Based on the empirical findings

of this study, including CURE plots, Model 5 (i.e., Functional Form 5) may not

seem to be an appropriate functional form to estimate the crashes at three-leg

intersections in rural areas, at least with the database used in this study. A similar

argument can be made about Model 1.

4) Developing reliable crash prediction models can never be overemphasized to

obtain rigorous EB estimates. Automatically adapting a functional form from

previous studies should be avoided, and a rigorous analysis not solely based on

GOF statistics to determine the appropriate functional form for a given study

dataset must be carried out. The residual diagnosis analysis using CURE plots

can provide much more detailed information beyond that from GOF statistics.

5) There is no substantial benefit in obtaining the EB estimates by applying varying

dispersion parameters (rather than a fixed dispersion parameter) if the study

purpose is to identify hotspots (e.g. also an aggregated-level analysis). However,

there might be a considerable impact on evaluating the treatment effects via EB

before-after studies and the impact should be worthy of further investigation. As

described above, time-varying exposure should also be investigated in this

context.

In conclusion, as pointed out by Miaou (2005) and to some degree in Miaou and Lord

(2003), statistics is just one of many sciences. The "science part" of a statistical model is

related to the mean function. What the transportation safety analyst needs the most is to better

understand the functional structure of the mean function. While this study focused on “the

structure of the variance function,” especially the one related to the dispersion parameter (or

its inverse), transportation safety analysts should never lose sight of the most important part

of a statistical model (i.e., the structure of the mean function) In theory, any modifications to

the structure of the mean function (via the inclusion or exclusion of covariates in crash

prediction models) will affect the structure of the variance function.

Ideally, with any types of model, a good structure and the proper selection of the

covariates for the mean function would make the structure of the variance function vanish or

at least significantly minimize the magnitude of the variance (e.g., see Miaou and Song, 2005

and Mitra and Washington, 2007). However, since it is practically unachievable to obtain a

perfect mean function (see Xie et al., 2006), transportation safety analysts will continue to

work with the variance function for the following three rationales: 1) looking for clues to

improve the deficiency of the mean function, 2) reducing the bias in the mean function, and

3) hopefully providing more accurate statistical inferences for decision-making purposes.

These are important issues that need to be addressed in future research projects.

REFERENCE

Abbas, K.A. Traffic safety assessment and development of predictive models for accidents on

rural roads in Egypt. Accident Analysis & Prevention, Vol. 36, No. 2, 2004, pp. 136-143.

Cameron, A.C., and Trivedi, P.K. Regression analysis of count data, Econometric Society

Monograph No.30, Cambridge University Press, 1988.

El-Basyouny, K., and Sayed, T. Comparison of Two Negative Binomial Regression

Techniques in Developing Accident Prediction Models, Presented at Transportation

Research Board 85th Annual Meeting, 2006.

Hardin, J.J and Hilbe, J.M. Generalized Linear Models and Extensions. Stata Press, Collage

Station, Texas, 2001.

Hardin, J.J., and Hilbe J.M. Generalized Estimating Equations. Chapman & Hall/CRC, Boca

Raton, FL, 2003.

Harwood, D.W., Council, F.M., Hauer, E., Hughes, W.E., and Vogt, A. Prediction of the

expected safety performance of rural two-lane highways, Federal Highway

Administration, Final Report, FHWA-RD-99-207, 2000.

Hauer, E. Observational before-after studies in road safety. Pergamon Press, Elsevier Science

Ltd., Oxford, England, 1997.

Hauer, E. Identification of “Site with Promise” Transportation Research Record 1542, 1996,

pp. 54-60.

Hauer, E. Overdispersion in modelling accidents on road sections and in Empirical Bayes

estimation. Accident Analysis & Prevention, Vol. 33, No. 6, 2001. pp. 799-808.

Hauer, E. and Bamfo, J. Two tools for finding what function links the dependent variable to

the explanatory variables. In Proceedings of the ICTCT 1997 Conference, Lund, Sweden.,

1997.

Heydecker, B.G., and J. Wu, Identification of sites for road accident remedial work by

Bayesian statistical methods: An example of uncertain inference. Advances in

Engineering Software, Vol. 32, 2001, pp. 859-869.

Karim, L.-B., and Sayed, T. Comparison of two negative binomial regression techniques in

developing accident prediction models. Transportation Research Record 1950, 2006, pp.

9-16.

Lord, D. Modeling motor vehicle crashes using Poisson-gamma models: Examining the

effects of low sample mean values and small sample size on the estimation of the fixed

dispersion parameter. Accident Analysis & Prevention, Vol. 38, No. 4, 2006, pp. 751-766.

Lord, D., and Bonneson, J.A. Development of Accident Modification Factors for Rural

Frontage Road Segments in Texas. Zachry Department of Civil Engineering, Texas A&M

University, College Station, TX, 2006.

Lord, D., Manar, A., and Vizioli, A. Modeling crash-flow-density and crash-flow-V/C ratio

relationships for rural and urban freeway segments. Accident Analysis & Prevention, Vol.

37, No. 1, 2005b, pp. 185-199.

Lord, D., and Persaud, B.N. Accident prediction models with and without trend: Application

of the Generalized Estimating Equations (GEE) procedure. Transportation Research

Record 1717, 2000, pp. 102-108.

Lord, D., and Persaud B.N. Estimating the safety performance of urban transportation

networks. Accident Analysis & Prevention. Vol. 36, No. 2, 2004, pp. 609-620.

Lord, D., Washington, S.P., and Ivan, J.N. Poisson, Poisson-Gamma and Zero Inflated

Regression Models of Motor Vehicle Crashes: Balancing Statistical Fit and Theory.

Accident Analysis & Prevention, Vol. 37, No. 1, 2005a, pp. 35-46.

Lyon, C., Haq, A., Persaud, B.N., and Kodama, S.T. Development of safety performance

functions for signalized intersections in a large urban area and application to evaluation of

left turn priority treatment, Presented at the 84th Annual Meeting of Transportation

Research Board, Washington, D.C., 2005.

Martin, J.-L. Relationship between crash rate and hourly traffic flow on interurban motorways.

Accident Analysis & Prevention, Vol. 34, 2002, pp. 619-629.

Miaou, S.P. The relationship between truck accidents and geometric design of road sections:

Poisson versus negative binomial regressions. Accident Analysis & Prevention, Vol. 26,

No. 4, 1994, pp. 471-482.

Miaou, S-P. Measuring the goodness-of-fit of accident prediction models. Federal Highway

Administration, Final Report, FHWA-RD-96-040, 1996.

Miaou, S-P., and Lord, D. Modeling traffic crash-flow relationships for intersections:

Dispersion parameter, functional form, and Bayes versus empirical Bayes methods.

Transportation Research Record 1840, 2003, pp 31-40.

Miaou, S.-P., and Song, J.J. Bayesian ranking of sites for engineering safety improvements:

Decision parameter, treatability concept, statistical criterion, and spatial dependence,

Accident Analysis & Prevention, Vol. 37, No. 4, 2005, pp. 699-720.

Miaou, S-P. Personal Communication by E-mail, April 22, 2005.

Miranda-Moreno, L.F., Fu, L., Saccomanno, F.F., and Labbe, A. Alternative risk models for

ranking locations for safety improvement. Transportation Research Record 1908, 2005,

pp 1-8.

Mitra, S., and Washington, S., On the nature of over-dispersion in motor vehicle crash

prediction models, Accident Analysis & Prevention (In Press).

Mountain, L., and Fawaz, B. Estimating accidents at junctions using routinely-available input

data. Traffic Engineering and Control, Vol. 11, 1996, pp. 624-628.

Mountain, L., Maher, M.J., and Fawaz, B. The influence of trend on estimates of accidents at

junctions, Accident Analysis & Prevention, Vol. 30, No. 5, 1998, pp. 641 -49.

Myers, R.H. Classical and Modern Regression with Applications, 2nd ed. Duxbury Press,

Pacific Grove, CA, 2000.

Nicholson, A., and Turner, S. Estimating accidents in a road network. In Proceedings of

Roads 96 Conference, Part 5, New Zealand, 1996, pp. 53-66.

Park, E.S., and Lord, D. Multivariate Poisson-Lognormal Models for Jointly Modeling Crash

Frequency by Severity. Working Paper, Texas Transportation Institute, College Station,

TX, 2006.

Persaud, B.N. Statistical methods in highway safety analysis, A Synthesis of Highway

Practice of Highway Practice, National Cooperative Highway Research Program

Synthesis 295, TRB, National Research Council National Academy Press, Washington,

D.C., 2001.

Persaud, B.N., Retting, R.A., Garder, P.E., and Lord, D. Safety effect of roundabout

conversions in the united states: empirical bayes observational before-after study, Journal

of the Transportation Research Record 1751, 2001, pp. 1-8.

Poch, M., and Mannering, F.L. Negative binomial analysis of intersection-accident

frequencies, Journal of Transportation Engineering, Vol. 122, No. 2, 1996, pp. 105-113.

Powers, M., and Carson, J. Before-after crash analysis: A primer for using the empirical

Bayes method Tutorial. U.S. Department of Transportation, Final Report, FHWA/MT-04-

002-8117-21, 2004.

Saccomanno, F.F., Grossi, R., Greco, D., and Mehmood, A. Identifying black spots along

Highway SS107 in southern Italy using two models. Journal of Transportation

Engineering, Vol. 127, No. 6, 2001, pp. 515-522.

Stata, Reference Manual, Release 8, Stata Press, 2003.

Turner, S., and Nicholson A. Intersection accident estimation: The role of intersection

location and non-collision flows. Accident Analysis & Prevention, Vol. 30, No. 4, 1998,

pp. 505-517.

Vogt, A. Crash models for rural intersections: Four-lane by two-lane stop-controlled and two-

lane by two-lane signalized, Federal Highway Administration, Final Report, FHWA-RD-

99-128, 1999.

Washington, S., Persaud, B., Lyon, C., and Oh, J., Validation of Accident Models for

Intersections, Federal Highway Administration, Final Report, FHWA-RD-03-037, 2005.

Wang, X., and Abdel-Aty, M.A. Investigation of Signalized Intersection Right-Angle Crash

Occurrence at Intersection, Roadway, and Approach Levels. Paper presented at the 84th

Annual Meeting of the TRB, Washington, D.C., 2007.

Xie, Y., Lord, D., and Zhang Y. Predicting Motor Vehicle Collisions using Bayesian Neural

Networks: An Empirical Analysis. Zachry Department of Civil Engineering, Texas A&M

University, College Station, TX, 2006.

Table 1. Summary Statistics for the Study Dataset

Variable (per site) Mean Std. Dev. Min. Max. No. of

intersections

No. of crashes* /5 years 1.073 2.806 0 64

No. of average crashes (KABCO)

/year

0.215 0.561 0 12.8

Avg. AADT of major road over 5

years

6,953 7,407 103 79,800

Avg. AADT of minor road over 5

years

268 683 1 23,111

5,588

* = All crash severities

Table 2. Estimated Coefficients and Statistics for NB Crash Prediction Models

NB Model β0,t β1,t β2,t β3,t Log(α) Log Likelihood No. of Parameters Deviance AIC BIC G2 - Statistics R2 like MOF

-10.5741 0.9383 1997

(0.0395) (0.0867)

-2717.7926 2 5435.585

0.973

-42762.527 4241.982

0.275

-10.8353 1.1580 1998

(0.0441) (0.0928)

-2252.1741 2 4504.348

0.807

-43693.764 3788.404

0.375

-10.5141 0.6542 1999

(0.0367) (0.0915)

-2846.5973 2 5693.195

1.020

-42504.918 4271.889

0.275

-10.4896 0.8203 2000

(0.0377) (0.0856)

-2885.1331 2 5770.266

1.033

-42427.846 4408.752

0.276

-10.4433 0.8974 2001

(0.0379) (0.0815)

-2962.3282 2 5924.656

1.061

-42273.456 4603.335

0.283

-8.9541 0.6871

Functional

Form 1

(NB Model 1)

All Year

(0.2609)

1.0153

(0.0295)

- -

(0.0399)

-6763.0979

13526.196

2.422

-34663.288

13034.299

0.475

-10.0342 0.3407 1997

(0.0371) (0.1092)

-2556.6668 2 5113.334

0.916

-43084.779 3710.145

0.382

-10.2757 0.7072 1998

(0.0423) (0.1081)

-2148.9187 2 4297.837

0.770

-43900.275 3391.635

0.434

-9.9651 -0.0623 1999

(0.0341) (0.1244)

-2663.3955 2 5326.791

0.954

-42871.321 3702.554

0.382

-9.9418 0.2665 2000

(0.0356) (0.1053)

-2725.6695 2 5451.339

0.976

-42746.773 3900.767

0.376

-9.8969 0.3854 2001

(0.0358) (0.0981)

-2811.3501 2 5622.700

1.007

-42575.412 4089.771

0.363

-8.4111 0.2610

Functional

Form 2

(NB Model 2)

All Year

(0.2325)

0.6730

(0.0266)

0.4880

(0.0162)

(0.0471)

-6420.9349

12841.870

2.300

-35338.986

10575.656

0.614

-9.2226 0.3190 1997

(0.0370) (0.1108)

-2556.1857 2 5112.371

0.916

-43085.741 3703.764

0.381

-9.4585 0.7331 1998

(0.0425 (0.1076)

-2160.1023 2 4320.205

0.774

-43877.908 3413.441

0.411

-9.1513) -0.0660 1999

(0.0341) (0.1250)

-2665.9620 2 5331.924

0.955

-42866.188 3705.128

0.386

-9.1288 0.2582 2000

(0.0355) (0.1060)

-2727.2530 2 5454.506

0.977

-42743.606 3900.835

0.370

-9.0851 0.3707 2001

(0.0357) (0.0989)

-2812.5312 2 5625.062

1.007

-42573.050 4089.798

0.354

-7.5976 0.2689

Functional

Form 3

(NB Model 3)

All Year

(0.1746)

0.5465

(0.0124)

- -

(0.0472)

-6435.9894 3 12871.979

2.305

-35317.505 10590.943

0.605

-10.3397 0.3265 1997

(0.0371) (0.1099)

-2553.4246 2 5106.849

0.915

-43091.263 3690.981

0.388

-10.5805 0.6996 1998

(0.0423) (0.1083)

-2146.2885 2 4292.577

0.769

-43905.535 3376.622

0.439

-10.2712 -0.0882 1999

(0.0340) (0.1259)

-2658.3654 2 5316.731

0.952

-42881.382 3679.758

0.382

-10.2486 0.2405 2000

(0.0355) (0.1068)

-2720.8898 2 5441.780

0.975

-42756.333 3875.938

0.383

-10.2045 0.3563 2001

(0.0357) (0.0992)

-2803.4481 2 5606.896

1.004

-42591.216 4060.822

0.375

-8.7168 0.2470

Functional

Form 4

(NB Model 4)

All Year

(0.2379)

1.1620

(0.0279)

0.4279

(0.0158)

(0.0474)

-6409.6162

12819.232

2.295

-35361.623

10458.744

0.621

-9.7416 0.3552 1997

(0.0374) (0.1067)

-2555.7990 2 5111.598

0.915

-43086.514 3597.308

0.372

-9.9824 0.7168 1998

(0.0426) (0.1064)

-2147.7370 2 4295.474

0.769

-43902.638 3306.568

0.426

-9.6755 -0.0294 1999

(0.0345) (0.1182)

-2662.1116 2 5324.223

0.954

-42873.889 3592.798

0.364

-9.6482 0.2873 2000

(0.0359) (0.1021)

-2724.6199 2 5449.240

0.976

-42748.873 3800.043

0.366

-9.6017 0.4041 2001

(0.0361) (0.0955)

-2808.7386 2 5617.477

1.006

-42580.635 3981.894

0.365

-8.1150 0.2548

Functional

Form 5

(NB Model 5)

All Year

(0.2424)

0.6706

(0.0264)

0.4196

(0.0239)

0.0002

(0.0000)

(0.0471)

-6413.0636

12826.127 2.297 -35346.100

10020.232 0.593

Table 3. Estimated Coefficients and Statistics for GNB Crash Prediction Models

Log(α) GNB Model β0,t β1,t β2,t β3,t

γ0,t γ 1,t γ 2,t γ 3,t

Log Likelihood No. of Parameters Deviance AIC BIC G2 - Statistics R2 like MOF

-10.5610 3.0961 -0.2285 - - 1997

(0.0397) (1.0610) (0.1125) - -

-2715.8124 3 5431.625

0.973

-42757.859 4248.340

0.274

-10.8026 5.6048 -0.4666 - - 1998

(0.0452) (1.1962) (0.1261) - -

-2245.9615 3 4491.923

0.805

-43697.561 3756.577

0.385

-10.4971 3.0128 -0.2483 - - 1999

(0.0368) (1.1197) (0.1183) - -

-2844.5574 3 5689.115

1.019

-42500.369 4268.989

0.277

-10.4742 0.8202 n.a. - - 2000

(0.0377) (0.0856) n.a. - -

-2885.0900 2 5770.180

1.033

-42427.932 4409.860

0.277

-10.4280 0.8972 n.a. - - 2001

(0.0378) (0.0815) n.a. - -

-2962.2666 2 5924.533

1.061

-42273.579 4604.552

0.284

-8.9388 2.1562 -0.1602 - -

Functional

Form 1

(NB Model 1)

All

(0.2612)

1.0136

(0.0293)

- -

(0.4667) (0.0508) -

-6758.3375

13516.675

2.420

-34664.181

13040.488

0.475

-10.0277 2.4832 n.a. -0.3441 - 1997

(0.0374) (0.3928) n.a. (0.0644) -

-2542.5005 3 5085.001

0.911

-43104.483 3663.926

0.380

-10.2571 5.7276 -0.3104 -0.3385 - 1998

(0.0435) (1.2646) (0.1340) (0.0641) -

-2132.6150 4 4265.230

0.765

-43915.626 3334.523

0.437

-9.9598 2.1456 n.a. -0.3507 - 1999

(0.0343) (0.4425) n.a. (0.0729) -

-2651.7545 3 5303.509

0.950

-42885.975 3656.838

0.380

-9.9414 2.0073 n.a. -0.2788 - 2000

(0.0359) (0.3947) n.a. (0.0641) -

-2716.8689 3 5433.738

0.973

-42755.746 3870.922

0.368

-9.8967 1.5581 n.a. -0.1879 - 2001

(0.0361) (0.3784) n.a. (0.0603) -

-2806.9713 3 5613.943

1.006

-42575.541 4059.650

0.352

-8.4040 1.7312 n.a. -0.2640 -

Functional

Form 2

(NB Model 2)

All

(0.2314)

0.6578

(0.0269)

0.5134

(0.0158)

(0.1669) n.a. (0.0295) -

-6384.3005

12768.601

2.287

-35403.626

10347.664

0.613

-9.3283 4.8790 -0.2924 - - 1997

(0.0373) (0.8095) (0.0531) - -

-2542.6011 3 5085.202

0.911

-43104.282 3679.690

0.378

-9.5443 6.1169 -0.3448 - - 1998

(0.0435) (0.8307) (0.0542) - -

-2141.2279 3 4282.456

0.767

-43907.028 3352.992

0.422

-9.2523 4.7512 -0.3067 - - 1999

(0.0343) (0.9192) (0.0604) - -

-2654.4034 3 5308.807

0.951

-42880.677 3665.911

0.386

-9.2387 3.7656 -0.2247 - - 2000

(0.0358) (0.8101) (0.0527) - -

-2719.4728 3 5438.946

0.974

-42750.538 3887.016

0.364

-9.1952 3.0049 -0.1691 - - 2001

(0.0360) (0.7839) (0.0508) - -

-2808.0660 3 5616.132

1.006

-42573.352 4076.810

0.346

-7.6995 3.5021 -0.2203 - -

Functional

Form 3

(NB Model 3)

All

(0.1708)

0.5541

(0.0118)

- -

(0.3422) (0.0234) - -

-6396.8367 4 12793.673

2.291

-35387.182 10423.853

0.606

-10.3783 3.4820 -0.4323 -0.3011 - 1997

(0.0375) (1.2291) (0.1327) (0.0640) -

-2540.0314 4 5080.063

0.911

-43100.793 3659.810

0.386

-10.6065 5.7966 -0.6326 -0.2878 - 1998

(0.0434) (1.3023) (0.1399) (0.0628) -

-2130.3818 4 4260.764

0.764

-43920.092 3327.764

0.445

-10.3085 3.7436 -0.5032 -0.3103 - 1999

(0.0343) (1.3887) (0.1514) (0.0710) -

-2646.6847 4 5293.369

0.949

-42887.486 3642.744

0.382

-10.2915 2.2491 -0.2956 -0.2555 - 2000

(0.0359) (1.2112) (0.1307) (0.0636) -

-2712.5502 4 5425.100

0.972

-42755.755 3855.607

0.378

-10.2480 2.3097 -0.2609 -0.1668 - 2001

(0.0361) (1.1175) (0.1225) (0.0576) -

-2798.9706 4 5597.941

1.003

-42582.914 4041.863

0.368

-8.7536 2.4856 -0.3335 -0.2334 -

Functional

Form 4

(NB Model 4)

All

(0.2378)

1.1724

(0.0271)

0.4418

(0.0152)

(0.5114) (0.0570) (0.0287) -

-6372.4107

12744.821

2.283

-35418.777

10283.859

0.621

-9.7060 2.7813 n.a. -0.4151 0.0001 1997

(0.0376) (0.4407) n.a. (0.0781) (0.0000)

-2542.4047 4 5084.809

0.911

-43096.046 3578.725 0.369

-9.9335 6.1316 -0.3005 -0.4513 0.0002 1998

(0.0436) (1.3185) (0.1369) (0.0780) (0.0000)

-2130.2489 5 4260.498

0.764

-43911.729 3269.585

0.429

-9.6373 2.7523 n.a. -0.4903 0.0002 1999

(0.0344) (0.4787) n.a. (0.0865) (0.0001)

-2646.5215 4 5293.043

0.949

-42887.813 3567.426

0.364

-9.6153 2.4632 n.a. -0.3823 0.0001 2000

(0.0360) (0.4373) n.a. (0.0780) (0.0000)

-2713.7821 4 5427.564

0.973

-42753.291 3789.491

0.362

-9.5673 1.8322 n.a. -0.2544 0.0001 2001

(0.0362) (1.8322) n.a. (0.0744) (0.0000)

-2803.1294 4 5606.259

1.005

-42574.597 3967.084

0.360

-8.0781 1.9449 n.a. -0.3182 0.0001

Functional

Form 5

(NB Model 5)

All

(0.2437)

0.6568

(0.0267)

0.4367

(0.0254)

0.0002

(0.0000)

(0.1883) n.a. (0.0360) (0.0000)

-6374.4074

12748.815 2.284 -35406.156

9920.566 0.592

Table 4. Summary of Statistical Tests

AIC BIC G2 - Statistics R2 like MOF*

Functional Form

(All Year Model) NB Model GNB

Model NB Model GNB

Model

Model 1 2.422 2.420 -34663.288 -34664.181 13034.299 13040.488 0.475 0.475

Model 2 2.300 2.287 -35338.986 -35403.626 10575.656 10347.664 0.614 0.613

Model 3 2.305 2.291 -35317.505 -35387.182 10590.943 10423.853 0.605 0.606

Model 4 2.295 2.283 -35361.623 -35418.777 10458.744 10283.859 0.621 0.621

Model 5 2.297 2.284 -35346.100 -35406.156 10020.232 9920.566 0.593 0.592

Best Model NB

Model 4

GNB

Model 4

GNB

Model 4

Model 5

GNB

Model 5

Model 4

GNB

Model 4

Second Best

Model

Model 5

GNB

Model 5

GNB

Model 5

Model 4

GNB

Model 4

Model 2

GNB

Model 2

Worst Model NB

Model 1

GNB

Model 1

GNB

Model 1

GNB

Model 1

GNB

Model 1

* = Bold values represent a better model between the NB and GNB model within the same functional form

Table 5 Cumulative Residuals between NB and GNB Models

Yearly Based Model Residuals*

()

∑−itit

ˆ 1997 1998 1999 2000 2001 Total

97-01

All Year

Model

Model 1 -17.88 42.62 6.58 -22.82 -33.18 -24.68 -16.58

Model 2 13.72 49.29 28.90 1.57 -5.72 87.76 124.99

Model 3 15.71 45.65 28.29 1.98 -3.80 87.85 123.52

Model 4 8.60 44.58 24.14 -2.34 -8.81 66.17 100.41

Models

Model 5 -268.74 -173.43 -269.18 -309.85 -334.16 -1355.37

-1332.83

Model 1 -14.58 27.11 5.11 -22.16 -32.42 -36.94 -13.38

Model 2 -6.16 22.37 9.09 -12.40 -20.09 -7.19 20.62

Model 3 4.19 18.03 10.01 -5.26 -11.00 15.97 43.22

Model 4 -5.79 21.50 7.05 -12.65 -18.68 -8.58 16.32

GNB

Models

Model 5 -278.88 -196.69 -284.04 -316.64 -343.74 -1419.99

-1393.78

* = yit represents the observe number of crashes per site per each year

Table 6. Inverse Dispersion Parameters among Different Models

Model 1 Model 2 Model 3 Model 4 Model 5 Model 1 ~ Model 5

Avg. Min. Max. Avg. Min. Max. Avg. Min. Max. Avg. Min. Max. Avg. Min. Max. Avg. Min. Max.

E{Di}_97 2.556 2.556 2.556 1.406 1.406 1.406 1.376 1.376 1.376 1.386 1.386 1.386 1.426 1.426 1.426 1.630 1.376 2.556

E{Di}_98 3.183 3.183 3.183 2.028 2.028 2.028 2.082 2.082 2.082 2.013 2.013 2.013 2.048 2.048 2.048 2.271 2.013 3.183

E{Di}_99 1.924 1.924 1.924 0.940 0.940 0.940 0.936 0.936 0.936 0.916 0.916 0.916 0.971 0.971 0.971 1.137 0.916 1.924

E{Di}_00 2.271 2.271 2.271 1.305 1.305 1.305 1.295 1.295 1.295 1.272 1.272 1.272 1.333 1.333 1.333 1.495 1.272 2.271

E{Di}_01 2.453 2.453 2.453 1.470 1.470 1.470 1.449 1.449 1.449 1.428 1.428 1.428 1.498 1.498 1.498 1.660 1.428 2.453

Fixed

Dispersion

(NB

Model)

E{Di}_All 1.988 1.988 1.988 1.298 1.298 1.298 1.309 1.309 1.309 1.280 1.280 1.280 1.290 1.290 1.290 1.433 1.280 1.988

E{Di}_97 3.302 1.676 7.093 2.867 0.377 11.980 3.605 0.305 25.158 3.093 0.324 15.466 2.965 0.250 16.140 3.166 0.250 25.158

E{Di}_98 5.933 1.401 26.653 6.156 0.370 53.100 12.431 1.051 86.754 5.914 0.350 46.729 6.681 0.199 84.114 7.423 0.199 86.754

E{Di}_99 2.582 1.233 5.913 1.994 0.252 8.547 3.172 0.268 22.140 2.350 0.192 14.164 2.202 0.114 15.679 2.460 0.114 22.140

E{Di}_00 2.271 2.271 2.271 2.294 0.452 7.443 1.184 0.100 8.263 2.276 0.418 7.546 2.435 0.252 11.742 2.092 0.100 11.742

E{Di}_01 2.453 2.453 2.453 2.111 0.719 4.750 0.553 0.047 3.862 2.220 0.617 5.907 2.121 0.485 6.248 1.892 0.047 6.248

Varying

Dispersion

(GNB

Model)

E{Di}_All 2.264 1.415 3.892 1.845 0.398 5.648 0.910 0.077 6.349 1.921 0.343 6.811 1.848 0.286 6.993 1.758 0.077 6.993

E{Di}_97 -0.746 0.879 -4.537 -1.461 1.028 -10.574 -2.229 1.071 -23.783 -1.707 1.063 -14.080 -1.539 1.177 -14.713

E{Di}_98 -2.749 1.783 -23.469 -4.128 1.658 -51.072 -10.349 1.030 -84.673 -3.901 1.663 -44.716 -4.634 1.849 -82.067

E{Di}_99 -0.659 0.691 -3.989 -1.055 0.688 -7.607 -2.236 0.668 -21.204 -1.434 0.723 -13.248 -1.231 0.857 -14.707

E{Di}_00 0.000 0.000 0.000 -0.988 0.853 -6.138 0.111 1.195 -6.968 -1.004 0.853 -6.275 -1.102 1.080 -10.409

E{Di}_01 0.000 0.000 0.000 -0.641 0.751 -3.280 0.896 1.402 -2.413 -0.792 0.811 -4.478 -0.624 1.013 -4.750

Fixed-

Varying

Dispersion

E{Di}_All -0.276 0.573 -1.904 -0.547 0.900 -4.349 0.399 1.232 -5.040 -0.641 0.937 -5.530 -0.558 1.004 -5.703

CURE Plot for the NB Model 1 (All Year)

-250

-200

-150

-100

-50

100

150

200

250

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(a)

CURE Plot for the GNB Model 1 (All Year)

-250

-200

-150

-100

-50

100

150

200

250

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

M ajor AADT (F 1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(b)

Figure 1. Cumulative Residual Plots for Model 1

CURE Plot for the NB Model 2 (All Year)

-250

-200

-150

-100

-50

100

150

200

250

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(a)

CURE Plot for the GNB Model 2 (All Year)

-250

-200

-150

-100

-50

100

150

200

250

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(b)

Figure 2. Cumulative Residual Plots for Model 2

CURE Plot for the NB Model 3 (All Year)

-250

-200

-150

-100

-50

100

150

200

250

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(a)

CURE Plot for the GNB Model 3 (All Year)

-250

-200

-150

-100

-50

100

150

200

250

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(b)

Figure 3. Cumulative Residual Plots for Model 3

CURE Plot for the NB Model 4 (All Year)

-250

-200

-150

-100

-50

100

150

200

250

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(a)

CURE Plot for the GNB Model 4 (All Year)

-250

-200

-150

-100

-50

100

150

200

250

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(b)

Figure 4. Cumulative Residual Plots for Model 4

CURE Plot for the NB Model 5 (All Year)

-1500

-1000

-500

500

1000

1500

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(a)

CURE Plot for the GNB Model 5 (All Year)

-1500

-1000

-500

500

1000

1500

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Major AADT (F1)

Cumulative Residuals

-2 Std.Dev. Cumulaive Residual +2 Std.Dev

(b)

Figure 5. Cumulative Residual Plots for Model 5

0.0

0.5

1.0

1.5

2.0

2.5

Model 1 Model 2 Model 3 Model 4 Model 5

Inverse Dispersion (Average)

Fixed Dispersion

Varying Dispersion

0.0

0.5

1.0

1.5

2.0

2.5

Model 1 Model 2 Model 3 Model 4 Model 5

Inverse Dispersion (Minimum)

Fixed Dispersion

Varying Dispersion

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

Model 1 Model 2 Model 3 Model 4 Model 5

Inverse Dispersion (Maximum)

Fixed Dispersion

Varying Dispersion

Figure 6. Dispersion Parameters based on All-Year Models

Model 1 using Fixed Dispersion (All Year)

0.0

3.0

6.0

9.0

12.0

15.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Weight Fa ctor

E{Yi}

(a)

Model 2 using Fixed D ispersion (All Year)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weight Factor

E{Yi}

(b)

Model 3 using Fixed Dispersion (All Year)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weight Factor

E{Yi}

(c)

Model 4 using Fix ed Dispersion (All Yea r)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weight Fa ctor

E{Yi}

(d)

Model 5 using Fixed Dispersion (All Year)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weig ht Fact or

E{Yi}

(e)

Figure 7. Relationships between the Weight Factor and NB Model Estimates

Mode l 1 using Varying Dispersion (All Year )

0.0

3.0

6.0

9.0

12.0

15.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Weight Fact or

E{Yi}

(a)

Model 2 using Varying Dispersion (All Year)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weight Factor

E{Yi}

(b)

Model 3 using Varying Dispersion (All Year)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weight Factor

E{Yi}

(c)

Model 4 using Fixed Dispersion (All Year)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weight Fa ctor

E{Yi}

(d)

Model 5 us ing Varying Disper sion (All Year)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Weight Fac tor

E{Yi}

(e)

Figure 8. Relationships between the Weight Factor and the GNB Model Estimates

Y = 0. 9256X

= 0.993

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 0.1 0.2 0.3 0.4 0 .5 0.6 0.7 0.8 0.9 1

Model 1 Weight Factor by Fixed Dispersion (All Year)

Model 1 weight Fact or by Va rying Dis persio

(All Yea r)

(a)

Y = 0.8885 X

= 0.9309

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Model 2 Weight Factor by Fixed Dispersion (All Year)

Model 2 Weight Factor by Varying Dispersion

(All Year)

(b)

Y = 0.8318X

= 0.9355

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Model 3 We ight Factor by Fixed Dispersion (All Year)

Model 3 Weight Factor by Varying Dispersion

(All Year)

(c)

Y = 0.8678X

R2 = 0.9362

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Model 4 We ight Factor by Fixed Dispersion (All Year)

Model 4 Weight Factor by Varying Dispersion

(All Year)

(d)

Y = 0.8888X

= 0.818

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Model 5 Weight Factor by Fixed Disper sion (All Year)

Model 5 Weight Factor by Varying Dispersion

(All Year)

(e)

Figure 9. Associations of Weight Factors between NB and GNB Models

Model 2 Varying Dispersion (All Year)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90, 000

Majo r Road Avg. AADT

Dispersio n Parameter

(a)

Model 2 Varying Dispersion (All Year)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 5000 10000 15000 20000 25000

Minor Road Avg. AADT

Dispersion Parameter

(b)

Model 4 Varying Dispersion (All Year)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Avera ge Ma jor AADT

Dispersion Paramter

(c)

Model 4 Varying Dispersion (All Year)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 5,000 10,000 15,000 20,000 25,000

Mino r Road Avg. AADT

Dispersio n Paramter

(d)

Model 5 Varying Dispersion (All Year)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000

Maj or Road AADT

Dispersion Parameter

(e)

Model 5 Varying Disper sion (All Year)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 5,000 10,000 15,000 20,000 25,000

Minor Road AADT

Dispersion Parameter

(f)

Figure 10. Relationships between the Varying Dispersions and Traffic Volumes

Y = 0.9996X

R2 = 0.9969

1,000

2,000

3,000

4,000

5,000

0 1,000 2,000 3,000 4,000 5,000

Model 1 Ranking (Fixed Dispers ion; Sum 97-01)

Model 1 Ranking (Varying Dispersion ; Sum 97-01)

(a) Spearman

s = 0.998

Y = 0.9983X

= 0.986

1,000

2,000

3,000

4,000

5,000

0 1,000 2,000 3,000 4,000 5,000

Model 2 Ranking (Fixed Dispersion; Sum 97-01)

Model 2 Ranking (Varying Dispersion; Sum 97-01)

(b) Spearman

s = 0.993

Y = 0.9978X

= 0.982

1,000

2,000

3,000

4,000

5,000

0 1,000 2,000 3,000 4,000 5,000

Model 3 Ranking (Fixed Dispersion; Sum 97-01)

Model 3 Ranking (Varying Dispersion; Sum 97-01)

s = 0.991

Y = 0.9982X

= 0.9857

1,000

2,000

3,000

4,000

5,000

0 1,000 2,000 3,000 4,000 5,000

Model 4 Ranking (Fixed Dispersion; Sum 97-01)

Model 4 Ran king (Varying Disp ersion; Sum 97- 01)

(d) Spearman

s = 0.993

Y = 0.9999X

= 0.9989

1,000

2,000

3,000

4,000

5,000

0 1,000 2,000 3,000 4,000 5,000

Model 5 Ranking (Fixed Dispers ion; Sum 97-01)

Model 5 Ranking (Varying Dispersion; Sum 97-01)

(e) Spearman

s = 0.999

Figure 11. Relationships in Ranking by Varying Dispersions and Traffic Volumes

Reviewer #1: The topic is an important one and the overall approach is reasonable, so I would like

to see the paper improved and salvaged. I am focusing on major issues only in this review because

the paper will, with the additional work, need another round of reviewing before it can be

considered further. At the very least, if my concerns turn out to be unfounded, I would be interested

to see the authors' response, before vetting the paper.

Response: We thank the reviewer for the thoughtful comments. We believe the comments

significantly improved the paper.

Here are my concerns:

1. Testing various models on the data from which they were calibrated is not considered

acceptable statistical practice, especially when assessing model performance by comparing the

sums of observations and predictions.

Response: The comparison of the observed and predicted values is only one of several tools that we

have used in this assessment. We have added discussions to this issue and included CURE plots as a

separate section of the paper. We have also added other goodness-of-fit statistics for evaluating the

models.

2. The authors use only one data set for their investigation, so it is difficult to make such

sweeping conclusions as have been made.

Response: Since we are not examining the predicting capabilities of statistical models (e.g., see Xie

at al., AA&P, in press), using one dataset is adequate for this kind of exercise. This is not different

than what other researchers have done for examining the performance of statistical models. Several

researchers have used one dataset for this kind of analysis, including previous work done by Mitra

and Washington (as noted by the reviewer), Hauer, Miaou, Mannering, Sayed, Abdel-Aty, Ivan, and

many others. This decision is also based on costs and resources available.

3. Theoretically, if the APM multiplier or AADT varies non-linearly over time, then it does make

sense to estimate time dependent models rather than use an average AADT and a single

multiplier - regardless of how we treat the dispersion parameter. So the authors conclusion on

this issue -- that time dependent models may be unnecessary — is a good example of the type

of conclusion that should not be drawn from such a limited empirical investigation.

Response: Given the additional analyses performed in this work, some of the conclusions have been

revised. Time-specific models are preferred over aggregated models as well as a varying dispersion

parameter is favored over a fixed dispersion, when it is warranted or when the number of covariates

is small. Many conclusions support previous work conducted on these topics. We have added some

references in the paper to support these conclusions.

4. The authors argue in favor of using simple models for their investigation. I wonder if

estimating well specified models will not resolve the issue of a varying dispersion parameter.

Washington and colleagues in Arizona have addressed this question I believe. If published, that

work should be discussed in the context of the authors' investigation.

* Response to Reviewers

Response: This research has been conducted before the publication of Mitra and Washington. Since

many models available in the literature are traffic-only models (including baseline models proposed

for the upcoming Highway Safety Manual), there is a need to examine how the varying dispersion

parameter affects the performance of such models. This and other papers have shown that the

dispersion parameter should not be assumed to be fixed when traffic-flow only models are used or

estimated. We have added the paper by Mitra and Washington and discussed its application to our

work.

5. The authors' model performance measures may not be the best ones, especially in the light of

the first comment. On this issue, I have two points.

a. The models are all fitted by taking logs to linearize them. Thus, it should not be surprising if

the sum of the observations does not match the sum of the predictions; so assessing model

performance by comparing observations to predictions seems inappropriate. To see why,

consider the worst possible model that says that every site has the same expected number of

accidents, equal to the mean if the population. Assessing this model by comparing the sum

of observations to the sum of predictions would suggest that this model is better than any of

those that the authors have estimated, since the sum of observations would exactly equal the

sum of predictions.

Response: We agree that the model is fitted in its logarithmic form and the sum should theoretically

be the same. However, the model will always be used in its non-logarithmic form. When a large

discrepancy exists, the model will be biased and provide inaccurate estimates. This has been shown

in previous work done by one of the authors. To confirm this point, we modified Table 5, and have

added Cumulative Residual (CURE) plots showing the differences between the different model

outputs. The CURE method has been used extensively by other transportation safety analysts to

evaluate the fit of models. Again, this is only one of the tools that safety analysts should use in

modeling development. Due to space constraints, we did not discuss the characteristic associated

with the fit in the logarithmic form of the equation.

b. It should not be surprising that the sums of the EB estimates are similar for the models and

quite close to the sums of the observations. I believe that this similarity should be

theoretically so. (That one model does not perform well with this test suggests to me that

there may be an error in the calculations.) Even if this were not theoretically so, the worst

possible model will again do best with this test. Consider one that is so bad that the model

prediction weight is zero for the EB estimate. Then the accident count has a weight of 1,

which would guarantee that the sum of the EB estimates would equal the sum of the

observations.

Response: We agree and have noted an error in the spreadsheet. We have removed this part of the

analysis from the paper. Thank you for noting this problem.

Given these concerns, I suggest that the authors consider more appropriate model performance

measures and use data other than the calibration set.

Response: To summarize, using one dataset is adequate for the objective of this study. We have

included additional measures to evaluate the performance of the models, including GOF measures

proposed by Mitra and Washington. We believe that all the different criteria presented in the paper

are helpful for determining the most suitable models whether they are developed using a fixed or a

varying dispersion parameter and how they affect the EB estimates.

Comparative Evaluation of Crash Hotspot Identification Methods: Empirical Bayes vs. Potential for Safety Improvement Using Variants of Negative Binomial Models

Article

Full-text available

Feb 2024

The empirical Bayes (EB) method is widely acclaimed for crash hotspot identification (HSID), which integrates crash prediction model estimates and observed crash frequency to compute the expected crash frequency of a site. The traditional negative binomial (NB) models, often used to estimate crash predictive models, typically struggle with accounting for the unobserved heterogeneity in crash data. Complex extensions of the NB models are applied to overcome these shortcomings. These techniques also present new challenges, for instance, applying the EB procedures, especially for out-of-sample data. This study applies a random parameter negative binomial (RPNB) model within the EB framework for HSID using out-of-sample data, comparing its performance with a varying dispersion parameter NB model (VDPNB). The research also evaluates the potential for safety improvement (PSI) scores for both models and compares them with EB estimates using three generalised criteria: high crashes consistency test (HCCT), common sites consistency test (CSCT), and absolute rank differences test (ARDT). The results yield dual insights. Firstly, the study highlights associations between crash covariates and frequency, emphasising the significance of roadway geometric design characteristics (e.g., lane width, number of lanes, and parking type) and traffic volume. Some variables also influenced overdispersion parameters in the VDPNB model. In the RPNB model, annual average daily traffic (AADT) and lane width emerged as random parameters. Secondly, the HSID performance assessment revealed the superiority of the EB method over PSI. Notably, the RPNB model, compared to the VDPNB, demonstrates superior performance in EB estimates for HSID with out-of-sample data. This research recommends adopting the EB method with RPNB models for robust HSID.

Empirical Bayes application on low-volume roads: Oregon case study

Article

Full-text available

Dec 2021
J SAFETY RES

Introduction: This paper investigates the Empirical Bayes (EB) method and the Highway Safety Manual (HSM) predictive methodology for network screening on low-volume roads in Oregon. Method: A study sample of around 870 miles of rural two-lane roadways with extensive crash, traffic and roadway information was used in this investigation. To understand the effect of low traffic exposure in estimating the EB expected number of crashes, the contributions of both the observed and the HSM predicted number of crashes were analyzed. Results and conclusions: The study found that, on low-volume roads, the predicted number of crashes is the major contributor in estimating the EB expected number of crashes. The study also found a large discrepancy between the observed and the predicted number of crashes using the HSM procedures calibrated for the state of Oregon, which could partly be attributed to the unique attributes of low-volume roads that are different from the rest of the network. However, the expected number of crashes for the study sample using the HSM EB method was reasonably close to the observed number of crashes over the 10-year study period. Practical Applications: Based on the findings, it can still be very effective to use network screening methods that rely primarily on risk factors for low-volume road networks. This is especially applicable in situations where accurate and reliable crash data are not available.

Evaluation Of The Effectiveness Of Neural Network Models In The Modeling Of Intra-City Highway Accidents Journal of LandscapeEcology

Article

Full-text available

Jan 2023

This paper discusses research evaluating the efficiency of neural network models in modeling inner-city highway accidents. As a case study, accident data from urban highways in Mashhad and variables related to traffic flow and road geometry are used as input variables for neural network modeling. Neural network modeling involves three steps: determining the neural network architecture, determining the transfer functions, training and errors, and creating neural network models. In this research, two neural network models were presented to estimate the number of financial and fatal accidents on inner-city highways. To evaluate the efficiency and accuracy of the models, the number of accidents estimated by the models was compared with the observed number and the value of R was used. On this basis, the presented models are suitable for estimating the number of financial and fatal accidents.

Evaluating The Effectiveness Of Fuzzy Logic In Modeling Inner-City Highway Accidents Journal of LandscapeEcology

Article

Full-text available

Jan 2023

In this study, the effectiveness of fuzzy models in modelling inner-city highway accidents is evaluated using the Mashhad highway accident data as a case study. For modelling based on fuzzy logic, the variables related to traffic flow and road geometry are used as input variables. Fuzzy modelling involves four stages: Fuzzification of input and output variables, generation of rules, combination and collection of diagrams and de-fuzzification. To fuzzify the variables in the scatter plot, the concept of statistical quantiles is used to assign linguistic terms such as low, medium or high. Based on fuzzy logic, this paper presents two models for predicting the number of financial and fatal accidents on inner-city highways. By comparing the accident numbers estimated by the models with the observed numbers, the efficiency and accuracy of the models can be evaluated by the correlation coefficient R^2. In order to create rules for fuzzy modelling, the results of previous studies were used to identify the factors that influence the occurrence of financial and fatal accidents on inner-city motorways. By demonstrating the effectiveness of the fuzzy logic models created in predicting the number of accidents, the results of these studies can be confirmed.

Developing Safety Performance Functions for Commercial Motor Vehicle Crashes at Interchange Ramp Segments in Kentucky

Article

Mar 2023

Compared with roadway segments and intersections, the safety of interchange ramp segments has not been studied extensively, especially in the context of commercial motor vehicles (CMVs). The main objective of this study was to develop a safety performance function (SPF) tool for predicting CMV crashes occurring on interchange ramp segments. Four count models, including the negative binomial (NB), heterogeneous NB (HTNB), standard Conway–Maxwell–Poisson (CMP), and heterogeneous Conway–Maxwell–Poisson (HTCMP), were used and compared while fitting CMV crash-specific SPFs along interchange ramp segments in Kentucky. The HTCMP model, which is an extension of the standard CMP model, is a more flexible approach that handles both over-dispersed and under-dispersed crash data while exhibiting varying dispersion parameters. Five-year (2015 to 2019) CMV-related crashes along Kentucky’s ramp segments were used. The model comparison results showed that the HTCMP significantly outperformed the other three models in crash prediction accuracy and goodness-of-fit statistics (e.g., the Akaike information criterion, Bayesian information criterion, and McFadden’s Pseudo R-squared). The SPF model results using the HTCMP approach indicated that on-ramps (relative to off-ramps), ramp annual average daily traffic, ramp configuration, left shoulder width, ramp gore length, absence of left roadside barrier, and presence of other merging or diverging ramps within the ramp of interest were significantly associated with CMV crash frequency on ramp segments. Potential safety countermeasures were proposed, for example, increasing ramp gore length to be at least 730 ft (since this was associated with a reduction in CMV crashes on ramp segments).

Application of Bayesian Semi-Parametric and Hierarchical Models for Analyzing Dispersed Traffic Barriers Crash Data

Article

Full-text available

Sep 2022
Open Transport J

Introduction Despite the traffic barriers effectiveness in reduction of the severity of run-off road crashes, the severity of barrier crashes still accounts for a significant fraction of road fatalities. Although extensive research has already been conducted in studying traffic barrier crashes, those studies mostly either consider the severity or frequency of crashes. Here, the equivalent property damage only (EPDO) was used to account for both aspects of crashes. While modeling EPDO crashes, there are challenges associated with that type of dataset including its sparse distribution, and the presence of heterogeneity in the dataset due to aggregation of various crash types. Methods Ignoring the sparse nature of the data might result in biased or even erroneous results. Thus, in this study we identify factors to barriers EPDO crashes while considering the discussed challenges. Those consideration are especially important as in the next step we will employ the modeling results for conducting the cost-benefit analysis. Two main methods were considered in this study to address the discussed challenges including parametric and non-parametric Bayesian hierarchical models. A semiparametric Bayesian approach was used to relax the normality assumption by using a mixture of multivariate Dirichlet prior, defining a flexible nonparametric model for the random effects’ distribution, and using grouping to account for the heterogeneity due to the structure of the dataset. On the other hand, Bayesian hierarchical models with two distributions of Poisson and negative binomial with similar levels of hierarchy were considered. These models were chosen as closest models to the Bayesian semiparametric model. The incorporated models were compared in terms of deviance information criterion (DIC). Results and Discussion The results highlighted that although the semi-parametric method outperforms the Bayesian hierarchical model with Poisson distribution, the Bayesian hierarchical model with negative binomial (NB) distribution outperform the semi-parametric model. The findings might be related to the severe sparse nature of the EPDO, which cannot optimally be accounted by semiparametric approach, and the model needs more flexibility. Conclusion It was found that being unrestrained, driving in interstate system, driving in clear weather, light conditions, and driving in a higher traffic all increase the likelihood of EPDO crashes. Also, while some predictors were significant in less accommodative models of semi-parametric or Poisson models, they were not for Negative binomial model.

Spatio-temporal analysis of road traffic accidents in Tunisia

Conference Paper

May 2022

Comprehensive Investigation of Commercial Motor Vehicle Crashes along Roadway Segments in Kentucky

Article

Mar 2022

The negative binomial (NB) model, traditionally used for safety performance function (SPF) development, suffers from a fixed over-dispersion parameter and is only valid for over-dispersed data (i.e., data exhibiting greater variance than the mean). A more flexible approach that handles over-dispersed data, under-dispersed data, and excess zero counts, in addition to exhibiting varying dispersion parameter as a function of site-specific characteristics, is the zero-inflated heterogeneous Conway–Maxwell–Poisson (ZI-HTCMP) model, which is an extension of Conway–Maxwell–Poisson (CMP)-based models. This study develops fatal + injury (FI) commercial motor vehicle (CMV) crash-specific SPFs along four roadway segment facilities in Kentucky, U.S., (urban multilane, rural multilane, urban two-lane, and rural two-lane segments). The traditional NB and newly introduced CMP-based models—ZI-HTCMP, zero-inflated Conway–Maxwell–Poisson (ZI-CMP), heterogeneous Conway–Maxwell–Poisson (HTCMP), and CMP—were compared using 14,967 CMV-related crashes on Kentucky’s road segments (between 2015 and 2019) and various roadway variables, for example, shoulder width, annual average daily traffic (AADT), and heavy vehicle percentage (HVP). From the developed SPFs, AADT and HVP >10% significantly increased FI CMV-related crashes on all four segment facilities. Various goodness-of-fit (GOF) statistics, including Akaike information criterion (AIC), mean absolute deviance (MAD), and mean square prediction error (MSPE), were used for model assessment and selection. For all four roadway facilities, CMP-based models showed better model fitting and prediction performance than the NB models. Furthermore, ZI-HTCMP was the best-fit model for urban multilane segments, which had high representation of zero-crash sites. CMP-family models could be used for effectively predicting FI CMV-related crashes (with excess zeros) on road segments.

Developing Commercial Motor Vehicle Crash-Specific Safety Performance Functions at Interchange Ramp Terminals in Kentucky

Article

Aug 2022

Commercial motor vehicle (CMV) drivers often experience difficulties in turning and crossing at ramp terminals. CMV-involved crashes could cause queue spillback at ramp terminals and possibly nearby freeway mainline and crossroads. The safety and mobility of ramp terminals for accommodating CMVs is thus of significant importance. However, interchange ramp terminals have rarely been investigated in previous safety studies, especially with regard to CMV-involved crashes. This study aimed at examining CMV-related crashes that occurred at ramp terminals. Heterogeneous negative binomial (HTNB) and traditional negative binomial (NB) models were fitted and compared while developing a safety performance function (SPF) for predicting CMV crashes and identifying high-risk ramp terminals. Information on crash history and site-specific characteristics was collected at 285 ramp terminals in the state of Kentucky between 2015 and 2019. The results showed that the HTNB model outperformed the NB model in relation to various goodness-of-fit measures (e.g., likelihood ratio test “LRT”, Akaike information criterion “AIC” and McFadden Pseudo R-squared statistic). The predicted crash frequencies while applying the HTNB model were then used to identify and rank high-risk ramp terminals. Ramp terminals with signalized traffic control type, with greater average daily traffic on the exit ramp, with two or more traffic lanes on the exit ramp, and adjacent to commercial or industrial areas were found to experience more CMV crashes. Several safety countermeasures were proposed to alleviate CMV-involved crashes at ramp terminals. One example is ensuring the presence of physical medians on crossroads (or major roads) ahead of exits ramps.

Application of Emerging Data Sources for Pedestrian Safety Analysis in Charlotte, NC

Article

Jun 2022

Pedestrian safety is a growing concern for transportation planners and safety engineers at both local and state levels. Continued advancements in data availability, data integration abilities, and analysis methodologies offer new opportunities to identify factors influencing pedestrian safety and to quantify their effects to inform data-driven road safety management. The main objective of this study was to spatially integrate Highway Safety Information System data with multijurisdictional and emerging datasets to analyze two measures of pedestrian safety performance in Charlotte, NC: (1) the severity of a pedestrian crash that has occurred, and (2) the probability that a pedestrian crash will occur on a given roadway segment. To accomplish the objectives, the study explored several high-priority research topics in safety data and analysis, including pedestrian exposure analysis and probe data integration. The research team developed a pedestrian count model to predict pedestrian volumes at locations without pedestrian counts and integrated speed information from probe data to supplement other roadway and contextual transportation data available from several agencies. Pedestrian exposure at a given intersection was found to be significantly influenced by demographic and socioeconomic characteristics, employment, land use, sidewalk presence, transit access, and roadway and intersection characteristics. The project team identified numerous significant factors that influenced pedestrian crash severity and probability, including outputs from the pedestrian exposure model, observed vehicle speeds, traffic volumes, intersection proximity, and other crash-related factors. The results could be used to identify locations that are more susceptible to pedestrian safety issues.

Estimating Accidents in a Road Network

Conference Paper

Full-text available

Aug 1996

This paper reviews model relating accidents to traffic flows, with particular emphasis on the appropriateness of the model form and the statistical technique employed for parameter estimation. The development of generalised linear models for predicting accidents at intersections in New Zealand is then described. It is shown that the new models fit the empirical data better than existing, simpler models. the use of the models for predicting intersection accidents in three networks is described.

Generalized Linear Models and Extensions, 4th Edition

Book

May 2018

Generalized Linear Models and extensions

Safety Performance Functions for Signalized Intersections in Large Urban Areas: Development and Application to Evaluation of Left-Turn Priority Treatment

Article

Jan 2005

This paper describes the development of safety performance functions (SPFs) for 1,950 urban signalized intersections on the basis of 5 years of collision data in Toronto, Ontario, Canada. Because Toronto has one of the largest known, readily accessible, urban signalized intersection databases, it was possible to develop reliable, widely applicable SPFs for different intersection classifications, collision severities, and impact types. Such a comprehensive set of SPFs is not available for urban signalized intersections from data for a single jurisdiction, despite the considerable recent interest in use of these functions for analyses related to network screening, and the development, prioritization, and evaluation of treatments. The application of a straightforward recalibration process requiring relatively little data means that the SPFs calibrated can be used by researchers and practitioners for other jurisdictions for which these functions do not exist and are unlikely to exist for some time. The value of the functions is illustrated in an application to evaluate a topical safety measure—left-turn priority treatment for which existing knowledge is on a shaky foundation. The results of this empirical Bayes evaluation show that this treatment is quite effective for reducing collisions, particularly those involving left-turn side impacts.

Alternative Risk Models for Ranking Locations for Safety Improvement

Article

Jan 2005

Many types of statistical models have been proposed for estimating accident risk in transport networks, ranging from basic Poisson and negative binomial models to more complicated models, such as zero-inflated and hierarchical Bayesian models. However, little systematic effort has been devoted to comparing the performance and practical implications of these models and ranking criteria when they are used for identifying hazardous locations. This research investigates the relative performance of three alternative models: the traditional negative binomial model, the heterogeneous negative binomial model, and the Poisson lognormal model. In particular, this work focuses on the impact of the choice of two alternative prior distributions (i.e., gamma versus lognormal) and the effect of allowing variability in the dispersion parameter on the outcome of the analysis. From each model, two alternative accident estimators are computed by using the conditional mean under both marginal and posterior distributions. A sample of Canadian highway—railway intersections with an accident history of 5 years is used to calibrate and evaluate the three alternative models and the two ranking criteria. It is concluded that the choice of model assumptions and ranking criteria can lead to considerably different lists of hazardous locations.

Comparison of Two Negative Binomial Regression Techniques in Developing Accident Prediction Models

Article

Jan 2006

There are several regression techniques to develop accident prediction models. Model development and subsequently the results are affected by the choice of regression technique. The objective of this paper is to compare two types of regression techniques: the traditional negative binomial (TNB) and the modified negative binomial (MNB). The TNB approach assumes that the shape parameter of the negative binomial distribution is fixed for all locations, while the MNB approach assumes that this shape parameter varies with the location's characteristics. The difference between the two approaches in terms of their goodness of fit and the identification and ranking of accident-prone locations is investigated. The study makes use of a sample of accident, volume, and geometric data corresponding to 392 arterial segments in British Columbia, Canada. Both models appear to fit the data well. However, the MNB approach provides a statistically significant improvement in model fit over the TNB approach. A total of 100 locations were identified as accident-prone by both approaches. A comparison between the ranks showed a close agreement in the general trend of ranking between the two models. While the MNB approach appears to fit the data better than the TNB approach, there was little difference in the results of the identification and ranking of accident-prone locations. This is likely due to the nature of the application and the data set used. The difference in results will depend on the extent to which deviant sites exist in the data set.

Regression Analysis of Count Data

Article

Nov 1999
TECHNOMETRICS

Estimating accidents at junctions using routinely available input data

Article

Jan 1996

Modeling Traffic Crash-Flow Relationships for Intersections: Dispersion Parameter, Functional Form, and Bayes Versus Empirical Bayes Methods

Article

Jan 2003

Statistical relationships between traffic crashes and traffic flows at roadway intersections have been extensively modeled and evaluated in recent years. The underlying assumptions adopted in the popular models for intersections are challenged. First, the assumption that the dispersion parameter is a fixed parameter across sites and time periods is challenged. Second, the mathematical limitations of some functional forms used in these models, particularly their properties at the boundaries, are examined. It is also demonstrated that, for a given data set, a large number of plausible functional forms with almost the same overall statistical goodness of fit (GOF) is possible, and an alternative class of logical formulations that may enable a richer interpretation of the data is introduced. A comparison of site estimates from the empirical Bayes and full Bayes methods is also presented. All discussions and comparisons are illustrated with a set of data collected for an urban four-legged signalized intersection in Toronto, Ontario, Canada, from 1990 to 1995. In discussing functional forms, the need for some goodness-of-logic measures, in addition to the GOF measure, is emphasized and demonstrated. Finally, analysts are advised to be mindful of the underlying assumptions adopted in the popular models, especially the assumption that the dispersion parameter is a fixed parameter, and the limitations of the functional forms used. Promising directions in which this study may be extended are also discussed.

Generalized Estimating Equations

Chapter

Sep 2008

Correlated datasets develop when multiple observations are collected from a sampling unit (e.g., repeated measures of a bank over time, or hormone levels in a breast cancer patient over time), or from clustered data where observations are grouped based on a shared characteristic (e.g., observations on different banks grouped by zip code, or on cancer patients from a specific clinic). The generalized linear model framework for independent data is extended to model correlated data via the introduction of second-order variance components directly into the independent data model's estimating equation. This generalization of the estimating equation from the independence model is thus referred to as a Generalized Estimating Equation (GEE). This article discusses the foundation of GEEs as well as how user-specified correlation structures are accommodated in the model-building process. This article also discusses the relationship and similarity to the underlying generalized linear model framework and we point out alternative approaches to GEEs for modeling correlated data such as fixed-effects models and random-effects models. Keywords: working correlation matrix; sandwich estimate of variance; generalized linear models; subject-specific models; population-averaged models

Classical and Modern Regression With Applications.

Article