PreprintPDF Available

Urban Science 15 - The Statistical Modeling Process - image

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The standard statistical modeling process is divided into three phases. The three phases are: the setup phase, the analysis phase and the reporting phase. Each phase is divided into steps. The steps are described in detail in this report. The purpose of this report is to disseminate information about the standard statistical modeling process. The statistical modeling process is different from the mathematical modeling process by the introduction of error due to random variables. For a discussion of the mathematical modeling process see Fowkes and Mahony (1994). The standard statistical modeling process includes the setup phase, the analysis phase and the reportingphase. Each of these three phases is described in the following three sections.
~l\l.Urban
~~Science
Tl1e
Statistical Modeling Process
Prepared
By: Dr.
James
\Vendelberger
Director
of
Statistical Analysis
Urban
Science
14:48
MST
25
March
1996
Urban
Science
Sh
1H
stical
Report
Number:
15
Abstract
The standard statistical modeling process is divided into three phases. The three
phases are: the setup phase, the analysis phase and the reporting phase. Each phase
is
divided into steps. The steps are described in detail in this report.
Purpose
The purpose of this report
is
to disseminate information about the statistical modeling
process.
Urban 2 Science
1.
Introduction
The purpose
of
this report
is
to disseminate information about the standard statistical
modeling process. The statistical modeling process
is
different from the mathematical
modeling process by the introduction
of
error due to random variables. For a discussion
of
the mathematical modeling process see Fowkes and Mahony (1994). The standard
statistical modeling process includes the setup phase, the analysis phase and the reporting
phase. Each
of
these three phases
is
described in the following three sections.
2. The Setup Phase
The setup phase
of
the standard statistical modeling process consists
of
answering
the questions in the following steps.
Step
1:
What is the business goal?
Step
2:
What data are available?
Step
3:
What does data acquisition cost?
Step
4:
What industry specific information is required?
Step
5:
Project feasible?
Step
6:
What project?
2.1
Setup Phase: What
is
the business goal?
The business goal may be vague or concrete. A vague goal would be
to
make
money in a project. A specific or concrete goal would be: for a fixed amount
of
money
increase the response rate by a certain percentage over what it would have been using
the usual company process. It is important that the business goal be measurable. The
more specific the business goal the easier
it
is
to
ascertain whether or not the project has
accomplished the business goal.
2.2 Setup Phase: What data are available?
At this point it is important to find out all data that could help accomplish the busi-
ness goal. Sometimes the data are already in hand.
If
the data are from
an
observational
study then extreme caution is in order when making inferences about the relationships
of
the variables outside the sample. Ideally, the data are from a designed experiment in
which case we may be able to make inferences about the variable relationships in gen-
eral. The data may be available only by collection of the information ourselves. Again,
caution is in order here so that the information collected is in fact useful and not useless
due to faulty collection practices. Faulty collection practices would include poor survey
design or noncoverage problems.
An
example
of
noncoverage occurs when our sample
is
from one region
of
the variable space and we will make inferences on another region
of
the space. Such
as
sampling only good dealers and then making inferences on poor
dealers. The data may be available through a third party. In this case, it is imperative
that one knows exactly what the data are, how they were collected and
if
any changes
have been made to the raw data. It is often the case that the data consist
of
a mixture
of
in hand, self collectible and third party sources.
Urban 3 Science
2.3 Setup Phase: What does data acquisition cost?
Data costs vary considerably. In hand data is usually least expensive. Data collected
specifically for a project is usually most expensive. Third party data costs usually fall
somewhere in between. The quality
of
the data needs to be considered
as
a tradeoff with
cost. This quality versus cost tradeoff should be done in step number six
of
the setup
phase.
2.4 Setup Phase: What industry specific information is required?
In most industries there is information which is peculiar to the industry and important
to
know about during the model building process. An industry expert should be involved
in the model building process to guard against creating a model which is deficient due to
lack
of
industry specific information. In the automotive industry this may be
as
simple
as
knowledge about the existence
of
registration data. Imagine the difference between
well constructed models which utilize this information and data and those models which
do not utilize the registration data for market share prediction.
2.5 Setup Phase: Project feasible?
Given the information in the previous steps can a project or projects be designed to
accomplish the business goal?
If
a project can be designed then we proceed to the next
step.
If
a project cannot be designed then we need
to
change something in the previous
steps. The business goal may need to be modified, other data may need to be found,
negotiations on data cost may be in order or
an
industry expert may help. After changes
have been made we ask again
if
there is a feasible project. When at least one feasible
project is found we move on
to
step number six.
2.6 Setup Phase: What Project?
Which
of
the projects,
if
any, in step 5 is to be undertaken? A cost benefit analysis
is required here. This analysis may prove useful in selling the project externally. It will
be hard
to
know the cost
of
poor data quality. For this reason this step should include
a modeling expert
to
provide the experts opinion on what level
of
quality is needed
to
maximize the probability
of
the project accomplishing the business goal. For similar
reason the industry expert should be involved in this step
as
well. It may be that the
setup phase requires knowledge
of
the analysis or reporting phases. For example,
if
the
specific software package to accomplish the model building is predetermined then that
may limit the possible models. Or, for example,
if
the model reporting is to be performed
on an underpowered machine in real time then that also may limit the possible models.
Each
of
these examples will impact the number
of
possible projects available in the setup
phase. Because
of
these interdependencies all
of
those involved in the project should be
involved in this step
of
the process.
If
the answer
to
step 6 provides a project then we can proceed to the next phase
of
the standard statistical modeling process, the analysis phase. Assuming that we have
determined a project we proceed to the analysis phase.
3. The Analysis Phase
The analysis phase assumes that a project has been determined in the setup phase.
Urban 4 Science
The analysis phase
of
the standard statistical model building process consists
of
the
following steps.
Step
1:
Establish the Modeling Goal
Step
2:
Acquire the Data
Step
3:
Check the Data
Step 4: Decide on the Class
of
Models
Step
5:
Estimate the Parameters
of
the Model
Step
6:
Validate the Model Internally
Step
7:
Validate the Model Externally
3.1
Analysis Phase: Establish the Modeling Goal
The first analysis step to establish the modeling goal consists
of
relevance to the
business goal
of
the project determined in the setup phase, intuition
of
the modeler, con-
sideration
of
constraints
of
data availability, money limitations, time and other resource
limitations. The modeling goal must be reasonable. Other previous modeling goals may
provide insight
as
to a reasonable current modeling goal. After these considerations a
modeling goal is selected.
3.2 Analysis Phase: Acquire the Data
The second analysis step
to
acquire the data requires project data specification,
investigation
of
data sources, costs and availability. It may
be
necessary to collect
the data yourself. The data must contain information which is relevant
to
achieving
the modeling goal. The selection
of
data may be made in connection with the model
feasibility in model building step number four.
3.3 Analysis Phase: Check the Data
The analysis step number three is
to
check the data. This
is
one of the most
important stages
of
the process
as
"garbage in garbage out" applies. The data must have
internal logical consistency. The data must
be
consistent with external facts. The data
should be checked to verify that it corresponds to specifications. Things
to
check here
include: missing data codes, incorrectly specified dummy variables, outliers, influential
data, typographical errors and transposition errors.
3.4 Analysis Phase: Decide on the Class
of
Models
The analysis step number four
is
to decide on the model. One method
of
doing this
is to emulate the data generation process. Another
is
to use past experience
of
models
in the application area. The feasibility
of
model implementation at a later stage, such
as
production, must
be
considered at this stage. The model should provide "maximal"
attainment
of
the modeling goal. The data must yield the information required to achieve
the modeling goal. Costs are also considered at this stage. These costs include: model
selection, estimation
of
parameters and implementation costs.
The model entertained may be
of
many kinds. In order to understand that there
are a large number
of
possible models a list
of
model types is provided. There may be
some overlap between certain
of
these model types (they are not mutually exclusive
of
Urban 5 Science
one another). Possible models types are: Basis Function Estimation, Bayesian Analy-
sis, Classification and Regression Trees, Clustering, Descriptive or Summary Statistics,
Differential Equations, Discriminant Analysis, Fourier Analysis, Fractal Techniques, Ge-
netic Algorithms, Graphical Methods, Kalman Filtering, Linear Methods (GLIM), Link
Function Regression Analysis, Multiple Linear Regression Analysis, Neural Networks,
Non-Parametric Function Estimation, Parametric Estimation and Models, Pattern Recog-
nition, Polynomial Fitting, Projection Pursuit, Regularization Methods, Segmentation,
Spline Estimation, Stochastic Models, Time Series Analysis (ARIMA and Transfer Func-
tions), Transformations
of
Data, Wavelets and Custom Combinations
of
the above.
An
example
of
custom combinations is the use
of
Bayesian Analysis and Neural Networks
to
provide a posterior distribution which could
be
used
to
estimate the probability
of
response, see Weigend and Srivastava (1995).
The model needs
to
handle any unusual things uncovered in the check data phase.
This includes how
to
handle missing values and outliers.
3.5 Analysis Phase: Estimate the Parameters
of
the Model
Analysis step number five consists
of
the determination
of
model parameters. This
is
called parameter estimation or model calibration. Methods to accomplish this are:
Ad Hoc (Heuristic or Made Up), Maximum Likelihood, Posterior Density Estimation,
Entropy Maximization and the Method of Moments.
3.6 Analysis Phase: Validate the Model Internally
Analysis step number six consists
of
validating the model internally. This step
requires checking internal consistency and validity. Internal consistency includes standard
statistical model diagnostic checks. For example, plotting the residuals against each
independent variable. Note that the residuals should not be plotted against the dependent
variable(s). Internal Validity consists of the "Laugh Test" and comparison to a holdout
set. The "Laugh Test" asks the question: does the output
of
the model make one laugh
or grimace in awe or disbelieve?
If
so,
then the model fails the laugh test.
If
not, then
the model passes the laugh test. The comparison to the holdout set involves comparing
the model predictions to a set of data which is set aside for the purpose
of
this internal
validity check.
If
the model fails these checks then this information should be recorded
and the process should be redone possibly starting at analysis stage number one. Internal
model verification is critical in this step.
3.7
Analysis Phase: Validate the Model Externally
Analysis step number seven consists of checking the model for external validity
and consistency. This may include comparison to a holdout sample which has not been
previously utilized or inspected by the analyst. It may also include testing against live
or more current data
if
available.
If
the model passes this external vality test the model
is "blessed" and ready for use.
We
have accomplished the modeling goal.
If
the model
is invalid according to the external validity and consistency test then we will be unable
to "bless" the model. In order to proceed to accomplish the modeling goal we require
more data in order to
do
another external validity and consistency check. This data may
be either live, more current or past data which was not previously available or utilized.
Urban 6 Science
test, and other logical tests such
as
probabilities and market shares being non-negative
and less than one.
4.5 Reporting Phase: External Validity
The fifth step
of
the reporting phase
of
the standard statistical model building process
is the external model validity and consistency check. External verification
of
the model
is
available after the analysis system has made its predictions. Predictions are made before
the results are known. These predictions may be at
an
individual prediction level or at
an aggregate level. When results become known the predictions are compared to the
actual values. This comparison can be made by a distributional comparison. Similarly,
aggregate predictions can be compared to actual aggregate values (such as market share).
As
live data becomes available this step should be repeatedly performed. This
is
an
opportunity to verify the entire system.
This external validity check is the ultimate test of the model. Passage
of
this test
allows one to have confidence in use of the model.
5. Conclusions
The steps in this report are meant
as
a guide for those attempting to develop a
statistical model. In practice, steps may be undertaken at different levels
of
involvement.
However, it
is
important that all steps be considered.
If
any steps are omitted then
the potential effect(s) of omission should
be
communicated to those responsible for the
omission.
The statistical modeling process is an art
as
well
as
a science. There are many factors
which may have not been accounted for by the procedures described in this report. A
knowledgeable expert is needed to sort out the importance
of
considering these other
factors.
For further information please contact Dr. James
G.
Wendelberger at Telephone
Voice: (0101) 505 662 2838 or Telephone Facsimile: (0101) 505 662 2841.
6. Reference
1.
Fowkes,
N.
D.
and
J. J.
Mahony (1994),
An
Introduction to Mathematical Modelling,
John Wiley and Sons, Inc., New
York.
Urban 8 Science
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.