PreprintPDF Available

Urban Science 15 - The Statistical Modeling Process - image

March 1996

March 1996

DOI:10.13140/RG.2.2.11060.17280

License
CC BY-NC-ND 4.0

Authors:

James Wendelberger

Los Alamos National Laboratory

Preprints and early-stage research may not have been peer reviewed yet.

The standard statistical modeling process is divided into three phases. The three phases are: the setup phase, the analysis phase and the reporting phase. Each phase is divided into steps. The steps are described in detail in this report. The purpose of this report is to disseminate information about the standard statistical modeling process. The statistical modeling process is different from the mathematical modeling process by the introduction of error due to random variables. For a discussion of the mathematical modeling process see Fowkes and Mahony (1994). The standard statistical modeling process includes the setup phase, the analysis phase and the reportingphase. Each of these three phases is described in the following three sections.

Content uploaded by James Wendelberger

Content may be subject to copyright.

Content uploaded by James Wendelberger

Content may be subject to copyright.

~l\l.Urban

~~Science

Tl1e

Statistical Modeling Process

Prepared

By: Dr.

James

\Vendelberger

Director

Statistical Analysis

Urban

Science

14:48

MST

March

1996

Urban

Science

stical

Report

Number:

Abstract

The standard statistical modeling process is divided into three phases. The three

phases are: the setup phase, the analysis phase and the reporting phase. Each phase

divided into steps. The steps are described in detail in this report.

Purpose

The purpose of this report

to disseminate information about the statistical modeling

process.

Urban 2 Science

Introduction

The purpose

this report

to disseminate information about the standard statistical

modeling process. The statistical modeling process

different from the mathematical

modeling process by the introduction

error due to random variables. For a discussion

the mathematical modeling process see Fowkes and Mahony (1994). The standard

statistical modeling process includes the setup phase, the analysis phase and the reporting

phase. Each

these three phases

described in the following three sections.

2. The Setup Phase

The setup phase

the standard statistical modeling process consists

answering

the questions in the following steps.

Step

What is the business goal?

Step

What data are available?

Step

What does data acquisition cost?

Step

What industry specific information is required?

Step

Project feasible?

Step

What project?

2.1

Setup Phase: What

the business goal?

The business goal may be vague or concrete. A vague goal would be

make

money in a project. A specific or concrete goal would be: for a fixed amount

money

increase the response rate by a certain percentage over what it would have been using

the usual company process. It is important that the business goal be measurable. The

more specific the business goal the easier

ascertain whether or not the project has

accomplished the business goal.

2.2 Setup Phase: What data are available?

At this point it is important to find out all data that could help accomplish the busi-

ness goal. Sometimes the data are already in hand.

the data are from

observational

study then extreme caution is in order when making inferences about the relationships

the variables outside the sample. Ideally, the data are from a designed experiment in

which case we may be able to make inferences about the variable relationships in gen-

eral. The data may be available only by collection of the information ourselves. Again,

caution is in order here so that the information collected is in fact useful and not useless

due to faulty collection practices. Faulty collection practices would include poor survey

design or noncoverage problems.

example

noncoverage occurs when our sample

from one region

the variable space and we will make inferences on another region

the space. Such

sampling only good dealers and then making inferences on poor

dealers. The data may be available through a third party. In this case, it is imperative

that one knows exactly what the data are, how they were collected and

any changes

have been made to the raw data. It is often the case that the data consist

a mixture

in hand, self collectible and third party sources.

Urban 3 Science

2.3 Setup Phase: What does data acquisition cost?

Data costs vary considerably. In hand data is usually least expensive. Data collected

specifically for a project is usually most expensive. Third party data costs usually fall

somewhere in between. The quality

the data needs to be considered

a tradeoff with

cost. This quality versus cost tradeoff should be done in step number six

the setup

phase.

2.4 Setup Phase: What industry specific information is required?

In most industries there is information which is peculiar to the industry and important

know about during the model building process. An industry expert should be involved

in the model building process to guard against creating a model which is deficient due to

lack

industry specific information. In the automotive industry this may be

simple

knowledge about the existence

registration data. Imagine the difference between

well constructed models which utilize this information and data and those models which

do not utilize the registration data for market share prediction.

2.5 Setup Phase: Project feasible?

Given the information in the previous steps can a project or projects be designed to

accomplish the business goal?

a project can be designed then we proceed to the next

step.

a project cannot be designed then we need

change something in the previous

steps. The business goal may need to be modified, other data may need to be found,

negotiations on data cost may be in order or

industry expert may help. After changes

have been made we ask again

there is a feasible project. When at least one feasible

project is found we move on

step number six.

2.6 Setup Phase: What Project?

Which

the projects,

any, in step 5 is to be undertaken? A cost benefit analysis

is required here. This analysis may prove useful in selling the project externally. It will

be hard

know the cost

poor data quality. For this reason this step should include

a modeling expert

provide the experts opinion on what level

quality is needed

maximize the probability

the project accomplishing the business goal. For similar

reason the industry expert should be involved in this step

well. It may be that the

setup phase requires knowledge

the analysis or reporting phases. For example,

the

specific software package to accomplish the model building is predetermined then that

may limit the possible models. Or, for example,

the model reporting is to be performed

on an underpowered machine in real time then that also may limit the possible models.

Each

these examples will impact the number

possible projects available in the setup

phase. Because

these interdependencies all

those involved in the project should be

involved in this step

the process.

the answer

step 6 provides a project then we can proceed to the next phase

the standard statistical modeling process, the analysis phase. Assuming that we have

determined a project we proceed to the analysis phase.

3. The Analysis Phase

The analysis phase assumes that a project has been determined in the setup phase.

Urban 4 Science

The analysis phase

the standard statistical model building process consists

the

following steps.

Step

Establish the Modeling Goal

Step

Acquire the Data

Step

Check the Data

Step 4: Decide on the Class

Models

Step

Estimate the Parameters

the Model

Step

Validate the Model Internally

Step

Validate the Model Externally

3.1

Analysis Phase: Establish the Modeling Goal

The first analysis step to establish the modeling goal consists

relevance to the

business goal

the project determined in the setup phase, intuition

the modeler, con-

sideration

constraints

data availability, money limitations, time and other resource

limitations. The modeling goal must be reasonable. Other previous modeling goals may

provide insight

to a reasonable current modeling goal. After these considerations a

modeling goal is selected.

3.2 Analysis Phase: Acquire the Data

The second analysis step

acquire the data requires project data specification,

investigation

data sources, costs and availability. It may

necessary to collect

the data yourself. The data must contain information which is relevant

achieving

the modeling goal. The selection

data may be made in connection with the model

feasibility in model building step number four.

3.3 Analysis Phase: Check the Data

The analysis step number three is

check the data. This

one of the most

important stages

the process

"garbage in garbage out" applies. The data must have

internal logical consistency. The data must

consistent with external facts. The data

should be checked to verify that it corresponds to specifications. Things

check here

include: missing data codes, incorrectly specified dummy variables, outliers, influential

data, typographical errors and transposition errors.

3.4 Analysis Phase: Decide on the Class

Models

The analysis step number four

to decide on the model. One method

doing this

is to emulate the data generation process. Another

to use past experience

models

in the application area. The feasibility

model implementation at a later stage, such

production, must

considered at this stage. The model should provide "maximal"

attainment

the modeling goal. The data must yield the information required to achieve

the modeling goal. Costs are also considered at this stage. These costs include: model

selection, estimation

parameters and implementation costs.

The model entertained may be

many kinds. In order to understand that there

are a large number

possible models a list

model types is provided. There may be

some overlap between certain

these model types (they are not mutually exclusive

Urban 5 Science

one another). Possible models types are: Basis Function Estimation, Bayesian Analy-

sis, Classification and Regression Trees, Clustering, Descriptive or Summary Statistics,

Differential Equations, Discriminant Analysis, Fourier Analysis, Fractal Techniques, Ge-

netic Algorithms, Graphical Methods, Kalman Filtering, Linear Methods (GLIM), Link

Function Regression Analysis, Multiple Linear Regression Analysis, Neural Networks,

Non-Parametric Function Estimation, Parametric Estimation and Models, Pattern Recog-

nition, Polynomial Fitting, Projection Pursuit, Regularization Methods, Segmentation,

Spline Estimation, Stochastic Models, Time Series Analysis (ARIMA and Transfer Func-

tions), Transformations

Data, Wavelets and Custom Combinations

the above.

example

custom combinations is the use

Bayesian Analysis and Neural Networks

provide a posterior distribution which could

used

estimate the probability

response, see Weigend and Srivastava (1995).

The model needs

handle any unusual things uncovered in the check data phase.

This includes how

handle missing values and outliers.

3.5 Analysis Phase: Estimate the Parameters

the Model

Analysis step number five consists

the determination

model parameters. This

called parameter estimation or model calibration. Methods to accomplish this are:

Ad Hoc (Heuristic or Made Up), Maximum Likelihood, Posterior Density Estimation,

Entropy Maximization and the Method of Moments.

3.6 Analysis Phase: Validate the Model Internally

Analysis step number six consists

validating the model internally. This step

requires checking internal consistency and validity. Internal consistency includes standard

statistical model diagnostic checks. For example, plotting the residuals against each

independent variable. Note that the residuals should not be plotted against the dependent

variable(s). Internal Validity consists of the "Laugh Test" and comparison to a holdout

set. The "Laugh Test" asks the question: does the output

the model make one laugh

or grimace in awe or disbelieve?

so,

then the model fails the laugh test.

not, then

the model passes the laugh test. The comparison to the holdout set involves comparing

the model predictions to a set of data which is set aside for the purpose

this internal

validity check.

the model fails these checks then this information should be recorded

and the process should be redone possibly starting at analysis stage number one. Internal

model verification is critical in this step.

3.7

Analysis Phase: Validate the Model Externally

Analysis step number seven consists of checking the model for external validity

and consistency. This may include comparison to a holdout sample which has not been

previously utilized or inspected by the analyst. It may also include testing against live

or more current data

available.

the model passes this external vality test the model

is "blessed" and ready for use.

have accomplished the modeling goal.

the model

is invalid according to the external validity and consistency test then we will be unable

to "bless" the model. In order to proceed to accomplish the modeling goal we require

more data in order to

another external validity and consistency check. This data may

be either live, more current or past data which was not previously available or utilized.

Urban 6 Science

test, and other logical tests such

probabilities and market shares being non-negative

and less than one.

4.5 Reporting Phase: External Validity

The fifth step

the reporting phase

the standard statistical model building process

is the external model validity and consistency check. External verification

the model

available after the analysis system has made its predictions. Predictions are made before

the results are known. These predictions may be at

individual prediction level or at

an aggregate level. When results become known the predictions are compared to the

actual values. This comparison can be made by a distributional comparison. Similarly,

aggregate predictions can be compared to actual aggregate values (such as market share).

live data becomes available this step should be repeatedly performed. This

opportunity to verify the entire system.

This external validity check is the ultimate test of the model. Passage

this test

allows one to have confidence in use of the model.

5. Conclusions

The steps in this report are meant

a guide for those attempting to develop a

statistical model. In practice, steps may be undertaken at different levels

involvement.

However, it

important that all steps be considered.

any steps are omitted then

the potential effect(s) of omission should

communicated to those responsible for the

omission.

The statistical modeling process is an art

well

a science. There are many factors

which may have not been accounted for by the procedures described in this report. A

knowledgeable expert is needed to sort out the importance

considering these other

factors.

For further information please contact Dr. James

Wendelberger at Telephone

Voice: (0101) 505 662 2838 or Telephone Facsimile: (0101) 505 662 2841.

6. Reference

Fowkes,

and

J. J.

Mahony (1994),

Introduction to Mathematical Modelling,

John Wiley and Sons, Inc., New

York.

Urban 8 Science

ResearchGate has not been able to resolve any citations for this publication.

ResearchGate has not been able to resolve any references for this publication.

Urban Science 15 - The Statistical Modeling Process - image

Abstract

Recommended publications

Urban Science 16 - Market Ranking From Demographic Factors - image

Urban Science 13 - Interpretation of Regression Estimates - image

Urban Science 200001 - Examples of Linear Regression, Outliers and the Relationship of Automobile Ef...

Prelim Examination - Smoothing with Surface Splines - Image