ArticlePDF Available

Using dialog boxes to vary program parameters

February 1997

February 1997
6(36)

Source
RePEc

Authors:

Howard Joseph Newton

Texas A&M University

Dialog box for time series power transform (lambda = 1).

Cressie-Read goodness-of-fit statistics.

…

Dialog box for time series power transform (lambda = 0.25).

…

Figures - uploaded by Howard Joseph Newton

Content may be subject to copyright.

Content uploaded by Howard Joseph Newton

Content may be subject to copyright.

STATA March 1997

TECHNICAL STB-36

BULLETIN

A publication to promote communication among Stata users

Editor Associate Editors

H. Joseph Newton Francis X. Diebold, University of Pennsylvania

Department of Statistics Joanne M. Garrett, University of North Carolina

Texas A & M University Marcello Pagano, Harvard School of Public Health

College Station, Texas 77843 James L. Powell, UC Berkeley and Princeton University

409-845-3142 J. Patrick Royston, Royal Postgraduate Medical School

409-845-3144 FAX

stb@stata.com EMAIL

Subscriptions are available from Stata Corporation, email stata@stata.com, telephone 979-696-4600 or 800-STATAPC,

fax 979-696-4601. Current subscription prices are posted at www.stata.com/bookstore/stb.html.

Previous Issues are available individually from StataCorp. See www.stata.com/bookstore/stbj.html for details.

Submissions to the STB, including submissions to the supporting ﬁles (programs, datasets, and help ﬁles), are on a nonex-

clusive, free-use basis. In particular, the author grants to StataCorp the nonexclusive right to copyright and distribute the ma-

terial in accordance with the Copyright Statement below. The author also grants to StataCorp the right to freely use the ideas,

including communication of the ideas to other parties, even if the material is never published in the STB. Submissions should

be addressed to the Editor. Submission guidelines can be obtained from either the editor or StataCorp.

and help ﬁles) are copyright cby StataCorp. The contents of the supporting ﬁles (programs, datasets, and help ﬁles), may be

copied or reproduced by any means whatsoever, in whole or in part, as long as any copy or reproduction includes attribution

to both (1) the author and (2) the STB.

The insertions appearing in the STB may be copied or reproduced as printed copies, in whole or in part, as long as any copy

or reproduction includes attribution to both (1) the author and (2) the STB. Written permission must be obtained from Stata

Corporation if you wish to make electronic copies of the insertions.

Users of any of the software, ideas, data, or other materials published in the STB or the supporting ﬁles understand that such use

is made without warranty of any kind, either by the STB, the author, or Stata Corporation. In particular, there is no warranty of

ﬁtness of purpose or merchantability, nor for special, incidental, or consequential damages such as loss of proﬁts. The purpose

of the STB is to promote free communication among Stata users.

The

Stata Technical Bulletin

(ISSN 1097-8879) is published six times per year by Stata Corporation. Stata is a registered

trademark of Stata Corporation.

Contents of this issue page

gr16.1. Convex hull plots 2

gr24. Easier bar charts 4

gr25. Spike plots for histograms, rootograms, and time-series plots 8

ip16. Using dialog boxes to vary program parameters 11

sbe13.2. Correction to age-speciﬁc reference intervals (“normal ranges”) 15

sbe14. Odds ratios and conﬁdence intervals for logistic regression models with effect modiﬁcation 15

sg67. Univariate summaries with boxplots 23

sg68. Goodness-of-ﬁt statistics for multinomial distributions 26

sg69. Immediate Mann–Whitney and binomial effect-size display 29

2Stata Technical Bulletin [STB-36]

gr16.1 Convex hull plots

Nicholas J. Cox, University of Durham, UK,FAX (011) 44-91-374-2456, n.j.cox@durham.ac.uk

Syntax

The syntax for the command is

yvar xvar exp range #hullvar graph options

Note: requires the and programs which must be installed from the gr16 directory of the STB 23 disk

(January 1995).

Options

#speciﬁes the convex hull required. The default is 1, the outermost hull.

hullvar speciﬁes a variable to hold information about the hulls and is only needed if is a variable in the data set.

speciﬁes that the ﬁnal graph (for 1) is to be preceded by a movie of the hulls up to . Thus,

means that the ﬁnal graph includes hull 2 and that hulls 1 and 2 will be shown separately beforehand.

graph options are any options allowed with . See help on . Defaults are with

and indicating and variablesasusual.

Explanation

The convex hull programs and of Gray and McGuire (1995) are a very useful adjunct to the graphics

facilities of Stata. The program here, , is a supplement to those programs designed to streamline two frequent needs,

namely to get either a quick or a presentable graph of a particular convex hull on a standard scatter plot. in fact calls

and , which do all the hard work, so all that is offered is a different interface that will ease some tasks.

The use of varies from the use of and in the following details:

1. and require the user to specify the variable before the variable, contrary to the convention of

that will be familiar to experienced Stata users. uses the same convention as ; that is, the

variable before the variable.

2. and leave behind extra variables and extra observations specifying the hulls and allowing graphical

closure of the hulls. These extras have to be dropped from the data set each time other convex hulls are drawn in the

same Stata session. handles these details by using temporary variables and leaving the data unchanged. A partial

exception is that uses a stub, by default , for a set of variables, e.g. , , . I do not know a way to

specify a temporary stub in Stata; one problem is that all temporary variables have names 8 characters long underneath

the names that the programmer employs. I have left and unchanged and merely used a stub .A

guess at user practices is that the varname is less likely in user data sets than the varname .If is in

the data set when is invoked, an error message is issued and the option must be used to specify another new

variable. However, is not left behind by , so that it is in effect used as a temporary variable.

3. allows the user to specify and .

4. allows the user to specify the options for a presentable graph at the same time as invoking

and .

5. has a simpler syntax for the tasks described, especially for the casual user.

6. does not allow separate hulls to be drawn for each level of a categorical variable.

7. does not allow two or more hulls to be drawn on the same graph.

8. does not allow the user to examine the subsets of the data deﬁned by each convex hull.

Items 1–5 are suggested to be advantages of , while items 6–8 are limitations.

Stata Technical Bulletin 3

Examples

We will work with data on 158 glacial cirques from the English Lake District (Evans and Cox 1995), found in the

accompanying ﬁle . Glacial cirques are hollows excavated by glaciers that are open downstream, bounded upstream

by the crest of a steep slope, and accurate in plan around a more gently sloping ﬂoor. More informally, they are sometimes

described as “armchair-shaped”. They are common in mountain areas that have or have had glaciers present.

Whether cirque shape changes with size is one question of interest to geomorphologists. Given data on cirque length and

width, the outermost convex hull is simply displayed by

engt

an ax

s, m

Width across median axis, m

215 1700

220

1830

Figure 1. Convex hull plot: natural scale.

The next step might be to use logarithmic scales and to add some sensible labels by

engt

an ax

s, m

Width across median axis, m

200 500 1000 2000

200

500

1000

2000

Figure 2. Convex hull plot: logarithmic scale.

Here we are exploiting the congenial fact that, for this example, taking the convex hull and logarithmic transformation of both

variables are commutative; the result is the same whichever you do ﬁrst. Note, however, that this is necessarily true only for

afﬁne transformations and must be checked otherwise.

References

Evans, I. S. and N. J. Cox. 1995. The form of glacial cirques in the English Lake District, Cumbria.

Zeitschrift f¨

ur Geomorphologie

39: 175– 202.

Gray, J. P. and T. McGuire. 1995. gr16: Convex hull programs.

Stata Technical Bulletin

23: 11–15. Reprinted in

Stata Technical Bulletin Reprints

vol. 4, pp. 59–66.

4Stata Technical Bulletin [STB-36]

gr24 Easier bar charts

Nicholas J. Cox, University of Durham, UK,FAX (011) 44-91-374-2456, n.j.cox@durham.ac.uk

Syntax

The syntax for the command is

varlist weight exp range sortvar sortvar

labelvar byvar bar options

Options

sortvar means that bars are to be plotted in ascending order of sortvar, highest values on the right. If weights are used

and sortvar is numeric, then the order is calculated using the weights. may not be combined with or .

sortvar means that bars are to be plotted in descending order of sortvar, highest values on the left. If weights are used

and sortvar is numeric, then the order is calculated using the weights. may not be combined with or .

labelvar means that bars are to be plotted with labels from labelvar, normally but not necessarily a string variable

containing text. may not be combined with .

byvar means what it does with : bars are plotted in groups according to the values of byvar. Note that may

not be combined with ,,or .

applies when varlist contains a single categorical variable and it is desired to show the count or frequency in each

category. If the number of categories is 6 or fewer, a set of temporary indicator variables will be generated, and the bars

will touch and have (by default) different colors and shadings. If the number of categories is 7 or more, the categories will

be plotted as separate bars for a single variable, with the same color and shading, so long as has not been invoked.

However, if separate bars are preferred when the number of categories is 6 or fewer, use the option in addition.

is explained just above and is restricted to overriding a default behavior with the option.

means that results are to be reported as percents. The default is percents of the grand total of all the variables in the varlist.

See also and .and are mutually exclusive.

means that results are to be reported as proportions or fractions between 0 and 1. The default is proportions of the grand

total of all the variables in the varlist. See also and .and are mutually exclusive.

means that percents or proportions (one must be speciﬁed) are of the total of all the values in each group, deﬁned by one

value of the variable. and are mutually exclusive.

means that percents or proportions (one must be speciﬁed) are of the total of all the values for each variable in the varlist.

and are mutually exclusive.

bar options are other options allowed with .See[R]graph or on-line help for , including details on ,

,,and .

Explanation

Simple bar charts can be surprisingly awkward in Stata. The basic reason for this is that has a built-in tendency

to add up whatever is fed to it. This is ﬁne so long as what you want plotted are sums of values in their original units, or

means, which are one step away with the option. If you want something else, some preprocessing is required, which

makes some basic tasks rather complicated, especially for users new to Stata.

I have written as a way of automating the most common kinds of preprocessing. is intended to be just

with some added bells and whistles.

First, however, I will commend . If you ﬁnd frustrating, check out , which gives a histogram for

categorical variables. may be the answer to your question.

There are three main problems that is designed to tackle. They sometimes arise in combination, especially the ﬁrst

and the second.

Stata Technical Bulletin 5

Problem 1: Percents and proportions

Suppose you want results plotted in percents (sum 100) or proportions (sum 1). It is necessary to transform your variables

so that their values do indeed add up to the appropriate sum, 100 or 1. In STB-14, Felicia Knaul (1993) asked how to create

stacked bar charts of percents. The answer to her question was six command lines creating the percents before they are fed to

. With , the answer to her problem would be

where the options and call for percents and for those percents to be calculated for each group (p by g).

Let us look at another similar example, especially for those without access to the insert by Knaul. Smith (1984) studied the

geography of crime in an area of Birmingham, England. She asked 531 residents in an ethnically mixed area whether they had

taken any action to avoid victimization. Her table can be read into Stata and a bar chart drawn using (see Figure 1).

100

150

200

yes no

Asian W.Indian white

Figure 1. vbar chart.

This could also have been done by

However, where excels is in the next step, standardizing so that we look at percents (see Figure 2):

percent

yes no

Asian W.Indian white

Figure 2. vbar chart with percents.

6Stata Technical Bulletin [STB-36]

The option could have been added with both ,and .

Another option allowed with is (percent or proportion for each variable). If or (which calls for

proportions) is called without or , the default is that they are calculated with reference to the total of all the variables

in the varlist, which is naturally equivalent to if there is just one variable.

Problem 2: Categorical variables

Suppose you have a categorical variable coded numerically with more or less arbitrary codes. would treat each

code literally, that is to say, numerically, so that if a variable were coded 1, 2, and so forth, all the 1’s would be added up, all

the 2’s added up, and so forth, so that each bar would represent the frequency of the value, multiplied by the arbitrary code. In

the automobile data distributed with Stata, is such a categorical variable with codes 1 through 5.

is an answer to this problem and plots frequencies (as, in a cruder way, does ). Yet with ,the

bars touch, which you may not want, perhaps because you are sensitive about giving the impression of an underlying continuous

scale. Note also that is not similar to .

A solution which is rather Stata-ish is to use to generate a set of indicator variables (values 1 or 0)

that when added up by will give the desired frequencies. Experienced Stata users will do this as a conditioned

reﬂex, but to those learning Stata it can seem a triﬂe arcane. automates this indicator variable line of attack when called

with the option. The attack fails if the number of categories exceeds 6, because will not allow more

than 6 variables. There is another line of attack that is then tried, which has the consequence that it plots separate bars, so long

as has not been used as an option. Separate bars may or may not be what is preferred. If there are 6 or fewer categories,

invoking the option will override the default. In the automobile example, the and options produce

the following plot:

Repair Record 1978

1 2 3 4 5

Figure 3. vbar with category option.

With the Smith data looked at earlier, the assumption was that we were reading in the counts from a summary table. What

is also likely with such data is that we do not have the summary counts but the raw data in a categorical variable, say ,

taking values or . The bar chart would then be produced by

Note that only allows one variable in the varlist with the category option. So

is allowed, but not

The best reasons for this limitation are that users do not often seem to want the latter form, and that if they ask for it, they

really would be better served by

Stata Technical Bulletin 7

instead, which is ﬁne. Note that in the last example the option speciﬁes that is a categorical variable. is treated

as one automatically, as is true of . Under Stata 5.0, which assumes, both and can be string variables as

well as numeric, because of a change to .

Problem 3: Ordered, labeled bar charts

Consider the United States census data distributed with Stata. You might want a bar graph of divorces for each state,

labeled with state identiﬁers. The full names would lead to a very messy graph, but two-letter identiﬁers (in lower case) would

be just about tolerable. Let us assume that contains those identiﬁers (in lower case), such as tx (a little state somewhere

in the South). We have included with this insert a ﬁle called containing the variables and from

as well as the variable with the two letter identiﬁers.

requires a preceding

and produces an alphabetically ordered bar chart, which may be what you want. On the other hand,

requires a preceding

and produces a numerically ordered bar chart. Incidentally, this works properly only because the 50 states have 50 unique values

for , with no ties. You would not always be so lucky. What if you want both the state labeling and the numerical ordering?

It can be done by with preprocessing, and it is done on request by .separates the ordering and the

labeling by another variable. The cost, however, is that the values of the string variable become value labels, and so cannot be

more than 8 characters long. There may well be some trick to get round this limit.

would produce bar charts with the divorces increasing or decreasing from left to right, respectively (see Figure 4 for the result of

the ﬁrst of these commands), and labeled by the state identiﬁers. Any ties in either the (values) or the (labels) variables

would have been handled properly, and separate bars shown for separate, but equal, values.

50000

100000

150000

Number of divorces

Figure 4. vbar with laby() option.

So far we have considered the case in which takes a string variable as argument. The option may also be used

with a numeric variable which is either labeled (its labels are used) or unlabeled (its values are used as labels). The same length

limit of 8 characters applies.

8Stata Technical Bulletin [STB-36]

Note that the sorting in is alphabetic, or reverse alphabetic, with string variables, and takes account of any weights

speciﬁed with numeric variables. I can imagine cases in which the user wants the bar heights to reﬂect the weights, but not the

sorting variable. Note that they need not be the same variable. That cannot be handled by . It would make the syntax and

programming more complicated for what is, I guess, an unusual case that can be tackled by something like

In the above, the user wishes to apply weights to , which together determine the bar heights, but not to the sorting variable

It is not an error to call the or option without . In that case, is used to label the sorted bars.

References

Knaul, F. 1993. qs5: How to create stacked bar charts of percentages.

Stata Technical Bulletin

14: 12–13. Reprinted in

Stata Technical Bulletin

Reprints

, vol. 3, pp. 71– 72.

Smith, S. J. 1984. Crime and the structure of social relations.

Transactions, Institute of British Geographers

9: 427– 442.

gr25 Spike plots for histograms, rootograms, and time-series plots

Nicholas J. Cox, University of Durham, UK,FAX (011) 44-91-374-2456, n.j.cox@durham.ac.uk

Anthony R. Brady, Public Health Laboratory Service Statistics Unit, UK, tbrady@phls.co.uk

Syntax

The syntax for the command is

varname weight exp range # # graph options

Options

#uses the function to round varname to the nearest multiple of #. See the help on functions or [U]20.3.5 in

the manual. In other words, speciﬁes the bin width or class interval.

speciﬁes that the vertical scale is to show fractions. This convention is the opposite of that for .

speciﬁes that the vertical scale is to show square roots of frequencies (J. W. Tukey’s “rootogram”).

#speciﬁes a constant other than 0 as a level from which spikes are drawn vertically.

graph options are options allowed with .is allowed, but is trapped with a warning, as the total

graph produced with byvar would superimpose spikes, not add them vertically.

The defaults for these graph options are

of “Frequency” (or “Fraction”, if is used, or “Root of frequency”, if is used);

of the variable label of varname (or varname itself, if the label is not deﬁned);

, so that each bin or value is shown by a vertical spike;

, so that invisible point symbols are used.

Explanation

A standard idiom with is

which connects and by vertical lines on a scatter (twoway) plot of and against , in this case without visible point

symbols. From that it is a small step to see that setting to a constant (e.g., 0 or some other reference level) produces some

useful graphs, and another small step to see that some prior calculation gives you a kind of histogram, with separate spikes, not

touching bars.

has been written to automate the production of such graphs, so that users can, as far as possible, get what they

want with a single command. The name “spike plot” was suggested partly by the “spike chart” of Berry (1996, 14).

Stata Technical Bulletin 9

Histograms

Some data analysts producing histograms seem to prefer spikes to bars, as a matter of logic or of taste, especially with

discrete variables. Examples are found in Plackett (1971) and Evans, Hastings, and Peacock (1993). Even if people want a

conventional histogram with touching bars, the limit in of 50 bins sometimes proves frustrating. In our

experience, this is often when there is some ﬁne structure in the frequency distribution that is of interest, even if it is in some

way spurious or pathological. Examples come from demographic data on human ages; people prefer to state certain ages (such

as multiples of 10 or 5 years) as a matter of vanity, ignorance, biases in memory, and so forth. The number of possible ages

is clearly about 100 and the ﬁne structure of the distribution can only be seen well if each is a separate bin. For such tasks,

offers an alternative to .

A more pervasive issue is the size of the data set. In broad terms, the details of a histogram should be less affected by

quirks of sampling as the number of values increases. Hence with larger , more bins can be justiﬁed, although it is difﬁcult

to make this precise in a general manner. Suggested rules of thumb for the number of bins include those discussed by Emerson

and Hoaglin (1983) and in [R]graph histogram.Witharuleof , more than 50 bins would be needed for 625; with

arule, for 100,000; and with Sturges’ rule ,for . Another rule, suggested

half-seriously by the current ﬁrst author, is , for which the threshold is 4,630. The compromise rule discussed in [R]graph

histogram of

leads to a threshold of 100,000. Without taking these rules more literally than they deserve, we note simply that some, but not

all, imply more than 50 bins for data set sizes that are frequently encountered.

Note that in Stata 5.0, it is possible, at the cost of some programming, to create your own alternative histograms with .

These could have more than 50 bins and they could even have unequal bin widths.

Rootograms

allows the production of the “rootograms” suggested by J. W. Tukey circa 1965. The idea is to show not

frequencies, but their square roots. Frequencies, as counted variables, tend to have variability that is stabilized by a root

transformation, at least approximately. Note also that the square root of a normal or Gaussian density is a multiple of another

normal or Gaussian density. Hence if the normal is the reference distribution, we are looking for the same shape on a rootogram,

and experience in assessing histograms for approximate normality can be applied directly in assessing rootograms. However,

taking the root is only the ﬁrst step in Tukey’s procedure, and we do not implement his hanging or suspended rootograms. See

Tukey (1965, 1972, 1977), Tukey and Wilk (1965), or Velleman and Hoaglin (1981).

Consider dotplots

In several ways, spike plots are comparable to the dotplots provided by Stata’s command, which can also be very

useful in showing the ﬁne structure of data distributions. Apart from the obvious difference that dots represent individual values

and spikes represent bin frequencies, a further difference is that dotplots are tuned by indirectly changing the number of bins,

whereas spike plots are tuned by directly changing the bin width. For many problems, we prefer dotplots to spike plots, and

conversely.

Time-series plots

Some kinds of time series lend themselves quite well to spike plots. Daily rainfalls for a year are often plotted as a series

of vertical spikes. On such graphs, wet and dry spells come out well, even for British stations, where according to a U.S. myth

it is usually raining. For other time series, setting to some average level shows clearly periods above and below average.

Examples

In practice, frequency distribution problems for come in two forms. First, the data for a variable are to be

summarized as a frequency distribution. This requires basically

varname

so long as the bin width or class interval is the same as the resolution of the data, or

varname #

if rounding is required. The number speciﬁed in place of #is the bin width or class interval.

Second, the data for two variables show values and frequencies. This form requires the use of weights. For example,

Mosteller, Fienberg, and Rourke (1983, 83) give data on the age structure of the population of Ghana reported in the 1960

census, rounded down to the nearest thousand, and taken from the U.N.

Demographic Yearbook

for 1962 (p. 189). These data

10 Stata Technical Bulletin [STB-36]

can be found in the ﬁle . For ages from 0 (under 1 year) to 90 years, the frequencies are the numbers at each

age. Note that 15000 people were 91 and over or of unknown age. 91 bins are too many for , which is why

we are using .

The command is then

opu

n 000

Age in years

020 40 60 80

100

200

300

Figure 1. Spike plot of age.

The resulting graph shown in Figure 1 shows heaping of ages, which is a well-known phenomenon in demography. It

is easy to pick out preferences for ages that are multiples of 10 and of 5. In addition, ages ending with 2, 4, 6, and 8 are

generally more frequent and ages ending with 1, 3, 7, and 9 generally less frequent than would be expected if there were a

relatively smooth underlying distribution. Further comment would require some anthropological or sociological knowledge on

the signiﬁcance of various numbers among this population: is 30, for example, a particularly important age to achieve for some

or all groups in Ghana? Another example of age heaping for the female population of Mexico is given by Mosteller and Tukey

(1977, 476– 477). Apart from its occurrence in demography, which appears to have been recognized since at least the 18th

century (Westergaard 1932, 77), preference for certain digits in reported numbers has been identiﬁed in several sciences (Cox

1991). Spike plots showing the frequencies of all possible results are a natural tool in recognition of such biases.

Time-series plots that are spike plots also require the use of weights. The usual situation is that each time occurs just once

in the data. Reference levels other than 0 are often useful, such as a mean or median.

To illustrate using for time series, we consider the yearly temperature means for the world and each hemisphere

given by Parker and Jones (1991), expressed as deviations from the 1951–80 mean. The data are given in .Aspike

plot representation of the Southern hemisphere series is given in Figure 2 and is obtained by

temperature

rom 1951-80 mean

deg C

year

1850 1900 1950 2000

-.4

-.2

Figure 2. Spike plot of temperature showing deviations from 1951 –80 mean

The pattern of recent warming stands out quite well. Note that we must override the default of “Frequency”, which

makes no sense in this example.

Stata Technical Bulletin 11

Note that when printing the result of using , users may want to draw the spikes with pen thicknesses larger than

the default.

References

Berry, D. A. 1996.

Statistics: A Bayesian Perspective

. Belmont, CA: Duxbury Press.

Cox, N. J. 1991. Human factors.

Nature

353: 597.

Emerson, J. D. and D. C. Hoaglin. 1983. Stem-and-leaf displays. In

Understanding Robust and Exploratory Data Analysis

, ed. D. C. Hoaglin, F.

Mosteller, and J. W. Tukey, 7–32. New York: John Wiley.

Evans, M., N. Hastings, and B. Peacock. 1993.

Statistical Distributions

. New York: John Wiley.

Mosteller, F., S. E. Fienberg, and R. E. K. Rourke. 1983.

Beginning Statistics with Data Analysis

. Reading, MA: Addison–Wesley.

Mosteller, F. and J. W. Tukey. 1977.

Data Analysis and Regression

. Reading, MA: Addison –Wesley.

Parker, D. E. and P. D. Jones. 1991. Global warmth in 1990.

Weather

46: 302– 311.

Plackett, R. L. 1971.

An Introduction to the Theory of Statistics

. Edinburgh: Oliver & Boyd.

Tukey, J. W. 1965. The future of processes of data analysis. Reprinted in

The Collected Works of John W. Tukey, Volume IV: Philosophy and

Principles of Data Analysis: 1965– 1986

, ed. L. V. Jones, 517–547 (1986). Monterey, CA: Wadsworth & Brooks/Cole.

Tukey, J. W. 1972. Some graphic and semigraphic displays. In

Statistical Papers in Honor of George W. Snedecor

, ed. T. A. Bancroft and S. A.

Brown, 293– 316. Ames, IA: Iowa State University Press.

Tukey, J. W. 1977.

Exploratory Data Analysis

. Reading, MA: Addison –Wesley, Ch. 17.

Tukey, J. W. and M. B. Wilk. 1965. Data analysis and statistics: Principles and practice. Reprinted in

The Collected Works of John W. Tukey, Volume

V: Graphics: 1965–1985

, ed. W. S. Cleveland, 23–29 (1988). Paciﬁc Grove, CA: Wadsworth & Brooks/Cole.

United Nations. 1962.

Demographic Yearbook 1962

. New York: United Nations.

Velleman, P. F. and D. C. Hoaglin. 1981.

Applications, Basics, and Computing of Exploratory Data Analysis

. Boston, MA: Duxbury Press, Ch. 9.

Westergaard, H. 1932.

Contributions to the History of Statistics

. London: P. S. King.

ip16 Using dialog boxes to vary program parameters

H. Joseph Newton, Texas A&M University, FAX (409) 845-3144, jnewton@stat.tamu.edu

One of the most exciting new features of Stata 5.0 for Windows is the ability to use dialog boxes to provide a graphical

user interface to Stata commands and programs. While dialog boxes can be used in a variety of ways, in this insert we describe

how they can be used to rapidly vary the value of a parameter such as a smoothing parameter or a bandwidth in a kernel density

estimator and have for each value of the parameter a new graph appear in the graphics window. This use of dialog boxes makes

it possible to emulate using a “slider” in a language such as .

To illustrate what we mean, consider Figure 1 where we are using a program we have written called (“power

transform dialog box”) to try to ﬁnd a suitable power transform for a time series of monthly sales data (discussed by Chatﬁeld

and Prothero (1973) and Newton (1988, 233), and included with this insert as ). If we denote the data by

(with 77 for the sales data), we seek a value of the exponent giving us a transformed data set (with constant variance

across time) by

Using such a power transform for stabilizing variance is a standard initial step in many methods of time series analysis; see Box

and Jenkins (1970), for example.

In Figure 1, we see a dialog box labeled “Time series power transform” as well as a Stata Graph window with the plot of

original sales data (except that it has been standardized to be in the interval 0 1 by subtracting its minimum value and dividing

by its range, a practice we do throughout this insert so that series for different values of are comparable).

We have resized the Stata window and the Graph window so that both the dialog box and Graph window are visible. We

have also made the foreground color black and the background color white in the graphics window so that Figure 1 would show

up well when printed.

12 Stata Technical Bulletin [STB-36]

Figure 1. Dialog box for time series power transform (lambda = 1).

The dialog box has list boxes for the user to choose the data and time variables (the user is expected to input the data prior

to using ). The user can then try a value of by choosing a value in the list box labeled “lambda” (for simplicity the

choices range from zero to one in steps of 0.05) and then clicking on the button labeled “lambda” (when begins it

uses the value one so that the standardized original data is plotted).

Alternatively, one can rapidly increase or decrease the value of by clicking on the buttons labeled “lambda ”and

“lambda ”, respectively. Each time such a click is done, the graphics window displays the new standardized transformed data

set. This allows a user to almost animate the graphs of the successive power transforms (unfortunately one must actually click

to get each graph, thus slowing down the animation).

Note that the standardization is important here as it keeps the vertical axis from changing which would disturb the visual

impression of smoothly changing the value of . Clicking on the button labeled “exit” ends and control of Stata is

returned to the command line.

In Figure 2, we display the result of using 0.25, that is, the fourth root transform recommended by Chatﬁeld and

Prothero for these data. Notice that the value of is inserted into the caption above each graph as is a number called “RMSE”.

This is the root mean square error in regressing the standardized transformed data on the sum of a linear trend and a cosine plus

sine of period 12.

For the sales data the value of minimizing this variance is arguably the best value to stabilize variance across time.

Except for this number, can be used for any time series. In our discussion below of how to write a program such as

, we show how this regression can be easily removed from the program.

Stata Technical Bulletin 13

Figure 2. Dialog box for time series power transform (lambda = 0.25).

Writing programs using dialog boxes to vary parameters

In this section, we present an annotated version of the ﬁle containing the program. See [R]window

control for complete details on the elements of Stata dialog boxes.

First we deﬁne the program and get the current list of variables into the global variable :

Next we use Stata’s command to deﬁne the three list boxes in the dialog box:

Now place the four buttons used in the dialog box and deﬁne the global macros used to hold the actions to take for each button

(note that the actions for all but “exit” are to call the program deﬁned below):

14 Stata Technical Bulletin [STB-36]

Finally, issue the command that actually forms the dialog box:

Another program, , does the actual calculation and graphing. Note that has one argument which can be for

“lambda ,” for “lambda ,” and 0 for “lambda.” Since it is called only by , we need not worry about checking

arguments or any of the usual programming problems.

Now get the data variable and time variable into the tempvar’s and and update the value of :

Next do the called-for transform and do the standardizing:

This section does the regression and can be easily removed:

Now do the graph, noting that to remove the regression, one need only change the argument of the option in :

References

Box, G. E. P. and G. M. Jenkins. 1970.

Time Series Analysis, Forecasting, and Control

. San Francisco: Holden –Day.

Chatﬁeld, C. and D. L. Prothero. 1973. Box –Jenkins seasonal forecasting: Problems in a case study.

Journal of the Royal Statistical Society, Series

136: 295– 336.

Newton, H. J. 1988.

TIMESLAB: A Time Series Analysis Laboratory

. Paciﬁc Grove, CA: Wadsworth & Brooks/Cole.

Stata Technical Bulletin 15

sbe13.2 Correction to age-speciﬁc reference intervals (“normal ranges”)

Eileen Wright, Royal Postgraduate Medical School, UK, ewright@rpms.ac.uk

Patrick Royston, Royal Postgraduate Medical School, UK, proyston@rpms.ac.uk

We have discovered that our routine, , for calculating age-speciﬁc reference intervals (Wright and Royston, 1997) has

an error in the estimation of the standard errors of centiles (option ) when used in conjunction with modeling the coefﬁcient of

variation (option ). Conﬁdence limits based on this routine are too narrow. An amended version is available with this edition

of STB.is correct in all other respects.

Reference

Wright, E. and P. Royston. 1997. sbe13: Age-speciﬁc reference intervals (“normal ranges”).

Stata Technical Bulletin

34: 24– 34.

sbe14 Odds ratios and conﬁdence intervals for logistic regression models with effect modiﬁcation

Joanne M. Garrett, University of North Carolina at Chapel Hill, FAX (919) 966-2274, garrettj@med.unc.edu

This insert describes , an easy-to-use program for anyone who dislikes having to hand calculate odds ratios and

conﬁdence intervals for logistic regression models with signiﬁcant interaction terms, or, even worse, have to explain to students

how to do it. After years of teaching non-statistician medical researchers how to use logistic regression, and watching their eyes

glaze over when I got to the formula for calculating the variance estimate for a linear combination of betas, I decided to make

life easier on all of us and write a program that automates the process.

After writing , I discovered the command in Stata 5.0, which will also display odds ratios and conﬁdence

intervals (see [R]in the Stata Reference Manual). and produce the same results, but they differ in their

syntax. requires you to specify the appropriate linear combinations of the estimators; uses a syntax based on

descriptive terms familiar to anyone who has studied epidemiology.

Which program is better depends entirely on user preference. is geared toward a non-mathematically inclined

audience and has the advantage of being speciﬁed and displayed using epidemiological terminology, with a summary of the

variables and stratum-speciﬁc values used for the odds ratios. However, can be explained in similar terms, and, if

speciﬁed appropriately, is simple to use. Following each example in this insert I will include the statement needed to duplicate

the results using . However, remember that must follow the estimation of a model.

Background

The general form for the calculation of the odds ratio from the estimates of the logistic regression model is

where and represent values for one group and a comparison group, respectively.

In many epidemiologic studies, the focus is on one main study factor (“exposure”) which frequently is coded as 1 for

“exposed” and 0 for “unexposed”. For instance, suppose were the exposure variable:

Exposure Status

Let

Then

Our fairly complicated odds ratio formula reduces to a simple exponentiation of the beta coefﬁcient for the “exposure”

variable (which Stata is kind enough to present for us on the output).

The formula for the 95% conﬁdence interval (which Stata also calculates and prints for us) is

Had the exposure variable been coded as something other than 1 and 0, we would need to multiply the beta coefﬁcient

by the difference that we want to compare before exponentiating. For instance, suppose age in years was our “exposure.” If we

16 Stata Technical Bulletin [STB-36]

exponentiate beta (or use the odds ratio calculation we ﬁnd on the printout), we are looking at the odds ratio for a one year

change in age, e.g., a 39 year-old versus a 38 year-old. It might be more informative to report a 10-year difference in age, say

comparing a 50 year-old to a 40 year-old. Our odds ratio and 95% conﬁdence interval formulas then become

Note that one multiplies the standard error in the conﬁdence interval formula by the same multiple used for beta. A dead giveaway

that someone has forgotten to do so is a tiny conﬁdence interval around a reasonable sized odds ratio.

This is still fairly straightforward. However, things start getting messy when there is signiﬁcant effect modiﬁcation, which

is the epidemiologists’ term for interaction between the exposure and another variable— the effect modiﬁer. What we are saying

is the odds ratio changes depending on the value of the effect modiﬁer or effect modiﬁers. In essence, we are stratifying our odds

ratio by categories of the effect modiﬁers. The general formula for the odds ratio reduces, but any terms which include interactions

with the exposure variable remain. For example, suppose we are interested in the odds ratio of developing coronary heart disease

(yes 1; no 0) for people with hypertension ( 1) versus people with normal blood pressure ( 0), and we

ﬁnd there is signiﬁcant interaction between hypertension and age, as well as hypertension and sex. The logistic model (written

in the log odds form) might look like this:

The formula for the odds ratio with the two interaction terms would be

Next we would substitute values of age and sex and use the estimated betas to solve the equation to get odds ratios for the

–comparison. For example, we might want the –odds ratio for 50 year-old males, or for 60 year-old females.

Okay so far, but what about the conﬁdence intervals for these odds ratios? Many journals are requiring conﬁdence intervals

rather than -values when we report odds ratios, and now we need separate conﬁdence intervals for each odds ratio (we may

have several odds ratios representing the categories of our effect modiﬁers). Not only do we need more conﬁdence intervals,

the variance estimate for the formula no longer is the simple variance of a single beta. It’s now a more complicated formula

for a linear combination of betas. The general form for the 95% conﬁdence interval (with interaction) is

where and

The good news is the terms involving variables other than the exposure and the effect modiﬁers drop out of this equation.

Additionally, if the exposure is coded as 1 and 0, the equations for and become

where beta for the exposure variable, betas for the (up to ) interaction terms, and values for the effect

modiﬁers.

For the 2 interaction term example ( and ), we would have

Stata Technical Bulletin 17

Now we pick off the appropriate estimates from the variance–covariance matrix, substitute in a value for age and sex, and

solve the equation. All this just to get one of the conﬁdence intervals for one odds ratio. We must repeat the calculation for

other combinations of age and sex. Of course, if we had started with only one interaction term (rather than two), the variance

estimate would reduce quite a bit (to 3 terms). Additionally, had the single effect modiﬁer been a dichotomous 0–1 variable, the

equation would reduce further to the familiar for the 0 category. Would you still rather not be bothered? Then will

help.

Description

calculates user speciﬁed stratum-speciﬁc odds ratios and conﬁdence intervals for logistic regression models which

include interaction terms between an exposure and effect modiﬁers. If the model has been speciﬁed previously or is

repeated with new stratum values, the current model estimates are used; otherwise a new model is ﬁt.

Syntax

dvar evar exp cov list interaction #

# #

dvar is the “disease” variable (dichotomous outcome) and should be coded as 1 and 0.

evar is the “exposure” variable and can be nominal, ordinal, interval, or continuous.

Options required

cov list is the list of confounders in the model.

interaction # is the list of interaction variables in the model plus a stratum value for each effect modiﬁer; for more

than one interaction term, the form is interaction # interaction #.

For example, suppose there are two interactions represented by the variables (hypertension by sex, where sex=1

for males and sex=0 for females) and (hypertension by age). To calculate the odds ratio and 95% CI for 60 year-old

males, specify

and for 60 year-old females

and so on.

Options allowed

#is the difference of values for “exposed” versus “unexposed.” For example, to compare age 50 versus age 40,

would be speciﬁed as: . The default for is 1, which is equivalent to exposed 1 versus unexposed 0.

displays the logistic regression table.

#speciﬁes the conﬁdence level in percent for the conﬁdence intervals. The default is a 95% conﬁdence interval.

Examples

The set of examples to illustrate come from a case–control study (Garrett et al. 1993) of the relationship between

the “exposure”—a rare form of a gene ( , for those readers familiar with genetics) and the “disease”—incidence of breast

cancer ( ). (I apologize in advance to the author of this study for taking liberties with the data to illustrate some points,

but, since the author is related to me, I hope he won’t mind.) The data are stored in . Several of the variables with

their descriptions and coding are in the following table:

18 Stata Technical Bulletin [STB-36]

Variable Deﬁnition Coding

Breast cancer diagnosis 1 case

(“outcome”) 0 control

Type of gene 1 rare form of hras

(“exposure”) 0 common form

Race of patient 1 black women

0white women

Body mass index— a continuous (higher #s

measure of obesity mean heavier)

Past history of breast 1 yes

biopsy 0 no

Has been through 1 post-menopause

menopause 0 pre-menopause

In addition, there are interaction terms between and (), and (), and and

().

In the ﬁrst example, calculates the odds ratio and 95% conﬁdence interval for rare hras and breast cancer with one

interaction term between and () for black women ( 1), controlling for ,,and .It

requests that the logistic regression table be printed ( ), and accepts the default for (1) since “exposed” ( 1)

minus “unexposed” ( 0) equals 1. The model is

With the commands

will ﬁt the model and produce the usual table of odds ratios and conﬁdence intervals for the odds ratios and

will produce:

Using gets the results of both and using a syntax my non-statistician students like:

Stata Technical Bulletin 19

Following the model results, reports a summary table of information (a way to check to make sure the variables

were speciﬁed as expected, particularly if the model is not printed), including “ ”, the variance of “ ”, and the ﬁnal odds ratio

and conﬁdence interval for the stratum requested. In this case, the stratum for black women is speciﬁed as “ .” The

odds ratio tells us that black women with the rare form of the hras gene are 4.2 times as likely to develop breast cancer as black

women with the common form of the gene. The 95% conﬁdence interval (1.4 to 12.5) does not include 1.0 (which would mean

no association), thus, the odds ratio is statistically signiﬁcant.

Now, let’s repeat the example for white women (“ ”):

The corresponding command is

This time the odds ratio tells us that among white women there is no association between having the rare form of hras and

breast cancer—the odds ratio is close to 1.0 and the conﬁdence interval includes 1.0, meaning it is nonsigniﬁcant. There was

no need to print the model again since the estimates did not change from the previous example.

For the next examples, let’s suppose that in addition to the and interaction, we think there is interaction

between and (). Again, is dichotomous, speciﬁed as 0 (white women) or 1 (black women), but is

continuous, so we must select some values of for our strata. I have selected 23 (a normal value for body mass index

for women) and 28 (getting rather zaftig—for non-New Yorkers, that means “plump”). Four examples of with

two interaction terms follow:

=1and =23

=1and =28

=0and =23

=0and =28

For the ﬁrst example, we have

20 Stata Technical Bulletin [STB-36]

We see from the model that both interactions are statistically signiﬁcant, or nearly so. But are the individual stratum-speciﬁc

odds ratios signiﬁcant? Black women with a normal value (23) for body mass index are 8.1 times as likely to develop breast

cancer if they have the rare form of hras. The conﬁdence interval (2.18 to 30.37) does not include 1.0, thus the odds ratio is

signiﬁcant.

For 1and 28:

Black women with a high value (28) for body mass index are 5.1 times as likely to develop breast cancer if they have the rare

form of hras. The odds ratio is signiﬁcant.

For 0and 23:

White women with a normal value (23) for bmi are 1.4 times as likely to develop breast cancer if they have rare hras, but the

odds ratio is not signiﬁcant.

For 0and 28:

Stata Technical Bulletin 21

White women with a high value (28) for bmi are no more likely to develop breast cancer if they have the rare form of hras.

The equivalent commands for the previous four examples are

Finally, let’s look at some examples where the “exposure” is continuous, rather than a 0–1 dichotomous variable. Suppose

is our exposure variable, i.e., we are interested in the relationship between body mass index and breast cancer. And, we

discover that there is effect modiﬁcation between and menopause status (post-menopause: 1 and pre-menopause:

0—don’t ask what happened to the group of women experiencing menopause— for sake of illustration, let’s assume

that menopause is an instantaneous event). We’ll keep the model simple, and look only at the outcome of breast cancer with

,, and the by interaction ( ). First, let’s compare odds ratios for a one-unit change in

, stratiﬁed by post-menopausal women, and then stratiﬁed by pre-menopausal women.

The corresponding command is

Thus, among post-menopausal women, there is a signiﬁcant increase in breast cancer with increasing (the conﬁdence interval

does not include 1.0). However, since we are looking at a one-unit change in , at ﬁrst glance we might conclude erroneously

that an odds ratio as small as 1.09 implies body mass index doesn’t have much to do with breast cancer incidence.

Next, we look at

22 Stata Technical Bulletin [STB-36]

with corresponding command

which shows that among pre-menopausal women, there is no evidence of increase in breast cancer with increasing (the

conﬁdence interval includes 1.0)

To make our odds ratios look a little more substantive, we can use the #option to calculate the odds ratio for a

larger change in . For instance, the next two examples repeat the previous two, using a 10 point difference on the scale.

The two commands are

Now for the ﬁrst of these two example:

(

shows the difference in bmi

)

This shows us that we can say that post-menopausal women are 2.4 times as likely to develop breast cancer with a 10 point

increase in . Again, we conclude this odds ratio is signiﬁcant. If the relationship is signiﬁcant for a one-unit change in ,

it will be for a 10-unit change (or 1000-unit change, for that matter— I once had a journal reviewer tell me that since the

conﬁdence interval was so close to 1.0 for a one-unit change of a continuous effect modiﬁer, it was sure to cross 1.0 if I tried

to look at a 10-unit change).

For the second example of using we have

Although the odds ratio for a 10-unit change in is slightly larger than the example where we did not use ,

as expected, the conﬁdence interval remains nonsigniﬁcant. Among pre-menopausal women, there is no signiﬁcant relationship

between body mass index and breast cancer.

References

Garrett, P. A., B. S. Hulka, Y. L. Kim, and R. A. Farber. 1993. HRAS protooncogene polymorphism and breast cancer.

Cancer Epidemiology,

Biomarkers & Prevention

2: 131– 138.

Kleinbaum, D. G., L. L. Kupper, and H. Morgenstern. 1982.

Epidemiologic Research: Principles and Quantitative Methods

. New York: Van Nostrom

Reinhold.

Stata Technical Bulletin 23

sg67 Univariate summaries with boxplots

John R. Gleason, Syracuse University, 73241.717@compuserve.com

Univariate summaries (means, standard deviations, etc.) are perhaps the quantities most often examined during data analysis.

This is partly because these summaries serve so many purposes, for example, to familiarize oneself with the data, to aid in

understanding other computations, or to act as canonical data descriptors in written reports.

Stata’s command provides the components of a univariate summary, but its presentation is not always ideal for

the purpose at hand. For instance, varlist displays the mean, standard deviation, minimum, and maximum of a set of

variables in a left-to-right fashion that allows many such sets of results to be viewed at once. But one might wish to describe each

variable by its ﬁve-number summary (minimum, 25th percentile, median, 75th percentile, and maximum). varlist

will compute the required values (along with many others), but present them in a style that consumes about 15 lines of

output for each variable.

As another example, typically shows the mean and standard deviation in a format that displays 7 signiﬁcant

digits. This can be desirable, but not when the goal is to extract, visually, or by cut-and-paste, the means and standard deviations

of several variables for inclusion in a written report; then, one might prefer those values to be aligned on their decimal points,

followed by a small, ﬁxed number of decimal places.

Of course, there are several other ways of displaying univariate summaries in Stata. For example, the command

(new in Stata 5.0) provides very general and ﬂexible formatting of tables of summaries, though it is not especially convenient for

interactive data analysis. This insert presents a new command ( ) that offers a streamlined display of univariate summaries

including, optionally, text-mode boxplots.

Syntax

varlist weight exp range bylist #

As with ,sand s are allowed.

Before explaining the options, we ﬁrst demonstrate the default behavior of using the data set supplied

by Clayton and Hills (1995):

By contrast, the default response of is

Two differences are apparent: shows the complete ﬁve-number summary in horizontal style, and prints results in

rather than format. The latter choice displays the values aligned on their decimal points with a ﬁxed number of decimal places,

the conventional style of presenting numbers in text. The former choice permits a more compact and intuitive presentation of

ﬁve-number summaries than the option of :

24 Stata Technical Bulletin [STB-36]

Options

draws a text-mode boxplot for each varlist variable. Stata can, of course, draw boxplots using the command,

but a somewhat coarser boxplot built of text characters can still be helpful, especially if displayed beside the numerical

values being portrayed.

bylist requests summaries at each unique set of values of the variables in bylist, which is analogous to attaching the

bylist preﬁx to the command.

#controls the number of decimal places used to display the summary values; for example, switches to

format. By default, uses format to display all values except the number of observations .

chooses between and output formats; for example, supplying the options and chooses the format

for the summary values, a style similar to the default output of .

requests listwise deletion of missing values (an observation is ignored if any of the varlist variables is missing); the

default is to use all available observations for each variable (variable-wise deletion).

requests that the standard error of the mean ( ) be printed in place of the sample standard deviation ( ).

Examples

To illustrate, we continue looking at the data in :

(

output for four bygroups omitted

)

The above command produces a summary for each of the six combinations of and , and draws a boxplot for each

ﬁve-number summary calculated. The boxplots map the range of each variable onto a ﬁxed width in the output; the median is

drawn with the character “ ”, the remainder of the box with “ ”, and the whiskers with “ ”. also draws a glyph at the

top of each summary table to serve as a reminder of this representation.

Stata Technical Bulletin 25

Adding the and options then gives this result:

(

output for four bygroups omitted

)

Remarks

1. The option differs from the preﬁx available to in that the data need not be sorted by the bylist

variables; sorts the data as necessary and then restores the original ordering before exiting. In addition, the

option may be combined with an clause, whereas the preﬁx may not.

2. However, uses the preﬁx to implement the option. Hence, the bylist can have at most ten variables,

which may be of either numeric or string type, and may include missing values or null strings.

3. runs on Stata Version 4.0, but Version 5.0 or newer is required to display value labels for bylist variables.

4. ’s output is 79 characters wide, the same width used by Stata’s ﬁles.

5. The characters “ ”, “ ”, and “ ” will produce reasonable boxplots in most ﬁxed pitch fonts. However, these characters are

set by local macros at the top of the ﬁle , and are easily redeﬁned, if desired. A similar comment applies to

the color used to draw the boxplots.

Acknowledgment

This project was supported by a grant R01-MH54929 from the National Institute on Mental Health to Michael P. Carey.

Reference

Clayton, D. and M. Hills. 1995. ssa7: Analysis of follow-up studies.

Stata Technical Bulletin

27: 19– 26. Reprinted in

Stata Technical Bulletin Reprints

vol. 5, pp. 219–227.

26 Stata Technical Bulletin [STB-36]

sg68 Goodness-of-ﬁt statistics for multinomial distributions

Jeroen Weesie, Utrecht University, Netherlands, weesie@weesie.fsw.ruu.nl

This insert describes a command that computes goodness-of-ﬁt statistics for multinomial distributed observations ( )

and expected values ( ). The expected values may be derived from an estimated model for the multinomial probability

distribution, e.g., a loglinear model, or be strictly theoretical. The gof-statistics are the members of the 1-parameter Cressie–Read

family (1984, 1988) of discrepancy measures,

where the summation is over all “cells,” i.e., the “response categories” of the multinomial distribution.

The Cressie–Read family contains many well known goodness-of-ﬁt statistics as special cases. For instance, Pearson’s ,

belongs to the Cressie–Read family with . The deviance or likelihood-ratio statistic LR,

is embedded in the Cressie–Read family as the (continuous) limiting value in . Other measures such as Freeman– Tukey’s

statistic ( ), the Kullback– Leibler information distance (entropy) ( ), and Neyman’s modiﬁed (; note

that in modern terminology we would refer to Neyman’s as a Wald statistic) are similarly special cases of the general form.

Finally, Cressie and Read’s recommended statistic ( ) is obviously a member of the family.

If the expected values ( ) are true or at least efﬁciently estimated, all members of the family are asymptotically

(central) distributed. Under standard regularity conditions, the degrees of freedom of can be expressed as “the

number of cells 1” for theoretical expected values and “the number of cells 1the number of imposed restrictions” for

estimated expected values. Thus, all statistics are ﬁrst-order efﬁcient. Based on a.o. higher-order asymptotic developments

and Monte Carlo experimentation, Cressie and Read (1988) recommend the application of a nonstandard statistic, .

Syntax

The syntax of is

obs exp exp range numeric list

varlist # # ﬁlename

Note: The option requires the program which must be installed from the ip14 directory of the STB 35 disk

(January 1997).

Options to select lambda

specify the members of the Cressie–Read family to be displayed ( and are synonymous). More

than one of these options may be speciﬁed. Formally,

description abbreviation formula

Pearson’s

C & R’s recommended statistic

log-likelihood ratio (deviance) or g2

Freeman–Tukey’s statistic

Kullback–Leibler information

Neyman’s modiﬁed

Stata Technical Bulletin 27

numeric lists speciﬁes a range of powers for the Cressie–Read statistics. See on-line help for (Weesie 1997)

for the deﬁnition of numeric lists. Note that must be installed separately for this option to work.

Specifying none of these options implies all of , or, stated differently, it is the same as specifying

Other options

varlist speciﬁes a list of variables on which to aggregate and (join “cells” with the same values on varlist) before

computing goodness-of-ﬁt statistics.

#speciﬁes the degrees of freedom used in the computations of chi-squared-based approximate signiﬁcance levels.

speciﬁes that the table with statistic values is displayed. This option is effective only in combination with .

#speciﬁes that the variables and are expressed as proportions with the total number of observations equal to #.

speciﬁes that a statistic-by-lambda plot is displayed. If is speciﬁed, horizontal lines at the 90%, 95%, and 99% critical

values of the (central) chi-squared-distribution is shown.

speciﬁes that the expected values may be scaled so that they sum to the number of observations. Otherwise and

should have equal sums (within a .001 multiplicative margin).

ﬁlename speciﬁes the name of a ﬁle to save the statistic-by-lambda plot.

Examples

We have estimated a model with 8 parameters, including the constant, on the data in the accompanying ﬁle .

The data are assumed to follow a multinomial distribution. In this case, we estimated a loglinear model in GLM. The data and

estimated expected counts and a variable ,tobeusedbelow,are

(

output omitted

)

The deviance (likelihood ratio statistic against the saturated model) can be obtained as

Note that the output of contains the proportions of cells with low observed and expected counts.

To obtain the default list of statistics, we run

Note that the values of the statistics vary. Using Neyman’s we would reject the model with expected values at any

signiﬁcance level below 1%. With the other statistics, we would not reject the model. These conﬂicting conclusions are somewhat

disturbing. We are concerned that some of the conclusions that we want to draw are not very robust with respect to (1) auxiliary

assumptions, such as the link-function in a GLM model, or (2) arbitrary decisions, such as the selection of test-statistic in a class

with similar asymptotic properties with little known about small-sample properties. Of course, the program can be abused

28 Stata Technical Bulletin [STB-36]

to shop around for a statistic that “proves” whatever we want to do. It is clear that such an application of has nothing to

do with good statistics or good science.

It is possible to collapse cells on a variable before computing the goodness-of-ﬁt statistics with the option. A relatively

high proportion of cells with low expected counts is often a reason to collapse cells. Note that you have to manually modify the

appropriate degrees of freedom.

Finally, it is often quite convenient to inspect the Cressie–Read family in a statistic-by-lambda plot. This plot, containing

the critical signiﬁcance levels at 90%, 95% and 99%, is obtained via the option

Cressie-Read goodness-of-fit statistics (40 cells; 32 df))

(horizontal lines at .90, .95 and .99 critical values)

lambda

-2 0 2

Figure 1. Cressie –Read goodness-of-ﬁt statistics.

Acknowledgment

The command is an extensive update of an early version in the ETS Kit, a collection of Stata commands written for

Stata 2.1 by the late Albert Verbeek, Professor of Statistics at the Department of Social Sciences at Utrecht University, and

myself.

References

Cressie, N. A. C. and T. R. C. Read. 1984. Multinomial goodness-of-ﬁt tests.

Journal of the Royal Statistical Society

Series B

46: 440– 464.

Read, T. R. C. and N. A. C. Cressie. 1988.

Goodness-of-Fit Statistics for Discrete Multivariate Data

. New York: Springer Verlag.

Weesie, J. 1997. ip14: Programming utility: Numeric lists.

Stata Technical Bulletin

35: 14– 16.

Stata Technical Bulletin 29

sg69 Immediate Mann–Whitney and binomial effect-size display

Richard Goldstein, Qualitas, Inc., Brighton, Mass., richgold@netcom.com

A perennial complaint of users of statistics is that the results are not “meaningful” in the real world. The two programs

presented here are the ﬁrst of what will eventually be several inserts presenting translations from statistical tests to what may

be more meaningful indices.

The ﬁrst index presented, the Mann– Whitney score, has previously been presented in (Goldstein 1994). The

presentation there was a simple reporting of part of what makes up the test. Here, we expand the use of the score to many other

tests, including pre-post studies, comparisons of proportions, matched-pairs tests, and independent groups tests. This statistic

is only presented as an immediate statistic, as it is meant to be used directly after some statistical test or procedure.

The interpretation of this score is that it shows the proportion of pairs in which cases of one type (e.g., one group such as

the experimental versus the control group, males versus females, etc.) that have a higher value for the dependent variable than

do cases of the other type. Pairs are simply deﬁned as the number of people in group 1 times the number of people in group

2; for example, if there are 15 people in the experimental group and 19 in the control group, there are 285 15 19 pairs.

Then, a value of 0.27 means that in 27% of those 285 pairs, the member of the experimental group has more of something (e.g.,

dollars, pain, days in hospital, etc.) than does the member of the control group. It also means, for a randomly chosen pair, the

probability is 0.27 that the person from the experimental group will have a higher value on the dependent variable than will the

person from the control group. This value is the same as the area under the ROC curve as shown in .

The second measure presented, the binomial effect-size display, translates the results of any correlation, chi-squared, test,

or test to a simple effect size.

Immediate Mann–Whitney Statistic

gives the Mann–Whitney statistic (“probability that a randomly selected member of one group will have a better result than a

randomly selected member of the other group”) for the following cases.

Options

, pre-post studies, is the total number of subjects in the study while is the number of subjects who improved; thus, if

there were 30 subjects and 22 of them improved, the user would enter .

, comparisons of proportions, is proportion of “treatment” group who improve, while is proportion of

“control” group who improve; that is, enter, for each group, the proportion who had whatever event is of interest (whatever

the event is; e.g., promotion, termination, survival, remission, relapse, etc.); these proportions are easily available from

Stata’s tabulate command by asking for column or row percentages (whichever is appropriate in your setup); for example:

, matched pairs tests, is average difference, while is standard deviation of differences; this information is

provided in the standard output from Stata’s command.

, independent groups tests, is difference of means; is variance for one group, while is variance for

other group. Note that the above asks for variances, while the shows standard deviations (the variance is just the

square of the standard deviation).

Note that this statistic is given at the very end of for a nonparametric comparison (via the test).

Remarks

In addition to its use in helping to interpret statistical results, this statistic can also be used, with caution, to “adjust”

the results from studies that are of lower quality than desired. For example, Colditz et al. (1989) suggest that studies that use

sequential assignment, rather than random assignment, should have their Mann–Whitney statistics reduced by 0.15 and that

non-double-blind randomized studies should be decreased by 0.11. Another example (Colditz et al. 1988a) is that when there is

a standard therapy but the control group is a placebo group, the Mann–Whitney statistic should be decreased by 0.10!

In other words, studies using sequential assignment have been shown to be biased (as compared with randomized studies)

and the amount of bias is about 15%; that is, non-randomized studies tend to overestimate the proportion of pairs in which one

30 Stata Technical Bulletin [STB-36]

group does better by about 15%, compared with randomized studies. Similarly, use of a placebo arm in a study, when there is a

standard therapy, tends to result in a bias of about 10%. Note that these percentages are based on an examination of a number

of studies in certain medical areas from the 1980s; the results might differ in other areas (e.g., schizophrenia) or at other times.

The formulas used are from Colditz et al. (1988b); in that article they also present a formula for obtaining a combined

score across several measures, weighting each by the inverse of their standard deviations. This requires the sample size for each

group.

Examples of the use of the Mann–Whitney test statistic

The ﬁrst example is from the Stata Manual (and is the same as the second example used for BESD, below). The only

difference between the ﬁrst two examples is the sign of the statistic, with the results summing to 1.0.

We would interpret this to mean that in only 1% of the with-and-without treatment pairs would the without treatment car have

greater mileage. Since there are only two groups, this implies that in about 99% of the pairs, the car with treatment would have

greater mileage; we can check on this by entering the same information as above, except that the negative sign on the statistic

is left off:

The next example is also from the Stata Manual.

This ﬁnal example shows the same information, but now using the separate variances. The Mann–Whitney score goes down

from about 99% to about 92% showing some effect of our assumption of equal variances.

Binomial effect-size display

test value # #

If there is no option then the ﬁrst argument should be the correlation and nothing else is needed.

Many people use statistics that appear to have obvious meanings, such as the correlation coefﬁcient, or an effect size (e.g.,

the difference in means divided by the common standard deviation). However, it is not always obvious how “important” the

value of these is in the real world.

“The BESD displays the change in success rate (e.g., survival rate, improvement rate, etc.) attributable to a new treatment

procedure. For example, an of .32 ... is said to account for “only 10% of the variance”; however, the BESD shows that this

proportion of variance accounted for is equivalent to increasing the success rate from 34% to 66%, which would mean, for

example, reducing an illness rate or a death rate from 66% to 34%.” (Rosenthal and Rubin 1982, 166).

Examples of the use of the BESD

The ﬁrst example is from Rosenthal and Rubin (1982, 167):

Thus, a correlation of 0.32 is equivalent to increasing the success rate from 34% to 66%, certainly an important accomplishment.

The next example is from the Stata Manual’s entry on [R]ttest.

Stata Technical Bulletin 31

In this example, we see that a relatively small statistic implies a large real-world difference in success rates, one that would

make most of us ecstatic.

References

Colditz, G. A., J. N. Miller, and F. Mosteller. 1988a. The effect of study design on gain in evaluations of new treatments in medicine and surgery.

Drug Information Journal

22: 343– 352.

Colditz, G. A., J. N. Miller, and F. Mosteller. 1988b. Measuring gain in the evaluation of medical technology.

International Journal of Technology

Assessment

4: 637– 42.

Colditz, G. A., J. N. Miller, and F. Mosteller. 1989. How study design affects outcomes in comparisons of therapy, I: Medical.

Statistics in Medicine

8: 441– 54.

Goldstein, R. 1994. The overlapping coefﬁcient and an “improved” rank-sum statistic.

Stata Technical Bulletin

22: 12– 15. Reprinted in

Stata Technical

Bulletin Reprints

, vol. 4, pp. 132– 136.

Rosenthal, R. and D. B. Rubin. 1982. A simple, general purpose display of the magnitude of experimental effect.

Journal of Educational Psychology

74: 166– 169.

32 Stata Technical Bulletin [STB-36]

STB categories and insert codes

Inserts in the STB are presently categorized as follows:

General Categories:

announcements

instruction on programming

communications & letters

operating system, hardware, &

data management interprogram communication

datasets

questions and suggestions

graphics

teaching

instruction

not elsewhere classiﬁed

Statistical Categories:

sbe

biostatistics & epidemiology

ssa

survival analysis

sed

exploratory data analysis

ssi

simulation & random numbers

general statistics

sss

social science & psychometrics

smv

multivariate analysis

sts

time-series, econometrics

snp

nonparametric methods

svy

survey sampling

sqc

quality control

sxd

experimental design

sqv

analysis of qualitative variables

szz

not elsewhere classiﬁed

srd

robust methods & statistical diagnostics

In addition, we have granted one other preﬁx,

stata

, to the manufacturers of Stata for their exclusive use.

International Stata Distributors

International Stata users may also order subscriptions to the

Stata Technical Bulletin

from our International Stata Distributors.

Company: Applied Statistics & Company: Smit Consult

Systems Consultants Address: Doormanstraat 19

Address: P.O. Box 1169 Postbox 220

Nazerath-Ellit 17100, Israel 5150 AE Drunen

Phone: +972 66554254 Netherlands

Fax: +972 66554254 Phone: +31 416-378 125

Email: sasconsl@actcom.co.il Fax: +31 416-378 385

Countries served: Israel Email: j.a.c.m.smit@smitcon.nl

Countries served: Netherlands

Company: Dittrich & Partner Consulting Company: Timberlake Consultants

Address: Prinzenstrasse 2 Address: 47 Hartﬁeld Crescent

D-42697 Solingen West Wickham

Germany Kent BR4 9DW U.K.

Phone: +49 212-3390 99 Phone: +44 181 462 0495

Fax: +49 212-3390 90 Fax: +44 181 462 0493

Email: evhall@dpc.net Email: timberlake@compuserve.com

Countries served: Austria, Germany, Italy Countries served: Ireland, U.K.

Company: Metrika Consulting Company: Timberlake Consultants

Address: Roslagsgatan 15 Satellite Ofﬁce

113 55 Stockholm Address: Praceta do Com´

ercio,

Sweden N 13–9 Dto. Quinta Grande

Phone: +46-708-163128 2720 Alfragide Portugal

Fax: +46-8-6122383 Phone: +351 (01) 4719337

Email: hedstrom@metrika.se Telem´

ovel: 0931 62 7255

Countries served: Baltic States, Denmark, Finland, Email: timberlake.co@mail.telepac.pt

Iceland, Norway, Sweden Countries served: Portugal

Company: Ritme Informatique

Address: 34 boulevard Haussmann

75009 Paris

France

Phone: +33 1 42 46 00 42

Fax: +33142460033

Email: ritme.inf@applelink.apple.com

Countries served: Belgium, France,

Luxembourg, Switzerland

ResearchGate has not been able to resolve any citations for this publication.

Beginning Statistics with Data Analysis

Article

Sep 1984

Understanding Robust and Exploratory Data Analysis

Article

Jan 1983

Epidemiologic Research (Principles and Quantitative Methods)

Article

Sep 1983

The form of glacial cirques in the English Lake District, Cumbria

Article

Jun 1995
Z GEOMORPHOL

A streamlined computer-based method for morphometric analysis of cirques is presented. Application to the complete population of 158 cirques in the Lake District demonstrates that: 1) Cirque size is unimodal. 2) Larger cirques are better developed. 3) Although altitude, aspect, position nd relief control cirque distribution, they account for only a little of the variability in form within the Lake district. Mapped geology likewise has limited effect. 4) Better-developed cirques have greater plan closure, higher maximum gradient and lower minimum gradient. -from Authors

Demographic Yearbook

Article

Dec 1955

Kurt Mayer

Applications, Basics and Computing of Exploratory Data Analysis.

Article

Jan 1983

Exploratory Data Analysis

Article

Sep 1979

Crime and the Structure of Social Relations

Article

Jan 1984

Susan J. Smith

The geography of crime spans a broader range of subject matter than is often recognized. Using the example of victimization in an ethnically mixed inner city, it is suggested in this paper that the distribution of crime reflects the lifestyle and activity patterns of a community; and that the effects of crime, in turn, help to shape these routine urban behaviours. Thus, social structure and its spatial organization are both reflected in the form of criminal activity, and affected by the fear that this behaviour engenders. Crime infuses a range of social, economic and political relations; its geography offers insights into how these structural phenomena are played out in everyday life.

Data Analysis and REgression

Article

Jan 1978
J Roy Stat Soc

Box-Jenkins Seasonal Forecasting: Problems in a Case-Study

Article

Jan 1973
J Roy Stat Soc

A step-by-step account is given of a Box-Jenkins analysis of some sales figures showing high multiplicative seasonal variation. Various practical problems are encountered and discussed. A critical appraisal is made of the Box-Jenkins procedure and some general remarks are made on short-term sales forecasting.

Using dialog boxes to vary program parameters

Figures

Recommended publications

TEXCOMP: A 3D analysis tool for 2D woven fabric composites

New Developments in Statistical Computing

Adaptable dialog boxes for cross-platform programming

The Bias Of The Mle Of Box’s Degrees Of Freedom Correction Factor For Correlation In Anova

A Conversation with William Gould