ArticlePDF Available

Using dialog boxes to vary program parameters

Authors:
STATA March 1997
TECHNICAL STB-36
BULLETIN
A publication to promote communication among Stata users
Editor Associate Editors
H. Joseph Newton Francis X. Diebold, University of Pennsylvania
Department of Statistics Joanne M. Garrett, University of North Carolina
Texas A & M University Marcello Pagano, Harvard School of Public Health
College Station, Texas 77843 James L. Powell, UC Berkeley and Princeton University
409-845-3142 J. Patrick Royston, Royal Postgraduate Medical School
409-845-3144 FAX
stb@stata.com EMAIL
Subscriptions are available from Stata Corporation, email stata@stata.com, telephone 979-696-4600 or 800-STATAPC,
fax 979-696-4601. Current subscription prices are posted at www.stata.com/bookstore/stb.html.
Previous Issues are available individually from StataCorp. See www.stata.com/bookstore/stbj.html for details.
Submissions to the STB, including submissions to the supporting files (programs, datasets, and help files), are on a nonex-
clusive, free-use basis. In particular, the author grants to StataCorp the nonexclusive right to copyright and distribute the ma-
terial in accordance with the Copyright Statement below. The author also grants to StataCorp the right to freely use the ideas,
including communication of the ideas to other parties, even if the material is never published in the STB. Submissions should
be addressed to the Editor. Submission guidelines can be obtained from either the editor or StataCorp.
Copyright Statement. The Stata Technical Bulletin (STB) and the contents of the supporting files (programs, datasets,
and help files) are copyright cby StataCorp. The contents of the supporting files (programs, datasets, and help files), may be
copied or reproduced by any means whatsoever, in whole or in part, as long as any copy or reproduction includes attribution
to both (1) the author and (2) the STB.
The insertions appearing in the STB may be copied or reproduced as printed copies, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the STB. Written permission must be obtained from Stata
Corporation if you wish to make electronic copies of the insertions.
Users of any of the software, ideas, data, or other materials published in the STB or the supporting files understand that such use
is made without warranty of any kind, either by the STB, the author, or Stata Corporation. In particular, there is no warranty of
fitness of purpose or merchantability, nor for special, incidental, or consequential damages such as loss of profits. The purpose
of the STB is to promote free communication among Stata users.
The
Stata Technical Bulletin
(ISSN 1097-8879) is published six times per year by Stata Corporation. Stata is a registered
trademark of Stata Corporation.
Contents of this issue page
gr16.1. Convex hull plots 2
gr24. Easier bar charts 4
gr25. Spike plots for histograms, rootograms, and time-series plots 8
ip16. Using dialog boxes to vary program parameters 11
sbe13.2. Correction to age-specific reference intervals (“normal ranges”) 15
sbe14. Odds ratios and confidence intervals for logistic regression models with effect modification 15
sg67. Univariate summaries with boxplots 23
sg68. Goodness-of-fit statistics for multinomial distributions 26
sg69. Immediate Mann–Whitney and binomial effect-size display 29
2Stata Technical Bulletin [STB-36]
gr16.1 Convex hull plots
Nicholas J. Cox, University of Durham, UK,FAX (011) 44-91-374-2456, n.j.cox@durham.ac.uk
Syntax
The syntax for the command is
yvar xvar exp range #hullvar graph options
Note: requires the and programs which must be installed from the gr16 directory of the STB 23 disk
(January 1995).
Options
#specifies the convex hull required. The default is 1, the outermost hull.
hullvar specifies a variable to hold information about the hulls and is only needed if is a variable in the data set.
specifies that the final graph (for 1) is to be preceded by a movie of the hulls up to . Thus,
means that the final graph includes hull 2 and that hulls 1 and 2 will be shown separately beforehand.
graph options are any options allowed with . See help on . Defaults are with
and indicating and variablesasusual.
Explanation
The convex hull programs and of Gray and McGuire (1995) are a very useful adjunct to the graphics
facilities of Stata. The program here, , is a supplement to those programs designed to streamline two frequent needs,
namely to get either a quick or a presentable graph of a particular convex hull on a standard scatter plot. in fact calls
and , which do all the hard work, so all that is offered is a different interface that will ease some tasks.
The use of varies from the use of and in the following details:
1. and require the user to specify the variable before the variable, contrary to the convention of
that will be familiar to experienced Stata users. uses the same convention as ; that is, the
variable before the variable.
2. and leave behind extra variables and extra observations specifying the hulls and allowing graphical
closure of the hulls. These extras have to be dropped from the data set each time other convex hulls are drawn in the
same Stata session. handles these details by using temporary variables and leaving the data unchanged. A partial
exception is that uses a stub, by default , for a set of variables, e.g. , , . I do not know a way to
specify a temporary stub in Stata; one problem is that all temporary variables have names 8 characters long underneath
the names that the programmer employs. I have left and unchanged and merely used a stub .A
guess at user practices is that the varname is less likely in user data sets than the varname .If is in
the data set when is invoked, an error message is issued and the option must be used to specify another new
variable. However, is not left behind by , so that it is in effect used as a temporary variable.
3. allows the user to specify and .
4. allows the user to specify the options for a presentable graph at the same time as invoking
and .
5. has a simpler syntax for the tasks described, especially for the casual user.
6. does not allow separate hulls to be drawn for each level of a categorical variable.
7. does not allow two or more hulls to be drawn on the same graph.
8. does not allow the user to examine the subsets of the data defined by each convex hull.
Items 1–5 are suggested to be advantages of , while items 6–8 are limitations.
Stata Technical Bulletin 3
Examples
We will work with data on 158 glacial cirques from the English Lake District (Evans and Cox 1995), found in the
accompanying file . Glacial cirques are hollows excavated by glaciers that are open downstream, bounded upstream
by the crest of a steep slope, and accurate in plan around a more gently sloping floor. More informally, they are sometimes
described as “armchair-shaped”. They are common in mountain areas that have or have had glaciers present.
Whether cirque shape changes with size is one question of interest to geomorphologists. Given data on cirque length and
width, the outermost convex hull is simply displayed by
L
engt
h
o
f
me
di
an ax
i
s, m
Width across median axis, m
215 1700
220
1830
Figure 1. Convex hull plot: natural scale.
The next step might be to use logarithmic scales and to add some sensible labels by
L
engt
h
o
f
me
di
an ax
i
s, m
Width across median axis, m
200 500 1000 2000
200
500
1000
2000
Figure 2. Convex hull plot: logarithmic scale.
Here we are exploiting the congenial fact that, for this example, taking the convex hull and logarithmic transformation of both
variables are commutative; the result is the same whichever you do first. Note, however, that this is necessarily true only for
affine transformations and must be checked otherwise.
References
Evans, I. S. and N. J. Cox. 1995. The form of glacial cirques in the English Lake District, Cumbria.
Zeitschrift f¨
ur Geomorphologie
39: 175– 202.
Gray, J. P. and T. McGuire. 1995. gr16: Convex hull programs.
Stata Technical Bulletin
23: 11–15. Reprinted in
Stata Technical Bulletin Reprints
,
vol. 4, pp. 59–66.
4Stata Technical Bulletin [STB-36]
gr24 Easier bar charts
Nicholas J. Cox, University of Durham, UK,FAX (011) 44-91-374-2456, n.j.cox@durham.ac.uk
Syntax
The syntax for the command is
varlist weight exp range sortvar sortvar
labelvar byvar bar options
Options
sortvar means that bars are to be plotted in ascending order of sortvar, highest values on the right. If weights are used
and sortvar is numeric, then the order is calculated using the weights. may not be combined with or .
sortvar means that bars are to be plotted in descending order of sortvar, highest values on the left. If weights are used
and sortvar is numeric, then the order is calculated using the weights. may not be combined with or .
labelvar means that bars are to be plotted with labels from labelvar, normally but not necessarily a string variable
containing text. may not be combined with .
byvar means what it does with : bars are plotted in groups according to the values of byvar. Note that may
not be combined with ,,or .
applies when varlist contains a single categorical variable and it is desired to show the count or frequency in each
category. If the number of categories is 6 or fewer, a set of temporary indicator variables will be generated, and the bars
will touch and have (by default) different colors and shadings. If the number of categories is 7 or more, the categories will
be plotted as separate bars for a single variable, with the same color and shading, so long as has not been invoked.
However, if separate bars are preferred when the number of categories is 6 or fewer, use the option in addition.
is explained just above and is restricted to overriding a default behavior with the option.
means that results are to be reported as percents. The default is percents of the grand total of all the variables in the varlist.
See also and .and are mutually exclusive.
means that results are to be reported as proportions or fractions between 0 and 1. The default is proportions of the grand
total of all the variables in the varlist. See also and .and are mutually exclusive.
means that percents or proportions (one must be specified) are of the total of all the values in each group, defined by one
value of the variable. and are mutually exclusive.
means that percents or proportions (one must be specified) are of the total of all the values for each variable in the varlist.
and are mutually exclusive.
bar options are other options allowed with .See[R]graph or on-line help for , including details on ,
,,and .
Explanation
Simple bar charts can be surprisingly awkward in Stata. The basic reason for this is that has a built-in tendency
to add up whatever is fed to it. This is fine so long as what you want plotted are sums of values in their original units, or
means, which are one step away with the option. If you want something else, some preprocessing is required, which
makes some basic tasks rather complicated, especially for users new to Stata.
I have written as a way of automating the most common kinds of preprocessing. is intended to be just
with some added bells and whistles.
First, however, I will commend . If you find frustrating, check out , which gives a histogram for
categorical variables. may be the answer to your question.
There are three main problems that is designed to tackle. They sometimes arise in combination, especially the first
and the second.
Stata Technical Bulletin 5
Problem 1: Percents and proportions
Suppose you want results plotted in percents (sum 100) or proportions (sum 1). It is necessary to transform your variables
so that their values do indeed add up to the appropriate sum, 100 or 1. In STB-14, Felicia Knaul (1993) asked how to create
stacked bar charts of percents. The answer to her question was six command lines creating the percents before they are fed to
. With , the answer to her problem would be
where the options and call for percents and for those percents to be calculated for each group (p by g).
Let us look at another similar example, especially for those without access to the insert by Knaul. Smith (1984) studied the
geography of crime in an area of Birmingham, England. She asked 531 residents in an ethnically mixed area whether they had
taken any action to avoid victimization. Her table can be read into Stata and a bar chart drawn using (see Figure 1).
0
50
100
150
200
yes no
Asian W.Indian white
Figure 1. vbar chart.
This could also have been done by
However, where excels is in the next step, standardizing so that we look at percents (see Figure 2):
percent
0
20
40
60
80
yes no
Asian W.Indian white
Figure 2. vbar chart with percents.
6Stata Technical Bulletin [STB-36]
The option could have been added with both ,and .
Another option allowed with is (percent or proportion for each variable). If or (which calls for
proportions) is called without or , the default is that they are calculated with reference to the total of all the variables
in the varlist, which is naturally equivalent to if there is just one variable.
Problem 2: Categorical variables
Suppose you have a categorical variable coded numerically with more or less arbitrary codes. would treat each
code literally, that is to say, numerically, so that if a variable were coded 1, 2, and so forth, all the 1’s would be added up, all
the 2’s added up, and so forth, so that each bar would represent the frequency of the value, multiplied by the arbitrary code. In
the automobile data distributed with Stata, is such a categorical variable with codes 1 through 5.
is an answer to this problem and plots frequencies (as, in a cruder way, does ). Yet with ,the
bars touch, which you may not want, perhaps because you are sensitive about giving the impression of an underlying continuous
scale. Note also that is not similar to .
A solution which is rather Stata-ish is to use to generate a set of indicator variables (values 1 or 0)
that when added up by will give the desired frequencies. Experienced Stata users will do this as a conditioned
reflex, but to those learning Stata it can seem a trifle arcane. automates this indicator variable line of attack when called
with the option. The attack fails if the number of categories exceeds 6, because will not allow more
than 6 variables. There is another line of attack that is then tried, which has the consequence that it plots separate bars, so long
as has not been used as an option. Separate bars may or may not be what is preferred. If there are 6 or fewer categories,
invoking the option will override the default. In the automobile example, the and options produce
the following plot:
0
10
20
30
Repair Record 1978
1 2 3 4 5
Figure 3. vbar with category option.
With the Smith data looked at earlier, the assumption was that we were reading in the counts from a summary table. What
is also likely with such data is that we do not have the summary counts but the raw data in a categorical variable, say ,
taking values or . The bar chart would then be produced by
Note that only allows one variable in the varlist with the category option. So
is allowed, but not
The best reasons for this limitation are that users do not often seem to want the latter form, and that if they ask for it, they
really would be better served by
Stata Technical Bulletin 7
or
instead, which is fine. Note that in the last example the option specifies that is a categorical variable. is treated
as one automatically, as is true of . Under Stata 5.0, which assumes, both and can be string variables as
well as numeric, because of a change to .
Problem 3: Ordered, labeled bar charts
Consider the United States census data distributed with Stata. You might want a bar graph of divorces for each state,
labeled with state identifiers. The full names would lead to a very messy graph, but two-letter identifiers (in lower case) would
be just about tolerable. Let us assume that contains those identifiers (in lower case), such as tx (a little state somewhere
in the South). We have included with this insert a file called containing the variables and from
as well as the variable with the two letter identifiers.
requires a preceding
and produces an alphabetically ordered bar chart, which may be what you want. On the other hand,
requires a preceding
and produces a numerically ordered bar chart. Incidentally, this works properly only because the 50 states have 50 unique values
for , with no ties. You would not always be so lucky. What if you want both the state labeling and the numerical ordering?
It can be done by with preprocessing, and it is done on request by .separates the ordering and the
labeling by another variable. The cost, however, is that the values of the string variable become value labels, and so cannot be
more than 8 characters long. There may well be some trick to get round this limit.
would produce bar charts with the divorces increasing or decreasing from left to right, respectively (see Figure 4 for the result of
the first of these commands), and labeled by the state identifiers. Any ties in either the (values) or the (labels) variables
would have been handled properly, and separate bars shown for separate, but equal, values.
0
50000
100000
150000
Number of divorces
nd
de
vt
sd
ak
ri
wy
hi
mt
nh
me
ne
id
ut
wv
nm
ia
ks
ct
sc
nv
ms
mn
ar
ky
md
wi
or
ma
la
co
az
va
ok
al
mo
nj
nc
wa
tn
ga
pa
in
mi
il
oh
ny
fl
tx
ca
Figure 4. vbar with laby() option.
So far we have considered the case in which takes a string variable as argument. The option may also be used
with a numeric variable which is either labeled (its labels are used) or unlabeled (its values are used as labels). The same length
limit of 8 characters applies.
8Stata Technical Bulletin [STB-36]
Note that the sorting in is alphabetic, or reverse alphabetic, with string variables, and takes account of any weights
specified with numeric variables. I can imagine cases in which the user wants the bar heights to reflect the weights, but not the
sorting variable. Note that they need not be the same variable. That cannot be handled by . It would make the syntax and
programming more complicated for what is, I guess, an unusual case that can be tackled by something like
In the above, the user wishes to apply weights to , which together determine the bar heights, but not to the sorting variable
.
It is not an error to call the or option without . In that case, is used to label the sorted bars.
References
Knaul, F. 1993. qs5: How to create stacked bar charts of percentages.
Stata Technical Bulletin
14: 12–13. Reprinted in
Stata Technical Bulletin
Reprints
, vol. 3, pp. 71– 72.
Smith, S. J. 1984. Crime and the structure of social relations.
Transactions, Institute of British Geographers
9: 427– 442.
gr25 Spike plots for histograms, rootograms, and time-series plots
Nicholas J. Cox, University of Durham, UK,FAX (011) 44-91-374-2456, n.j.cox@durham.ac.uk
Anthony R. Brady, Public Health Laboratory Service Statistics Unit, UK, tbrady@phls.co.uk
Syntax
The syntax for the command is
varname weight exp range # # graph options
Options
#uses the function to round varname to the nearest multiple of #. See the help on functions or [U]20.3.5 in
the manual. In other words, specifies the bin width or class interval.
specifies that the vertical scale is to show fractions. This convention is the opposite of that for .
specifies that the vertical scale is to show square roots of frequencies (J. W. Tukey’s “rootogram”).
#specifies a constant other than 0 as a level from which spikes are drawn vertically.
graph options are options allowed with .is allowed, but is trapped with a warning, as the total
graph produced with byvar would superimpose spikes, not add them vertically.
The defaults for these graph options are
of “Frequency” (or “Fraction”, if is used, or “Root of frequency”, if is used);
of the variable label of varname (or varname itself, if the label is not defined);
, so that each bin or value is shown by a vertical spike;
, so that invisible point symbols are used.
Explanation
A standard idiom with is
which connects and by vertical lines on a scatter (twoway) plot of and against , in this case without visible point
symbols. From that it is a small step to see that setting to a constant (e.g., 0 or some other reference level) produces some
useful graphs, and another small step to see that some prior calculation gives you a kind of histogram, with separate spikes, not
touching bars.
has been written to automate the production of such graphs, so that users can, as far as possible, get what they
want with a single command. The name “spike plot” was suggested partly by the “spike chart” of Berry (1996, 14).
Stata Technical Bulletin 9
Histograms
Some data analysts producing histograms seem to prefer spikes to bars, as a matter of logic or of taste, especially with
discrete variables. Examples are found in Plackett (1971) and Evans, Hastings, and Peacock (1993). Even if people want a
conventional histogram with touching bars, the limit in of 50 bins sometimes proves frustrating. In our
experience, this is often when there is some fine structure in the frequency distribution that is of interest, even if it is in some
way spurious or pathological. Examples come from demographic data on human ages; people prefer to state certain ages (such
as multiples of 10 or 5 years) as a matter of vanity, ignorance, biases in memory, and so forth. The number of possible ages
is clearly about 100 and the fine structure of the distribution can only be seen well if each is a separate bin. For such tasks,
offers an alternative to .
A more pervasive issue is the size of the data set. In broad terms, the details of a histogram should be less affected by
quirks of sampling as the number of values increases. Hence with larger , more bins can be justified, although it is difficult
to make this precise in a general manner. Suggested rules of thumb for the number of bins include those discussed by Emerson
and Hoaglin (1983) and in [R]graph histogram.Witharuleof , more than 50 bins would be needed for 625; with
arule, for 100,000; and with Sturges’ rule ,for . Another rule, suggested
half-seriously by the current first author, is , for which the threshold is 4,630. The compromise rule discussed in [R]graph
histogram of
leads to a threshold of 100,000. Without taking these rules more literally than they deserve, we note simply that some, but not
all, imply more than 50 bins for data set sizes that are frequently encountered.
Note that in Stata 5.0, it is possible, at the cost of some programming, to create your own alternative histograms with .
These could have more than 50 bins and they could even have unequal bin widths.
Rootograms
allows the production of the “rootograms” suggested by J. W. Tukey circa 1965. The idea is to show not
frequencies, but their square roots. Frequencies, as counted variables, tend to have variability that is stabilized by a root
transformation, at least approximately. Note also that the square root of a normal or Gaussian density is a multiple of another
normal or Gaussian density. Hence if the normal is the reference distribution, we are looking for the same shape on a rootogram,
and experience in assessing histograms for approximate normality can be applied directly in assessing rootograms. However,
taking the root is only the first step in Tukey’s procedure, and we do not implement his hanging or suspended rootograms. See
Tukey (1965, 1972, 1977), Tukey and Wilk (1965), or Velleman and Hoaglin (1981).
Consider dotplots
In several ways, spike plots are comparable to the dotplots provided by Stata’s command, which can also be very
useful in showing the fine structure of data distributions. Apart from the obvious difference that dots represent individual values
and spikes represent bin frequencies, a further difference is that dotplots are tuned by indirectly changing the number of bins,
whereas spike plots are tuned by directly changing the bin width. For many problems, we prefer dotplots to spike plots, and
conversely.
Time-series plots
Some kinds of time series lend themselves quite well to spike plots. Daily rainfalls for a year are often plotted as a series
of vertical spikes. On such graphs, wet and dry spells come out well, even for British stations, where according to a U.S. myth
it is usually raining. For other time series, setting to some average level shows clearly periods above and below average.
Examples
In practice, frequency distribution problems for come in two forms. First, the data for a variable are to be
summarized as a frequency distribution. This requires basically
varname
so long as the bin width or class interval is the same as the resolution of the data, or
varname #
if rounding is required. The number specified in place of #is the bin width or class interval.
Second, the data for two variables show values and frequencies. This form requires the use of weights. For example,
Mosteller, Fienberg, and Rourke (1983, 83) give data on the age structure of the population of Ghana reported in the 1960
census, rounded down to the nearest thousand, and taken from the U.N.
Demographic Yearbook
for 1962 (p. 189). These data
10 Stata Technical Bulletin [STB-36]
can be found in the file . For ages from 0 (under 1 year) to 90 years, the frequencies are the numbers at each
age. Note that 15000 people were 91 and over or of unknown age. 91 bins are too many for , which is why
we are using .
The command is then
P
opu
l
at
i
on
i
n 000
Age in years
020 40 60 80
0
100
200
300
Figure 1. Spike plot of age.
The resulting graph shown in Figure 1 shows heaping of ages, which is a well-known phenomenon in demography. It
is easy to pick out preferences for ages that are multiples of 10 and of 5. In addition, ages ending with 2, 4, 6, and 8 are
generally more frequent and ages ending with 1, 3, 7, and 9 generally less frequent than would be expected if there were a
relatively smooth underlying distribution. Further comment would require some anthropological or sociological knowledge on
the significance of various numbers among this population: is 30, for example, a particularly important age to achieve for some
or all groups in Ghana? Another example of age heaping for the female population of Mexico is given by Mosteller and Tukey
(1977, 476– 477). Apart from its occurrence in demography, which appears to have been recognized since at least the 18th
century (Westergaard 1932, 77), preference for certain digits in reported numbers has been identified in several sciences (Cox
1991). Spike plots showing the frequencies of all possible results are a natural tool in recognition of such biases.
Time-series plots that are spike plots also require the use of weights. The usual situation is that each time occurs just once
in the data. Reference levels other than 0 are often useful, such as a mean or median.
To illustrate using for time series, we consider the yearly temperature means for the world and each hemisphere
given by Parker and Jones (1991), expressed as deviations from the 1951–80 mean. The data are given in .Aspike
plot representation of the Southern hemisphere series is given in Figure 2 and is obtained by
temperature
d
ev
i
at
i
on
f
rom 1951-80 mean
deg C
year
1850 1900 1950 2000
-.4
-.2
0
.2
.4
Figure 2. Spike plot of temperature showing deviations from 1951 –80 mean
The pattern of recent warming stands out quite well. Note that we must override the default of “Frequency”, which
makes no sense in this example.
Stata Technical Bulletin 11
Note that when printing the result of using , users may want to draw the spikes with pen thicknesses larger than
the default.
References
Berry, D. A. 1996.
Statistics: A Bayesian Perspective
. Belmont, CA: Duxbury Press.
Cox, N. J. 1991. Human factors.
Nature
353: 597.
Emerson, J. D. and D. C. Hoaglin. 1983. Stem-and-leaf displays. In
Understanding Robust and Exploratory Data Analysis
, ed. D. C. Hoaglin, F.
Mosteller, and J. W. Tukey, 7–32. New York: John Wiley.
Evans, M., N. Hastings, and B. Peacock. 1993.
Statistical Distributions
. New York: John Wiley.
Mosteller, F., S. E. Fienberg, and R. E. K. Rourke. 1983.
Beginning Statistics with Data Analysis
. Reading, MA: Addison–Wesley.
Mosteller, F. and J. W. Tukey. 1977.
Data Analysis and Regression
. Reading, MA: Addison –Wesley.
Parker, D. E. and P. D. Jones. 1991. Global warmth in 1990.
Weather
46: 302– 311.
Plackett, R. L. 1971.
An Introduction to the Theory of Statistics
. Edinburgh: Oliver & Boyd.
Tukey, J. W. 1965. The future of processes of data analysis. Reprinted in
The Collected Works of John W. Tukey, Volume IV: Philosophy and
Principles of Data Analysis: 1965– 1986
, ed. L. V. Jones, 517–547 (1986). Monterey, CA: Wadsworth & Brooks/Cole.
Tukey, J. W. 1972. Some graphic and semigraphic displays. In
Statistical Papers in Honor of George W. Snedecor
, ed. T. A. Bancroft and S. A.
Brown, 293– 316. Ames, IA: Iowa State University Press.
Tukey, J. W. 1977.
Exploratory Data Analysis
. Reading, MA: Addison –Wesley, Ch. 17.
Tukey, J. W. and M. B. Wilk. 1965. Data analysis and statistics: Principles and practice. Reprinted in
The Collected Works of John W. Tukey, Volume
V: Graphics: 1965–1985
, ed. W. S. Cleveland, 23–29 (1988). Pacific Grove, CA: Wadsworth & Brooks/Cole.
United Nations. 1962.
Demographic Yearbook 1962
. New York: United Nations.
Velleman, P. F. and D. C. Hoaglin. 1981.
Applications, Basics, and Computing of Exploratory Data Analysis
. Boston, MA: Duxbury Press, Ch. 9.
Westergaard, H. 1932.
Contributions to the History of Statistics
. London: P. S. King.
ip16 Using dialog boxes to vary program parameters
H. Joseph Newton, Texas A&M University, FAX (409) 845-3144, jnewton@stat.tamu.edu
One of the most exciting new features of Stata 5.0 for Windows is the ability to use dialog boxes to provide a graphical
user interface to Stata commands and programs. While dialog boxes can be used in a variety of ways, in this insert we describe
how they can be used to rapidly vary the value of a parameter such as a smoothing parameter or a bandwidth in a kernel density
estimator and have for each value of the parameter a new graph appear in the graphics window. This use of dialog boxes makes
it possible to emulate using a “slider” in a language such as .
To illustrate what we mean, consider Figure 1 where we are using a program we have written called (“power
transform dialog box”) to try to find a suitable power transform for a time series of monthly sales data (discussed by Chatfield
and Prothero (1973) and Newton (1988, 233), and included with this insert as ). If we denote the data by
(with 77 for the sales data), we seek a value of the exponent giving us a transformed data set (with constant variance
across time) by
Using such a power transform for stabilizing variance is a standard initial step in many methods of time series analysis; see Box
and Jenkins (1970), for example.
In Figure 1, we see a dialog box labeled “Time series power transform” as well as a Stata Graph window with the plot of
original sales data (except that it has been standardized to be in the interval 0 1 by subtracting its minimum value and dividing
by its range, a practice we do throughout this insert so that series for different values of are comparable).
We have resized the Stata window and the Graph window so that both the dialog box and Graph window are visible. We
have also made the foreground color black and the background color white in the graphics window so that Figure 1 would show
up well when printed.
12 Stata Technical Bulletin [STB-36]
Figure 1. Dialog box for time series power transform (lambda = 1).
The dialog box has list boxes for the user to choose the data and time variables (the user is expected to input the data prior
to using ). The user can then try a value of by choosing a value in the list box labeled “lambda” (for simplicity the
choices range from zero to one in steps of 0.05) and then clicking on the button labeled “lambda” (when begins it
uses the value one so that the standardized original data is plotted).
Alternatively, one can rapidly increase or decrease the value of by clicking on the buttons labeled “lambda ”and
“lambda ”, respectively. Each time such a click is done, the graphics window displays the new standardized transformed data
set. This allows a user to almost animate the graphs of the successive power transforms (unfortunately one must actually click
to get each graph, thus slowing down the animation).
Note that the standardization is important here as it keeps the vertical axis from changing which would disturb the visual
impression of smoothly changing the value of . Clicking on the button labeled “exit” ends and control of Stata is
returned to the command line.
In Figure 2, we display the result of using 0.25, that is, the fourth root transform recommended by Chatfield and
Prothero for these data. Notice that the value of is inserted into the caption above each graph as is a number called “RMSE”.
This is the root mean square error in regressing the standardized transformed data on the sum of a linear trend and a cosine plus
sine of period 12.
For the sales data the value of minimizing this variance is arguably the best value to stabilize variance across time.
Except for this number, can be used for any time series. In our discussion below of how to write a program such as
, we show how this regression can be easily removed from the program.
Stata Technical Bulletin 13
Figure 2. Dialog box for time series power transform (lambda = 0.25).
Writing programs using dialog boxes to vary parameters
In this section, we present an annotated version of the file containing the program. See [R]window
control for complete details on the elements of Stata dialog boxes.
First we define the program and get the current list of variables into the global variable :
Next we use Stata’s command to define the three list boxes in the dialog box:
Now place the four buttons used in the dialog box and define the global macros used to hold the actions to take for each button
(note that the actions for all but “exit” are to call the program defined below):
14 Stata Technical Bulletin [STB-36]
Finally, issue the command that actually forms the dialog box:
Another program, , does the actual calculation and graphing. Note that has one argument which can be for
“lambda ,” for “lambda ,” and 0 for “lambda.” Since it is called only by , we need not worry about checking
arguments or any of the usual programming problems.
Now get the data variable and time variable into the tempvar’s and and update the value of :
Next do the called-for transform and do the standardizing:
This section does the regression and can be easily removed:
Now do the graph, noting that to remove the regression, one need only change the argument of the option in :
References
Box, G. E. P. and G. M. Jenkins. 1970.
Time Series Analysis, Forecasting, and Control
. San Francisco: Holden –Day.
Chatfield, C. and D. L. Prothero. 1973. Box Jenkins seasonal forecasting: Problems in a case study.
Journal of the Royal Statistical Society, Series
A
136: 295– 336.
Newton, H. J. 1988.
TIMESLAB: A Time Series Analysis Laboratory
. Pacific Grove, CA: Wadsworth & Brooks/Cole.
Stata Technical Bulletin 15
sbe13.2 Correction to age-specific reference intervals (“normal ranges”)
Eileen Wright, Royal Postgraduate Medical School, UK, ewright@rpms.ac.uk
Patrick Royston, Royal Postgraduate Medical School, UK, proyston@rpms.ac.uk
We have discovered that our routine, , for calculating age-specific reference intervals (Wright and Royston, 1997) has
an error in the estimation of the standard errors of centiles (option ) when used in conjunction with modeling the coefficient of
variation (option ). Confidence limits based on this routine are too narrow. An amended version is available with this edition
of STB.is correct in all other respects.
Reference
Wright, E. and P. Royston. 1997. sbe13: Age-specific reference intervals (“normal ranges”).
Stata Technical Bulletin
34: 24– 34.
sbe14 Odds ratios and confidence intervals for logistic regression models with effect modification
Joanne M. Garrett, University of North Carolina at Chapel Hill, FAX (919) 966-2274, garrettj@med.unc.edu
This insert describes , an easy-to-use program for anyone who dislikes having to hand calculate odds ratios and
confidence intervals for logistic regression models with significant interaction terms, or, even worse, have to explain to students
how to do it. After years of teaching non-statistician medical researchers how to use logistic regression, and watching their eyes
glaze over when I got to the formula for calculating the variance estimate for a linear combination of betas, I decided to make
life easier on all of us and write a program that automates the process.
After writing , I discovered the command in Stata 5.0, which will also display odds ratios and confidence
intervals (see [R]in the Stata Reference Manual). and produce the same results, but they differ in their
syntax. requires you to specify the appropriate linear combinations of the estimators; uses a syntax based on
descriptive terms familiar to anyone who has studied epidemiology.
Which program is better depends entirely on user preference. is geared toward a non-mathematically inclined
audience and has the advantage of being specified and displayed using epidemiological terminology, with a summary of the
variables and stratum-specific values used for the odds ratios. However, can be explained in similar terms, and, if
specified appropriately, is simple to use. Following each example in this insert I will include the statement needed to duplicate
the results using . However, remember that must follow the estimation of a model.
Background
The general form for the calculation of the odds ratio from the estimates of the logistic regression model is
where and represent values for one group and a comparison group, respectively.
In many epidemiologic studies, the focus is on one main study factor (“exposure”) which frequently is coded as 1 for
“exposed” and 0 for “unexposed”. For instance, suppose were the exposure variable:
Exposure Status
Let
Then
Our fairly complicated odds ratio formula reduces to a simple exponentiation of the beta coefficient for the “exposure”
variable (which Stata is kind enough to present for us on the output).
The formula for the 95% confidence interval (which Stata also calculates and prints for us) is
Had the exposure variable been coded as something other than 1 and 0, we would need to multiply the beta coefficient
by the difference that we want to compare before exponentiating. For instance, suppose age in years was our “exposure.” If we
16 Stata Technical Bulletin [STB-36]
exponentiate beta (or use the odds ratio calculation we find on the printout), we are looking at the odds ratio for a one year
change in age, e.g., a 39 year-old versus a 38 year-old. It might be more informative to report a 10-year difference in age, say
comparing a 50 year-old to a 40 year-old. Our odds ratio and 95% confidence interval formulas then become
Note that one multiplies the standard error in the confidence interval formula by the same multiple used for beta. A dead giveaway
that someone has forgotten to do so is a tiny confidence interval around a reasonable sized odds ratio.
This is still fairly straightforward. However, things start getting messy when there is significant effect modification, which
is the epidemiologists’ term for interaction between the exposure and another variable— the effect modifier. What we are saying
is the odds ratio changes depending on the value of the effect modifier or effect modifiers. In essence, we are stratifying our odds
ratio by categories of the effect modifiers. The general formula for the odds ratio reduces, but any terms which include interactions
with the exposure variable remain. For example, suppose we are interested in the odds ratio of developing coronary heart disease
(yes 1; no 0) for people with hypertension ( 1) versus people with normal blood pressure ( 0), and we
find there is significant interaction between hypertension and age, as well as hypertension and sex. The logistic model (written
in the log odds form) might look like this:
The formula for the odds ratio with the two interaction terms would be
Next we would substitute values of age and sex and use the estimated betas to solve the equation to get odds ratios for the
comparison. For example, we might want the odds ratio for 50 year-old males, or for 60 year-old females.
Okay so far, but what about the confidence intervals for these odds ratios? Many journals are requiring confidence intervals
rather than -values when we report odds ratios, and now we need separate confidence intervals for each odds ratio (we may
have several odds ratios representing the categories of our effect modifiers). Not only do we need more confidence intervals,
the variance estimate for the formula no longer is the simple variance of a single beta. It’s now a more complicated formula
for a linear combination of betas. The general form for the 95% confidence interval (with interaction) is
where and
The good news is the terms involving variables other than the exposure and the effect modifiers drop out of this equation.
Additionally, if the exposure is coded as 1 and 0, the equations for and become
where beta for the exposure variable, betas for the (up to ) interaction terms, and values for the effect
modifiers.
For the 2 interaction term example ( and ), we would have
Stata Technical Bulletin 17
Now we pick off the appropriate estimates from the variance–covariance matrix, substitute in a value for age and sex, and
solve the equation. All this just to get one of the confidence intervals for one odds ratio. We must repeat the calculation for
other combinations of age and sex. Of course, if we had started with only one interaction term (rather than two), the variance
estimate would reduce quite a bit (to 3 terms). Additionally, had the single effect modifier been a dichotomous 0–1 variable, the
equation would reduce further to the familiar for the 0 category. Would you still rather not be bothered? Then will
help.
Description
calculates user specified stratum-specific odds ratios and confidence intervals for logistic regression models which
include interaction terms between an exposure and effect modifiers. If the model has been specified previously or is
repeated with new stratum values, the current model estimates are used; otherwise a new model is fit.
Syntax
dvar evar exp cov list interaction #
# #
dvar is the “disease” variable (dichotomous outcome) and should be coded as 1 and 0.
evar is the “exposure” variable and can be nominal, ordinal, interval, or continuous.
Options required
cov list is the list of confounders in the model.
interaction # is the list of interaction variables in the model plus a stratum value for each effect modifier; for more
than one interaction term, the form is interaction # interaction #.
For example, suppose there are two interactions represented by the variables (hypertension by sex, where sex=1
for males and sex=0 for females) and (hypertension by age). To calculate the odds ratio and 95% CI for 60 year-old
males, specify
and for 60 year-old females
and so on.
Options allowed
#is the difference of values for “exposed” versus “unexposed.” For example, to compare age 50 versus age 40,
would be specified as: . The default for is 1, which is equivalent to exposed 1 versus unexposed 0.
displays the logistic regression table.
#specifies the confidence level in percent for the confidence intervals. The default is a 95% confidence interval.
Examples
The set of examples to illustrate come from a case–control study (Garrett et al. 1993) of the relationship between
the “exposure”—a rare form of a gene ( , for those readers familiar with genetics) and the “disease”—incidence of breast
cancer ( ). (I apologize in advance to the author of this study for taking liberties with the data to illustrate some points,
but, since the author is related to me, I hope he won’t mind.) The data are stored in . Several of the variables with
their descriptions and coding are in the following table:
18 Stata Technical Bulletin [STB-36]
Variable Definition Coding
Breast cancer diagnosis 1 case
(“outcome”) 0 control
Type of gene 1 rare form of hras
(“exposure”) 0 common form
Race of patient 1 black women
0white women
Body mass index— a continuous (higher #s
measure of obesity mean heavier)
Past history of breast 1 yes
biopsy 0 no
Has been through 1 post-menopause
menopause 0 pre-menopause
In addition, there are interaction terms between and (), and (), and and
().
In the first example, calculates the odds ratio and 95% confidence interval for rare hras and breast cancer with one
interaction term between and () for black women ( 1), controlling for ,,and .It
requests that the logistic regression table be printed ( ), and accepts the default for (1) since “exposed” ( 1)
minus “unexposed” ( 0) equals 1. The model is
With the commands
will fit the model and produce the usual table of odds ratios and confidence intervals for the odds ratios and
will produce:
Using gets the results of both and using a syntax my non-statistician students like:
Stata Technical Bulletin 19
Following the model results, reports a summary table of information (a way to check to make sure the variables
were specified as expected, particularly if the model is not printed), including “ ”, the variance of ”, and the final odds ratio
and confidence interval for the stratum requested. In this case, the stratum for black women is specified as .” The
odds ratio tells us that black women with the rare form of the hras gene are 4.2 times as likely to develop breast cancer as black
women with the common form of the gene. The 95% confidence interval (1.4 to 12.5) does not include 1.0 (which would mean
no association), thus, the odds ratio is statistically significant.
Now, let’s repeat the example for white women (“ ”):
The corresponding command is
This time the odds ratio tells us that among white women there is no association between having the rare form of hras and
breast cancer—the odds ratio is close to 1.0 and the confidence interval includes 1.0, meaning it is nonsignificant. There was
no need to print the model again since the estimates did not change from the previous example.
For the next examples, let’s suppose that in addition to the and interaction, we think there is interaction
between and (). Again, is dichotomous, specified as 0 (white women) or 1 (black women), but is
continuous, so we must select some values of for our strata. I have selected 23 (a normal value for body mass index
for women) and 28 (getting rather zaftig—for non-New Yorkers, that means “plump”). Four examples of with
two interaction terms follow:
=1and =23
=1and =28
=0and =23
=0and =28
For the first example, we have
20 Stata Technical Bulletin [STB-36]
We see from the model that both interactions are statistically significant, or nearly so. But are the individual stratum-specific
odds ratios significant? Black women with a normal value (23) for body mass index are 8.1 times as likely to develop breast
cancer if they have the rare form of hras. The confidence interval (2.18 to 30.37) does not include 1.0, thus the odds ratio is
significant.
For 1and 28:
Black women with a high value (28) for body mass index are 5.1 times as likely to develop breast cancer if they have the rare
form of hras. The odds ratio is significant.
For 0and 23:
White women with a normal value (23) for bmi are 1.4 times as likely to develop breast cancer if they have rare hras, but the
odds ratio is not significant.
For 0and 28:
Stata Technical Bulletin 21
White women with a high value (28) for bmi are no more likely to develop breast cancer if they have the rare form of hras.
The equivalent commands for the previous four examples are
Finally, let’s look at some examples where the “exposure” is continuous, rather than a 0–1 dichotomous variable. Suppose
is our exposure variable, i.e., we are interested in the relationship between body mass index and breast cancer. And, we
discover that there is effect modification between and menopause status (post-menopause: 1 and pre-menopause:
0—don’t ask what happened to the group of women experiencing menopause— for sake of illustration, let’s assume
that menopause is an instantaneous event). We’ll keep the model simple, and look only at the outcome of breast cancer with
,, and the by interaction ( ). First, let’s compare odds ratios for a one-unit change in
, stratified by post-menopausal women, and then stratified by pre-menopausal women.
The corresponding command is
Thus, among post-menopausal women, there is a significant increase in breast cancer with increasing (the confidence interval
does not include 1.0). However, since we are looking at a one-unit change in , at first glance we might conclude erroneously
that an odds ratio as small as 1.09 implies body mass index doesn’t have much to do with breast cancer incidence.
Next, we look at
22 Stata Technical Bulletin [STB-36]
with corresponding command
which shows that among pre-menopausal women, there is no evidence of increase in breast cancer with increasing (the
confidence interval includes 1.0)
To make our odds ratios look a little more substantive, we can use the #option to calculate the odds ratio for a
larger change in . For instance, the next two examples repeat the previous two, using a 10 point difference on the scale.
The two commands are
Now for the first of these two example:
(
shows the difference in bmi
)
This shows us that we can say that post-menopausal women are 2.4 times as likely to develop breast cancer with a 10 point
increase in . Again, we conclude this odds ratio is significant. If the relationship is significant for a one-unit change in ,
it will be for a 10-unit change (or 1000-unit change, for that matter— I once had a journal reviewer tell me that since the
confidence interval was so close to 1.0 for a one-unit change of a continuous effect modifier, it was sure to cross 1.0 if I tried
to look at a 10-unit change).
For the second example of using we have
Although the odds ratio for a 10-unit change in is slightly larger than the example where we did not use ,
as expected, the confidence interval remains nonsignificant. Among pre-menopausal women, there is no significant relationship
between body mass index and breast cancer.
References
Garrett, P. A., B. S. Hulka, Y. L. Kim, and R. A. Farber. 1993. HRAS protooncogene polymorphism and breast cancer.
Cancer Epidemiology,
Biomarkers & Prevention
2: 131– 138.
Kleinbaum, D. G., L. L. Kupper, and H. Morgenstern. 1982.
Epidemiologic Research: Principles and Quantitative Methods
. New York: Van Nostrom
Reinhold.
Stata Technical Bulletin 23
sg67 Univariate summaries with boxplots
John R. Gleason, Syracuse University, 73241.717@compuserve.com
Univariate summaries (means, standard deviations, etc.) are perhaps the quantities most often examined during data analysis.
This is partly because these summaries serve so many purposes, for example, to familiarize oneself with the data, to aid in
understanding other computations, or to act as canonical data descriptors in written reports.
Stata’s command provides the components of a univariate summary, but its presentation is not always ideal for
the purpose at hand. For instance, varlist displays the mean, standard deviation, minimum, and maximum of a set of
variables in a left-to-right fashion that allows many such sets of results to be viewed at once. But one might wish to describe each
variable by its five-number summary (minimum, 25th percentile, median, 75th percentile, and maximum). varlist
will compute the required values (along with many others), but present them in a style that consumes about 15 lines of
output for each variable.
As another example, typically shows the mean and standard deviation in a format that displays 7 significant
digits. This can be desirable, but not when the goal is to extract, visually, or by cut-and-paste, the means and standard deviations
of several variables for inclusion in a written report; then, one might prefer those values to be aligned on their decimal points,
followed by a small, fixed number of decimal places.
Of course, there are several other ways of displaying univariate summaries in Stata. For example, the command
(new in Stata 5.0) provides very general and flexible formatting of tables of summaries, though it is not especially convenient for
interactive data analysis. This insert presents a new command ( ) that offers a streamlined display of univariate summaries
including, optionally, text-mode boxplots.
Syntax
varlist weight exp range bylist #
As with ,sand s are allowed.
Before explaining the options, we first demonstrate the default behavior of using the data set supplied
by Clayton and Hills (1995):
By contrast, the default response of is
Two differences are apparent: shows the complete five-number summary in horizontal style, and prints results in
rather than format. The latter choice displays the values aligned on their decimal points with a xed number of decimal places,
the conventional style of presenting numbers in text. The former choice permits a more compact and intuitive presentation of
five-number summaries than the option of :
24 Stata Technical Bulletin [STB-36]
Options
draws a text-mode boxplot for each varlist variable. Stata can, of course, draw boxplots using the command,
but a somewhat coarser boxplot built of text characters can still be helpful, especially if displayed beside the numerical
values being portrayed.
bylist requests summaries at each unique set of values of the variables in bylist, which is analogous to attaching the
bylist prefix to the command.
#controls the number of decimal places used to display the summary values; for example, switches to
format. By default, uses format to display all values except the number of observations .
chooses between and output formats; for example, supplying the options and chooses the format
for the summary values, a style similar to the default output of .
requests listwise deletion of missing values (an observation is ignored if any of the varlist variables is missing); the
default is to use all available observations for each variable (variable-wise deletion).
requests that the standard error of the mean ( ) be printed in place of the sample standard deviation ( ).
Examples
To illustrate, we continue looking at the data in :
(
output for four bygroups omitted
)
The above command produces a summary for each of the six combinations of and , and draws a boxplot for each
five-number summary calculated. The boxplots map the range of each variable onto a fixed width in the output; the median is
drawn with the character “ ”, the remainder of the box with “ ”, and the whiskers with ”. also draws a glyph at the
top of each summary table to serve as a reminder of this representation.
Stata Technical Bulletin 25
Adding the and options then gives this result:
(
output for four bygroups omitted
)
Remarks
1. The option differs from the prefix available to in that the data need not be sorted by the bylist
variables; sorts the data as necessary and then restores the original ordering before exiting. In addition, the
option may be combined with an clause, whereas the prefix may not.
2. However, uses the prefix to implement the option. Hence, the bylist can have at most ten variables,
which may be of either numeric or string type, and may include missing values or null strings.
3. runs on Stata Version 4.0, but Version 5.0 or newer is required to display value labels for bylist variables.
4. ’s output is 79 characters wide, the same width used by Stata’s files.
5. The characters “ ”, “ ”, and “ ” will produce reasonable boxplots in most fixed pitch fonts. However, these characters are
set by local macros at the top of the file , and are easily redefined, if desired. A similar comment applies to
the color used to draw the boxplots.
Acknowledgment
This project was supported by a grant R01-MH54929 from the National Institute on Mental Health to Michael P. Carey.
Reference
Clayton, D. and M. Hills. 1995. ssa7: Analysis of follow-up studies.
Stata Technical Bulletin
27: 19– 26. Reprinted in
Stata Technical Bulletin Reprints
,
vol. 5, pp. 219–227.
26 Stata Technical Bulletin [STB-36]
sg68 Goodness-of-fit statistics for multinomial distributions
Jeroen Weesie, Utrecht University, Netherlands, weesie@weesie.fsw.ruu.nl
This insert describes a command that computes goodness-of-fit statistics for multinomial distributed observations ( )
and expected values ( ). The expected values may be derived from an estimated model for the multinomial probability
distribution, e.g., a loglinear model, or be strictly theoretical. The gof-statistics are the members of the 1-parameter Cressie–Read
family (1984, 1988) of discrepancy measures,
where the summation is over all “cells,” i.e., the “response categories” of the multinomial distribution.
The Cressie–Read family contains many well known goodness-of-fit statistics as special cases. For instance, Pearson’s ,
belongs to the Cressie–Read family with . The deviance or likelihood-ratio statistic LR,
is embedded in the Cressie–Read family as the (continuous) limiting value in . Other measures such as Freeman– Tukey’s
statistic ( ), the Kullback– Leibler information distance (entropy) ( ), and Neyman’s modified (; note
that in modern terminology we would refer to Neyman’s as a Wald statistic) are similarly special cases of the general form.
Finally, Cressie and Read’s recommended statistic ( ) is obviously a member of the family.
If the expected values ( ) are true or at least efficiently estimated, all members of the family are asymptotically
(central) distributed. Under standard regularity conditions, the degrees of freedom of can be expressed as “the
number of cells 1” for theoretical expected values and “the number of cells 1the number of imposed restrictions” for
estimated expected values. Thus, all statistics are first-order efficient. Based on a.o. higher-order asymptotic developments
and Monte Carlo experimentation, Cressie and Read (1988) recommend the application of a nonstandard statistic, .
Syntax
The syntax of is
obs exp exp range numeric list
varlist # # filename
Note: The option requires the program which must be installed from the ip14 directory of the STB 35 disk
(January 1997).
Options to select lambda
specify the members of the Cressie–Read family to be displayed ( and are synonymous). More
than one of these options may be specified. Formally,
description abbreviation formula
Pearson’s
C & R’s recommended statistic
log-likelihood ratio (deviance) or g2
Freeman–Tukey’s statistic
Kullback–Leibler information
Neyman’s modified
Stata Technical Bulletin 27
numeric lists specifies a range of powers for the Cressie–Read statistics. See on-line help for (Weesie 1997)
for the definition of numeric lists. Note that must be installed separately for this option to work.
Specifying none of these options implies all of , or, stated differently, it is the same as specifying
.
Other options
varlist specifies a list of variables on which to aggregate and (join “cells” with the same values on varlist) before
computing goodness-of-fit statistics.
#specifies the degrees of freedom used in the computations of chi-squared-based approximate significance levels.
specifies that the table with statistic values is displayed. This option is effective only in combination with .
#specifies that the variables and are expressed as proportions with the total number of observations equal to #.
specifies that a statistic-by-lambda plot is displayed. If is specified, horizontal lines at the 90%, 95%, and 99% critical
values of the (central) chi-squared-distribution is shown.
specifies that the expected values may be scaled so that they sum to the number of observations. Otherwise and
should have equal sums (within a .001 multiplicative margin).
filename specifies the name of a file to save the statistic-by-lambda plot.
Examples
We have estimated a model with 8 parameters, including the constant, on the data in the accompanying file .
The data are assumed to follow a multinomial distribution. In this case, we estimated a loglinear model in GLM. The data and
estimated expected counts and a variable ,tobeusedbelow,are
(
output omitted
)
The deviance (likelihood ratio statistic against the saturated model) can be obtained as
Note that the output of contains the proportions of cells with low observed and expected counts.
To obtain the default list of statistics, we run
Note that the values of the statistics vary. Using Neyman’s we would reject the model with expected values at any
significance level below 1%. With the other statistics, we would not reject the model. These conflicting conclusions are somewhat
disturbing. We are concerned that some of the conclusions that we want to draw are not very robust with respect to (1) auxiliary
assumptions, such as the link-function in a GLM model, or (2) arbitrary decisions, such as the selection of test-statistic in a class
with similar asymptotic properties with little known about small-sample properties. Of course, the program can be abused
28 Stata Technical Bulletin [STB-36]
to shop around for a statistic that “proves” whatever we want to do. It is clear that such an application of has nothing to
do with good statistics or good science.
It is possible to collapse cells on a variable before computing the goodness-of-fit statistics with the option. A relatively
high proportion of cells with low expected counts is often a reason to collapse cells. Note that you have to manually modify the
appropriate degrees of freedom.
Finally, it is often quite convenient to inspect the Cressie–Read family in a statistic-by-lambda plot. This plot, containing
the critical significance levels at 90%, 95% and 99%, is obtained via the option
Cressie-Read goodness-of-fit statistics (40 cells; 32 df))
(horizontal lines at .90, .95 and .99 critical values)
go
f
lambda
-2 0 2
40
50
60
Figure 1. Cressie –Read goodness-of-fit statistics.
Acknowledgment
The command is an extensive update of an early version in the ETS Kit, a collection of Stata commands written for
Stata 2.1 by the late Albert Verbeek, Professor of Statistics at the Department of Social Sciences at Utrecht University, and
myself.
References
Cressie, N. A. C. and T. R. C. Read. 1984. Multinomial goodness-of-fit tests.
Journal of the Royal Statistical Society
,
Series B
46: 440– 464.
Read, T. R. C. and N. A. C. Cressie. 1988.
Goodness-of-Fit Statistics for Discrete Multivariate Data
. New York: Springer Verlag.
Weesie, J. 1997. ip14: Programming utility: Numeric lists.
Stata Technical Bulletin
35: 14– 16.
Stata Technical Bulletin 29
sg69 Immediate MannWhitney and binomial effect-size display
Richard Goldstein, Qualitas, Inc., Brighton, Mass., richgold@netcom.com
A perennial complaint of users of statistics is that the results are not “meaningful” in the real world. The two programs
presented here are the first of what will eventually be several inserts presenting translations from statistical tests to what may
be more meaningful indices.
The first index presented, the Mann– Whitney score, has previously been presented in (Goldstein 1994). The
presentation there was a simple reporting of part of what makes up the test. Here, we expand the use of the score to many other
tests, including pre-post studies, comparisons of proportions, matched-pairs tests, and independent groups tests. This statistic
is only presented as an immediate statistic, as it is meant to be used directly after some statistical test or procedure.
The interpretation of this score is that it shows the proportion of pairs in which cases of one type (e.g., one group such as
the experimental versus the control group, males versus females, etc.) that have a higher value for the dependent variable than
do cases of the other type. Pairs are simply defined as the number of people in group 1 times the number of people in group
2; for example, if there are 15 people in the experimental group and 19 in the control group, there are 285 15 19 pairs.
Then, a value of 0.27 means that in 27% of those 285 pairs, the member of the experimental group has more of something (e.g.,
dollars, pain, days in hospital, etc.) than does the member of the control group. It also means, for a randomly chosen pair, the
probability is 0.27 that the person from the experimental group will have a higher value on the dependent variable than will the
person from the control group. This value is the same as the area under the ROC curve as shown in .
The second measure presented, the binomial effect-size display, translates the results of any correlation, chi-squared, test,
or test to a simple effect size.
Immediate Mann–Whitney Statistic
gives the Mann–Whitney statistic (“probability that a randomly selected member of one group will have a better result than a
randomly selected member of the other group”) for the following cases.
Options
, pre-post studies, is the total number of subjects in the study while is the number of subjects who improved; thus, if
there were 30 subjects and 22 of them improved, the user would enter .
, comparisons of proportions, is proportion of “treatment” group who improve, while is proportion of
“control” group who improve; that is, enter, for each group, the proportion who had whatever event is of interest (whatever
the event is; e.g., promotion, termination, survival, remission, relapse, etc.); these proportions are easily available from
Stata’s tabulate command by asking for column or row percentages (whichever is appropriate in your setup); for example:
.
, matched pairs tests, is average difference, while is standard deviation of differences; this information is
provided in the standard output from Stata’s command.
, independent groups tests, is difference of means; is variance for one group, while is variance for
other group. Note that the above asks for variances, while the shows standard deviations (the variance is just the
square of the standard deviation).
Note that this statistic is given at the very end of for a nonparametric comparison (via the test).
Remarks
In addition to its use in helping to interpret statistical results, this statistic can also be used, with caution, to “adjust”
the results from studies that are of lower quality than desired. For example, Colditz et al. (1989) suggest that studies that use
sequential assignment, rather than random assignment, should have their Mann–Whitney statistics reduced by 0.15 and that
non-double-blind randomized studies should be decreased by 0.11. Another example (Colditz et al. 1988a) is that when there is
a standard therapy but the control group is a placebo group, the Mann–Whitney statistic should be decreased by 0.10!
In other words, studies using sequential assignment have been shown to be biased (as compared with randomized studies)
and the amount of bias is about 15%; that is, non-randomized studies tend to overestimate the proportion of pairs in which one
30 Stata Technical Bulletin [STB-36]
group does better by about 15%, compared with randomized studies. Similarly, use of a placebo arm in a study, when there is a
standard therapy, tends to result in a bias of about 10%. Note that these percentages are based on an examination of a number
of studies in certain medical areas from the 1980s; the results might differ in other areas (e.g., schizophrenia) or at other times.
The formulas used are from Colditz et al. (1988b); in that article they also present a formula for obtaining a combined
score across several measures, weighting each by the inverse of their standard deviations. This requires the sample size for each
group.
Examples of the use of the Mann–Whitney test statistic
The first example is from the Stata Manual (and is the same as the second example used for BESD, below). The only
difference between the first two examples is the sign of the statistic, with the results summing to 1.0.
We would interpret this to mean that in only 1% of the with-and-without treatment pairs would the without treatment car have
greater mileage. Since there are only two groups, this implies that in about 99% of the pairs, the car with treatment would have
greater mileage; we can check on this by entering the same information as above, except that the negative sign on the statistic
is left off:
The next example is also from the Stata Manual.
This final example shows the same information, but now using the separate variances. The Mann–Whitney score goes down
from about 99% to about 92% showing some effect of our assumption of equal variances.
Binomial effect-size display
test value # #
If there is no option then the first argument should be the correlation and nothing else is needed.
Many people use statistics that appear to have obvious meanings, such as the correlation coefficient, or an effect size (e.g.,
the difference in means divided by the common standard deviation). However, it is not always obvious how “important” the
value of these is in the real world.
“The BESD displays the change in success rate (e.g., survival rate, improvement rate, etc.) attributable to a new treatment
procedure. For example, an of .32 ... is said to account for “only 10% of the variance”; however, the BESD shows that this
proportion of variance accounted for is equivalent to increasing the success rate from 34% to 66%, which would mean, for
example, reducing an illness rate or a death rate from 66% to 34%.” (Rosenthal and Rubin 1982, 166).
Examples of the use of the BESD
The first example is from Rosenthal and Rubin (1982, 167):
Thus, a correlation of 0.32 is equivalent to increasing the success rate from 34% to 66%, certainly an important accomplishment.
The next example is from the Stata Manual’s entry on [R]ttest.
Stata Technical Bulletin 31
In this example, we see that a relatively small statistic implies a large real-world difference in success rates, one that would
make most of us ecstatic.
References
Colditz, G. A., J. N. Miller, and F. Mosteller. 1988a. The effect of study design on gain in evaluations of new treatments in medicine and surgery.
Drug Information Journal
22: 343– 352.
Colditz, G. A., J. N. Miller, and F. Mosteller. 1988b. Measuring gain in the evaluation of medical technology.
International Journal of Technology
Assessment
4: 637– 42.
Colditz, G. A., J. N. Miller, and F. Mosteller. 1989. How study design affects outcomes in comparisons of therapy, I: Medical.
Statistics in Medicine
8: 441– 54.
Goldstein, R. 1994. The overlapping coefficient and an “improved” rank-sum statistic.
Stata Technical Bulletin
22: 12– 15. Reprinted in
Stata Technical
Bulletin Reprints
, vol. 4, pp. 132– 136.
Rosenthal, R. and D. B. Rubin. 1982. A simple, general purpose display of the magnitude of experimental effect.
Journal of Educational Psychology
74: 166– 169.
32 Stata Technical Bulletin [STB-36]
STB categories and insert codes
Inserts in the STB are presently categorized as follows:
General Categories:
an
announcements
ip
instruction on programming
cc
communications & letters
os
operating system, hardware, &
dm
data management interprogram communication
dt
datasets
qs
questions and suggestions
gr
graphics
tt
teaching
in
instruction
zz
not elsewhere classified
Statistical Categories:
sbe
biostatistics & epidemiology
ssa
survival analysis
sed
exploratory data analysis
ssi
simulation & random numbers
sg
general statistics
sss
social science & psychometrics
smv
multivariate analysis
sts
time-series, econometrics
snp
nonparametric methods
svy
survey sampling
sqc
quality control
sxd
experimental design
sqv
analysis of qualitative variables
szz
not elsewhere classified
srd
robust methods & statistical diagnostics
In addition, we have granted one other prefix,
stata
, to the manufacturers of Stata for their exclusive use.
International Stata Distributors
International Stata users may also order subscriptions to the
Stata Technical Bulletin
from our International Stata Distributors.
Company: Applied Statistics & Company: Smit Consult
Systems Consultants Address: Doormanstraat 19
Address: P.O. Box 1169 Postbox 220
Nazerath-Ellit 17100, Israel 5150 AE Drunen
Phone: +972 66554254 Netherlands
Fax: +972 66554254 Phone: +31 416-378 125
Email: sasconsl@actcom.co.il Fax: +31 416-378 385
Countries served: Israel Email: j.a.c.m.smit@smitcon.nl
Countries served: Netherlands
Company: Dittrich & Partner Consulting Company: Timberlake Consultants
Address: Prinzenstrasse 2 Address: 47 Hartfield Crescent
D-42697 Solingen West Wickham
Germany Kent BR4 9DW U.K.
Phone: +49 212-3390 99 Phone: +44 181 462 0495
Fax: +49 212-3390 90 Fax: +44 181 462 0493
Email: evhall@dpc.net Email: timberlake@compuserve.com
Countries served: Austria, Germany, Italy Countries served: Ireland, U.K.
Company: Metrika Consulting Company: Timberlake Consultants
Address: Roslagsgatan 15 Satellite Office
113 55 Stockholm Address: Praceta do Com´
ercio,
Sweden N 13–9 Dto. Quinta Grande
Phone: +46-708-163128 2720 Alfragide Portugal
Fax: +46-8-6122383 Phone: +351 (01) 4719337
Email: hedstrom@metrika.se Telem´
ovel: 0931 62 7255
Countries served: Baltic States, Denmark, Finland, Email: timberlake.co@mail.telepac.pt
Iceland, Norway, Sweden Countries served: Portugal
Company: Ritme Informatique
Address: 34 boulevard Haussmann
75009 Paris
France
Phone: +33 1 42 46 00 42
Fax: +33142460033
Email: ritme.inf@applelink.apple.com
Countries served: Belgium, France,
Luxembourg, Switzerland
ResearchGate has not been able to resolve any citations for this publication.
Article
A streamlined computer-based method for morphometric analysis of cirques is presented. Application to the complete population of 158 cirques in the Lake District demonstrates that: 1) Cirque size is unimodal. 2) Larger cirques are better developed. 3) Although altitude, aspect, position nd relief control cirque distribution, they account for only a little of the variability in form within the Lake district. Mapped geology likewise has limited effect. 4) Better-developed cirques have greater plan closure, higher maximum gradient and lower minimum gradient. -from Authors
Article
The geography of crime spans a broader range of subject matter than is often recognized. Using the example of victimization in an ethnically mixed inner city, it is suggested in this paper that the distribution of crime reflects the lifestyle and activity patterns of a community; and that the effects of crime, in turn, help to shape these routine urban behaviours. Thus, social structure and its spatial organization are both reflected in the form of criminal activity, and affected by the fear that this behaviour engenders. Crime infuses a range of social, economic and political relations; its geography offers insights into how these structural phenomena are played out in everyday life.
Article
A step-by-step account is given of a Box-Jenkins analysis of some sales figures showing high multiplicative seasonal variation. Various practical problems are encountered and discussed. A critical appraisal is made of the Box-Jenkins procedure and some general remarks are made on short-term sales forecasting.