ArticlePDF Available

Meta-Analysis: Cumulating Research Findings across Studies

Authors:

Abstract

Meta-analysis is a way of synthesizing previous research on a subject in order to assess what has already been learned, and even to derive new conclusions from the mass of already researched data. In the opinion of many social scientists, it offers hope for a truly cumulative social scientific knowledge.
Chapter 22
Meta-Analysis
Frank L. Schmidt
Abstract
The small sample studies typical of psychological research produce seemingly
contradictory results, and reliance on statistical significance tests causes study
results to appear even more conflicting. Meta-analysis integrates the findings
across such studies to reveal the simpler patterns of relations that underlie
research literatures, thus providing a basis for theory development. Meta-
analysis can correct for the distorting effects of sampling error, measurement
error, and other artifacts that produce the illusion of conflicting findings. This
chapter discusses these artifacts and the procedures used to correct for them.
Different approaches to meta-analysis are discussed. Applications of meta-
analysis in I/O psychology and other areas are discussed and evidence is
presented that meta-analysis is transforming research in psychology. Meta-
analysis has become almost ubiquitous. One indication of this is that, as of
October 12, 2011, Google listed over 9 million entries for meta-analysis.
Keywords: Meta-Analysis, Research Synthesis, Data Analysis, Cumulative
Knowledge, Psychological Theory
Why We Need Meta-Analysis
The goal in any science is the production of cumulative knowledge.
Ultimately this means the development of theories that explain the phenomena that
are the focus of the scientific area. One example would be theories that explain
how personality traits develop in children and adults over time and how these traits
affect their lives. Another would be theories of what factors cause job and career
satisfaction and what effects job satisfaction in turn has on other aspects of one’s
life. But before theories can be developed, we need to be able to pin down the
relations between variables. For example, what is the relation between peer
socialization and level of extroversion? Or the relation between job satisfaction
and job performance?
249
Unless we can precisely calibrate such relations among variables, we do not
have the raw materials out of which to construct theories. There is nothing for a
theory to explain. For example, if the relationship between extroversion and
popularity of children varies capriciously across different studies from a strong
positive to strong negative correlation and everything in between, we cannot begin
to construct a theory of how extroversion might affect popularity. The same
applies to the relation between job satisfaction and job performance.
The unfortunate fact is that most research literatures do show conflicting
findings of this sort. Some research studies in psychology find statistically
significant relationships and some do not. In many research literatures, this split is
approximately 50– - 50 (Cohen, 1962,; 1988; Schmidt, Hunter, & Urry, 1976;
Sedlmeier & Gigerenzer, 1989). This has been the traditional situation in most
areas of the behavioral and social sciences. Hence it has been very difficult to
develop understanding, theories and cumulative knowledge.
The Myth of the Perfect Study
Before meta-analysis, the usual way in which scientists attempted to make
sense of research literatures was by use of the narrative subjective review. But in
many research literatures there were not only conflicting findings, there were also
large numbers of studies. This combination made the standard narrative-subjective
review a nearly impossible task—--one far beyond human information processing
capabilities (Hunter & Schmidt, 1990b, pp. 468–469). How does one sit down and
mentally make sense of, say, 210 conflicting studies?
The answer as developed in many narrative reviews was what came to be
called the myth of the perfect study. Reviewers convinced themselves that most
—--usually the vast majority—--of the studies available were “methodologically
deficient” and should not even be considered in the review. These judgments of
methodological deficiency were often based on idiosyncratic ideas: One reviewer
might regard the Peabody Personality Inventory as “lacking in construct validity”
and throw out all studies that used that instrument. Another might regard use of
that same inventory as a prerequisite for methodological soundness and eliminate
all studies not using this inventory. Thus any given reviewer could eliminate from
consideration all but a few studies and perhaps narrow the number of studies from
210 to, say, seven. Conclusions would then be based on these seven studies.
It has long been the case that the most widely read literature reviews are
those appearing in textbooks. The function of textbooks, especially advanced level
textbooks, is to summarize what is known in a given field. But no textbook can
cite and discuss 210 studies on a single relationship. Often textbook authors would
pick out what they considered to be the one or two “best” studies and then base
250
textbook conclusions on just those studies, discarding the vast bulk of the
information in the research literature. Hence the myth of the perfect study.
But in fact there are no perfect studies. All studies contain measurement
error in all measures used, as discussed later. Independent of measurement error,
no study’s measures have perfect construct validity. And there are typically other
artifacts that distort study findings. Even if a hypothetical (and it would have to be
hypothetical) study suffered from none of these distortions, it would still contain
sampling error—--typically a substantial amount of sampling error, becausesince
sample sizes are rarely very large. Hence no single study or small selected
subgroup of studies can provide an optimal basis for scientific conclusions about
cumulative knowledge. As a result, reliance on “best studies” did not provide a
solution to the problem of conflicting research findings. This procedure did not
even successfully deceive researchers into believing it was a solution—--because
different narrative reviewers arrived at different conclusions because they selected
a different subset of “best” studies. Hence the “conflicts in the literature” became
“conflicts in the reviews.”
Some Relevant History
By the mid-dle 1970s the behavioral and social sciences were in serious
trouble. Large numbers of studies had accumulated on many questions that were
important to theory development and/or social policy decisions. Results of
different studies on the same question typically were conflicting. For example, are
workers more productive when they are satisfied with their jobs? The studies did
not agree. Do students learn more when class sizes are smaller? Research findings
were conflicting. Does participative decision making in management increase
productivity? Does job enlargement increase job satisfaction and output? Does
psychotherapy really help people? The studies were in conflict. As a consequence,
the public and government officials were becoming increasingly disillusioned with
the behavioral and social sciences, and it was becoming more and more difficult to
obtain funding for research. In an invited address to the American Psychological
Association in 1970, then Senator Walter Mondale expressed his frustration with
this situation:
What I have not learned in what we should do about these
problems. I had hoped to find research to support or to conclusively
oppose my belief that quality integrated education is the most
promising approach. But I have found very little conclusive
evidence. For every study, statistical or theoretical, that contains a
proposed solution or recommendation, there is always another,
equally well documented, challenging the assumptions or
251
conclusions of the first. No one seems to agree with anyone else’s
approach. But more distressing I must confess, I stand with my
colleagues confused and often disheartened.
Then in 1981, the Ddirector of the Federal Office of Management and
Budget, David Stockman, proposed an 80% reduction in federal funding for
research in the behavioral and social sciences. (This proposal was politically
motivated in part, but the failure of behavioral and social science research to be
cumulative created the vulnerability to political attack.) This proposed cut was a
trial balloon sent up to see how much political opposition it would arouse. Even
when proposed cuts are much smaller than a draconian 80%, constituencies can
usually be counted on to come forward and protest the proposed cuts. This usually
happens, and many behavioral and social scientists expected it to happen. But it
did not. The behavioral and social sciences, it turned out, had no constituency
among the public; the public did not care (see “Cuts Raise New Social Science
Query,” 1981). Finally, out of desperation, the American Psychological
Association took the lead in forming the Consortium of Social Science
Associations to lobby against the proposed cuts. Although this super-association
had some success in getting these cuts reduced (and even, in some areas, getting
increases in research funding in subsequent years), these developments should
make us look carefully at how such a thing could happen.
The sequence of events that led to this state of affairs was much the same in
one research area after another. First, there was initial optimism about using social
science research to answer socially important questions. Do government-sponsored
job- training programs work? We will do studies to find out. Does Head Start
really help disadvantaged kids? The studies will tell us. Does integration increase
the school achievement of Bblack children? Research will provide the answer.
Next, several studies on the question are conducted, but the results are conflicting.
There is some disappointment that the question has not been answered, but policy
makers—--and people in general—--are still optimistic. They, along with the
researchers, conclude that more research is needed to identify the supposed
interactions (moderators) that have caused the conflicting findings. For example,
perhaps whether job training works depends on the age and education of the
trainees. Maybe smaller classes in the schools are beneficial only for children with
lower levels of academic aptitude. It is hypothesized that psychotherapy works for
middle-class but not working-class patients. That is, the conclusion at this point is
that a search for moderator variables in needed.
In the third phase, a large number of research studies are funded and
conducted to test these moderator hypotheses. When they are completed, there is
now a large body of studies, but instead of being resolved, the number of conflicts
increases. The moderator hypotheses from the initial studies are not borne out, and
no one can make sense out of the conflicting findings. Researchers then conclude
that the question that was selected for study in this particular case has turned out to
be hopelessly complex. They then turn to the investigation of another question,
252
hoping that this time the question will turn out to be more tractable. Research
sponsors, government officials, and the public become disenchanted and cynical.
Research funding agencies cut money for research in this area and in related areas.
After this cycle has been repeated enough times, social and behavioral scientists
themselves become cynical about the value of their own work, and they publish
articles expressing doubts about whether behavioral and social science research is
capable in principle of developing cumulative knowledge and providing general
answers to socially important questions (e.g., see Gergen, 1982; and Meehl,
1978). In fact, Lee J. Cronbach, a renown expert in research methods and
psychological measurement, stated explicitly that the conflicts in psychological
literatures indicate to him that cumulative knowledge was not possible in
psychology and the social sciences (Cronbach, 1975).
Clearly, at this point there is a critical need for some means of making sense
of the vast number of accumulated study findings. Starting in the late 1970s new
methods of combining findings across studies on the same subject were developed.
These methods were referred to collectively as meta-analysis, a term coined by
Glass (1976). Applications of meta-analysis to accumulated research literatures
showed that research findings are not nearly as conflicting as had been thought and
that useful and sound general conclusions can in fact be drawn from existing
research. This made it apparent that cumulative theoretical knowledge is possible
in the behavioral and social sciences and that socially important questions can be
answered in reasonably definitive ways. As a result, the gloom and cynicism that
had enveloped many in the behavioral and social sciences lifted.
Meta-Analysis Versusvs. Significance
Testing
A key point in understanding the effect that meta-analysis has had is that the
illusion of conflicting findings in research literatures resulted mostly from the
traditional reliance of researchers on statistical significance testing in analyzing
and interpreting data in their individual studies (Cohen, 1994). These statistical
significance tests typically had low power to detect existing relationships. Yet the
prevailing decision rule has been that if the finding was statistically significant,
then a relationship existed; and if it was not statistically significant, then there was
no relationship (Oakes, 1986; Schmidt, 1996). For example, suppose that the
population correlation between a certain familial condition and juvenile
delinquency is .30. That is, the relationship in the population of interest is = .30.
Now suppose that 50 studies are conducted to look for this relationship, and each
has statistical power of .50 to detect this relationship if it exists. (This level of
253
statistical power is typical of many research literatures.) Then approximately 50%
of the studies (25 studies) would find a statistically significant relationship; the
other 25 studies would report no significant relationship, and this would be
interpreted as indicating that no relationship existed. That is, the researchers in
these 25 studies would most likely incorrectly state that because the observed
relationship did not reach statistical significance, it probably occurred merely by
chance. Thus half the studies report that the familial factor was related to
delinquency and half report that it had no relationship to delinquency—--a
condition of maximal apparent conflicting results in the literature. Of course, the
25 studies that report that there is no relationship are all incorrect. The relationship
exists and is always = .30. Traditionally, however, researchers did not
understand that a statistical power problem such as this was even a possibility,
because they did not understand the concept of statistical power (Oakes, 1986:
Schmidt, 1996). In fact, they believed that their error rate was no more than 5%
because they used an alpha level (significance level) of .05. But the 5% is just the
Type I error rate (the alpha error rate)—--the error rate that would exist if the null
hypothesis were true and in fact there was no relationship. They overlooked the
fact that if a relationship did exist, then the error rate would be 1.00 minus the
statistical power (which here is 1.00 –- .50 = .50). This is the Type II error rate:
the probability of failing to detect the relationship that exists. If the relationship
does exist, then it is impossible to make a Type I error; that is, when there is a
relationship, it is impossible to falsely conclude that there is a relationship. Only
Type II errors can occur—--and the significance test does not control Type II
errors.
Now suppose these 50 studies were analyzed using meta-analysis. Meta-
analysis would first compute the average r across the 50 studies; all rs would be
used in computing this average regardless of whether they were statistically
significant or not. This average should be very close to the correct value of .30,
because sampling errors on either side of .30 would average out. So meta-analysis
would lead to the correct conclusion that the relationship is on the average = .30.
Meta-analysis can also estimate the real variability of the relationship
across studies. To do this, one first computes the variance of the 50 observed rs,
using the ordinary formula for the variance of a set of scores. One next computes
the amount of variance expected solely from sampling error variance, using the
formula for sampling error variance of the correlation coefficient. This sampling
variance is then subtracted from the observed variance of the rs; after this
subtraction, the remaining variance in our example should be approximately zero
if the population correlations are all .30. Thus the conclusion would be that all of
the observed variability of the rs across the 50 studies is due merely to sampling
error and does not reflect any real variability in the true relationship. Thus one
would conclude correctly that the real relationship is always .30—--and not merely
.30 on the average.
254
This simple example illustrates two critical points. First, the traditional
reliance on statistical significance tests in interpreting studies leads to false
conclusions about what the study results mean; in fact, the traditional approach to
data analysis makes it virtually impossible to reach correct conclusions in most
research areas (Hunter, 1997; Hunter & Schmidt, 1990b,; 2004; Schmidt & Hunter,
1997; Schmidt, 1996,; 2010). Second, meta-analysis leads, by contrast, to the
correct conclusions about the real meaning of research literatures. These principles
are illustrated and explained in more detail in Hunter and Schmidt (1990a, &
2004); for a shorter treatment, see Schmidt (1996).
The reader might reasonably ask what statistical methods researchers
should use in analyzing and interpreting the data in their individual studies. If
reliance on statistical significance testing leads to false conclusions, what methods
should researchers use? The answer is point estimates of effect sizes (correlations
and d-values) and confidence intervals. The many advantages of point estimates
and confidence intervals are discussed in Hunter and Schmidt (1990b,; 2004),
Hunter (1997), Schmidt (1996,; 2010), and Thompson (2002,; 2007). Some steps
have been taken to address the problems created by overreliance on significance
testing. The 1999 American Psychological Association (APA) Task Force Report
on significance testing (Wilkinson and the APA Task Force on Statistical Inference,
1999) stated that researchers should report effect size estimates and confidence
intervals (CIs). And both the fifth5th and sixth6th editions of the APA Publication
Manual state that it is “almost always” necessary for studies to report effect size
estimates and CIs (American Psychological Association, 2001,; 2009). The
reporting standards of the American Educational Research Association (2006) have
an even stronger requirement of this sort. Also, at least 24 research journals now
require the reporting of effect sizes as a matter of editorial policy (Thompson,
2007). However, it is apparent from perusing most psychology research journals
that this reform effort still has a long way to go (Thompson, 2002). In many cases,
little has yet changed.
Our example here has examined only the effects of sampling error variance
and low statistical power. There are other statistical and measurement artifacts that
cause artifactual variation in effect sizes and correlations across studies—--for
example, differences between studies in amount of measurement error, range
restriction, and dichotomization of measures. Also, in meta-analysis, mean
correlations and d-values must be corrected for downward bias due to such
artifacts as measurement error and dichotomization of measures. There are also
artifacts such as coding or transcriptional errors in the original data that are
difficult or impossible to correct for. These artifacts and the complexities involved
in correcting for them are discussed later in this chapter and are covered in more
detail in Hunter and Schmidt (1990a,; 1990b,; 2004) and Schmidt and Hunter
(1996). This section is an overview of why traditional data analysis and
interpretation methods logically lead to erroneous conclusions and why meta-
analysis can solve this problem and provide correct conclusions.
255
A common reaction to the above critique of traditional reliance on
significance testing goes something like this: “Your explanation is clear but I don’t
understand how so many researchers (and even some methodologists) could have
been so wrong so long on a matter as important as the correct way to analyze data?
How could psychologists and others have failed to see the pitfalls of significance
testing?” Over the years, a number of methodologists have addressed this question
(Carver, 1978; Cohen, 1994; Guttman, 1985; Meehl, 1967; Oakes, 1986;
Rozeboom, 1960). For one thing, in their statistics classes young researchers have
typically been taught a lot about Type I error and very little about Type II error and
statistical power. Thus they are unaware that the error rate is very large in the
typical study; they tend to believe that the error rate is the alpha level used
(typically .05 or .01). In addition, empirical research suggests that most
researchers believe that the use of significance tests provides them with many
nonexistent benefits in understanding their data. For example, most researchers
believe that a statistically significant finding is a “reliable” finding in the sense that
it will replicate if a new study is conducted (Carver, 1978; Oakes, 1986; Schmidt,
1996). For example, they believe that if a result is significant at the .05 level, then
the probability of replication in subsequent studies (if conducted) is 1.00 –- .05
= .95. This belief is completely false. The probability of replication is the
statistical power of the study and is almost invariably much lower than .95
(typically about .50). Most researchers also falsely believe that if a result is
nonsignificant, one can conclude that it is probably just due to chance, another
false belief, as illustrated in our deliquency research example. There are other
widespread but false beliefs about the usefulness of information provided by
significance tests (Carver, 1978; Oakes, 1986). A discussion of these beliefs can be
found in Schmidt (1996).
Au Add Meehl, 1967 to Refs
During the 1980s and accelerating up to the present, the use of meta-
analysis to make sense of research literatures has increased dramatically, as is
apparent from reading research journals. Lipsey and Wilson (1993) found more
thanover 350 meta-analyses of experimental studies of treatment effects alone; the
total number is many times larger, because most meta-analyses in psychology and
the social sciences are conducted on correlational data (as was our hypothetical
example above). The overarching meta-conclusion from all these efforts is that
cumulative, generalizable knowledge in the behavioral and social sciences is not
only possible but is increasingly a reality. In fact, meta-analysis has even produced
evidence that cumulativeness of research findings in the behavioral sciences is
probably as great as in the physical sciences. Psychologists have long assumed that
their research studies are less replicable than those in the physical sciences.
Hedges (1987) used meta-analysis methods to examine variability of findings
across studies in 13 research areas in particle physics and 13 research areas in
psychology. Contrary to common belief, his findings showed that there was as
much variability across studies in physics as in psychology. Furthermore, he found
256
that the physical sciences used methods to combine findings across studies that
were “essentially identical” to meta-analysis. The research literature in both areas
—--psychology and physics—--yielded cumulative knowledge when meta-analysis
was properly applied. Hedges’s major finding is that the frequency of conflicting
research findings is probably no greater in the behavioral and social sciences than
in the physical sciences. The fact that this finding has been so surprising to many
psychologists points up two facts. First, psychologists’ reliance on significance
tests has caused our research literatures to appear much more inconsistent than
they are. Second, we have long overestimated the consistency of research findings
in the physical sciences. In the physical sciences also, no research question can be
answered by a single study, and physical scientists must use meta-analysis to make
sense of their research literature, just as psychologists do.
Another fact is relevant at this point: The physical sciences, such as physics
and chemistry, do not use statistical significance testing in interpreting their data
(Cohen, 1990). It is no accident, then, that these sciences have not experienced the
debilitating problems described earlier that are inevitable when researchers rely on
significance tests. Given that the physical sciences regard reliance on significance
testing as unscientific, it is ironic that so many psychologists defend the use of
significance tests on grounds that such tests are the objective and scientifically
correct approach to data analysis and interpretation. In fact, it has been our
experience that psychologists and other behavioral scientists who attempt to
defend significance testing usually equate null hypothesis statistical significance
testing with scientific hypothesis testing in general. They argue that hypothesis
testing is central to science and that the abandonment of significance testing would
amount to an attempt to have a science without hypothesis testing. They falsely
believe that null hypothesis significance testing and hypothesis testing in science
in general are one and the same thing. This belief is tantamount to stating that
physics, chemistry, and the other physical sciences are not legitimate sciences
because they are not built on hypothesis testing. Another logical implication of this
belief is that prior to the introduction of null hypothesis significance testing by
Fisher (1932) in the 1930s, no legitimate scientific research was possible. The fact
is, of course, that there are many ways to test scientific hypotheses—--and that
significance testing is one of the least effective methods of doing this (Schmidt &
Hunter, 1997).
Is Statistical Power the Solution?
Some researchers believe that the only problem with significance testing is low
power and that if this problem could be solved there would be no problems with reliance
on significance testing. These individuals see the solution as larger sample sizes. They
believe that the problem would be solved if every researcher before conducting each
257
study would calculate the number of subjects needed for “adequate” power (usually taken
as power of .80) and then use that sample size. What this position overlooks is that this
requirement would make it impossible for most studies ever to be conducted. At the start
of research in a given area, the questions are often of the form, “Does Treatment A have
an effect?” (e.g., Does interpersonal skills training have an effect? Or: Does this predictor
have any validity?). If Treatment A indeed has a substantial effect, the sample size needed
for adequate power may not be prohibitively large. But as research develops, subsequent
questions tend to take the form, “Is the effect of Treatment A larger than the effect of
Treatment B?” (e.g., Is the effect of the new method of training larger than that of the old
method? Or: Is predictor A more valid than predictor B?). The effect size then becomes
the difference between the two effects. Such effect sizes will often be small, and the
required sample sizes are therefore often quite large—--often 1,000 or 2,000 or more
(Schmidt & Hunter, 1978). And this is just to attain power of .80, which still allows a
20% Type II error rate when the null hypothesis is false—--an error rate most would
consider high. Many researchers cannot obtain that many subjects, no matter how hard
they try; either it is beyond their resources or the subjects are just unavailable at any cost.
Thus the upshot of this position would be that many—--perhaps most --studies would not
be conducted at all.1
People advocating the power position say this would not be a loss. They
argue that a study with inadequate power contributes nothing and therefore should
not be conducted. But in fact such studies contain valuable information when
combined with others like them in a meta-analysis. In fact, very precise meta-
analysis results can be obtained based on studies that all have inadequate statistical
power individually. The information in these studies is lost if these studies are
never conducted.
The belief that such studies are worthless is based on two false
assumptions: (a) the assumption that every individual study must be able to justify
a conclusion on its own, without reference to other studies, and (b) the assumption
that every study should be analyzed using significance tests. One of the
contributions of meta-analysis has been to show that no single study is adequate by
itself to answer a scientific question. Therefore each study should be considered as
11. Something like this has apparently occurred in industrial-organizational psychology in
the area of validation studies of personnel selection methods. After the appearance of the
Schmidt, Hunter, and Urry (1976) article showing that statistical power in criterion related
validity studies probably averaged less than .50, researchers began paying more attention
to statistical power in designing studies. Average sample sizes increased from around 70 to
more than 200, with corresponding increases in statistical power. However, the number of
studies conducted declined dramatically, with the result the total amount of information
created per year or per decade for entry into meta-analyses (validity generalization) studies
probably decreased. That is, the total amount of information generated in the earlier period
from large numbers of small sample studies may have been greater than that generated in
the later period from a small number of larger sample studies.
258
a data point to be contributed to a later meta-analysis. And individual studies
should be analyzed using not significance tests but point estimates of effect sizes
and confidence intervals.
How, then, can we solve the problem of statistical power in individual
studies? Actually, this problem is a pseudoproblem. It can be “solved” by
discontinuing the significance test. As Oakes (1986, p. 68) notes, statistical power
is a legitimate concept only within the context of statistical significance testing. If
significance testing is not used, then the concept of statistical power has no place
and is not meaningful. In particular, there need be no concern with statistical
power when point estimates and confidence intervals used to analyze data in
studies and meta-analysis is used to integrate findings across studies.
Our critique of the traditional practice of reliance on significance testing in
analyzing data in individual studies and in interpreting research literatures might
suggest a false conclusion: the conclusion that if significance tests had never been
used, the research findings would have been consistent across different studies
examining a given relationship. Consider the correlation between job satisfaction
and job performance. Would these studies have all had the same findings if
researchers had not relied on significance tests? Absolutely not: tThe correlations
would have varied widely (as indeed they did). The major reason for this
variability in correlations is simple sampling error—--caused by the fact that the
small samples used in individual research studies are randomly unrepresentative of
the populations from which they are drawn. Most researchers severely
underestimate the amount of variability in findings that is caused by sampling
error.
The law of large numbers correctly states that large random samples are
representative of their populations and yield parameter estimates that are close to
the real (population) values. Many researchers seem to believe that the same law
applies to small samples. As a result they erroneously expect statistics computed
on small samples (e.g., 50 to 300) to be close approximations to the real
(population) values. In one study we conducted (Schmidt, Ocasio, Hillery, &
Hunter, 1985), we drew random samples (small studies) of N = 30 from a much
larger data set and computed results on each N = 30 sample. These results varied
dramatically from “study” to “study”—---and all this variability was due solely to
sampling error (Schmidt et al., Ocasio, Hillery, & Hunter, 1985). Yet when we
showed these data to researchers they found it hard to believe that each study
was a random draw from the larger study. They did not believe simple sampling
error could produce that much variation. They were shocked because they did not
realize how much variation simple sampling error produces in research studies.
A major advantage of meta-analysis is that it controls for sampling error.
Sampling error is random and nonsystematic—--over and under estimation of
population values are equally likely. Hence averaging correlations or d-values
(standardized mean differences) across studies causes sampling error to be
259
averaged out, producing an accurate estimate of the underlying population
correlation or mean population correlation. As noted earlier, we can also subtract
sampling error variance from the between-study variance of the observed
correlations (or d-values) to get a more accurate estimate of real variability across
studies. Taken together, these two procedures constitute what we call Bare Bones
meta-analysis—--the simplest form of meta-analysis. Bare Bbones meta-analysis is
discussed in more detail in a later section.
Most other artifacts that distort study findings are systematic rather than
random. They usually create a downward bias on the obtained study r or d value.
For example, all variables in a study must be measured and all measures of
variables contain measurement error. (There are no exceptions to this rule.) The
effect of measurement error is to downwardly bias every correlation or d-value.
However, measurement error can also contribute to differences between studies: If
the measures used in one study have more measurement error than those used in
another study, the observed rs or ds will be smaller in the first study. Thus meta-
analysis must correct for both the downward bias and the artifactually created
differences between different studies. Corrections of this sort are discussed under
the heading “More Advanced MethodsForms of Meta-Analysis..
Organization of Remainder of This
Chapter
Different methodologists have developed somewhat different approaches to
meta-analysis (Glass, McGaw, & Smith, 1981; Hedges & Olkin, 1985; Hunter &
Schmidt, 1990b, 2004; Hunter, Schmidt, & Jackson, 1982; Hunter & Schmidt,
1990b; 2004; Rosenthal, 1991). We first examine the Hunter-Schmidt methods,
followed by an examination of the other approaches. Finally, we look at the impact
of meta-analysis over the lpast 20 years on the research enterprise in psychology
and other disciplines.
Bare Bones Meta-Analysis
Bare Bbones meta-analysis corrects only for the distorting effects of
sampling error. It ignores all other statistical and measurement artifacts that distort
study findings. For this reason we do not recommend Bbare Bbones meta-analysis
for use in final integration of research literatures. Its primary value is that it allows
illustration of some of the key features of more complete methods of meta-
260
analysis. We illustrate Bbare Bbones meta-analysis using the data shown in Table
22.1. Table 22.1 shows 21 observed correlations, each based on a sample of 68
U.S. Postal Service letter sorters. Each study presents the estimated correlation
between the same aptitude test and the same measure of accuracy in sorting letters
by zip code. Values range from .02 to .39 and only 8eight of the 21 (38%) are
statistically significant. Both these facts suggest a great deal of disagreement
among the studies.
Table 22.1 21 Validity Studies (N = 68 Each)
Study Observed validity Study Observed validity
1 .04 12 .11
2 .14 13 .21
3 .31* 14 .37*
4 .12 15 .14
5 .38* 16 .29*
6 .27* 17 .26*
7 .15 18 .17
8 .36* 19 .39*
9 .20 20 .22
10 .02 21 .21
11 .23
* p < .05, two-tailed.
We first compute the average correlation using the following formula:
 
xy
i
ii ˆ
N
rN
r
.22 (1)
where
(the average observed correlation) estimates
,
xy
the population
mean correlation. Note that this formula weights each correlation by its sample
size—--because studies with larger Ns contain more information. (However, in this
case all N = 68, so all studies are weighted equally.) The mean value of .22 is the
meta-analysis estimate of the mean population correlation.
We next compute the variance of the observed correlations using the
following formula:
 
 
0120.
N
rrN
S
i
2
ii
2
r
(2)
261
This formula also weights by sample size. The next step is to compute the
amount of variance in the observed correlations expected across these studies due
solely to sampling error variance:
 
0135.
1N
r1
S
2
2
2
e
(3)
where
N
is the average sample size across studies.
Finally, we estimate the amount of between-study variance that is left after we
subtract out expected sampling error variance:
2
e
2
r
2SSS xy
(4)
where
2
xy
S
estimates
2
xy
, the population value.
2
xy
S
.0120 –- .0135 = –-.0015
In this case there is slightly less variance in the observed r than is predicted
from sampling error. (This deviation from zero is called second order sampling
error. Negative variance estimates also occur in ANOVA and other statistical
models in which estimates are produced by subtraction; see Hunter and& Schmidt,
2004, cChapter 9). Hence we conclude that
xy
= .22 and
xy
SD
= 0. That is, we
conclude that sampling error accounts for all the observed differences between the
studies. We conclude that the population
xy
value underlying every one of the
studies is .22.
This example illustrates how meta-analysis sometimes reveals that all of the
apparent variability across studies is illusory. Frequently, however, there is
considerable variability remaining after correcting for sampling error. Often this
remaining variability will be due to other variance-producing artifacts that have
not been corrected for. But sometimes some of it might be “real.” Suppose the
researcher hypothesizes (say, based on evolutionary psychology theory) that the
results are different for males and females. He or she can then check this
hypothesis by subgrouping the studies into those conducted on males and those
conducted on females. If sex is indeed a real moderator, then the mean correlations
will be different for males and females. The average within group
xy
SD
will also
be smaller than the overall
xy
SD
. Later we will discuss other methods of checking
for moderator variables.
262
Other Artifacts and Their Effects
Bare Bbones meta-analysis is deficient and should not be used without
further refinement in integrating research literatures. It is deficient because there is
no research literature in which the only source of distortion in study findings is
sampling error. Since there are no scales that are free of measurement error, the
findings of every study are distorted by measurement error—--in both the
independent variable measure and the dependent variable measure. In addition,
independent of measurement error, no measure has perfect construct validity; all
measures, even good ones, have at least some construct deficiency (something left
out) and some construct contamination (something included that should not be).
The findings of most studies are also distorted by other artifacts.
Table 22.2 lists 10ten of these additional artifacts. (For notational
simplicity, we consider each of these as population values.) Measurement error in
the independent and dependent variable measures biases obtained correlations or
d-values downward, with the amount of downward bias depending on the size of
the reliabilities of the measures. For example, if both measures have reliability of .
70, the downward bias from this artifact alone will be 30%. In addition,
differences in reliability between studies will cause differences in findings between
studies.
Table 22.2 Study Artifacts Beyond Sampling Error That Alter the Value of
Outcome Measures, With Examples fFrom Personnel Selection Research
1. Error of measurement in the dependent variable. Example:
Study validity will be systematically lower than true validity to the extent that job
performance is measured with random error.
2. Error of measurement in the independent variable. Example:
Study validity for a test will systematically understate the validity of the ability
measured since the test is not perfectly reliable.
3. Dichotomization of a continuous dependent variable. Example:
Turnover--the length of time that worker stays with the organization is often
dichotomized into “more than . . .” or “less than . . . ,” where the cutoff point is some
arbitrarily chosen interval such as one year or six months.
4. Dichotomization of a continuous independent variable. Example:
Interviewers are often told to dichotomize their perceptions into “acceptable” versus
“reject.”
5. Range variation in the independent variable. Example:
Study validity will be systematically lower than true validity to the extent that hiring
policy causes incumbents to have a lower variation in the predictor than is true of
applicants.
6. Attrition artifacts: Range variation in the dependent variable. Example:
263
Study validity will be systematically lower than true validity to the extent that there is
systematic attrition in workers on performance, as when good workers are promoted
out of the population or when poor workers are fired for poor performance, or both.
7. Deviation from perfect construct validity in the independent variable. Example:
Study validity will vary if the factor structure of the test differs from the usual structure
of tests for the same trait.
8. Deviation from perfect construct validity in the dependent variable. Example:
Study validity will differ from true validity if the criterion is deficient or contaminated.
9. Reporting or transcriptional error. Example:
Reported study validities may differ from actual study validities due to a variety of
reporting problems: inaccuracy in coding data, computational errors, errors in reading
computer output, typographical errors by secretaries or by printers. These errors can
be very large in magnitude.
10. Variance due to extraneous factors. Example:
Study validity will be systematically lower than true validity if incumbents differ in job
experience at the time their performance is measured.
Either or both of the continuous independent and dependent variable
measures may be dichotomized (typically at the median). If both measures are
dichotomized at the median or mean, the underlying correlation will be reduced by
the factor .80 x .80 or .64 (Hunter & Schmidt, 1990a).
Either or both of the variables may be affected by range variation. For
example, only the top 50% of test scorers might be hired, producing a downward
bias of around 30% due to direct range restriction on the independent variable. In
addition, those with poor job performance might be fired, producing range
restriction on the dependent variable, resulting in a further downward bias. Unlike
data errors and sampling errors, range restriction is a systematic artifact. Range
restriction reduces the mean correlation or d-value. Also, variation in range
restriction across studies increases the between-study variability of study
outcomes. Differences across studies in variability of measures can be produced by
direct or indirect range restriction (DRR and IRR). DRR is produced by direct
truncation on the independent variable and on only that variable. For example,
range restriction would be direct if college admissions were based only on one test
score, with every applicant above the cut score being admitted and everyone else
rejected. This is quite rare, because multiple items of information are almost
always used in such decision making. Most range restriction is indirect. For
example, self-selection into psychology lab studies can result in IRR on study
variables. Range restriction is correctable in a meta-analysis, and a new procedure
has recently been developed for correcting for IRR that can be applied when older
procedures cannot be (Hunter, Schmidt, & Le, 2006). This procedure has been
demonstrated via Monte Carlo simulation studies to be quite accurate (Le &
Schmidt, 2006). Application of this new correction method has shown that general
intelligence is considerably more important in job performance than previously
264
believed. The correlation (validity) for the most common job group in the economy
is about .65. Previous estimates of this correlation, based on corrections for DRR
when in fact IRR existed in the data, have been about .50 (Hunter et al., 2006;
Schmidt, Oh, & Le, 2006), so the new, more accurate estimate is about 30% larger
than the older one. Application of this correction for IRR has also shown that the
relative importance of general mental ability in job performance compared to that
of personality is greater than previously thought (Schmidt, Shaffer, & Oh, 2008). A
recent application of this IRR correction has also shown that the Graduate
Management aptitude Test (GMAT) is more valid that previously thought (Oh,
Schmidt, Shaffer, & Le, 2008). These are examples of how meta-analysis methods
are continuing to grow and develop, even 35 years after their introduction in the
mid-1970s (Schmidt, 2008).
Au Add Oh, Schmidt, Shaffer, & Le, 2008 to Refs
Deviation from perfect construct validity in the two measures produces an
additional independent downward bias. Construct validity is defined as the
correlation between the actual construct one is attempting to measure and true
scores on the scale one uses to measure that construct. Although this correlation
cannot be directly observed, there is much empirical evidence to indicate it is
rarely perfect.
Errors in the data are not systematic in their effect on the mean correlation
or mean d-value. The distortion produced in the correlation can be in either
direction. Across studies such errors do have a systematic effect: they increase the
amount of artifactual between-study variation. Sometimes data errors can be
detected and corrected (e.g., observed correlations larger than 1.00 can be spotted
as errors) but usually this is not possible in meta-analysis.
An Eexample. Consider an example. Suppose the construct level correlation
between two personality traits A and B is .60. (
AB
= .60.) This is the correlation
that we as researchers are interested in, the correlation between the two constructs
themselves. Measure x is used to measure tTrait A and measure y is used to
measure tTrait B. Now suppose we have the following situation:
a1 = .90 = the square root of the reliability of x; rxx = .81;
a2 = .90 = the square root of the reliability of y; ryy = .81;
a3 = .90 = the construct validity of x;
a4 = .90 = the construct validity of y;
a5 = .80 = the attenuation factor for splitting x at the median; and
a6 = .80 = the attenuation factor for splitting y at the median.
This is not an extreme example. Both measures have acceptable reliability
(.81 in both cases). Both measures have high construct validity; for each measure,
its true scores correlate .90 with the actual construct. Both measures have been
265
dichotomized into low and high groups, but the dichotomization is at the median,
which produces less downward bias than any other split.
The total impact of the six study imperfections is the total
attenuation factor A:
A = (.90)(.90)(.90)(.90)(.80)(.80) = .42 (5)
Hence the attenuated study correlation—--the expected observed study
correlation—--is:
xy
=
AB
42.
= .42(.60) = .25 (6)
That is, the study correlation is reduced to less than half the value of the
actual correlation between the two personality traits.
This realistic example illustrates the power of artifacts other than sampling
error to severely distort study results. These artifacts produce serious distortions
and must be taken seriously. This example contains six artifacts; the first four of
these are always present in every study. Dichotomization does not occur in every
study, but in many studies in which it does not, other artifacts such as range
restriction do occur, and the overall attenuation factor, A, is often smaller than
our .42 here.
This example illustrates a single study. The different studies in a research
literature will have different levels of artifacts and hence different levels of
downward bias. Hence these artifacts not only depress the overall mean observed
correlation, they also create additional variability in correlations across studies
beyond that created by sampling error.
More Advanced Methods of Meta-
Analysis
More advanced forms of meta-analysis correct for these artifacts. First, they
correct for the overall downward bias produced by the artifacts. Second, they
correct for the artifactual differences between studies that these artifacts create.
These more advanced meta-analysis methods take two forms: methods in which
each observed study correlation (or d-value) is corrected individually, and methods
in which distributions of artifact values are used to correct the entire distribution
of observed correlations (or d-values) at one time. As discussed later, both of these
advanced meta-analysis methods are referred to as psychometric meta-analysis
methods. These methods are discussed in the following two sections.
266
Methods That Correct Each r- or d-
vValue Independently
We will describe this form of meta-analysis for correlations but the same
principles apply when the statistic being used is the d-value. The method that
corrects each statistic individually is the most direct form of the more complete
methods of meta-analysis. In this method, each individual observed correlation is
corrected for each of the artifacts that have biased it. This is most easily illustrated
using our example from the last section. In that example, the underlying construct
level correlation is .60 (
AB
= .60). But the total downward bias created by the six
artifacts operating on it reduced it to .25:
xy
=
AB
42.
= .42 (.60) = .25 (7)
The total attenuating or biasing factor is .42. Now in a real study if we can compute
this total biasing factor, we can correct our observed value of .25 by dividing it by this
factor:
AB
= .25 / .42 = .60 (8)
A correction of this sort is applied to each of the observed correlations included in
the meta-analysis. This correction reverses the downwardly biasing process and restores the
correlation to its actual construct level value. In the population (that is, when N is infinite),
this correction is always accurate, because there is no sampling error. In real studies,
however, sample sizes are not infinite, so there is sampling error. The effect of this sampling
error is that corrections of this kind are accurate only on the average. Because of sampling
error, any single corrected value may be randomly too large or too small, but the average of
such corrected values is accurate. It is the average of these corrected values across all the
studies in the meta-analysis that is the focus of the meta-analysis. So our estimate of
AB
is
accurate in expectation. There will be no downward or upward bias in
AB
.
It should be noted that in cases in which reliabilities and range restriction
values are missing for a minority of the effect sizes it has become common
practice to use average reliability values and average range restriction values to
correct these correlations or d-values. A good example of this practice is the Judge,
Thorensen, Bono, and Patton (2001) meta-analysis for the relationship between
job satisfaction and job performance. This practice appears to introduce little
inaccuracy into the results.
Meta-analysis also has a second focus: on the variability of these corrected
correlations. The variance of these corrected correlations is inflated by sampling
error variance. In fact, the corrections actually increase the amount of sampling
error variance. This sampling error variance is subtracted from the variance of the
corrected rs to estimate the real variance of the construct level correlations:
267
2
e
2
ˆ
2
AB
ˆ
ABAB SSS
(9)
In this equation,
2
ˆAB
S
is the computed variance of the corrected
correlations. This variance contains sampling error, and the amount of that
sampling error is
2
eAB
ˆ
S
. Hence the difference between these two figures,
2
AB
S
,
estimates the real (i.e., population) variance of
AB
. The square root of
2
AB
S
is
the estimate of
AB
SD
. Hence we have
AB
and
AB
SD
as the product of the meta-
analysis. That is, we have estimated the mean and the SD of the underlying
construct level correlations. This is a major improvement over Bbare Bbones meta-
analysis, which estimates the mean and SD of the downwardly biased correlations
(
xy
and
xy
SD
) and hence does not tell us anything about the correlation
between actual constructs or traits.
If
AB
SD
is zero or very small, this indicates that there are no moderators
(interactions) producing different values of
AB
in different studies. Hence there
is no need to test moderator hypotheses. If
AB
SD
is larger, this variation may be
due to other artifacts—--such as data errors—--that you have not been able to
correct for. However, some of the remaining variation may be due to one or more
moderator variables. If there is theoretical evidence to suggest this, these
hypotheses can be tested by subgrouping the studies and performing a separate
meta-analysis on each subgroup. It may turn out that
AB
really is different for
males and females, or for higher versusvs. lower management levels. If so, the
moderator hypothesis has been confirmed. Another approach to moderator analysis
is correlational: the values of
AB
ˆ
can be correlated with study characteristics
(hypothesized moderators). Multiple regression can also be used. Values of
AB
ˆ
can be regressed on multiple study characteristics. In all forms of moderator
analysis, there are statistical problems in moderator analysis that the researcher
should be aware of. We discuss these problems later.
What we have presented here is merely an overview of the main ideas in
this approach to meta-analysis. A detailed discussion can be found in Hunter and
Schmidt (2004). In that book, cChapter 3 discusses application of this method to
correlations and cChapter 7 to d-values. The actual formulas are considerably
more complicated than in the case of Bbare Bbones meta-analysis and are beyond
scope and length limitations of this chapter. Windows-based software is available
for applying psychometric meta-analysis (Schmidt & Le, 2004). These programs
are described in some detail in the Appendix to Hunter and Schmidt (2004).
268
Meta-Analysis Using Artifact
Distributions
Many meta-analyses do not use the method described above. These meta-
analyses do not correct each r or d statistic individually for the artifactual biases
that have affected it. The reason for this is that most studies in those literatures do
not present all of the information on artifacts that is necessary to make these
corrections. For example, many studies do not present information on the
reliability of the scales used. Many studies do not present information on the
degree of range restriction present in the data. The same is true for the other
artifacts.
However, artifact information is usually presented sporadically in the
studies included in the meta-analysis. Some studies present reliability information
on the independent variable measures and some on the dependent variable
measures. Some present range restriction information but not reliability
information. In addition, information on artifact levels typical of the research
literature being analyzed is often available from other sources. For example, test or
inventory manuals often present information on scale reliability. Information on
typical levels of range restriction can be found in the personnel selection literature.
Using all such sources of artifact information, it is often possible to compile a
distribution of artifacts that is representative of that research literature; for
example, a distribution of inter-rater reliabilities of supervisory ratings of job
performance; a distribution of reliabilities typical of spatial ability tests; or a
distribution of reliabilities of job satisfaction measures.
Artifact distribution meta-analysis is a set of quantitative methods for
correcting artifact-produced biases using such distributions of artifacts.
Correlations are not corrected individually. Instead, a Bbare Bbones meta-analysis
is first performed and then the mean
 
xy
and SD
 
xy
SD
produced by the Bbare
Bbones analysis are corrected for artifacts other than sampling error. The formulas
for this form of meta-analysis are even more complex than those used when each
correlation is corrected individually. These methods are presented in detail in
Hunter and Schmidt (2004), in cChapter 4 for correlations and in cChapter 7 for
the d-value statistic. A large percentage of advanced level meta-analyses use
artifact distribution meta-analysis methods.
In addition to methods developed by the Hunter and Schmidt, artifact
distribution based meta-analysis methods have been developed by Callender and
Osburn (1980) and Raju and Burke (1983). Computer simulation studies have
shown that all of these methods are quite accurate (e.g., Law, Schmidt, & Hunter,
1994a,; 1994b). In addition, in data sets in which artifact information is available
269
for each correlation, it is possible to apply both methods of advanced level meta-
analysis to the same set of studies. That is, each correlation can be corrected
individually and artifact distribution based meta-analysis can also be applied in a
separate analysis. In such cases, the meta-analysis results have been essentially
identical (Hunter & Schmidt, 2004), as would be expected.
Moderator hypotheses may also be examined when using artifact
distribution meta-analysis. With this method of meta-analysis, subgrouping of
studies is the preferred method of moderator analysis. Regression of study
correlations onto study characteristics (potential moderators) works less well
because the study correlations in this case have not been (and cannot be)
individually corrected for artifacts and hence the correlations are (differentially)
biased as indices of actual study findings. Hence they lack construct validity as
measures of true study correlations or effect sizes. The set of Windows-based
programs for psychometric meta-analysis mentioned earlier (Schmidt & Le, 2004)
includes programs for application of artifact distribution meta-analysis, as well as
meta-analysis based on correction of individual effect sizes. In both case, the
programs allow for correction for indirect as well as direct range restriction. (See
descriptions in the Appendix to Hunter and Schmidt, 2004).
Classification of Meta-Analysis
Methods
Meta-analysis methods can be divided into three categories: (1a) methods
that are purely descriptive (and do not address sampling error); (2b) methods that
address sampling error but not other artifacts; and (3c) methods that address both
sampling error and other artifacts that distort findings in individual studies. Figure
22.1 illustrates this classification system and references publications that explicate
each type of method.
Figure 22.1 Schematic Iillustrating Mmethods of Mmeta-Aanalysis.
[ch22f01]
Descriptive Meta-Analysis Methods
Glass (1976) advanced the first meta-analysis procedures and coined the
term meta-analysis to designate the analysis of analyses (studies). For Glass, the
purpose of meta-analysis is descriptive; the goal is to paint a very general, broad,
270
and inclusive picture of a particular research literature (Glass, 1977; Glass et al.,
McGaw, & Smith, 1981). The questions to be answered are very general; for
example, does psychotherapy—--regardless of type—--have an impact on the
kinds of outcomes that therapy researchers consider important enough to measure,
regardless of the nature of these outcomes (e.g., self-reported anxiety, count of
emotional outbursts, etc.)? Thus Glassian meta-analysis often combines studies
with somewhat different independent variables (e.g., different kinds of therapy)
and different dependent variables. As a result, some critics have criticized these
methods as combining apples and oranges. However, Glassian meta-analysis does
allow for separate meta-analyses for different independent variables (e.g., different
types of psychotherapy). But this is rarely done for different dependent variables.
Glassian meta-analysis has three primary properties:
1. A strong emphasis on effect sizes rather than significance levels. Glass
believed the purpose of research integration is more descriptive than
inferential, and he felt that the most important descriptive statistics are
those that indicate most clearly the magnitude of effects. Glassian meta-
analysis typically employs estimates of the Pearson r or estimates of d.
The initial product of a Glassian meta-analysis is the mean and standard
deviation of observed effect sizes or correlations across studies.
2. Acceptance of the variance of effect sizes at face value. Glassian meta-
analysis implicitly assumes that the observed variability in effect sizes is
real and should have some substantive explanation. There is no attention
to sampling error variance in the effect sizes. The substantive explanations
are sought in the varying characteristics of the studies (e.g., sex or mean
age of subjects, length of treatment, and more). Study characteristics that
correlate with study effect are examined for their explanatory power. The
general finding in applications of Glassian meta-analysis has been that few
study characteristics correlate significantly with study outcomes. Problems
of capitalization on chance and low statistical power associated with this
step in meta-analysis are discussed in Hunter and Schmidt (2004,;
cChapter 2).
3. A strongly empirical approach to determining which aspects of studies
should be coded and tested for possible association with study
outcomes. Glass (1976, 1977) felt that all such questions are empirical
questions, and he de-emphasized the role of theory in determining
which variables should be tested as potential moderators of study
outcome (see also Glass, 1972).
One variation of Glass’s methods has been labeled study effects meta-
analysis by Bangert-Drowns (1986). It differs from Glass’s procedures in several
ways. First, only one effect size from each study is included in the meta-analysis,
thus ensuring statistical independence within the meta-analysis. If a study has
multiple dependent measures, those that assess the same construct are combined
271
(usually averaged), and those that assess different constructs are assigned to
different meta-analyses. Second, study effects meta-analysis calls for the meta-
analyst to make some judgments about study methodological quality and to
exclude studies with deficiencies judged serious enough to distort study outcomes.
In reviewing experimental studies, for example, the experimental treatment must
be at least similar to those judged by experts in the research area to be appropriate,
or the study will be excluded. This procedure seeks to calibrate relationships
between specific variables rather than to paint a broad Glassian picture of a
research area. In this sense it is quite different from Glassian methods and is more
focused on the kinds of questions that researchers desire answers to. However, this
approach is like the Glass method in that it does not acknowledge that much of the
variability in study findings is due to sampling error variance. That is, it takes
observed correlations and d-values at face value. Some of those instrumental in
developing and using this procedure are Mansfield and Busse (1977), Kulik and
his associates (Bangert-Drowns, Kulik, & Kulik, 1983; Kulik & Bangert-Drowns,
1983–1984), Landman and Dawes (1982), and Wortman and Bryant (1985). In
recent years, fewer published meta-analyses have used Glassian methods or study
effects meta-analyses.
Meta-Analysis Methods That Focus on Sampling
Error
As noted earlier, numerous artifacts produce the deceptive appearance of
variability in results across studies. The artifact that typically produces more false
variability than any other is sampling error variance. Glassian meta-analysis and
study effect meta-analysis implicitly accept variability produced by sampling error
variance as real variability. There are two types of meta-analyses that move
beyond Glassian methods in that they attempt to control for sampling error
variance.
Homogeneity Test-Based Meta-Analysis
The first of these methods is homogeneity test-based meta-analysis. This
approach has been advocated independently by Hedges (1982b; Hedges & Olkin,
1985) and by Rosenthal and Rubin (1982). Hedges (1982a) and Rosenthal and
Rubin (1982) proposed that chi-square statistical tests be used to decide whether
study outcomes are more variable than would be expected from sampling error
alone. If these chi-square tests of homogeneity are not statistically significant, then
the population correlation or effect size is accepted as constant across studies and
272
there is no search for moderators.2 Use of chi-square tests of homogeneity to
estimate whether findings in a set of studies differ more than would be expected
from sampling error variance was originally proposed by Snedecor (1946).
The chi-square test of homogeneity typically has low power to detect
variation beyond sampling error (National Research Council, 1992).3 Hence the
meta-analyst will often conclude that the studies being examined are homogenous
when they are not; that is, the meta-analyst will conclude that the value of
xy
or
xy
is the same in all the studies included in the meta-analysis when, in fact, these
parameters actually vary across studies. A major problem here is that when this
occurs, the fixed effects model of meta-analysis is then used in almost all cases.
Unlike random effects meta-analysis models, fixed effects models assume zero
between-study variability in
xy
or
xy
in computing the standard error of the
or
,d
resulting in underestimates of the relevant standard errors of the mean. This
in turn results in confidence intervals around the
or
d
that are erroneously
narrow--sometimes by large amounts (Schmidt, Oh, & Hayes, 2009). This creates
an erroneous impression that the meta-analysis findings are much more precise
than in fact they really are. This problem also results in Type I biases in all
significance tests conducted
or
,d
and these biases are often quite large
(Hunter & Schmidt, 1999). As a result of this problem, the National Research
Council (1992) report on data integration recommended that fixed effects models
be replaced by random effects models which do not suffer from this problem. We
have also made that recommendation (Hunter & Schmidt, 1999,; 2004; Schmidt et
al., 2009). However, the majority of published meta-analyses using the Rosenthal-
Rubin methods and the Hedges-Olkin methods have used their fixed effects
models. For example, most of the meta-analyses that have appeared in
Psychological Bulletin are fixed effects meta-analysis. Most of these analyses
employ the Hedges and Olkin (1985) fixed effect meta-analysis model.
22. Hedges and Olkin (1985) recommend that if theory suggests the existence of
moderators, a moderator analysis should be conducted even if the homogeneity test is not
significant. However, those using their methods typically ignore this recommendation.
33. Other methods of “detecting” the existence of variance beyond that caused by artifacts,
including the 75% rule described in Hunter and Schmidt (1990), also often have low
power. This is why we are wary of tests of any kind for detecting real variation across
studies in population parameters. Instead we recommend that such variation be estimated
quantitatively, as described earlier in this chapter. One implication of this is that one should
always assume and employ a random effects model, the more general model, which
subsumes the fixed effects model as a special case. The issue here is again use of
significance tests (with their false allure of dichotomous yes-no decisions) versusvs.
quantitative point estimates of magnitude and confidence intervals.
273
Both Rosenthal and Rubin and Hedges and Olkin have presented random
effects meta-analysis models as well as fixed effects methods, but meta-analysts
have rarely employed their random effects methods. The Hunter-Schmidt methods,
described earlier in this chapter, are all random effects methods.
Hedges (1982b) and Hedges and Olkin (1985) extended the concept of
homogeneity tests to develop a more general procedure for moderator analysis
based on significance testing. It calls for breaking the overall chi-square statistic
down into the sum of within- and between-group chi-squares. The original set of
effect sizes in the meta-analysis is divided into successively smaller subgroups
until the chi-square statistics within the subgroups are non-significant, which is
then interpreted as indicating that sampling error can explain all the variation
within the last set of subgroups.
Homogeneity test-based meta-analysis represents an ironic return to the
practice that originally led to the great difficulties in making sense out of research
literatures: reliance on statistical significance tests. As noted above, the chi-square
test typically has low power. Another problem is that the chi-square test has a Type
I bias. Under the null hypotheses, the chi-square test assumes that all between-
study variance in study outcomes (e.g., rs or ds) is sampling error variance; but
there are other purely artifactual sources of variance between studies in effect
sizes. As discussed earlier, these include computational, transcriptional, and other
data errors, differences between studies in reliability of measurement, and in levels
of range restriction, and others, as discussed earlier. Thus, even when true study
effect sizes are actually the same across studies, these sources of artifactual
variance will create variance beyond sampling error, sometimes causing the chi-
square test to be significant and hence to falsely indicate heterogeneity of effect
sizes. This is especially likely when the number of studies is large, increasing
statistical power to detect small amounts of such artifactual variance. Another
problem is that even when the variance beyond sampling error is not artifactual, it
often will be small in magnitude and of little or no theoretical or practical
significance. Hedges and Olkin (1985) recognized this fact and cautioned that
researchers should not merely look at significance levels but should evaluate the
actual size of the variance; unfortunately, however, once researchers are caught up
in significance tests, the usual practice is to assume that if it is statistically
significant it is important (and if it is not, it is zero). Once the major focus is on
the results of significance tests, effect sizes are usually ignored.
Bare Bones Meta-Analysis
The second approach to meta-analysis that attempts to control only for the
artifact of sampling error is what we referred to earlier as bare bones meta-
analysis (Hunter et al., Schmidt & Jackson, 1982; Hunter & Schmidt, 1990b,;
2004; Pearlman, Schmidt, & Hunteret al., 1980). This approach can be applied to
274
correlations, d-values or any other effect size statistic for which the standard error
is known. For example, if the statistic is correlations,
is first computed. Then
the variance of the set of correlations is computed. Next the amount of sampling
error variance is computed and subtracted from this observed variance. If the result
in zero, then sampling error accounts for all the observed variance, and the r value
accurately summarizes all the studies in the meta-analysis. If not, then the square root
of the remaining variance is the index of variability remaining around the mean r
after sampling error variance is removed. Earlier in this chapter we presented
examples of bare bones meta-analysis (the delinquency research example and the
Ppostal letter sorter example).
Because there are always other artifacts (such as measurement error) that
should be corrected for, we have consistently stated in our writings that the bare
bones meta-analysis method is incomplete and unsatisfactory. It is useful primarily
as the first step in explaining and teaching meta-analysis to novices. However,
studies using bare bones methods have been published; the authors of these studies
have invariably claimed that the information needed to correct for artifacts beyond
sampling error was unavailable to them. In our experience, this is in fact rarely the
case. Estimates of artifact values (e.g., reliabilities of scales) are usually available
from the literature, from test manuals, or from other sources, as indicated earlier.
These values can be used to create distributions of artifacts for use in artifact
distribution-based meta-analysis (described earlier in this chapter). Even partial
corrections for these biasing artifacts produce results that are less inaccurate than
standard bare bones meta-analysis.
Psychometric Meta-Analysis
The third type of meta-analysis is psychometric meta-analysis. These
methods correct not only for sampling error (an unsystematic artifact) but for
other, systematic artifacts, such as measurement error, range restriction or
enhancement, dichotomization of measures, and so forth. These other artifacts are
said to be systematic because, in addition to creating artifactual variation across
studies, they also create systematic downward biases in the results of all studies.
For example, measurement error systematically biases all correlations and d-values
downward. Psychometric meta-analysis corrects not only for the artifactual
variation across studies, but also for the downward biases. Psychometric meta-
analysis is the only meta-analysis method that takes into account both statistical
and measurement artifacts. Two variations of these procedures were described
earlier in this chapter in the section “More Advanced Methods of Meta-Analysis.”
A detailed presentation of these procedures can be found in Hunter and Schmidt
(1990b, or 2004) or Hunter et al., Schmidt and Jackson (1982). Callender and
Osborn (1980) and Raju and Burke (1983) also developed methods for
psychometric meta-analysis. These methods differ slightly in computational details
275
but have been shown to produce virtually identical results (Law et al., 1994a,
1994b).
Unresolved Problems in Meta-
Analysis
In all forms of meta-analysis, there are unresolvable problems in the search
for moderators. First, when effect size estimates are regressed on multiple-study
characteristics, capitalization on chance operates to increase the apparent number
of significant associations for those study characteristics that have no actual
associations with study outcomes. Because the sample size is the number of
studies and many study properties may be coded, this problem is often severe
(Hunter & Schmidt, 2004, cChapter 2). There is no purely statistical solution to
this problem. The problem can be mitigated, however, by basing choice of study
characteristics and final conclusions not only on the statistics at hand, but also on
other theoretically relevant empirical findings (which may be the result of other
meta-analyses) and on theoretical considerations. Results should be examined
closely for substantive and theoretical meaning. Capitalization on chance is a
threat whenever the (unknown) correlation or regression weight is actually zero or
near zero. When there is in fact a relationship, there is another problem: Power to
detect the relation is often low (Hunter & Schmidt, 2004, cChapter 2). Again, this
is because sample size (i.e., the number of studies in the meta-analysis) is small.
Thus, true moderators of study outcomes (to the extent that such exist) may have
only a low probability of showing up as statistically significant. In short, this step
in meta-analysis is often plagued with all the problems of small-sample studies.
Other things being equal, conducting separate meta-analyses on subsets of studies
to identify a moderator does not avoid these problems and may lead to additional
problems of confounding of moderator effects (Hunter & Schmidt, 2004, cChapter
13).
Although there are often serious problems in detecting moderator variables
in meta-analysis, there is no approach to moderator detection that is superior to
meta-analysis. In fact, alternative methods (e.g., quantitative analyses within
individual studies; narrative reviews of literatures) have even more serious
problems and hence are inferior to meta-analysis. Moderator detection is difficult
because a large amount of information is required for clear identification of
moderators. Even sets of 100 to - 200 studies often do not contain the required
amounts of information (Schmidt & Hunter, 1978).
Another issue in meta-analysis that is widely regarded as unresolved is the
issue of judgments about which studies to include in a meta-analysis. There is
276
widespread agreement that studies should not be included if they do not measure
the constructs that are the focus of the meta-analysis. For example, if the focus is
on the correlation between the personality trait of Cconscientiousness and job
performance, correlations based on other personality traits should be excluded
from that meta-analysis. Correlations between Cconscientiousness and measures
of other dependent variables—such as tenure—should also be excluded. In
addition, it should be explicitly decided in advance exactly what kinds of measures
qualify as measures of job performance. For many purposes, only measures of
overall job performance will quality; partial measures, such as citizenship
behaviors on the job, are deficient in construct validity as measures of overall job
performance. Hence there is general agreement that meta-analysis requires careful
attention to construct validity issues in determining which studies should be
included.
Most meta-analysis studies published today contain multiple meta-analyses.
To continue our example from the previous paragraph, the meta-analysis of the
relation between Conscientiousness and job performance would probably be only
one of several reported in the article. Others would include the relationship with
job performance for the other four of the Big Five personality traits. In addition,
other meta-analyses would probably be reported separately for other dependent
variables: citizenship behaviors, tenure, absenteeism, and so onetc. That is, one
meta-analysis is devoted to each combination of constructs. Again, there appears to
be little disagreement that this should be the case. Hence the total number of meta-
analyses is much larger than the total number of meta-analysis publications.
The disagreement concerns whether studies that do meet construct validity
requirements should be excluded on the basis other alleged methodological
defects. One position is that, in many literatures, most studies should be excluded
a priori on such grounds and that the meta-analysis should be performed only the
remaining, often small, set of studies. This position reflects the “myth of the
perfect study,, discussed at the beginning of this chapter. The alternative that we
advocate is to include all studies that meet basic construct validity requirements
and to treat the remaining judgments about methodological quality as hypotheses
to be tested empirically. This is done by conducting separate meta-analyses on
subgroups of studies that do and do not have the methodological feature in
question and comparing the findings. If the results are essentially identical, then
the hypothesis that that methodological feature affects study outcomes is
disconfirmed and all conclusions should be based on combined meta-analysis. If
the results are different, then the hypothesis that that methodological feature is
important is confirmed. This position takes it as axiomatic that any methodological
feature that has no effect on study findings is not important and can be
disregarded. In our experience, most methodological hypotheses of this sort are
disconfirmed. In any case, this empirical approach helps to settle disputes about
what methodological features of studies are important. That is, this approach leads
to advances in methodological knowledge.
277
Because multiple studies are needed to solve the problem of sampling error,
it is critical to ensure the availability of all studies on each topic. A major
unresolved problem is that many good replication articles are rejected by our
primary research journals. Journals currently put excessive weight on innovation
and creativity when evaluating studies and often fail to consider the effects of
either sampling error or other technical problems such as measurement error.
Many journals will not even consider “mere replication studies” or “mere
measurement studies.” Many persistent authors eventually publish such studies in
journals with lower prestige, but they must endure many letters of rejection and
publication is delayed for a long period.
This situation indicates a need a new type of journal—--whether hard- copy
based or electronic—--that systematically archives all studies that will be needed
for later meta-analyses. The American Psychological Association’s Experimental
Publication System in the early 1970s was an attempt in this direction. However, at
that time the need subsequently created by meta-analysis did not yet exist; the
system apparently met no real need at that time and hence was discontinued.
Today, the need is so great that failure to have such a system in place is retarding
our efforts to reach our full potential in creating cumulative knowledge in
psychology and the social sciences. The Board of Scientific Affairs of the
American Psychological Association is currently studying the feasibility of such a
system.
This issue is closely related to the potential problem of availability bias or
publication bias. That is, it is possible that the studies that are not available to be
included in the meta-analysis differ in results and outcomes from those that are
available. For example, there is substantial research evidence in areas other than
psychology (e.g., medical research) that studies obtaining results that are not
statistically significant are less likely to be submitted for publication and, if they
are submitted, are less likely to be published. Of course, this problem existed
before meta-analysis came into being or use and is a potential problem for all
research literatures. Nevertheless, it has received a lot of attention in recent years
because of its potential to distort meta-analytic findings. Various methods have
been proposed for detecting publication or availability bias in research literatures
and some methods have been proposed for correcting for the effect of such bias if
it is detected. However, there is no consensus on how best to address this potential
problem. This issue and these procedures are discussed in Hunter and Schmidt
(2004, cChapter 13) and in Rothstein (2008). An edited book has also been
devoted to these methods (Rothstein, Sutton, & Borenstein, 2005).
Au Add Rothstein (2008) to Refs
278
The Role of Meta-Analysis in Theory
Development
As noted at the beginning of this chapter, the major task in the behavioral
and social sciences, as in other sciences, is the development of theory. A good
theory is a good explanation of the processes that actually take place in a
phenomenon. For example, what actually happens when employees develop a high
level of organizational commitment? Does job satisfaction develop first and then
cause the development of commitment? If so, what causes job satisfaction to
develop and how does it have an effect on commitment? How do higher levels of
mental ability cause higher levels of job performance? Only by increasing job
knowledge? Or also by directly improving problem solving on the job? The
researcher is essentially a detective; his or her job is to find out why and how
things happen the way they do. To construct theories, however, researchers must
first know some of the basic facts, such as the empirical relations among variables.
These relations are the building blocks of theory. For example, if researchers know
there is a high and consistent population correlation between job satisfaction and
organization commitment, this will send them in particular directions in
developing their theories. If the correlation between these variables is very low and
consistent, theory development will branch in different directions. If the relation is
highly variable across organizations and settings, researchers will be encouraged to
advance interactive or moderator-based theories. Meta-analysis provides these
empirical building blocks for theory. Meta-analytic findings tell us what it is that
needs to be explained by the theory. Meta-analysis has been criticized because it
does not directly generate or develop theory (Guzzo, Jackson, & Katzell, 1986).
This is like criticizing typewriters or word processors because they do not generate
novels on their own. The results of meta-analysis are indispensable for theory
construction; but theory construction itself is a creative process distinct from meta-
analysis.
As implied in the language used here, theories are causal explanations. The
goal in every science is explanation, and explanation is always causal. In the
behavioral and social sciences, the methods of path analysis (e.g., see Hunter &
Gerbing, 1982) can be used to test causal theories when the data meet the
assumptions of the method. The relationships revealed by meta-analysis—--the
empirical building blocks for theory—--can be used in path analysis or structural
equation modeling to test causal theories even when all the delineated
relationships are observational rather than experimental. Experimentally
determined relationships can also be entered into path analyses along with
observationally based relations by transforming d values to correlations. Path
analysis can be a very powerful tool for reducing the number of theories that could
possibly be consistent with the data, sometimes to a very small number, and
279
sometimes to only one theory (Becker, 2009; Hunter, 1988). For examples, see
Hunter (1983) and Schmidt (1992). Every such reduction in the number of
possible theories is an advance in understanding.
Meta-Analysis in Industrial-
Organizational Psychology and Other
Applied Areas
There have been numerous applications of meta-analysis in industrial-
organizational (I/O) psychology. The most extensive and detailed application of
meta-analysis in I/O psychology has been the study of the generalizability of the
validities of employment selection procedures (Schmidt, 1988; Schmidt & Hunter,
1981). The findings have resulted in major changes in the field of personnel
selection. Validity generalization research is described in more detail below.
The meta-analysis methods presented in this chapter have been applied in
other areas of I/O psychology and organizational behavior. Meta-analyses on
numerous topic areas routinely appear in top-tier applied psychology journals such
as the Journal of Applied Psychology and Personnel Psychology. Recent meta-
analyses have addressed topics such as Hofstede’s (1980) cultural value
dimensions (Taras, Kirkman, & Steelet al., 2010), unit-level job satisfaction and
performance (Whitman, Van Rooy, & Viswesvaranet al., 2010), personality traits
and job turnover (Zimmerman, 2008), team work processes and team effectiveness
(LePine, Piccolo, Jackson, Mathieu, & Saul et al., 2008), ethnic and gender
subgroup differences in assessment center ratings (Dean, Roth, & Bobko, 2008),
the Productivity Measurement and Enhancement System (ProMES) (Pritchard,
Harrell, DiazGranadaos, & Guzman et al., 2008), and the causal relationship
between job attitudes and job performance (Riketta, 2008). These are recent
examples. Older examples include the correlates of role conflict and role
ambiguity (Fisher & Gitelson, 1984), realistic job previews (Premack & Wanous,
1984), Fielder’s contingency theory of leadership (Peters, Harthe, & Pohlman et
al., 1984), the accuracy of self-ratings of ability and skill (Mabe & West, 1982),
the relation of LSAT scores to performance in law schools (Linn & Hastings,
1983), the relation of job satisfaction to absenteeism (Terborg & Leeet al. , 1982),
and the ability of financial analysts to predict stock growth (Coggin & Hunter,
1983). Tett, Meyes, and& Roese (1994) list additional areas where meta-analysis
had been applied. The applications have been to both correlational and
experimental literatures. For a detailed discussion of the impact that meta-analysis
has had on I/O psychology, see DeGeest and Schmidt (2011).
280
Au Add Linn & Hastings, 1983; Coggin & Hunter, 1983 to Refs
The examples cited here applied meta-analysis to research programs. The
results of such programs can sometimes be used as a foundation for policy
recommendations. But meta-analysis can be applied more directly in the public
policy arena. Consider one example. The Federation of Behavioral, Psychological
and Cognitive Sciences sponsors regular Sscience and Ppublic Ppolicy Sseminars
for members of Congress and their staffs. In one seminar, the speaker was Eleanor
Chelimsky, for years the director of the General Accounting Office’s (GAO)
Division of Program Evaluation and Methodology. In that position she pioneered
the use of meta-analysis as a tool for providing program evaluation and other
legislatively significant advice to Congress. Chelimsky (1994) stated that meta-
analysis has proven to be an excellent way to provide Congress with the widest
variety of research results that can hold up under close scrutiny under the time
pressures imposed by Congress. She stated that General Accounting Office has
found that meta-analysis reveals both what is known and what is not known in a
given topic area, and distinguishes between fact and opinion “without being
confrontational.” One application she cited as an example was a meta-analysis of
studies on the merits of producing binary chemical weapons (nerve gas in which
the two key ingredients are kept separate for safety until the gas is to be used). The
meta-analysis did not support the production of such weapons. This was not what
officials in the Department of Defense wanted to hear, and the Department of
Defense disputed the methodology and the results. But the methodology held up
under close scrutiny, and in the end Congress eliminated funds for binary
weapons. By law it is the responsibility of the General Accounting Office to
provide policy-relevant research information to Congress. So the adoption of meta-
analysis by the General Accounting Office (now called the General Accountability
Office) provides a clear and even dramatic example of the impact that meta-
analysis can have on public policy.
As noted above, one major application of meta-analysis to date has been the
examination of the validity of tests and other methods used in personnel selection.
Meta-analysis has been used to test the hypothesis of situation specific validity. In
personnel selection it had long been believed that validity was specific to
situations; that is, it was believed that the validity of the same test for what
appeared to be the same job varied from employer to employer, region to region,
across time periods, and so forth. In fact, it was believed that the same test could
have high validity (i.e., a high correlation with job performance) in one location or
organization and be completely invalid (i.e., have zero validity) in another. This
belief was based on the observation that observed validity coefficients for similar
tests and jobs varied substantially across different studies. In some such studies
there was a statistically significant relationship, and in others there was no
significant relationship—--which, as noted earlier, was falsely taken to indicate no
relationship at all. This puzzling variability of findings was explained by
postulating that jobs that appeared to be the same actually differed in important but
281
subtle (and undetectable) ways in what was required to perform them. This belief
led to a requirement for local or situational validity studies. It was held that
validity had to be estimated separately for each situation by a study conducted in
that setting; that is, validity findings could not be generalized across settings,
situations, employers, and the like (Schmidt & Hunter, 1981). In the late 1970s,
meta-analysis of validity coefficients began to be conducted to test whether
validity might in fact be generalizable (Schmidt & Hunter, 1977; Schmidt, Hunter,
Pearlman, & Shane, 1979); these meta-analyses were therefore called validity
generalization studies. If all or most of the study-to-study variability in observed
validities was due to sampling error and other artifacts, then the traditional belief
in situational specificity of validity would be seen to be erroneous, and the
conclusion would be that validity did generalize.
Meta-analysis has now been applied to more thanover 800 research
literatures in employment selection from the United StatesU.S., Canada, European
countries, and East Asian countries, with each meta-analysis representing a
predictor-job performance combination. These predictors have included nontest
procedures, such as evaluations of education and experience, employment
interviews, and biographical data scales, as well as ability and aptitude tests. As an
example, consider the relation between quantitative ability and overall job
performance in clerical jobs (Hunter & Schmidt, 1996). This substudy was based
on 223 correlations computed on a total of 18,919 people. All of the variance of
the observed validities was traceable to artifacts. The mean validity was .50. Thus,
integration of these of data leads to the general (and generalizable) principle that
the correlation between quantitative ability and clerical performance is .50, with
no true variation around this value. Like other similar findings, this finding shows
that the old belief that validities are situationally specific is false.
Today many organizations use validity generalization findings as the basis
of their selection-testing programs. Validity generalization has been included in
standard texts (e.g., Anastasi, 1988) and in the Standards for Educational and
Psychological Tests (1999). A report by the National Academy of Sciences
(Hartigan & Wigdor, 1989) devoted a chapter (chapter 6) to validity generalization
and endorsed its methods and assumptions.
Wider Impact of Meta-Analysis on
Psychology
Some have viewed meta-analysis as merely a set of improved methods for
doing literature reviews. Meta-analysis is actually more than that. By
quantitatively comparing findings across diverse studies, meta-analysis can
282
discover new knowledge not inferable from any individual study and can
sometimes answer questions that were never addressed in any of the individual
studies contained in the meta-analysis. For example, no individual study may have
compared the effectiveness of a training program for people of higher and lower
mental ability; but by comparing mean d-value statistics across different groups of
studies, meta-analysis can reveal this difference. That is, moderator variables
(interactions) never studied in any individual study can be revealed by meta-
analysis. But even though it is much more than that, meta-analysis is indeed an
improved method for synthesizing or integrating research literatures. The premier
review journal in psychology is Psychological Bulletin. In viewing that journal’s
volumes from 1980 to 2010, the impact of meta-analysis is apparent. Over this
time period, a steadily increasing percentage of the reviews published in this
journal are meta-analyses and a steadily decreasing percentage are traditional
narrative subjective reviews. Most of the remaining narrative reviews published
today in Psychological Bulletin focus on research literatures that are not well
enough developed to be amenable to quantitative treatment. Several editors have
told me that it is not uncommon for narrative review manuscripts to be returned by
editors to the authors with the request that meta-analysis be applied to the studies
reviewed.
As noted above, most of the meta-analyses appearing in Psychological
Bulletin have employed fixed effects methods, resulting in many cases in
overstatement of the precision of the meta-analysis findings (Hunter & Schmidt,
1999; Schmidt et al., Oh, & Hayes, 2009). Despite this fact, these meta-analyses
produce findings and conclusions that are far superior to the confusion produced
by the traditional narrative subjective method. Many other journals have shown the
same increase over time in the number of meta-analyses published. Many of these
journals, such as Journal of Applied Psychology, had traditionally published only
individual empirical studies and had rarely published reviews up until the advent
of meta-analysis in the late 1970s. These journals began publishing meta-analyses
because meta-analyses came to be viewed not as “mere reviews” but as a form of
empirical research in themselves. As a result of this change, the quality and
accuracy of conclusions from research literatures improved in a wide variety of
journals and in a corresponding variety of research areas in psychology. This
improvement in the quality of conclusions from research literatures has expedited
theory development in a wide variety of areas in psychology (DeGeest & Schmidt,
2011).
The impact of meta-analysis on psychology textbooks has been positive and
dramatic. Textbooks are important because their function is to summarize the state
of cumulative knowledge in a given field. Most people—--students and others—--
acquire most of their knowledge about psychological theory and findings from
their reading of textbooks. Prior to meta-analysis, textbooks authors faced with
hundreds of conflicting studies on a single question subjectively and arbitrarily
selected a small number of their preferred studies from such a literature and based
283
the textbook conclusions on only those few studies. Today most textbook authors
base their conclusions on meta-analysis findings—--making their conclusions and
their textbooks much more accurate. It is hard to overemphasize the importance of
this development in advancing cumulative knowledge in psychology.
The realities revealed about data and research findings by the principles of
meta-analysis have produced changes in our views of the individual empirical
study, the nature of cumulative research knowledge, and the reward structure in the
research enterprise. Meta-analysis has explicated the role of sampling error,
measurement error, and other artifacts in determining the observed findings and
statistical power of individual studies. In doing this, it revealed how little
information there is in any single study. It has shown that, contrary to previous
belief, no single primary study can provide more than tentative evidence on any
issue. Multiple studies are required to draw solid conclusions. The first study done
in an area may be revered for its creativity, but sampling error and other artifacts in
that study will often produce a fully or partially erroneous answer to the study
question. The quantitative estimate of effect size will almost always be erroneous
to some degree. The shift from tentative to solid conclusions requires the
accumulation of studies and the application of meta-analysis to those study results.
Furthermore, adequate handling of other study imperfections such as
measurement error—--and especially imperfect construct validity—--may also
require separate studies and more advanced meta-analysis. Because of the effects
of artifacts such as sampling error and measurement error, the data in studies come
to us encrypted, and to understand their meaning we must first break the code.
Doing this requires meta-analysis. Therefore any individual study must be
considered only a single data point to be contributed to a future meta-analysis.
Thus the scientific status and value of the individual study is necessarily reduced,
while at the same time the value of individual studies in the aggregate is increased.
Impact of Meta-Analysis Outside
Psychology
The impact of meta-analysis may be even greater in medical research than
in the behavioral and social sciences (Hunt, 1997, Chapter 4). Hundreds of meta-
analyses have now been published in leading medical research journals such as
Tthe New England Journal of Medicine and Journal of the American Medical
Association. In medical research, the preferred study is the randomized controlled
trial (RTC), in which participants are assigned randomly to receive either the
treatment or a placebo, with the researchers being blind as to which treatment the
participants are receiving. Despite the strengths of this research design, it is
284
usually the case that different RTCs on the same treatment obtain conflicting
results (Ioannidis, 2010). This is partly because the effects sizes are often small
and partly because (contrary perhaps to our perceptions) RTCs are often based on
small sample sizes. In addition, the problem of information overload is even
greater in medicine than in the social sciences; over a million medical research
studies are published every year. No practitioner can possibly keep up with the
medical literature in his or her area.
The leader in introducing meta-analysis to medical research was Thomas
Chalmers. In addition to being a researcher, Chalmers was also a practicing
internist who became frustrated with the inability of the vast, scattered, and
unfocused medical research literature to provide guidance to practitioners. Starting
in the mid-1970s, Chalmers developed his initial meta-analysis methods
independently of those developed in the social and behavioral sciences. Despite
being well conducted, his initial meta-analyses were not well accepted by medical
researchers, who were critical of the concept of meta-analysis. In response he and
his associates developed “sequential meta-analysis”—a technique that reveals the
date by which enough information had become available to show conclusively that
a treatment was effective. Suppose, for example, that the first RTC for a particular
drug had been conducted in 1975 but had a wide confidence interval, one that
spans zero effect. Now suppose three more studies had been conduced in 1976,
providing a total of four studies to be meta-analyzed—and the confidence interval
for the meta-analytic mean of these studies is narrower but still wide enough that it
includes zero. Now suppose five5 more RTCs had been conducted in 1977, now
providing 12 studies for a meta-analysis up to this date. If that meta-analysis now
yields a confidence interval that excludes zero, then we conclude that, given the
use of meta-analysis, enough information was already available in 1977 to begin
using this drug. Chalmers and his associates then computed, based on the meta-
analysis finding and statistics on the disease, how many lives would have been
saved to date had use of the drug begun in 1977. It turned out that, considered
across different treatments, diseases, and areas of medical practice, a very large
number of lives would have been saved had medical research historically relied on
meta-analysis. The resulting article (Antman, Lau, Kupelnick, Mosteller, &
Chalmers, 1992) is widely considered the most important and influential meta-
analysis ever published in medicine. It was even reported and discussed widely in
the popular press (e.g., the New York Times science section). It assured a major
role for meta-analysis in medical research from that point on (Hunt, 1997, Chapter
4).
Chalmers was also one of the driving forces behind the establishment of the
Cochrane Collaboration, an organization that applies sequential meta-analysis in
medical research in real time. This group conducts meta-analyses in a wide variety
of medical research areas—and then updates each meta-analysis as new RTCs
become available. That is, when a new RTC becomes available, the meta-analysis
is re-run with the new RTC being included. Hence each meta-analysis is always
285
current. The results of these updated meta-analyses are available on the internet to
researchers and medical practitioners around the world. It is likely that this effort
has saved hundreds of thousands of lives by improving medical decision-making.
The Cochrane Collaboration website is: www.update-
software.com/ccweb/cochrane/general.htm.
Meta-analysis has also become important in research in finance, marketing,
sociology, and even wildlife management. In fact, it would probably be difficult to
find a research area in which meta-analysis is unknown today. In the broad areas
of education and social policy, the Campbell Collaboration, is attempting to do for
the social sciences what the Cochrane Collaboration (on which it is modeled) has
done for medical practice (Rothstein, 2003; Rothstein, McDaniel, & Borenstein,
2002). Among the social sciences, perhaps the last to assign an important role to
meta-analysis has been economics. However, meta-analysis has recently become
important in economics, too (e.g., see Stanley, 1998,; 2001; Stanley & Jarrell,
1989,; 1998). There is even a doctoral program in meta-analysis in economics
(http://www.feweb.vu.nl/re/Master-Point/). Meta-analysis is also
beginning to be used in political science (e.g., see Pinello, 1999).
Au Add Stanley & Jarrell 1998 to Refs
Conclusions
Until recently, psychological research literatures appeared conflicting and
contradictory. As the number of studies on each particular question became larger
and larger, this situation became increasingly frustrating and intolerable. This
situation stemmed from reliance on defective procedures for achieving cumulative
knowledge: the statistical significance test in individual primary studies in
combination with the narrative subjective review of research literatures. Meta-
analysis principles have now correctly diagnosed this problem, and, more
important, have provided the solution. In area after area, meta-analytic findings
have shown that there is much less conflict between different studies than had
been believed, that coherent, useful, and generalizable conclusions can be drawn
from research literatures, and that cumulative knowledge is possible in psychology
and the social sciences. These methods have also been adopted in other areas such
as medical research. Prominent medical researcher Thomas Chalmers (as cited in
Mann, 1990), has stated, “[Meta-analysis] is going to revolutionize how the
sciences, especially medicine, handle data. And it is going to be the way many
arguments will be ended.” (p. 478). In concluding his oft-cited review of meta-
analysis methods, Bangert-Drowns (1986, p. 398) stated:
Meta-analysis is not a fad. It is rooted in the fundamental values of the
scientific enterprise: replicabililty, quantification, causal and correlational analysis.
286
Valuable information is needlessly scattered in individual studies. The ability of
social scientists to deliver generalizable answers to basic questions of policy is too
serious a concern to allow us to treat research integration lightly. The potential
benefits of meta-analysis seem enormous.
References
American Educational Research Association, American Psychological Association, and National
Council on Measurement in Education (1999). Standards for educational and
psychological testing. Washington, DC: American Psychological Association.
American Educational Research Association. (2006). Standards for reporting on empirical social
science research in AERA publications. Educational Researcher, 35, 33 40.
American Psychological Association. (2001). Publication manual of the American
Ppsychological Aassociation. (5th Eed.). Washington, DC: American Psychological
Association.
American Psychological Association. (2009). Publication manual of the American
Ppsychological Aassociation. Washington, DC: American Psychological Association.
Anastasi, A. (1988). Psychological testing (7th Eed.). New York, NY: Macmillan.
Antman, E. M., Lau, J., Kupelnick, B., Mosteller, F., & Chalmers, T. C. (1992). A comparison of
results of meta-analyses of randomized control trials and recommendations of clinical
experts. Journal of the American Medical Association, 1992, 268:240–248.
Bangert-Drowns, R. L. (1986). Review of developments in meta-analytic method. Psychological
Bulletin, 99, 388–399.
Bangert-Drowns, R. L., Kulik, J. A., & Kulik, C. -L. C. (1983). Effects of coaching programs on
achievement test performance. Review of Educational Research, 53, 571–585.
Baum, M. L., Anish, D. S., Chalmers, T. C., Sacks, H. S., Smith, H., & Fagerstrom, R. M.
(1981). A survey of clinical trials of antibiotic prophylaxis in colon surgery: Evidence
against further use of no-treatment controls. New England Journal of Medicine, 305,
795–799.
Au Please add to text or omit
Becker, B. J. (2009). Model-based meta-analysis. In H. Cooper, L. V. Hedges, & J. C. Valentine
(Eds.). The handbook of research synthesis and meta-analysis. (2ndnd edEd.,) Ppp. 377
398. New York, NY: Russell Sage Foundation.
Callender, J. C., & Osburn, H. G. (1980). Development and test of a new model for validity
generalization. Journal of Applied Psychology, 65, 543–558.
287
Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational
Review, 48, 378–399.
Chelimsky, E. (1994, October 14). Use of meta-analysis in the General Accounting Office. Paper
presented at the Science and Public Policy Seminars, Federation of Behavioral,
Psychological and Cognitive Sciences. Washington, DC.
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review.
Journal of Abnormal and Social Psychology, 65, 145–153.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. (2nd ed.). Hillsdale, NJ:
Erlbaum.
Cohen, J. (1990). Things I learned (so far). American Psychologist, 45, 1304–1312.
Cohen, J. (1994). The earth is round ( < .05). American Psychologist, 49, 997–1003.
Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American
Psychologist, 30, 116–127.
Cuts raise new social science query: Does anyone appreciate social science? (1981, March 27).
Wall Street Journal, p. 54.
Dean, M. A., Roth, P. L., & Bobko, P. (2008). Ethnic and gender subgroup differences in
assessment center ratings: A meta-analysis. Journal of Applied Psychology, 93, 685 691.
DeGeest, D., & Schmidt, F. L. (2011). The impact of research synthesis methods on Iindustrial-
Oorganizational Ppsychology: The road from pessimism to optimism about cumulative
knowledge. Research Synthesis Methods, 1, 185 197.
Fisher, C. D., & Gittelson, R. (1983). A meta-analysis of the correlates of role conflict and
ambiguity. Journal of Applied Psychology, 68, 320–333.
Fisher, R. A. (1932). Statistical methods for research workers (4th ed.). Edinburgh, UKScotland:
Oliver and Boyd.
Fisher, R. A. (1935). The design of experiments. London, UK: Oliver and Boyd.
Au Please add to text or omit
Fisher, R. A. (1973). Statistical methods and scientific inference (3rd ed.). Edinburgh, UK:
Oliver and Boyd.
Au Please add to text or omit
Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Meta-analysis of
assessment center validity. Journal of Applied Psychology, 72, 493–511.
Au Please add to text or omit
288
Gergen, K. J. (1982). Toward transformation in social knowledge. New York, NY: Springer-
Verlag.
Glass, G. V. (1972). The wisdom of scientific inquiry on education. Journal of Research in
Science Teaching, 9, 3–18.
Glass, G. V. (1976). Primary, secondary and meta-analysis of research. Educational Researcher, 5,
3–8.
Glass, G. V. (1977). Integrating findings: The meta-analysis of research. Review of Research in
Education, 5, 351–379.
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills,
CA: Sage.
Guttman, L.ouis (1985). The illogic of statistical inference for cumulative science. Applied
Stochastic Models and Data Analysis, 1, 3–10.
Guzzo, R. A., Jackson, S. E., & Katzell, R. A. (1986). Meta-analysis analysis. In L. L. Cummings
& B. M. Staw (Eds.), Research in organizational behavior (Vol. 9, pp. 407–442).
Greenwich, CT: JAI Press.
Guzzo, R. A., Jette, R. D., & Katzell, R. A. (1985). The effects of psychologically based
intervention programs on worker productivity: A meta-analysis. Personnel Psychology,
38, 275–292.
Au Please add to text or omit
Hackett, R. D., & Guion, R. M. (1985). A re-evaluation of the absenteeism-job satisfaction
relationship. Organizational Behavior and Human Decision Processes, 35, 340–381.
Au Please add to text or omit
Halvorsen, K. T. (1986). Combining results from independent investigations: Meta-analysis in
medical research. In J. C. Bailar & F. Mosteller (Eds.), Medical uses of statistics.
Waltham, MA: New England Journal of Medicine Books.
Au Please add to text or omit
Hartigan, J. A., & Wigdor, A. K. (1989). Fairness in employment testing: Validity generalization,
minority issues, and the Ggeneral Aaptitude Ttest Bbattery. Washington, DC: National
Academy Press.
Hedges, L. V. (1982a). Estimation of effect size from a series of independent experiments.
Psychological Bulletin, 92, 490–499.
Hedges, L. V. (1982b). Fitting categorical models to effect sizes from a series of experiments.
Journal of Educational Statistics, 7, 119–137.
289
Hedges, L. V. (1987). How hard is hard science, how soft is soft science: The empirical
cumulativeness of research. American Psychologist, 42, 443–455.
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic
Press.
Hofstede, G. (1980). Culture’s consequences: International differences in work-related values.
Beverly Hills, CA: Sage.
Hunt, M. (1997). How science takes stock. New York, NY: Russell Sage Foundation.
Hunter, J. E. (1979, September). Cumulating results across studies: A critique of factor analysis,
canonical correlation, MANOVA, and statistical significance testing. Invited address
presented at the 86th Annual Convention of the American Psychological Association. New
York, New York.
Au Please add to text or omit
Hunter, J. E. (1983). A causal analysis of cognitive ability, job knowledge, job performance, and
supervisory ratings. In F. Landy, S. Zedeck, & J. Cleveland (Eds.), Performance
measurement and theory (pp. 257–266). Hillsdale, NJ: Erlbaum.
Hunter, J. E. (1988). A path analytic approach to analysis of covariance. Unpublished
manuscript, Department of Psychology, Michigan State University, East Lansing.
Hunter, J. E. (1997). Needed: A ban on the significance test. Psychological Science, 8, 3–7.
Hunter, J. E., & Gerbing, D. W. (1982). Unidimensional measurement, second order factor
analysis and causal models. In B. M. Staw & L. L. Cummings (Eds.), Research in
organizational behavior (Vol. 4, pp. 267–320), Greenwich, CT: JAI Press.
Hunter, J. E., & Hirsh, H. R. (1987). Applications of meta-analysis. In C. L. Cooper & I. T.
Robertson (Eds.), International review of industrial and organizational psychology 1987
(pp. 321–357). London, UK: Wiley.
Au Please add to text or omit
Hunter, J. E., & Schmidt, F. L. (1990a). Dichotomization of continuous variables: The
implications for meta-analysis. Journal of Applied Psychology, 75, 334–349.
Hunter, J. E., & Schmidt, F. L. (1990b). Methods of meta-analysis: Correcting error and bias in
research findings. Newbury Park, CA: Sage.
Hunter, J. E., & Schmidt, F. L. (1994). The estimation of sampling error variance in meta-analysis
of correlations: The homogenous case. Journal of Applied Psychology, 79, 171–177.
Au Please add to text or omit
290
Hunter, J. E., & Schmidt, F. L. (1996). Cumulative research knowledge and social policy
formulation: The critical role of meta-analysis. Psychology, Public Policy, and Law, 2,
324–347.
Hunter, J. E., & Schmidt, F. L. (2000). Fixed effects vs. random effects meta-analysis models:
Implications for cumulative knowledge in psychology. International Journal of Selection
and Assessment, 8, 275 292.
Au Please add to text or omit
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in
research findings. Thousand Oaks, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis: Cumulating research
findings across studies. Beverly Hills, CA: Sage.
Hunter, J. E., Schmidt, F. L., & Le, H. (2006). Implications of direct and indirect range restriction
for meta-analysis methods and findings. Journal of Applied Psychology, 91, 594 612.
Iaffaldono, M. T., & Muchinsky, P. M. (1985). Job satisfaction and job performance: A meta-
analysis. Psychological Bulletin, 97, 251–273.
Au Please add to text or omit
Ioannidis, J. P. A. (2010). Contradicted and initially stronger effects in highly cited clinical
research. Journal of the American Medical Association, 294, 218 228.
Jackson, S. E., & Schuler, R. S. (1985). A meta-analysis and conceptual critique of research on
role ambiguity and role conflict in work settings. Organizational Behavioral and Human
Decision Processes, 36, 16–78.
Au Please add to text or omit
Jöoreskog, K. G., & Söorbom, D. (1979). Advances in factor analysis and structural equation
models. Cambridge, MA: Abt Books.
Au Please add to text or omit
Judge, T. A., Thorensen, C. J., Bono, J. E., & Patton, G. K. (2001). The job-satisfaction-job
performance relationship: A qualitative and quantitative review. Psychological Bulletin,
127, 376 401.
Kulik, J. A., & Bangert-Drowns, R. L. (1983–1984). Effectivness of technology in precollege
mathematics and science teaching. Journal of Educational Technology Systems, 12, 137–
158.
Landman, J. T., & Dawes, R. M. (1982). Psychotherapy outcome: Smith and Glass’ conclusions
stand up under scrutiny. American Psychologist, 37, 504–516.
Law, K. S., Schmidt, F. L., & Hunter, J. E. (1994a). Nonlinearity of range corrections in meta-
analysis: A test of an improved procedure. Journal of Applied Psychology, 79, 425–438.
291
Law, K. S., Schmidt, F. L., & Hunter, J. E. (1994b). A test of two refinements in meta-analysis
procedures. Journal of Applied Psychology, 79, 978–986.
Le, H., & Schmidt, F. L. (2006). Correcting for indirect range restriction in meta-analysis: Testing
a new meta-analytic procedure. Psychological Methods, 11, 416 438.
LePine, J. A., Piccolo, R. F., Jackson, C. L., Mathieu, J. E., & Saul, J. R. (2008). A meta-analysis
of teamwork processes: Tests of a multidimensional model and relationships with team
effectiveness criteria. Personnel Psychology, 61, 273 307.
Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral
treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181–1209.
Mabe, P. A., III, & West, S. G. (1982). Validity of self evaluations of ability: A review and meta-
analysis. Journal of Applied Psychology, 67, 280–296.
Maloley et al. v. Department of National Revenue (1986, February). Canadian Civil Service
Appeals Board, Ottawa, Ontario Canada.
Au Please add to text or omit
Mann, C. (1990, August 3). Meta-analysis in the breech. Science, 249, 476–480.
Mansfield, R. S., & Busse, T. V. (1977). Meta-analysis of research: A rejoinder to Glass.
Educational Researcher, 6, 3.
McDaniel, M. A., Schmidt, F. L., & Hunter, J. E. (1988). A meta-analysis of training and
experience ratings in personnel selection. Personnel Psychology, 41, 283–314.
Au Please add to text or omit
McDaniel, M. A., Whetzel, D. L., Schmidt, F. L. & Maurer, S. D. (1994). The validity of
employment interviews: A comprehensive review and meta-analysis. Journal of Applied
Psychology, 79, 599–616.
Au Please add to text or omit
McEvoy, G. M., & Cascio, W. F. (1985). Strategies for reducing employee turnover: A meta-
analysis. Journal of Applied Psychology, 70, 342–353.
McEvoy, G. M., & Cascio, W. F. (1987). Do poor performers leave? A meta-analysis of the
relation between performance and turnover. Academy of Management Journal, 30, 744–
762.
Au Please add both McEvoys to text or omit
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow
process of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834.
National Research Council. (1992). Combining information: Statistical issues and opportunities
for research. Washington, DC: National Academy of Science Press.
292
Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences.
New York, NY: Wiley.
Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used
to predict job proficiency and training success in clerical occupations. Journal of Applied
Psychology, 65, 373–406.
Peters, L. H., Harthe, D., & Pohlman, J. (1985). Fiedler’s contingency theory of leadership: An
application of the meta-analysis procedures of Schmidt and Hunter. Psychological
Bulletin, 97, 274–285.
Petty, M. M., McGee, G. W., & Cavender, J. W. (1984). A meta-analysis of the relationship
between individual job satisfaction and individual performance. Academy of Management
Review, 9, 712–721.
Au Please add to text or omit
Pinello, D. R. (1999). Linking party to judicial ideology in American courts: A meta-analysis. The
Justice System Journal, 20, 219 254.
Premack, S., & Wanous, J. P. (1985). Meta-analysis of realistic job preview experiments. Journal
of Applied Psychology, 70, 706–719.
Pritchard, R. D., Harrell, M. M., DiazGranadaos, D., & Guzman, M. J. (2008). The productivity
measurement and enhancement system: A meta-analysis. Journal of Applied Psychology,
93, 540 567.
Raju, N. S., & Burke, M. J. (1983). Two procedures for studying validity generalization. Journal
of Applied Psychology, 68, 382–395.
Ramamurti, A. S. (1989). A systematic approach to generating excess returns using a multiple
variable model. In F. J. Fabozzi (Ed.), Institutional investor focus on investment
management. Cambridge, MA: Ballinger.
Au Please add to text or omit
Riketta, M. (2008). The causal relations between job attitudes and job performance: A meta-
analysis of panel studies. Journal of Applied Psychology, 93, 472 481.
Rosenthal, R. (1991). Meta-analytic procedures for social research (2nd ed.). Newbury Park,
CA: Sage.
Rosenthal, R., & Rubin, D. B. (1982). Comparing effect sizes of independent studies.
Psychological Bulletin, 92, 500–504.
Rothstein, H. R. (2003). Progress is our most important product: Contributions of validity
generalization and meta-analysis to the development and communication of knowledge in
I/O psychology. In K. R. Murphy (Ed.), Validity generalization: A critical review (pp. 115
154). Mahwah, NJ: Lawrence Erlbaum Associates.
293
Rothstein, H. R. (2007). Publication bias as a threat to the validity of meta-analytic results.
Journal of Experimental Criminology, 4, 61– -81.
Au Please add to text or omit
Rothstein, H. R., McDaniel, M. A., & Borenstein, M. (2002). Meta-analysis: A review of
quantitative cumulation methods. In N. Schmitt & F. Drasgow (Eds.), Advances in
measurement and data analysis . San Francisco, CA: Jossey-Bass.
Rothstein, H. R., Sutton, A. J., & Borenstein, M. (Eds.). (2005). Publication bias in meta-
analysis: Prevention, assessment, and adjustment. Chichester, UK: Wiley.
Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological
Bulletin, 57, 416–428.
Sackett, P. R. (2003). The status of validity generalization research: Key issues in drawing
inferences from cumulative research findings. In Murphy, K. R. (Ed.), Validity
generalization: A critical review (pp. 91–114). Mahwah, NJ: Lawrence Erlbaum
Associates. Pp. 91 – 114.
Au Please add to text or omit
Sacks, H. S., Berrier, J., Reitman, D., Ancona-Berk, V. A., & Chalmers, T. C. (1987). Meta-
analysis of randomized controlled trials. New England Journal of Medicine, 316, 450–
455.
Au Please add to text or omit
Schmidt, F. L. (1988). Validity generalization and the future of criterion-related validity. In H.
Wainer &and H. I. Braun (Eds.), Test validity (pp. 173–292). Hillsdale, NJ: Erlbaum.
Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and
cumulative knowledge in psychology. American Psychologist, 47, 1173–1181.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology:
Implications for the training of researchers. Psychological Methods, 1, 115–129.
Schmidt, F. L. (2008). Meta-analysis: A constantly evolving research integration tool.
Organizational Research Methods, 11, 96 113.
Schmidt, F. L. (2010). How to detect and correct the lies that data tell. Perspectives on
Psychological Science, 5, 233 242.
Schmidt, F. L., Gast-Rosenberg, I., & Hunter, J. E. (1980). Validity generalization results for
computer programmers. Journal of Applied Psychology, 65, 643–661.
Au Please add to text or omit
Schmidt, F. L., & Hunter. J. E. (1977). Development of a general solution to the problem of
validity generalization. Journal of Applied Psychology, 62, 529–540.
294
Schmidt, F. L., & Hunter, J. E. (1978). Moderator research and the law of small numbers.
Personnel Psychology, 31, 215–232.
Schmidt, F. L., & Hunter, J. E. (1981). Employment testing: Old theories and new research
findings. American Psychologist, 36, 1128–1137.
Schmidt, F. L., & Hunter, J. E. (1995). The impact of data analysis method on cumulative
knowledge: Statistical significance testing, confidence intervals, and meta-analysis.
Evaluation and the Health Professions, 18, 408–427.
Au Please add to text or omit
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons
from 26 research scenarios. Psychological Methods, 1, 199–223.
Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation
of significance testing in the analysis of research data. In L. Harlow, S. Muliak, &and J.
Steiger (Eds.), What if there were no significance tests? (pp. 37–64). Mahwah, NJ:
Erlbaum, pp. 37–64.
Schmidt, F. L., & Hunter, J. E. (2003). History, development, evolution, and impact of validity
generalization and meta-analysis methods. In Murphy, K. R. (Ed.), Validity
generalization: A critical review. Mahwah, NJ: Lawrence Erlbaum Associates.
Au Please add to text or omit
Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane, G. S. (1979). Further tests of the Schmidt-
Hunter Bayesian validity generalization procedure. Personnel Psychology, 32, 257–281.
Schmidt, F. L., Hunter, J. E., & Urry, V. E. (1976). Statistical power in criterion-related validation
studies. Journal of Applied Psychology, 61, 473–485.
Schmidt, F. L., Law, K. S., Hunter, J. E., Rothstein, H. R., Pearlman, K., & McDaniel, M. (1993).
Refinements in validity generalization methods: Implications for the situational specificity
hypothesis. Journal of Applied Psychology, 78, 3–13.
Au Please add to text or omit
Schmidt, F. L., & Le, H. (2004). Software for the Hunter-Schmidt meta-analysis methods.
University of Iowa, Department of Management and Organizations, Iowa City, IA 52242.
Schmidt, F. L., Le, H., & Oh, I-S. (2009). Correcting for the distorting effects of study artifacts in
meta-analysis. In L. V. Hedges, H. L. V., Cooper, H., &and J. C. Valentine, J. C. (Eds.),
Handbook of research synthesis and meta-analysis (2ndnd Eed., pp. 317–333). New York,
NY: Russell Sage Foundation. Pp. 317 – 333.
Au Please add to text or omit
Schmidt, F. L., Ocasio, B. P., Hillery, J. M., & Hunter, J. E. (1985). Further within-setting
empirical tests of the situational specificity hypothesis in personnel selection. Personnel
Psychology, 38, 509–524.
295
Schmidt, F. L., & Oh, I-S. (2010). Second order meta-analysis. Paper under review.
Au Please add to text or omit
Schmidt, F. L., Oh, I. -S., & Hayes, T. L. (2009). Fixed versus random models in meta-analysis:
Model properties and comparison of differences in results. British Journal of
Mathematical and Statistical Psychology, 62, 97 128.
Schmidt, F. L., Oh, I. -S., & Le, H. (2006). Increasing the accuracy of corrections for range
restriction: Implications for selection procedure validities and other research results.
Personnel Psychology, 59, 281 305.
Schmidt, F. L., Shaffer, J. A., & Oh, I. -S. (2008). Increased accuracy for range restriction
corrections: Implications for the role of personality and general mental ability in job and
training performance. Personnel Psychology, 61, 827 868.
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the
power of the studies? Psychological Bulletin, 105, 309–316.
Snedecor, G. W. (1946). Statistical methods (4thth ed.). Ames, IA: Iowa State College Press.
Stanley, T. D. (1998). New wine in old bottles: A meta-analysis of Ricardian equivalence.
Southern Economic Journal, 64, 713 727.
Stanley, T. D. (2001). Wheat from chaff: Meta-analysis as quantitative literature review. Journal
of Economic Perspectives, 15, 131 150.
Stanley, T. D., & Jarrell, S.B. (1989). Meta-regression analysis: A quantitative method of
literature surveys. Journal of Economic Surveys, 3, 161 169.
Terborg, J. R., & Lee, T. W. (1982). Extension of the Schmidt-Hunter validity generalization
procedure to the prediction of absenteeism behavior from knowledge of job satisfaction
and organizational commitment. Journal of Applied Psychology, 67, 280–296.
Taras, V., Kirkman, B. L., & Steel, P. (2010). Examining the impact of Cculture’s Cconsequences:
A three decade, multilevel, meta-analytic review of Hofstede’s cultural value dimensions.
Journal of Applied Psychology, 95, 405 439.
Tett, R. P., Meyer, J. P., & Roese, N. J. (1994). Applications of meta-analysis: 1987–1992. In
International Rreview of Iindustrial and Oorganizational Ppsychology. (Vol. 9, pp. 71–
112). London, UK: Wiley.
Thompson, B. (2002). What future quantitative social science research could look like:
Confidence intervals for effect sizes. Educational Researcher, 31(3), 25 32.
Thompson, B. (2007). Effect sizes and confidence intervals for effect sizes. Psychology in the
Schools, 44, 423 432.
296
Zimmerman, R. D. (2008). Understanding the impact of personality traits on individuals’ turnover
decisions: A meta-analytic path model. Personnel Psychology, 61, 309 – 348.
Whitman, D. S., Van Rooy, D. L., & Viswesvaran, C. (2010). Satisfaction, citizenship behaviors,
and performance in work units: A meta-analysis of collective construct relations.
Personnel Psychology, 63, 41 81.
Wilkinson, L., & the Task Force on Statistical Inference (1999). Statistical methods in psychology
journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Wortman, P. M., & Bryant, F. B. (1985). School desegration and black achievement: An
integrative review. Sociological Methods and Research, 13, 289–324.
Yusuf, S., Simon, R., & Ellenberg, S. S. (1986). Preface to proceedings of the workshop on
methodological issues in overviews of randomized clinical trials, May 1986. Statistics in
Medicine, 6, 217–218.
Au Please add to text or omit
Zimmerman, R. D. (2008). Understanding the impact of personality traits on individuals’ turnover
decisions: A meta-analytic path model. Personnel Psychology, 61, 309–348.
FOOTNOTES
297
... Existing reviews of the literature may, however, have been compromised by their narrative approach (Cook & Leviton, 1980;Greenberg & Folger, 1988;Strube & Hartmann, 1983). Meta-analysis, a comprehensive form of review based on quantitative rigor and the statistical standards that are applied in primary data analysis, may be better able to exploit the information available within the research body on behavioral treatments of insomnia (Hunter, Schmidt, & Jackson, 1982;Wolf, 1986). ...
... The present meta-analysis was unique in that we were able to use regression analyses to evaluate treatment effects independent from moderator variables. In addition, we were able to apply the Hunter et al. (1982) meta-analytic strategy of correcting observed variance for variance that is due to statistical artifacts. Furthermore, it allowed us to consider the effects of treatment on subjective ratings of sleep, in addition to sleep parameters, in both the long term and short term. ...
... In an effort to ensure that these values were not distorted by the results from small-sample treatment groups, they were weighted by sample size. Following the recommendations of Hunter et al. (1982), effect size variance was corrected for artifactual variance. Unfortunately, because of limited reporting in the studies, corrections for measurement and range restriction error were not possible. ...
Article
Full-text available
Insomnia is a debilitating and widespread complaint. Concern over the iatrogenic effects of pharmacological therapies has led to the development of several psychological treatments for insomnia. To clarify the effects of these treatments, 66 outcome studies representing 139 treatment groups were included in a meta-analysis. The results indicated that psychological treatments produce considerable enhancement of both sleep patterns and the subjective experience of sleep. In terms of enhancing sleep onset, active treatments were all superior to placebo therapies but did not differ greatly in efficacy. Greater therapeutic gains were available for participants who were clinically referred and who were not regular users of sedative hypnotics. Future research directions are suggested.
... Only one sample was included when two or more studies were found to have used the same dataset. Besides, for articles reporting two or more samples, each dataset was treated independently in the meta-analysis (Hunter et al., 1982). ...
... We followed the guidelines of Hunter et al. (1982) to conduct our analysis. We corrected for sampling error by taking the sample size of each study as weights and calculating the sample size adjusted mean (Table 3) where r þ is the sample size weighted mean, N is each study's sample size, and r i is the reported correlation in a specific study i. ...
Article
Purpose The purpose of this study is to examine the relationship between cultural intelligence (CQ) and cross-cultural adjustment (CCA) using meta-analytic methods. The paper serves a dual purpose as it critically examines the CQ-CCA literature and provides summary effects using meta-analysis to determine how CQ and its facets affect CCA and its three dimensions. Design/methodology/approach A meta-analysis of 77 studies involving 18,399 participants was conducted to obtain the summary effects. The studies reporting the relationship of CQ and/or its facets with CCA or any of its dimensions were included in the analysis. Findings Results revealed that CQ (overall) and all individual CQs were positively and significantly related to CCA and its three subdimensions. Although CQ (overall) had a strong effect on CCA and moderate to strong effects on all the subdimensions of CCA, the strongest effect size was measured for the relationship of motivational CQ with CCA. Not only this, when individual CQs' relationships were assessed with the individual adjustment dimensions, the motivational aspect of CQ happened to be the most influencing factor, having a close to strong effect on interaction adjustment. Research limitations/implications Since the study combines the results from numerous empirical research conducted over time, it avoids the limitations that an individual study has, which is carried out at a single point in time and on a limited sample. Originality/value This study adds to the academic research by critically reviewing the CQ-CCA literature. It also works as a guiding map for future research in the area. The study highlights the summary effects for each association between CQ and CCA and their dimensions, elucidating the mixed findings reported in previous research.
... Lee, Carswell, & Allen, 2000;Steiner, Lane, Dobbins, Schnur, & McConnell, 1991). To accomplish this review, we identified relevant studies by means of both manual and computer-assisted searches (PsycLIT) of management and psychology journals (Hunter, Schmidt, & Jackson, 1982). These studies are listed in the references, denoted with an asterisk. ...
... The importance of assessing the effects of potential methodological artifacts or nuisance variables in meta-analysis is widely acknowledged (Hunter & Schmidt, 1995;Hunter et al., 1982;Steiner et al., 1991). A. Cohen (see, especially 1991Cohen (see, especially , 1993a suggested the possibility that the correlation between commitment and job performance might be moderated by employee age rather than tenure. ...
Article
Full-text available
This meta-analysis investigated the correlation between attitudinal commitment and job performance for 3,630 employees obtained from 27 independent studies across various levels of employee tenure. Controlling for employee age and other nuisance variables, the authors found that tenure had a very strong nonlinear moderating effect on the commitment-performance correlation, with correlations tending to decrease exponentially with increasing tenure. These findings do not appear to be the result of differences across studies in terms of the type of performance measure (supervisory vs. self), type of tenure (job vs. organizational), or commitment measure (Organizational Commitment Questionnaire [L. W. Porter, R. M. Steers, R. T. Mowday, & P. V. Boulian, 1974] vs. other). The implications and future research directions of these results are discussed.
... To solution those questions, this look at quantitatively synthesized experimental and quasiexperimental posted studies at the outcomes of coaching and gaining knowledge of with generation on pupil results in naturalistic settings. The strategies of studies synthesis that have been implemented derive from the paintings of Glass, McGaw, and Smith (1981) and Hunter, Schmidt, and Jackson (1982) on meta-analysis, in addition to contributions from Arthur, Bennett, and Huffcutt (2001), Durlak (1995), and Lipsey and Wilson (2001). Several standards have been hooked up for inclusion on this synthesis. ...
Preprint
Full-text available
To estimate the outcomes of coaching and studying with era on students’ cognitive, affective, and behavioral consequences of studying, 282 impact sizes had been calculated the use of statistical facts from forty two research that contained a blended pattern of about 7,000 students. The imply of the study-weighted impact sizes averaging throughout all consequences was .410 (p < .001), with a 95-percentage self assurance interval (CI) of .one hundred seventy five to .644. This end result suggests that coaching and studying with era has a small, positive, significant (p < .001) impact on scholar consequences whilst in comparison to conventional instruction. The imply study-weighted impact length for the 29 research containing cognitive consequences was .448, and the imply study-weighted impact length for the ten comparisons that centered on scholar affective consequences was .464. On the opposite hand, For the three studies that included behavioral outcomes, the mean study-weighted effect size was -.091, showing that technology had a small, negative impact on students' behavioral results. Study variables, quality of study indicators, technological characteristics, and instructional/teaching factors all had consistent overall study-weighted effects. How to cite: Colbong RM & Aban, JL (2024). Meta-analysis on the implications of technology-assisted teaching and learning on student performance. A Research Output in Biomolecules. pp: 1-39
... Taking Internet-based studies as an example, we argue that the random effects model is most adequate, because Internet-based studies differ in, e.g., measurement error. Hence, we recommend to use the psychometric meta-analysis approach by Schmidt and Hunter (Hunter et al., 1982;Hunter & Schmidt, 1990;Schmidt & Hunter, 2014). In the following, we describe this approach. ...
Article
Full-text available
In recent years, much research and many data sources have become digital. Some advantages of digital or Internet-based research, compared to traditional lab research (e.g., comprehensive data collection and storage, availability of data) are ideal for an improved meta-analyses approach.In the meantime, in meta-analyses research, different types of meta-analyses have been developed to provide research syntheses with accurate quantitative estimations. Due to its rich and unique palette of corrections, we recommend to using the Schmidt and Hunter approach for meta-analyses in a digitalized world. Our primer shows in a step-by-step fashion how to conduct a high quality meta-analysis considering digital data and highlights the most obvious pitfalls (e.g., using only a bare-bones meta-analysis, no data comparison) not only in aggregation of the data, but also in the literature search and coding procedure which are essential steps in any meta-analysis. Thus, this primer of meta-analyses is especially suited for a situation where much of future research is headed to: digital research. To map Internet-based research and to reveal any research gap, we further synthesize meta-analyses on Internet-based research (15 articles containing 24 different meta-analyses, on 745 studies, with 1,601 effect sizes), resulting in the first mega meta-analysis of the field. We found a lack of individual participant data (e.g., age and nationality). Hence, we provide a primer for high-quality meta-analyses and mega meta-analyses that applies to much of coming research and also basic hands-on knowledge to conduct or judge the quality of a meta-analyses in a digitalized world.
... A meta-synthesis approach works to generate a deeper understanding of a topic by examining and synthesizing evidence from two or more independent qualitative case studies (Hoon, 2013;Aguirre & Bolton, 2014). While each qualitative study provides an in-depth perspective into the experiences of a specific group in a specific setting, consideration of multiple studies can enhance understanding as evidence from diverse settings and applications is synthesized (Hunter et al., 1982). Linking the different studies and interpreting their findings in light of a central research question broadens the perspective and can develop a gestalt that builds theory and can inform practice and policy (Zimmer, 2006;Aguirre & Bolton, 2014). ...
Article
The rapid growth in digital technologies continues to accelerate, bringing not only new opportunities, but also new challenges and needs to the field of education. As educational technologists design research to improve the implementation of learning technologies, they must adapt their research approaches to social and cultural contexts. In Participatory Action Research (PAR), teachers, students, or other members of the educational community participate as co-researchers who collaborate with researchers to build understanding and solve problems that are relevant to the school or community. This article describes the purpose, background, characteristics, and potential applications of PAR methods. It employs a meta-synthesis approach to investigate five adult-youth PAR collaborations that implement educational technology to meet needs in diverse educational and community settings. The main questions asked are: How can PAR advance educational technology research? In educational technology research, how can adult and youth collaborations in PAR benefit learning and the community? Results show that PAR collaborations not only provide opportunities to gather and assess information, but can also increase dialogue that leads to meaningful understanding, insightful action, and positive change in the community and digital environments. Findings suggest that, in educational technology research that is focused on improving learning or addressing a community need, combining technology with adult/youth collaborative research relationships can increase insights and understanding while moving community members to actively address the issue.
... For the meta-analysis, we followed the procedures outlined in Hunter and Schmidt (1990) and Hunter, Schmidt, and Jackson (1982). We first cumulated all obtained correlations for job and life satisfaction (e.g., where reported work-to-family and family-to-work correlations were combined), in order to estimate the relationship between all forms of w-f conflict and life or job satisfaction. ...
Article
Full-text available
This review examines the relationship among work–family (w-f) conflict, policies, and job and life satisfaction. The meta-analytic results show that regardless of the type of measure used (bidirectional w-f conflict, work to family, family to work), a consistent negative relationship exists among all forms of w-f conflict and job–life satisfaction. This relationship was slightly less strong for family to work conflict. Although confidence intervals overlap, the relationship between job–life satisfaction and w-f conflict may be stronger for women than men. Future research should strive for greater consistency and construct development of measures, examination of how sample composition influences findings, and increased integration of human resources policy and role conflict perspectives, including whether a positive relationship between w-f policies and satisfaction is mediated by w-f conflict.
... Previous reviewers (e.g., Greer, 1987;Trijsburg et al., 1992) have been concerned that studies have not consistently shown statistically significant results. However, such a pattern of results could be expected because of the small sample size and consequent low statistical power of the average study in this area (Hedges & Olkin, 1985;Hunter, Schmidt, & Jackson, 1982). In the few studies (5 of 45) that included medical outcome measures, there was no statistically significant effect of psychosocial interventions on those variables. ...
Article
Full-text available
Meta-analytic methods were used to synthesize the results of published randomized, controlled-outcome studies of psychosocial interventions with adult cancer patients. Forty-five studies reporting 62 treatment–control comparisons were identified. Samples were predominantly White, female, and from the United States. Beneficial effect size d s were .24 for emotional adjustment measures, .19 for functional adjustment measures, .26 for measures of treatment-and disease-related symptoms, and .28 for compound and global measures. The effect size of .17 found for medical measures was not statistically significant for the few reporting studies. Effect sizes for treatment–control comparisons did not significantly differ among several categories of treatment: behavioral interventions, nonbehavioral counseling and therapy, informational and educational methods, organized social support provided by other patients, and other nonhospice interventions.
Preprint
Full-text available
Effect sizes are widely used to quantify the magnitude of the potential effect in the settings where the association of two variables or means of different sub-populations are compared. Three estimators linked together for these settings are product-moment correlation r, Cohen’s d and Cohen’s f of which d is originally developed for binary or dichotomous settings and f for polytomous settings. The traditional thresholds for “small”, “medium”, and “large” effect sizes for r and f are based on the simplified relation between d and f assuming equal sample sizes in the sub-populations. The difference between the traditional thresholds and the ones suggested by the comparable general formulae gets wider the further the sample sizes of the sub-populations are from each other. It is probable that ESs has often been under-evaluated by using the simplified formulae. General and comparable forms of transforming r and f to the scale d in the polytomous settings are discussed, and more refined thresholds are given for practical users. Keywords: Effect size, Cohen d, Cohen f, r effect size
Preprint
Full-text available
Transforming properly an estimate of product-moment correlation (r) to the scale of Cohen’s d is important when willing to qualitatively evaluate the magnitude of r effect size in terms of “very small”, “small”, “medium”, “large” “very large”, and “”huge” because the traditional thresholds are based on the scale of d. Although we have a proper formula for an exact transformation between r and d in the binary and dichotomous settings, we do not have had an exact formula for the polytomous and continuous settings. For the exact transformation for these settings, two general formulae are derived, and two short-cut options are studied. Based on 14,880 estimates from real-life settings, the simplified formulae tend to give robust although conservative correspondence with r and d in dichotomous and polytomous settings, if the difference between the greatest and smallest proportion of cases in the categories of the variable with a narrower scale do not exceed 0.4 units. Within this range, the renewed thresholds for r effect size, parallel and comparable with Cohen’s d, are 0.05 for “very small”, 0.09–0.10 for “small”, 0.22–0.24 for “medium”, 0.34–0.37 for “large”, 0.48–0.51 for “very large”, and 0.68–0.71 for “huge” effect size, and their accuracy depends on the extent of the discrepancy level of the proportions of the cases in the sub-populations. With continuous variables or when the number of cases in the sub-populations otherwise are equal, the thresholds are 0.05–0.10–0.24–0.37–0.51–0.71, respectively. These are, notably, the same thresholds suggested by Cohen for the dichotomous settings.
ResearchGate has not been able to resolve any references for this publication.