ChapterPDF Available

Scale Development

Authors:

Abstract

Scale Development is a process of developing a reliable and valid measure of a construct in order to assess an attribute of interest. Industrial-Organizational research involves the measurement of organizational and psychological constructs, which present unique challenges because they are generally unobservable (e.g., work attitudes, perceptions, personality traits). As opposed to observable characteristics (e.g., height, precipitation, velocity) unobservable constructs cannot be measured directly and must be assessed through indirect means, such as self-report. Relatedly, these constructs are often very abstract (e.g., core self-evaluations), making it difficult to determine which items adequately represent them—and which ones do so reliably. Finally, these constructs are often complex and may be composed of several different components rather than being a single, solitary concept. As a result of these complexities, developing a measurement instrument can be a challenging task, and validation is especially important to the process of scale construction. This article focuses on the principles and best practices of scale creation with regard to self-report scales.
Tay, L., & Jebb, A. (2017). Scale Development. In S. Rogelberg (Ed), The SAGE Encyclopedia
of Industrial and Organizational Psychology, 2nd edition. Thousand Oaks, CA: Sage.
Scale Development
Scale creation is a process of developing a reliable and valid measure of a construct in
order to assess an attribute of interest. Industrial-Organizational research involves the
measurement of organizational and psychological constructs, which present unique challenges
because they are generally unobservable (e.g., work attitudes, perceptions, personality traits). As
opposed to observable characteristics (e.g., height, precipitation, velocity) unobservable
constructs cannot be measured directly and must be assessed through indirect means, such as
self-report. Relatedly, these constructs are often very abstract (e.g., core self-evaluations),
making it difficult to determine which items adequately represent themand which ones do so
reliably. Finally, these constructs are often complex and may be composed of several different
components rather than being a single, solitary concept. As a result of these complexities,
developing a measurement instrument can be a challenging task, and validation is especially
important to the process of scale construction. This article focuses on the principles and best
practices of scale creation with regard to self-report scales.
Approaches to scale creation. There are two distinct approaches to scale creation. A
deductive approach focuses on using theory and the already-formed conceptualization of
construct to generate items within its domain. This approach is useful when the definition of the
construct is known and substantial enough to generate an initial pool of items. By contrast, an
inductive approach is useful when there is uncertainty in the definition or dimensionality of the
construct. In this case, organizational incumbents are asked to provide descriptions of the
concept and a conceptualization is then derived which then forms the basis for generating items.
Construct Definition. Regardless of the approach to scale creation, in order to create a
scale, a clear conceptualization of the construct is required. This entails delineating and defining
the construct (i.e., stating what it is, and thus what it is not) either through a thorough literature
review or through an inductive uncovering of the phenomenon. It is also important to define the
level of conceptual breadth of the target construct. A construct that is very broad (e.g., attitudes
about working in general) requires different types of items than one that is more specific (e.g.,
attitudes about performing administrative duties). Another important theoretical step is to specify
the likely number of components, or dimensions, that make up the construct. The dimensionality
of a construct can be understood as whether the construct is best conceived as being made up of a
single variable (unidimensional) or the combination of a number of distinct subcomponents
(multidimensional). For instance, job satisfaction has been conceptualized, and thus measured, as
both a unidimensional and multidimensional construct. Hakman and Oldham’s 3-item scale of is
a single scale of global job satisfaction, whereas Smith et al.’s Job Descriptive Index is
composed five subscales: satisfaction with pay, the work itself, promotions, coworkers, and
supervisors. Although this example shows that the same construct can be validly conceived as
both unidimensional and multidimensional, properly specifying the dimensions of a construct is
essential, as a distinct scale must be constructed for each one.
A key idea in construct definition is to outline the nomological network: i.e., how the
focal construct (and its specific dimensions) is related to other constructs. Once the construct is
defined, one can begin to specify this nomological network, which entails stating what the
construct should be positively related to, negatively related to, and relatively independent of
based on theory. The nomological network will be essential to the validation process, as a scale
that empirically relates to other established measures in the way predicted by theory displays
important types of validity evidence (convergent and divergent validity).
Purposes of created scale. Before discussing the specific principles of item writing, it is
necessary to specify the purpose of the scale. Will the scale be used for research, selection,
development, or another purpose? Is the scale intended for the general population, the population
of adult workers, or another specific population? Outlining the scale’s purpose and use in future
contexts will allow one to identify the unique practical concerns related to the scale. This guides
item creation in a number of ways, such as (1) determining an appropriate reading level for the
target population; (2) identifying whether the items should refer to general or specific contexts
and situations (work contexts); (3) considering differences in how respondents interpret the items
(e.g., the different meaning of the term “stress in different national contexts); (4) deciding the
type of scale response format and behavioral anchors, which can potentially affect scale
responses; and (4) determining the applicability of reverse scoring, which may not be appropriate
for positive constructs such as virtues.
Principles of item writing. When writing items, one aims to create an initial item pool
that contains many more items than in the final scale (e.g., 3-4 times larger than in the final
scale). This gives the researcher more freedom about the psychometric standard of the items that
survive to the final scale. The initial redundancy and over-inclusivity in the initial item pool is
also desirable because it can serve to uncover sub-dimensions or closely related but distinct
constructs. As for the actual writing of items, recommendations from a wide range of sources
agree on the following principles: items should be simple and straightforward; one should avoid
slang, jargon, double negatives, ambiguous words, overly abstract words and favor the use of
specific and concrete words; no double-barreled items (i.e., two different ideas included in a
single question); no leading questions or statements (e.g., Most supervisors are toxic. Please
respond to how aggressive your supervisor has been to you); and items should not be identical
re-statements but should seek to state the same idea in different ways. Finally, it is often helpful
to provide the construct definition, relevant adjectives, and example scale items to item writers
when generating items.
Scale validation research design. As noted in the introduction, validation is supremely
important in the development of a self-report scale; in the measurement of unobservable
variables, one cannot simply assume that a scale measures what it intends to. Such assumptions
can lead to false scientific conclusions. There are many ways suggested by Cronbach and Meehl
on how scale validation can be conducted. Primary approaches include comparing group
differences, assessing correlations with other measures, or examining the change in scale scores
over repeated occasions. As mentioned earlier, the specification of the nomological network will
help a researcher determine the types of designs and measures to include. Group differences are
appropriate when there is an expectation that measures should discriminate between groups (e.g.,
experts vs. non-experts). On the other hand, establishing correlations with related
constructs/criteria are important for assessing convergent-divergent validity and predictive
validity. Further, changes over time can help determine the reliability and stability of the
operationalized construct. Where possible, the use of multitrait-multimethod designs are more
informative than a single method or single trait approach to scale validation.
Regarding sampling, the preliminary sample size for examining psychometric properties
of items has been recommended to be 100-200 and a later confirmatory sample size with a
minimum of 300. However, this may depend on group differences and the type of analysis one
seeks to conduct. Based on its theoretical and practical context, one should also seek to match the
validation of the scale to its scale application. For instance, if a scale is meant for a work sample
for entrepreneurs, it will be important to obtain a sample from the same subpopulation of
interest. Notably, using a broader sample than the target subpopulation can artificially raise
reliability of the scale. A recommended best practice is to cross-validate the scale across
independent samples to show that scale properties are stable and generalizable.
Scale psychometric properties. After data collection, one needs to establish the reliability
and validity of the scale items. At the first step, it is critical to identify a good set of items with
reasonable psychometric properties. This is usually done by examining the mean, standard
deviations, score range, endorsement proportions across all the options, and the item-total
correlation for each item. One should select items that have reasonable item-total correlations
(around .20 or higher), appropriate score ranges (i.e., no ceiling or floor effects), and a utilization
of different scale options.
Based on the selected items, there are different approaches for calculating reliability, but
calculation of internal consistency is the most common. In general, the rule-of-thumb for internal
consistency reliability is a minimum .70 although it is recommended that .90 or higher for high-
stakes decisions (e.g., selection). One should also calculate the reliability on sub-dimensions of
the construct.
It is important to distinguish reliability from dimensionality, as a high reliability does not
necessarily indicate unidimensionality. The number of dimensions should have been specified by
theory and be confirmed by exploratory factor analysis (EFA). The number of latent
factors/dimensions should equal the number of scales being developed. One may also seek to
replicate the factor structures across different subpopulations to ensure the generalizability of the
factor structure.
EFA loadings of items to specified dimensions should be moderate (around .4) to high
(closer to 1.0), and one may choose to delete items that inappropriately low on other dimensions
or have low loadings. After theoretically-based dimensionality is borne out in EFA, confirmatory
factor analysis (CFA) should be conducted with new sample, and the model should be evaluated
using a number of fit indices. Although there are many fit indices that can be used, some of the
most popular and useful are the comparative fit index (CFI), Tucker-Lewis index (TLI), root
mean square error of approximation (RMSEA), and standardized root mean square residual
(SRMR). General standards hold that the minimum standards of good fit for these metrics are:
CFI ≥ .90, TLI ≥ .90, RMSEA ≤ .08, SRMR ≤ .08.
After establishing reliability and factorial validity, a researcher would continue providing
validation evidence by examining evidence based on the scale validation design. This may
include examining group differences on scale scores or divergent and convergent validity based
on with other related measures. This involves examining how the new construct empirically
relates to other constructs its nomological network, and this overall process is a test of both the
scale as well as the underlying theory driving the test.
Scale revision. It is common to conduct several rounds of scale revision to improve on
the initial items. There are several reasons for this, including poor reliability, divergence between
theoretical and empirical structure, and inadequate construct representation. Revising a scale
requires analyzing items with poor item-total correlations or low loadings to discern possible
sources of poor item functioning. Where needed, one would also revise and write more items to
tap onto specific dimensions that were not adequately measured.
Recommended Reading
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale
development. Psychological Assessment, 7, 309-319.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications.
Journal of Applied Psychology, 78, 98-104.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological
Bulletin, 52, 281-302.
DeVellis, R. F. (2012). Scale development: Theory and application. Newbury Park, CA: Sage.
Drasgow, F., Nye, C. D., & Tay, L. (2010). Indicators of quality assessment. In J. C. Scott & D.
H. Reynolds (Eds.), Handbook of workplace assessment: Evidence-based practices for
selecting and developing organizational talent (pp. 27-60). San Francisco, CA: John
Wiley & Sons.
Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey question
questionnaires. Organizational Research Methods, 1, 104-121.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons'
responses and performances as scientific inquiry into score meaning. American
Psychologist, 50, 741-749.
Peterson, C., & Park, N. (2004). Classification and measurement of character strengths:
Implications for practice. In P. A. Linley & S. Joseph (Eds.), Positive psychology in
practice (pp. 433-446). Hoboken, NJ: Wiley and Sons Inc.
Reise, S. P., Waller, N. G., & Comrey, A. L. (2000). Factor analysis and scale revision.
Psychological Assessment, 12, 287-297.
Schwarz, N. (1999). Self-reports: How the questions shape the answer. American Psychologist,
54, 93-105.
Schwarz, N., Knauper, B., Hippler, H.-J., Noelle-Neumann, E., & Clark, L. (1991). Rating
scales: Numeric values may change the meaning of scale labels. Public Opinion
Quarterly, 55, 570-582.
Smith, G. T., & McCarthy, D. M. (1995). Methodological considerations in the refinement of
clinical assessment instruments. Psychological Assessment, 7, 300-308.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Instrument refinement refers to any set of procedures designed to improve an instrument's representation of a construct. Though often neglected, it is essential to the development of reliable and valid measures. Five objectives of instrument refinement are proposed: identification of measures' hierarchical or aggregational structure, establishment of internal consistency of undimensional facets of measures, determination of content homogeneity of undimensional facets, inclusion of items that discriminate at the desired level of attribute intensity, and replication of instrument properties on an independent sample. The use of abbreviated scales is not recommended. The refinement of behavioral observation procedures is discussed, and the role of measure refinement in theory development is emphasized. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Psychological research involving scale construction has been hindered considerably by a widespread lack of understanding of coefficient alpha and reliability theory in general. A discussion of the assumptions and meaning of coefficient alpha is presented. This discussion is followed by a demonstration of the effects of test length and dimensionality on alpha by calculating the statistic for hypothetical tests with varying numbers of items, numbers of orthogonal dimensions, and average item intercorrelations. Recommendations for the proper use of coefficient alpha are offered. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
The traditional conception of validity divides it into three separate and substitutable types: content, criterion, and construct validities. This view is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use. The new unified concept of validity interrelates these issues as fundamental aspects of a more comprehensive theory of construct validity that addresses both score meaning and social values in test interpretation and test use. That is, unified validity integrates considerations of content, criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and theoretically relevant relationships, including those of an applied and a scientific nature. Six distinguishable aspects of construct validity are highlighted as a means of addressing central issues implicit in the notion of validity as a unified concept. These are content, substantive, structural, generalizability, external, and consequential aspects of construct validity. In effect, these six aspects function as general validity criteria or standards for all educational and psychological measurement, including performance assessments, which are discussed in some detail because of their increasing emphasis in educational and employment settings. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
The Values in Action Classification of StrengthsAssessment of the Via StrengthsThe Via Inventory of StrengthsImplications for PracticeConclusion
Article
Instrument refinement refers to any set of procedures designed to improve an instrument's representation of a construct. Though often neglected, it is essential to the development of reliable and valid measures. Five objectives of instrument refinement are proposed: identification of measures' hierarchical or aggregational structure, establishment of internal consistency of undimensional facets of measures, determination of content homogeneity of undimensional facets, inclusion of items that discriminate at the desired level of attribute intensity, and replication of instrument properties on an independent sample. The use of abbreviated scales is not recommended. The refinement of behavioral observation procedures is discussed, and the role of measure refinement in theory development is emphasized.
Article
"Construct validation was introduced in order to specify types of research required in developing tests for which the conventional views on validation are inappropriate. Personality tests, and some tests of ability, are interpreted in terms of attributes for which there is no adequate criterion. This paper indicates what sorts of evidence can substantiate such an interpretation, and how such evidence is to be interpreted." 60 references. (PsycINFO Database Record (c) 2006 APA, all rights reserved).
Article
Self-reports of behaviors and attitudes are strongly influenced by features of the research instrument, including question wording, format, and context. Recent research has addressed the underlying cognitive and communicative processes, which are systematic and increasingly well-understood. The author reviews what has been learned, focusing on issues of question comprehension, behavioral frequency reports, and the emergence of context effect in attitude measurement. The accumulating knowledge about the processes underlying self-reports promises to improve the questionnaire design and data quality. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
The adequate measurement of abstract constructs is perhaps the greatest challenge to understanding the behavior of people in organizations. Problems with the reliability and validity of measures used on survey questionnaires continue to lead to difficulties in interpreting the results of field research. Price and Mueller suggest that measurement problems may be due to the lack of a well-established framework to guide researchers through the various stages of scale development. This article provides a conceptual framework and a straightforward guide for the development of scales in accordance with established psychometric principles for use infield studies
Article
A primary goal of scale development is to create a valid measure of an underlying construct. We discuss theoretical principles, practical issues, and pragmatic decisions to help developers maximize the construct validity of scales and subscales. First, it is essential to begin with a clear conceptualization of the target construct. Moreover, the content of the initial item pool should be overinclusive and item wording needs careful attention. Next, the item pool should be tested, along with variables that assess closely related constructs, on a heterogeneous sample representing the entire range of the target population. Finally, in selecting scale items, the goal is unidimensionality rather than internal consistency; this means that virtually all interitem correlations should be moderate in magnitude. Factor analysis can play a crucial role in ensuring the unidimensionality and discriminant validity of scales. (PsycINFO Database Record (c) 2012 APA, all rights reserved)