ChapterPDF Available

Scale Development

January 2016

January 2016

In book: The SAGE Encyclopedia of Industrial and Organizational Psychology
Edition: 2nd edition
Chapter: Scale Development
Publisher: Sage
Editors: Steve Rogelberg

Authors:

Louis Tay

Purdue University

Andrew T. Jebb

Purdue University

Scale Development is a process of developing a reliable and valid measure of a construct in order to assess an attribute of interest. Industrial-Organizational research involves the measurement of organizational and psychological constructs, which present unique challenges because they are generally unobservable (e.g., work attitudes, perceptions, personality traits). As opposed to observable characteristics (e.g., height, precipitation, velocity) unobservable constructs cannot be measured directly and must be assessed through indirect means, such as self-report. Relatedly, these constructs are often very abstract (e.g., core self-evaluations), making it difficult to determine which items adequately represent them—and which ones do so reliably. Finally, these constructs are often complex and may be composed of several different components rather than being a single, solitary concept. As a result of these complexities, developing a measurement instrument can be a challenging task, and validation is especially important to the process of scale construction. This article focuses on the principles and best practices of scale creation with regard to self-report scales.

Content uploaded by Louis Tay

Content may be subject to copyright.

Tay, L., & Jebb, A. (2017). Scale Development. In S. Rogelberg (Ed), The SAGE Encyclopedia

of Industrial and Organizational Psychology, 2nd edition. Thousand Oaks, CA: Sage.

Scale Development

Scale creation is a process of developing a reliable and valid measure of a construct in

order to assess an attribute of interest. Industrial-Organizational research involves the

measurement of organizational and psychological constructs, which present unique challenges

because they are generally unobservable (e.g., work attitudes, perceptions, personality traits). As

opposed to observable characteristics (e.g., height, precipitation, velocity) unobservable

constructs cannot be measured directly and must be assessed through indirect means, such as

self-report. Relatedly, these constructs are often very abstract (e.g., core self-evaluations),

making it difficult to determine which items adequately represent them—and which ones do so

reliably. Finally, these constructs are often complex and may be composed of several different

components rather than being a single, solitary concept. As a result of these complexities,

developing a measurement instrument can be a challenging task, and validation is especially

important to the process of scale construction. This article focuses on the principles and best

practices of scale creation with regard to self-report scales.

Approaches to scale creation. There are two distinct approaches to scale creation. A

deductive approach focuses on using theory and the already-formed conceptualization of

construct to generate items within its domain. This approach is useful when the definition of the

construct is known and substantial enough to generate an initial pool of items. By contrast, an

inductive approach is useful when there is uncertainty in the definition or dimensionality of the

construct. In this case, organizational incumbents are asked to provide descriptions of the

concept and a conceptualization is then derived which then forms the basis for generating items.

Construct Definition. Regardless of the approach to scale creation, in order to create a

scale, a clear conceptualization of the construct is required. This entails delineating and defining

the construct (i.e., stating what it is, and thus what it is not) either through a thorough literature

review or through an inductive uncovering of the phenomenon. It is also important to define the

level of conceptual breadth of the target construct. A construct that is very broad (e.g., attitudes

about working in general) requires different types of items than one that is more specific (e.g.,

attitudes about performing administrative duties). Another important theoretical step is to specify

the likely number of components, or dimensions, that make up the construct. The dimensionality

of a construct can be understood as whether the construct is best conceived as being made up of a

single variable (unidimensional) or the combination of a number of distinct subcomponents

(multidimensional). For instance, job satisfaction has been conceptualized, and thus measured, as

both a unidimensional and multidimensional construct. Hakman and Oldham’s 3-item scale of is

a single scale of global job satisfaction, whereas Smith et al.’s Job Descriptive Index is

composed five subscales: satisfaction with pay, the work itself, promotions, coworkers, and

supervisors. Although this example shows that the same construct can be validly conceived as

both unidimensional and multidimensional, properly specifying the dimensions of a construct is

essential, as a distinct scale must be constructed for each one.

A key idea in construct definition is to outline the nomological network: i.e., how the

focal construct (and its specific dimensions) is related to other constructs. Once the construct is

defined, one can begin to specify this nomological network, which entails stating what the

construct should be positively related to, negatively related to, and relatively independent of

based on theory. The nomological network will be essential to the validation process, as a scale

that empirically relates to other established measures in the way predicted by theory displays

important types of validity evidence (convergent and divergent validity).

Purposes of created scale. Before discussing the specific principles of item writing, it is

necessary to specify the purpose of the scale. Will the scale be used for research, selection,

development, or another purpose? Is the scale intended for the general population, the population

of adult workers, or another specific population? Outlining the scale’s purpose and use in future

contexts will allow one to identify the unique practical concerns related to the scale. This guides

item creation in a number of ways, such as (1) determining an appropriate reading level for the

target population; (2) identifying whether the items should refer to general or specific contexts

and situations (work contexts); (3) considering differences in how respondents interpret the items

(e.g., the different meaning of the term “stress” in different national contexts); (4) deciding the

type of scale response format and behavioral anchors, which can potentially affect scale

responses; and (4) determining the applicability of reverse scoring, which may not be appropriate

for positive constructs such as virtues.

Principles of item writing. When writing items, one aims to create an initial item pool

that contains many more items than in the final scale (e.g., 3-4 times larger than in the final

scale). This gives the researcher more freedom about the psychometric standard of the items that

survive to the final scale. The initial redundancy and over-inclusivity in the initial item pool is

also desirable because it can serve to uncover sub-dimensions or closely related but distinct

constructs. As for the actual writing of items, recommendations from a wide range of sources

agree on the following principles: items should be simple and straightforward; one should avoid

slang, jargon, double negatives, ambiguous words, overly abstract words and favor the use of

specific and concrete words; no double-barreled items (i.e., two different ideas included in a

single question); no leading questions or statements (e.g., “Most supervisors are toxic. Please

respond to how aggressive your supervisor has been to you”); and items should not be identical

re-statements but should seek to state the same idea in different ways. Finally, it is often helpful

to provide the construct definition, relevant adjectives, and example scale items to item writers

when generating items.

Scale validation research design. As noted in the introduction, validation is supremely

important in the development of a self-report scale; in the measurement of unobservable

variables, one cannot simply assume that a scale measures what it intends to. Such assumptions

can lead to false scientific conclusions. There are many ways suggested by Cronbach and Meehl

on how scale validation can be conducted. Primary approaches include comparing group

differences, assessing correlations with other measures, or examining the change in scale scores

over repeated occasions. As mentioned earlier, the specification of the nomological network will

help a researcher determine the types of designs and measures to include. Group differences are

appropriate when there is an expectation that measures should discriminate between groups (e.g.,

experts vs. non-experts). On the other hand, establishing correlations with related

constructs/criteria are important for assessing convergent-divergent validity and predictive

validity. Further, changes over time can help determine the reliability and stability of the

operationalized construct. Where possible, the use of multitrait-multimethod designs are more

informative than a single method or single trait approach to scale validation.

Regarding sampling, the preliminary sample size for examining psychometric properties

of items has been recommended to be 100-200 and a later confirmatory sample size with a

minimum of 300. However, this may depend on group differences and the type of analysis one

seeks to conduct. Based on its theoretical and practical context, one should also seek to match the

validation of the scale to its scale application. For instance, if a scale is meant for a work sample

for entrepreneurs, it will be important to obtain a sample from the same subpopulation of

interest. Notably, using a broader sample than the target subpopulation can artificially raise

reliability of the scale. A recommended best practice is to cross-validate the scale across

independent samples to show that scale properties are stable and generalizable.

Scale psychometric properties. After data collection, one needs to establish the reliability

and validity of the scale items. At the first step, it is critical to identify a good set of items with

reasonable psychometric properties. This is usually done by examining the mean, standard

deviations, score range, endorsement proportions across all the options, and the item-total

correlation for each item. One should select items that have reasonable item-total correlations

(around .20 or higher), appropriate score ranges (i.e., no ceiling or floor effects), and a utilization

of different scale options.

Based on the selected items, there are different approaches for calculating reliability, but

calculation of internal consistency is the most common. In general, the rule-of-thumb for internal

consistency reliability is a minimum .70 although it is recommended that .90 or higher for high-

stakes decisions (e.g., selection). One should also calculate the reliability on sub-dimensions of

the construct.

It is important to distinguish reliability from dimensionality, as a high reliability does not

necessarily indicate unidimensionality. The number of dimensions should have been specified by

theory and be confirmed by exploratory factor analysis (EFA). The number of latent

factors/dimensions should equal the number of scales being developed. One may also seek to

replicate the factor structures across different subpopulations to ensure the generalizability of the

factor structure.

EFA loadings of items to specified dimensions should be moderate (around .4) to high

(closer to 1.0), and one may choose to delete items that inappropriately low on other dimensions

or have low loadings. After theoretically-based dimensionality is borne out in EFA, confirmatory

factor analysis (CFA) should be conducted with new sample, and the model should be evaluated

using a number of fit indices. Although there are many fit indices that can be used, some of the

most popular and useful are the comparative fit index (CFI), Tucker-Lewis index (TLI), root

mean square error of approximation (RMSEA), and standardized root mean square residual

(SRMR). General standards hold that the minimum standards of good fit for these metrics are:

CFI ≥ .90, TLI ≥ .90, RMSEA ≤ .08, SRMR ≤ .08.

After establishing reliability and factorial validity, a researcher would continue providing

validation evidence by examining evidence based on the scale validation design. This may

include examining group differences on scale scores or divergent and convergent validity based

on with other related measures. This involves examining how the new construct empirically

relates to other constructs its nomological network, and this overall process is a test of both the

scale as well as the underlying theory driving the test.

Scale revision. It is common to conduct several rounds of scale revision to improve on

the initial items. There are several reasons for this, including poor reliability, divergence between

theoretical and empirical structure, and inadequate construct representation. Revising a scale

requires analyzing items with poor item-total correlations or low loadings to discern possible

sources of poor item functioning. Where needed, one would also revise and write more items to

tap onto specific dimensions that were not adequately measured.

Recommended Reading

Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale

development. Psychological Assessment, 7, 309-319.

Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications.

Journal of Applied Psychology, 78, 98-104.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological

Bulletin, 52, 281-302.

DeVellis, R. F. (2012). Scale development: Theory and application. Newbury Park, CA: Sage.

Drasgow, F., Nye, C. D., & Tay, L. (2010). Indicators of quality assessment. In J. C. Scott & D.

H. Reynolds (Eds.), Handbook of workplace assessment: Evidence-based practices for

selecting and developing organizational talent (pp. 27-60). San Francisco, CA: John

Wiley & Sons.

Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey question

questionnaires. Organizational Research Methods, 1, 104-121.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons'

responses and performances as scientific inquiry into score meaning. American

Psychologist, 50, 741-749.

Peterson, C., & Park, N. (2004). Classification and measurement of character strengths:

Implications for practice. In P. A. Linley & S. Joseph (Eds.), Positive psychology in

practice (pp. 433-446). Hoboken, NJ: Wiley and Sons Inc.

Reise, S. P., Waller, N. G., & Comrey, A. L. (2000). Factor analysis and scale revision.

Psychological Assessment, 12, 287-297.

Schwarz, N. (1999). Self-reports: How the questions shape the answer. American Psychologist,

54, 93-105.

Schwarz, N., Knauper, B., Hippler, H.-J., Noelle-Neumann, E., & Clark, L. (1991). Rating

scales: Numeric values may change the meaning of scale labels. Public Opinion

Quarterly, 55, 570-582.

Smith, G. T., & McCarthy, D. M. (1995). Methodological considerations in the refinement of

clinical assessment instruments. Psychological Assessment, 7, 300-308.

ResearchGate has not been able to resolve any citations for this publication.

Methodological Considerations in the Refinement of Clinical Assessment Instruments

Article

Full-text available

Sep 1995

Instrument refinement refers to any set of procedures designed to improve an instrument's representation of a construct. Though often neglected, it is essential to the development of reliable and valid measures. Five objectives of instrument refinement are proposed: identification of measures' hierarchical or aggregational structure, establishment of internal consistency of undimensional facets of measures, determination of content homogeneity of undimensional facets, inclusion of items that discriminate at the desired level of attribute intensity, and replication of instrument properties on an independent sample. The use of abbreviated scales is not recommended. The refinement of behavioral observation procedures is discussed, and the role of measure refinement in theory development is emphasized. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

What Is Coefficient Alpha? An Examination of Theory and Applications

Article

Full-text available

Feb 1993

Jose Cortina

Psychological research involving scale construction has been hindered considerably by a widespread lack of understanding of coefficient alpha and reliability theory in general. A discussion of the assumptions and meaning of coefficient alpha is presented. This discussion is followed by a demonstration of the effects of test length and dimensionality on alpha by calculating the statistic for hypothetical tests with varying numbers of items, numbers of orthogonal dimensions, and average item intercorrelations. Recommendations for the proper use of coefficient alpha are offered. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Validity of psychological assessment: Validation of inferences from persons’ responses and performance as scientific inquiry into score meaning

Article

Full-text available

Sep 1995

Samuel Messick

The traditional conception of validity divides it into three separate and substitutable types: content, criterion, and construct validities. This view is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use. The new unified concept of validity interrelates these issues as fundamental aspects of a more comprehensive theory of construct validity that addresses both score meaning and social values in test interpretation and test use. That is, unified validity integrates considerations of content, criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and theoretically relevant relationships, including those of an applied and a scientific nature. Six distinguishable aspects of construct validity are highlighted as a means of addressing central issues implicit in the notion of validity as a unified concept. These are content, substantive, structural, generalizability, external, and consequential aspects of construct validity. In effect, these six aspects function as general validity criteria or standards for all educational and psychological measurement, including performance assessments, which are discussed in some detail because of their increasing emphasis in educational and employment settings. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Classification and Measurement of Character Strengths: Implications for Practice

Article

Sep 2012

The Values in Action Classification of StrengthsAssessment of the Via StrengthsThe Via Inventory of StrengthsImplications for PracticeConclusion

Methodological Considerations in the Refinement of Clinical Assessment Instruments

Article

Sep 1995

Construct Validity in Psychological Tests

Article

Jan 1955

Construct Validity in Psychological Test

Article

Aug 1955

"Construct validation was introduced in order to specify types of research required in developing tests for which the conventional views on validation are inappropriate. Personality tests, and some tests of ability, are interpreted in terms of attributes for which there is no adequate criterion. This paper indicates what sorts of evidence can substantiate such an interpretation, and how such evidence is to be interpreted." 60 references. (PsycINFO Database Record (c) 2006 APA, all rights reserved).

Self-Reports: How the Questions Shape the Answers

Article

Feb 1999

Norbert Schwarz

Self-reports of behaviors and attitudes are strongly influenced by features of the research instrument, including question wording, format, and context. Recent research has addressed the underlying cognitive and communicative processes, which are systematic and increasingly well-understood. The author reviews what has been learned, focusing on issues of question comprehension, behavioral frequency reports, and the emergence of context effect in attitude measurement. The accumulating knowledge about the processes underlying self-reports promises to improve the questionnaire design and data quality. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

A Brief Tutorial on the Development of Measures for Use in Survey Questionnaires

Article

Jan 1998

Timothy R. Hinkin

The adequate measurement of abstract constructs is perhaps the greatest challenge to understanding the behavior of people in organizations. Problems with the reliability and validity of measures used on survey questionnaires continue to lead to difficulties in interpreting the results of field research. Price and Mueller suggest that measurement problems may be due to the lack of a well-established framework to guide researchers through the various stages of scale development. This article provides a conceptual framework and a straightforward guide for the development of scales in accordance with established psychometric principles for use infield studies

Constructing validity: Basic issues in objective scale development

Article

Sep 1995

A primary goal of scale development is to create a valid measure of an underlying construct. We discuss theoretical principles, practical issues, and pragmatic decisions to help developers maximize the construct validity of scales and subscales. First, it is essential to begin with a clear conceptualization of the target construct. Moreover, the content of the initial item pool should be overinclusive and item wording needs careful attention. Next, the item pool should be tested, along with variables that assess closely related constructs, on a heterogeneous sample representing the entire range of the target population. Finally, in selecting scale items, the goal is unidimensionality rather than internal consistency; this means that virtually all interitem correlations should be moderate in magnitude. Factor analysis can play a crucial role in ensuring the unidimensionality and discriminant validity of scales. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Scale Development

Abstract

Recommended publications

The psychological sense of school membership among adolescents: Scale development and educational co...

Measuring Affective Responses to Cuteness and Japanese kawaii as a Multidimensional Construct

AAIDD Diagnostic Adaptive Behavior Scale

The Self-Efficacy Scale: Construction and Validation