ArticlePublisher preview available

Sample Size Calculation and Optimal Design for Regression-Based Norming of Tests and Questionnaires

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

To prevent mistakes in psychological assessment, the precision of test norms is important. This can be achieved by drawing a large normative sample and using regression-based norming. Based on that norming method, a procedure for sample size planning to make inference on Z-scores and percentile rank scores is proposed. Sampling variance formulas for these norm statistics are derived and used to obtain the optimal design, that is, the optimal predictor distribution, for the normative sample, thereby maximizing precision of estimation. This is done under five regression models with a quantitative and a categorical predictor, differing in whether they allow for interaction and nonlinearity. Efficient robust designs are given in case of uncertainty about the regression model. Furthermore, formulas are provided to compute the normative sample size such that individuals' positions relative to the derived norms can be assessed with prespecified power and precision. (PsycInfo Database Record (c) 2021 APA, all rights reserved). Supplemental materials: doi.org/10.1037/met0000394.supp and github.com/FrancescoInnocenti-Stat/SampSize-OD-Norming
Sample Size Calculation and Optimal Design for Regression-Based
Norming of Tests and Questionnaires
Francesco Innocenti
1
, Frans E. S. Tan
1
, Math J. J. M. Candel
1
, and Gerard J. P. van Breukelen
1, 2
1
Department of Methodology and Statistics, Care and Public Health Research Institute (CAPHRI), Maastricht University
2
Department of Methodology and Statistics, Graduate School of Psychology and Neuroscience, Maastricht University
Abstract
To prevent mistakes in psychological assessment, the precision of test norms is important. This can be
achieved by drawing a large normative sample and using regression-based norming. Based on that norming
method, a procedure for sample size planning to make inference on Z-scores and percentile rank scores is
proposed. Sampling variance formulas for these norm statistics are derived and used to obtain the optimal
design, that is, the optimal predictor distribution, for the normative sample, thereby maximizing precision of
estimation. This is done under ve regression models with a quantitative and a categorical predictor, differ-
ing in whether they allow for interaction and nonlinearity. Efcient robust designs are given in case of uncer-
tainty about the regression model. Furthermore, formulasare provided to compute the normative sample size
such that individualspositions relative to the derived norms can be assessed with prespecied power and
precision.
Translational Abstract
Normative studies are needed to derive reference values (or norms) for tests and questionnaires, so that psy-
chologists can use them to assess individuals. Specically, norms allow psychologists to interpret individu-
alsscore on a test by comparing it with the scores of their peers (e.g., individuals with the same sex,age, and
educational level) in the reference population. Because norms are also used to make decisions on individuals,
such as the assignment to clinical treatment or remedial teaching, it is important that norms are precise (i.e.,
not strongly affected by sampling error in the sample on which the norms are based). This article shows how
this goal can be attained in three steps. First, norms are derived using the regression-based approach, which
is more efcient than the traditional approach of splitting the sample into subgroups based on demographic
factors and deriving norms per subgroup. Specically, the regression-based approach allows researchers to
identify the predictors (e.g., demographic factors) that affect the test score of interest, and to use the whole
sample to derive norms. Second, the design of the normative study (e.g., which age groups to include) is cho-
sen such that the precision of the norms is maximized for a given total sample size for norming. Third, this
total sample size is computed such that a prespecied power and precision are obtained.
Keywords: normative data, optimal design, percentile rank score, sample size calculation, Z-score
Supplemental materials: https://doi.org/10.1037/met0000394.supp
Normative studies provide reference values, also known as
norms, that psychologists can use to compare individuals with the
reference population, for instance, to make decisions about clinical
treatments, school admission or remedial teaching, or selection of
candidates for job vacancies. Examples of normative studies are
Goretti et al. (2014) and Parmenter et al. (2010), who have derived
reference values for two batteries of neuropsychological tests to
assess cognitive function in patients with multiple sclerosis, and
Van der Elst et al. (2006), who have normed the Dutch version of
three verbal uency tests. Normative studies are of practical im-
portance because they allow psychologists to interpret scores on
the outcome variable of interest by comparing an individuals test
score with the scores of his or her peers (e.g., individuals of the
same age, sex, and educational level) in the reference population.
For instance, knowing that a highly educated 75-year-old woman
scored 11.5 on the profession naming verbal uency test is in itself
not informative on whether this score is within the normal range
or exceptional. According to the normative data provided by Van
der Elst et al., (2006, Table A.2), only 10% of her peers (i.e.,
women of the same age and educational level) have a test score equal
Francesco Innocenti https://orcid.org/0000-0001-6113-8992
Math J. J. M. Candel https://orcid.org/0000-0002-2229-1131
Gerard J. P. van Breukelen https://orcid.org/0000-0003-0949-0272
We have no conict of interest to disclose. A summary of this study was
presented at the 41st Annual Conference of the International Society for
Clinical Biostatistics.
Correspondence concerning this article should be addressed to Francesco
Innocenti, Department of Methodology and Statistics, Care and Public
Health Research Institute (CAPHRI), Maastricht University, P.O. Box 616,
6200 MD, Maastricht, the Netherlands. Email: francesco.innocenti@
maastrichtuniversity.nl
Psychological Methods
©2021 American Psychological Association
ISSN: 1082-989X https://doi.org/10.1037/met0000394
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
2023, Vol. 28, No. 1, 89–106
This article was published Online First August 12, 2021.
89
... Now, normative studies often provide reference values for several tests and questionnaires. In a literature review of 65 regression-based normative studies, Innocenti et al. (2023) found that 54 studies (83%) derived norms for at least two tests or subscales of the same questionnaire and that norms were derived from separate univariate analyses based on the same sample. However, fitting a regression model for each test or subscale has (at least) three weaknesses ( Van der Elst et al., 2017). ...
... To the best of our knowledge, there are no guidelines on how to determine the required sample size for multivariate norming. Indeed, the literature on sample size calculation for normative studies has focused only on univariate norming (Innocenti et al., 2023;Oosterhuis et al., 2016Oosterhuis et al., , 2017. Specifically, Oosterhuis et al. (2016) obtained sample size guidelines for percentile estimation under the traditional and the regression-based Multivariate Norming: Optimal Design and Sample Size approach, assuming a quantitative and a qualitative predictor in their simulations. ...
... This latter approach leads to the sample size per subgroup (e.g., per age group per sex), as the formulas do not allow for covariates. Innocenti et al. (2023) have derived the optimal design of the normative study for five univariate linear regression models with a qualitative and a quantitative predictor and proposed a sample size calculation procedure, such that individuals' positions relative to the derived norms (i.e., univariate Z-scores and percentile rank scores) can be assessed with prespecified power and precision. In Innocenti et al. (2023), the optimal design was defined as the joint distribution of the predictors included in the norming model (i.e., the design of the normative study) that maximizes the precision of estimation of the desired norm statistics. ...
Article
Full-text available
Normative studies are needed to obtain norms for comparing individuals with the reference population on relevant clinical or educational measures. Norms can be obtained in an efficient way by regressing the test score on relevant predictors, such as age and sex. When several measures are normed with the same sample, a multivariate regression-based approach must be adopted for at least two reasons: (1) to take into account the correlations between the measures of the same subject, in order to test certain scientific hypotheses and to reduce misclassification of subjects in clinical practice, and (2) to reduce the number of significance tests involved in selecting predictors for the purpose of norming, thus preventing the inflation of the type I error rate. A new multivariate regression-based approach is proposed that combines all measures for an individual through the Mahalanobis distance, thus providing an indicator of the individual’s overall performance. Furthermore, optimal designs for the normative study are derived under five multivariate polynomial regression models, assuming multivariate normality and homoscedasticity of the residuals, and efficient robust designs are presented in case of uncertainty about the correct model for the analysis of the normative sample. Sample size calculation formulas are provided for the new Mahalanobis distance-based approach. The results are illustrated with data from the Maastricht Aging Study (MAAS).
... ND are reference values that help clinicians interpret test scores of individual patients compared to their peers (normative group) using sociodemographic variables (Innocenti et al., 2021). In other words, reference values provide an empirical context for test scores and a representation of the performance of a group (Mitrushina et al., 2005). ...
... If a patient has below-expected cognitive performance, ND can serve as a guide for setting specific therapeutic goals and measuring the patient's progress throughout treatment (i.e., Schneider & McGrew, 2018). Because of this, norms' accuracy and representativeness of the population are important (Innocenti et al., 2021). Selecting appropriate ND will affect neuropsychological assessment result interpretation accuracy, reducing the probability of false diagnoses of cognitive impairment (Inesta et al., 2021). ...
... These two studies are examples of the traditional approach to generate norms. However, according to Innocenti et al. (2021), two approaches exist: the traditional approach and the regression-based approach. The first approach divides the sample into subgroups based on some relevant demographic factors, such as age, education, and sex. ...
Article
Full-text available
Objective To quantify the evolution, impact, and importance of normative data (ND) calculation by identifying trends in the research literature and what approaches need improvement. Methods A PRISMA-guideline systematic review was performed on literature from 2000 to 2022 in PubMed, Pub-Psych, and Web of Science. Inclusion criteria included scientific articles about ND in neuropsychological tests with clear data analysis, published in any country, and written in English or Spanish. Cross-sectional and longitudinal studies were included. Bibliometric analysis was used to examine the growth, productivity, journal dispersion, and impact of the topic. VOSViewer compared keyword co-occurrence networks between 1952–1999 and 2000–2022. Results Four hundred twelve articles met inclusion and exclusion criteria. The most studied predictors were age, education, and sex. There were a greater number of studies/projects focusing on adults than children. The Verbal Fluency Test (12.7%) was the most studied test, and the most frequently used variable selection strategy was linear regression (49.5%). Regression-based approaches were widely used, whereas the traditional approach was still used. ND were presented mostly in percentiles (44.2%). Bibliometrics showed exponential growth in publications. Three journals (2.41%) were in the Core Zone. VOSViewer results showed small nodes, long distances, and four ND-related topics from 1952 to 1999, and there were larger nodes with short connections from 2000 to 2022, indicating topic spread. Conclusions Future studies should be conducted on children’s ND, and alternative statistical methods should be used over the widely used regression approaches to address limitations and support growth of the field.
... Three studies examined sample size requirements, all for inferential norming 266 (Innocenti et al., 2021;Oosterhuis et al., 2016;Zhu & Chen, 2011 inferential norming. Inferential norms showed a higher precision and, consequently, needed a 275 smaller sample size than conventional norms for an equally precise estimation. ...
... When 661 comparing the different continuous norming methods, inferential norming may be less 662 complex. However, inferential norms can lead to biased norms if data assumptions are neither 663 tested nor respected (e.g., Innocenti et al., 2021;Oosterhuis et al., 664 2016). Data assumptions can also be relevant for parametric norming. ...
Preprint
Norming of psychological tests and scales is decisive for the interpretation of test scores. However, conventional norming methods based on subgroups result either in biases or require very large samples to gather precise norms. Continuous norming methods, namely inferential, semi-parametric, and parametric norming, propose to solve those issues. This paper provides a systematic review of international research on continuous norming and summarizes and describes currently applied continuous norming practices. The review includes 121 publications with overall 189 studies. Most of these studies used inferential norming to compute continuous norms for a specific test and emerged in recent years. Summarizing the literature, we identified open questions such as when to prefer which continuous norming method over another. To address these open questions, we conducted a real data example. We used the Need for Cognition-KIDS scale, a personality questionnaire for elementary school children. Comparing the precision of conventional, semi-parametric, and parametric norms revealed a clear hierarchy in favor of parametric norms. Moreover, bias comparison of conventional and parametric norms showed less bias in parametric norms. Estimating the discrepancies between continuous and conventional norm scores revealed tremendous differences for some individuals.
... Lastly, most of the studies followed the traditional approach (use of means and SD) to generate normative data, which presents two main problems: it establishes a single norm for the whole population, ignoring possible demographic factors effects, or it allows to establish different norms but per subsample (according to demographic variables), splitting up the sample and reducing sample size and norms precision (Innocenti et al., 2021). However, the regression-based approach taken in this study allows to overcome these problems (Innocenti et al., 2021). ...
... Lastly, most of the studies followed the traditional approach (use of means and SD) to generate normative data, which presents two main problems: it establishes a single norm for the whole population, ignoring possible demographic factors effects, or it allows to establish different norms but per subsample (according to demographic variables), splitting up the sample and reducing sample size and norms precision (Innocenti et al., 2021). However, the regression-based approach taken in this study allows to overcome these problems (Innocenti et al., 2021). ...
Article
Background: Verbal fluency tests (VFT) are highly sensitive to cognitive deficits. Usually, the score on VFT is based on the number of correct words produced, yet it alone gives little information regarding underlying test performance. The implementation of different strategies (cluster and switching) to perform efficiently during the tasks provide more valuable information. However, normative data for clustering and switching strategies are scarce. Moreover, scoring criteria adapted to Colombian Spanish are missing. Aims: (1) To describe the Colombian adaptation of the scoring system guidelines for clustering and switching strategies in VFT; (2) to determine its reliability; and (3) to provide normative data for Colombian children and adolescents aged 6-17 years. Methods & procedures: A total of 691 children and adolescents from Colombia completed phonological (/f/, /a/, /s/, /m/, /r/ and /p/) and semantic (animals and fruits) VFT, and five scores were calculated: total score (TS), number of clusters (NC), cluster size (CS), mean cluster size (MCS) and number of switches (NS). The intraclass correlation coefficient was used for interrater reliability. Hierarchical multiple regressions were conducted to investigate which strategies were associated with VFT TS. Multiple regressions were conducted for each strategy, including as predictors age, age2 , sex, mean parents' education (MPE), MPE2 and type of school, to generate normative data. Outcomes & results: Reliability indexes were excellent. Age was associated with VFT TS, but weakly compared with strategies. For both VFT TS, NS was the strongest variable, followed by CS and NC. Regarding norms, age was the strongest predictor for all measures, while age2 was relevant for NC (/f/ phoneme) and NS (/m/ phoneme). Participants with higher MPE obtained more NC, and NS, and larger CS in several phonemes and categories. Children and adolescents from private school generated more NC, NS and larger CS in /s/ phoneme. Conclusions & implications: This study provides new scoring guidelines and normative data for clustering and switching strategies for Colombian children and adolescents between 6 and 17 years old. Clinical neuropsychologists should include these measures as part of their everyday practice. What this paper adds: What is already known on the subject VFT are widely used within the paediatric population due to its sensitivity to brain injury. Its score is based on the number of correct words produced; however, TS alone gives little information regarding underlying test performance. Several normative data for VFT TS in the paediatric population exist, but normative data for clustering and switching strategies are scarce. What this paper adds to existing knowledge The present study is the first to describe the Colombian adaptation of the scoring guidelines for clustering and switching strategies, and provided normative data for these strategies for children and adolescents between 6 and 17 years old. What are the potential or actual clinical implications of this work? Knowing VFT's performance, including strategy development and use in healthy children and adolescents, may be useful for clinical settings. We encourage clinicians to include not only TS, but also a careful analysis of strategies that may be more informative of the underlying cognitive processes failure than TS.
Article
Full-text available
Test publishers usually provide confidence intervals (CIs) for normed test scores that reflect the uncertainty due to the unreliability of the tests. The uncertainty due to sampling variability in the norming phase is ignored. To express uncertainty due to norming, we propose a flexible method that is applicable in continuous norming and allows for a variety of score distributions, using Generalized Additive Models for Location, Scale, and Shape (GAMLSS; Rigby & Stasinopoulos, 2005). We assessed the performance of this method in a simulation study, by examining the quality of the resulting CIs. We varied the population model, procedure of estimating the CI, confidence level, sample size, value of the predictor, extremity of the test score, and type of variance-covariance matrix. The results showed that good quality of the CIs could be achieved in most conditions. The method is illustrated using normative data of the SON-R 6-40 test. We recommend test developers to use this approach to arrive at CIs, and thus properly express the uncertainty due to norm sampling fluctuations, in the context of continuous norming. Adopting this approach will help (e.g., clinical) practitioners to obtain a fair picture of the person assessed. Electronic supplementary material The online version of this article (10.3758/s13428-018-1122-8) contains supplementary material, which is available to authorized users.
Article
Full-text available
Cluster randomized trials evaluate the effect of a treatment on persons nested within clusters, where treatment is randomly assigned to clusters. Current equations for the optimal sample size at the cluster and person level assume that the outcome variances and/or the study costs are known and homogeneous between treatment arms. This paper presents efficient yet robust designs for cluster randomized trials with treatment‐dependent costs and treatment‐dependent unknown variances, and compares these with 2 practical designs. First, the maximin design (MMD) is derived, which maximizes the minimum efficiency (minimizes the maximum sampling variance) of the treatment effect estimator over a range of treatment‐to‐control variance ratios. The MMD is then compared with the optimal design for homogeneous variances and costs (balanced design), and with that for homogeneous variances and treatment‐dependent costs (cost‐considered design). The results show that the balanced design is the MMD if the treatment‐to control cost ratio is the same at both design levels (cluster, person) and within the range for the treatment‐to‐control variance ratio. It still is highly efficient and better than the cost‐considered design if the cost ratio is within the range for the squared variance ratio. Outside that range, the cost‐considered design is better and highly efficient, but it is not the MMD. An example shows sample size calculation for the MMD, and the computer code (SPSS and R) is provided as supplementary material. The MMD is recommended for trial planning if the study costs are treatment‐dependent and homogeneity of variances cannot be assumed.
Article
Full-text available
Objective: Multi-trial memory tests are widely used in research and clinical practice because they allow for assessing different aspects of memory and learning in a single comprehensive test procedure. However, the use of multi-trial memory tests also raises some key data analysis issues. Indeed, the different trial scores are typically all correlated, and this correlation has to be properly accounted for in the statistical analyses. In the present paper, the focus is on the setting where normative data have to be established for multi-trial memory tests. At present, normative data for such tests are typically based on a series of univariate analyses, i.e. a statistical model is fitted for each of the test scores separately. This approach is suboptimal because (1) the correlated nature of the data is not accounted for, (2) multiple testing issues may arise, and (3) the analysis is not parsimonious. Method and results: Here, a normative approach that is not hampered by these issues is proposed (the so-called multivariate regression-based approach). The methodology is exemplified in a sample of N = 221 Dutch-speaking children (aged between 5.82 and 15.49 years) who were administered Rey's Auditory Verbal Learning Test. An online Appendix that details how the analyses can be conducted in practice (using the R software) is also provided. Conclusion: The multivariate normative regression-based approach has some substantial methodological advantages over univariate regression-based methods. In addition, the method allows for testing substantive hypotheses that cannot be addressed in a univariate framework (e.g. trial by covariate interactions can be modeled).
Article
To compute norms from reference group test scores, continuous norming is preferred over traditional norming. A suitable continuous norming approach for continuous data is the use of the Box-Cox Power Exponential model, which is found in the generalized additive models for location, scale, and shape. Applying the Box-Cox Power Exponential model for test norming requires model selection, but it is unknown how well this can be done with an automatic selection procedure. In a simulation study, we compared the performance of two stepwise model selection procedures combined with four model-fit criteria (Akaike information criterion, Bayesian information criterion, generalized Akaike information criterion (3), cross-validation), varying data complexity, sampling design, and sample size in a fully crossed design. The new procedure combined with one of the generalized Akaike information criterion was the most efficient model selection procedure (i.e., required the smallest sample size). The advocated model selection procedure is illustrated with norming data of an intelligence test.
Article
Norm statistics allow for the interpretation of scores on psychological and educational tests, by relating the test score of an individual test taker to the test scores of individuals belonging to the same gender, age, or education groups, et cetera. Given the uncertainty due to sampling error, one would expect researchers to report standard errors for norm statistics. In practice, standard errors are seldom reported; they are either unavailable or derived under strong distributional assumptions that may not be realistic for test scores. We derived standard errors for four norm statistics (standard deviation, percentile ranks, stanine boundaries and Z-scores) under the mild assumption that the test scores are multinomially distributed. A simulation study showed that the standard errors were unbiased and that corresponding Wald-based confidence intervals had good coverage. Finally, we discuss the possibilities for applying the standard errors in practical test use in education and psychology. The procedure is provided via the R function check.norms, which is available in the mokken package.
Book
"This is an engaging and informative book on the modern practice of experimental design. The authors' writing style is entertaining, the consulting dialogs are extremely enjoyable, and the technical material is presented brilliantly but not overwhelmingly. The book is a joy to read. Everyone who practices or teaches DOE should read this book." -Douglas C. Montgomery, Regents Professor, Department of Industrial Engineering, Arizona State University ''It's been said: 'Design for the experiment, don't experiment for the design.' This book ably demonstrates this notion by showing how tailor-made, optimal designs can be effectively employed to meet a client's actual needs. It should be required reading for anyone interested in using the design of experiments in industrial settings.''-Christopher J. Nachtsheim, Frank A Donaldson Chair in Operations Management, Carlson School of Management, University of Minnesota This book demonstrates the utility of the computer-aided optimal design approach using real industrial examples. These examples address questions such as the following: How can I do screening inexpensively if I have dozens of factors to investigate? What can I do if I have day-to-day variability and I can only perform 3 runs a day? How can I do RSM cost effectively if I have categorical factors? How can I design and analyze experiments when there is a factor that can only be changed a few times over the study? How can I include both ingredients in a mixture and processing factors in the same study? How can I design an experiment if there are many factor combinations that are impossible to run? How can I make sure that a time trend due to warming up of equipment does not affect the conclusions from a study? How can I take into account batch information in when designing experiments involving multiple batches? How can I add runs to a botched experiment to resolve ambiguities? While answering these questions the book also shows how to evaluate and compare designs. This allows researchers to make sensible trade-offs between the cost of experimentation and the amount of information they obtain.
Article
Test norms enable determining the position of an individual test taker in the group. The most frequently used approach to obtain test norms is traditional norming. Regression-based norming may be more efficient than traditional norming and is rapidly growing in popularity, but little is known about its technical properties. A simulation study was conducted to compare the sample size requirements for traditional and regression-based norming by examining the 95% interpercentile ranges for percentile estimates as a function of sample size, norming method, size of covariate effects on the test score, test length, and number of answer categories in an item. Provided the assumptions of the linear regression model hold in the data, for a subdivision of the total group into eight equal-size subgroups, we found that regression-based norming requires samples 2.5 to 5.5 times smaller than traditional norming. Sample size requirements are presented for each norming method, test length, and number of answer categories. We emphasize that additional research is needed to establish sample size requirements when the assumptions of the linear regression model are violated. © The Author(s) 2015.