BookPDF Available

Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data

January 2013

January 2013

DOI:10.4135/9781452269948

Publisher: SAGE Publishing
ISBN: 978-1500594343

Authors:

Jason W Osborne

Miami University

Many researchers jump straight from data collection to data analysis without realizing how analyses and hypothesis tests can go profoundly wrong without clean data. This book provides a clear, step-by-step process to examining and cleaning data in order to decrease error rates and increase both the power and replicability of results. Jason W. Osborne, author of Best Practices in Quantitative Methods (SAGE, 2008) provides easily-implemented suggestions that are research-based and will motivate change in practice by empirically demonstrating for each topic the benefits of following best practices and the potential consequences of not following these guidelines. If your goal is to do the best research you can do, draw conclusions that are most likely to be accurate representations of the population(s) you wish to speak about, and report results that are most likely to be replicated by other researchers, then this basic guidebook is indispensible.

Performance on class unit examination, Undergraduate Education Psychology Course.

…

Distribution of SES.

…

Distribution of SES with 4% outliers.

…

Correlation of SES and achievement, with 4% outliers.

…

Figures - uploaded by Jason W Osborne

Content may be subject to copyright.

Content uploaded by Jason W Osborne

Content may be subject to copyright.

Data Cleaning Basics: Best Practices in

Dealing with Extreme Scores

Jason W. Osborne, PhD

In quantitative research, it is critical to perform data cleaning to ensure that the conclusions drawn from the

data are as generalizable as possible, yet few researchers report doing so (Osborne JW. Educ Psychol.

2008;28:1-10). Extreme scores are a significant threat to the validity and generalizability of the results. In this

article, I argue that researchers need to examine extreme scores to determine which of many possible causes

contributed to the extreme score. From this, researchers can take appropriate action, which has many

laudatory effects, from reducing error variance and improving the accuracy of parameter estimates to reducing

the probability of errors of inference.

Keywords: Data cleaning; Extreme scores; Outliers; Parameter estimates

Most authors of peer-reviewed journal articles go to great

lengths to describe their study, the research methods, the

sample, the statistical analyses used, results, and conclusions

based on those results. However, few seem to mention data

cleaning (which can include screening for extreme scores,

missing data, normality, etc). To be sure, some of the

researchers do check their data for these things (and may

neglect to report having done that), but Osborne

examined 2

years' worth of empirical articles in top-tier Educational

Psychology journals, none explicitly discussed any data cleaning.

There is no reason to believe that the situation is different in

other disciplines.

The goal of this article is to discuss the issue of extreme

scores, which can dramatically increase risk for errors of

inference, problems with generalizability (biased estimates), and

suboptimal power (some “robust” procedures and nonpara-

metric tests are incorrectly considered to be immune from these

sorts of issues; howev er, even robust and nonparametric tests

benefit from clean data

2,3

The goal of this article is to highlight why it is critical to

screen data for extreme scores and specific suggestions for how

to deal with them.

What Are Extreme Scores and Why Do

We Care About Them?

An extreme score, or data point far outside the normal

distribution for a variable or population,

4-6

is also described as

an observation that “deviates so much from other observations

as to arouse suspicions that it was generated by a different

mechanism.”

Arguably, if an extreme score has origins in a

different mechanism or population, it does not belong in your

analysis. Outliers have also been defined as values that are

“dubious in the eyes of the researcher”

and contaminants,

all

of which lead to the same conclusion.

So Why Do We Care About Extreme Values?

Extreme values can cause serious problems for statistical

analyses. First, they generally serve to increase error variance and

reduce the power of statistical tests. Second, if nonrandomly

distributed, they can substantially alter the odds of making both

type I and type II errors. Third, they can seriously bias or

influence estimates that may be of substantive interest because

they may not be generated by the population of interest.

2,5,10

What Is an Extreme Score?

There is as much controversy over what constitutes an

extreme score as whether to remove them or not. It is always a

good idea to visually inspect data before any other analysis.

Simple rules of thumb (eg, data points 3 or more SDs from the

mean) are good starting points, unless the sample is particularly

small.

11,12

I recommend examining scores at or beyond 3 SDs

from the mean, as in a normally distributed population, the

probability of an individual being more than 3 SDs from the

mean by random chance alone is 0.26%. Because of this, we

have a strong basis for suspecting data points beyond ±3 SD

from the mean are not generated by the population of interest and

as such should be dealt with in some fashion.

Bivariate and multivariate outliers are typically measured

using either an index of influence or leverage, or distance.

Popular indices include Mahalanobis' distance and Cook's D are

both frequently used to calculate the leverage (influence) that

specific cases may exert on the predicted value of the regression

line.

Standardized or studentized residuals in regression and

analysis of variance (ANOVA)-type analyses can also help

From North Carolina State University.

Address correspondence to Jason W. Osborne, PhD, North Carolina State

University, Curriculum and Instruction and Counselor Education, Poe

602c, Campus Box 7801, NCSU, Raleigh, NC 27695-7801. E-mail:

jason_osborne@ncsu.edu.

1527-3369/09/1001-0343$36.00/0

doi:10.1053/j.nainr.2009.12.009

identify within-group outliers. The z = ±3 rule works well for

standardized residuals as well.

What Causes Extreme Scores and What

Should We Do About Them?

Extreme scores can arise from (at least) six possible reasons

for data points that may be suspect. First note that not all

extreme scores ar e illegitimate contaminants, and not all

illegitimate scores show up as extreme scores.

It is therefo re

important t o consider the range of c auses that may be

responsible for extreme scores. Inferred cause can then inform

what action a researcher should take with a given extreme score.

Extreme Scores From Data Errors

Extreme scores are often caused by errors in data collection,

recording, or entry. Data from an interview or survey can be

recorded incorrectly, or mis-keyed upon data entry (eg, a

survey respondent reporting yearly wage rather than hourly

wage). Errors of this nature can often be corrected by returning

to the original documents, recalculating or inferring the correct

response, or recontacti ng the original participant. This can save

important data and eliminate an problematic extreme score.

Extreme Scores From Intentional or

Motivated Misreporting

Motivated misreporting by research participants is a long-

discussed source of bias in data. A participant may make a

conscious effort to sabotage the research,

or may be acting

from social desir ability or self-presentation motives. Identifying

and reducing this issue is difficult, unless researchers take care

to triangulate or validate data in some manner. Osborne and

Blanchard

summarizes several ap proaches to identifying

response sets such as this. If you suspect motivated mis-

responding in your data, you should probably remove that

participant, because the data are being influenced by more than

the phenomena you wish to examine.

Extreme Scores From Sampling Error or Bias

No sampling framework is perfect, and sampling error or

bias can produce extreme scores by erroneously including

individuals from populations not intended to be sampled.

For example, some colleagues and I

randomly sampled

registered nurses from licensure rolls for a survey on

organizational commitment. As part of this survey, we asked

nurses to report their salary. Upon examining some very

extreme scores, we discovered we had inadvertently

surveyed some registered nurses who had moved into

hospital administration (with a much higher salary) but who

had also maintaine d their nursing license. These cases, being

extreme and not of the population of interest (floor nurses)

were removed.

Extreme Scores From Standardization Failure

Unexpectedly, extreme scores can be caused by research

methodology, particularly if something anomalous happened

during a particular subject's experience. Unusual phenomena

such as construction noise outside a res earch laboratory or an

experimenter feeling particularly grouchy, or even events

outside the context of the research laboratory, such as a student

protest, a rape, or murder on campus, observations in a

classroom the day before a big holiday recess, and so on can

produce outlier s. Faulty or noncalibrated equipment is another

common cause of extreme scores.

Let us consider two possible cases in relation to this source of

outliers. In the first case, we might have a piece of equipment in

our laboratory that was miscalibrated, yielding measurements

that were extremely different from other days' measurements. If

the miscalibration results in a fixed change to the score that is

consistent or predictable across all measurements (eg, all

measurements are off by 100), then adjustment of the scores is

appropriate. If there is no clear way to defensibly adjust the

measurements, they must be discarded.

Extreme Scores From Faulty Distributional

Assumptions

Incorrect assumptions about the distribution of the data can

also lead to the presence of suspected outliers.

Blood sugar

levels, disciplinary referrals, scores on classroom tests where

students are well-prepared, and self-reports of low-frequency

behaviors (eg, number of times a student has been suspended or

held back a grade) may give rise to highly nonnormal

distributions. These distributions may look like they have a

substantial number of extreme scores, but after transformation

(s) to improve normality,

it might be the case that few, if any,

of the data points are subsequently identified as outliers.

The data presented in Fig 1 on 180 students taking an

examination in an undergraduate psychology class shows a

highly skewed distribution with a mean of 87.50 and an SD of

8.78. Although one could argue that the lowest scores on this

test are outliers because they are more than 3 SDs below the

Fig 1. Performance on class unit examination,

Undergraduate Education Psychology Course.

VOLUME 10, NUMBER 1, www.nainr.c om

mean, a better interpretation is that the data are not normally

distributed. In this case, a transformation should be used to

normalize the data before analysis of extreme scores should

occur or analyses appropriate for nonnormal distributions

should be used.

Extreme Scores as Legitimate Cases Sampled From

the Correct Population

Finally, it is possible that an outlier can come from the

population being sampled legitimately through random chance.

It is important to note that sample size plays a role in the

probability of outlying values. Within a normally distributed

population, it is more probable that a given data point will be

drawn from the most densely concentrated area of the

distribution, rather than one of the tails.

20,21

As a researcher

casts a wider net and the data set becomes larger, the more the

sample resembles the population from which it was drawn, and

thus, the likelihood of legitimate individual outlying values

becomes greater, although as a percentage of the sample, they

become less significant overall.

When extreme scores occur as a function of the inherent

variability of the data, opinions differ widely on what to do.

When legitimate extreme scores are in a data set, they can have

deleterious effects on power, accuracy, and type I/II error rates.

One way to deal wit h them is to use truncation , in which you

specify an upper reasonable limit to your dat a and recode higher

scores to that number (eg, in a study of adolescents one

indicated he had 99 close friends, yet by our definition that

would be impossible; thus, we recoded all responses above 15

(the highest reasonable number of close friends) to 15). This

keeps all data in the sample while at the same time reducing the

influence of these scores. Data transformations (eg, square root

and log) also have the effect of reducing the effect of extreme

scores when used appropriately.

Alternatively, extreme scores can present an opportunity for

inquiry. When researchers in Africa discovered some women

who had been repeatedly exposed to human immun odeficiency

virus over several years but remained uninfected,

they

represent potential for an important advance in understanding.

Thus, before discarding outliers, researchers need to consid er

whether those data contain valuable information that may not

necessarily relate to the intended study but have importance in a

more global sense.

To be clear on this point, no matter the inferred cause of

the extreme score, it must be dealt with in some fashion and

that decision should be reported and defended in any

research rep orts that involve the data. Extreme scores should

be corrected, removed, truncated, reduced in importance

through data transformation, or separated from the rest of the

sample for separate study. This affords the most replicable,

honest estimate of the population parameters possible.

23,24

Not only are basic parameter estimates closer to population

values when illegitimate extreme values are removed, but

inferential statistics (correlations, t tests, etc) have substan-

tially lower error rate.

Advanced Techniques for Dealing With Extreme

Scores: Robust Methods

Instead of transformations or truncation, researchers some-

times use various “robust” procedures to protect their data from

being distorted by the presence of outliers. Certain par ameter

estimates, especially the mean and least squares estimations, are

particularly vulnerable to outliers, or have “low breakdown”

Fig 2. Distribution of SES.

Fig 3. Distribution of SES with 4% outliers.

NEWBORN & INFANT NURSING REVIEWS, MARCH 2010

values. For this reason, researchers turn to robust or “high

breakdown” methods to provide alternative estimates for these

important aspects of the data.

A common robust estimation method for univariate

distributions involves the use of a trimmed mean, which is

calculated by temporarily eliminating extreme observations at

both ends of the sample.

Alternatively, researchers may

choose to compute a Windsorized mean, for which the highest

and lowest observations are temporarily censored and replaced

with adjacent values from the remaining data.

Assuming that the distribution of prediction errors is close to

normal, several common robust regression techniques can help

reduce the influence of outlying data points. The least trimmed

squares and the leas t median of squares es timators are

conceptually similar to the trimmed mean, helping to minimize

the scatter of the prediction errors by eliminating a specific

percentage of the largest positive and negative outliers,

whereas

Windsorized regression smoothes the Y-data by replacing

extreme residuals with the next closest value in the dataset.

Many options exist for analysis of nonideal variables. In

addition to the abovementioned options, analysts can choose

from nonparametric analyses, because these types of analyses

have few if any distributional assumptions, although research by

Zimmerman

3,28

do point out that even nonparametric analyses

suffer from outlier cases.

The Effects of Extreme Scores and Their

Removal on Individual Variables

Extreme scores have several specific effects on variables that

are otherwise norma lly distributed. To illustrate this, we will use

socioeconomic status (SES)

⁎

that represents a composite of family

income and social status based on parent occupation. In this

data set, the scores were transformed to z scores. This variable

shows strong normality, with a skew of −0.001 (0.00 is

perfectly symmetrical; as depicted in Fig 2).

Samples from this distribution should also share these

distributional traits, especially large samples. For example, a

relatively large sample of n = 416 that included 4% extreme

scores on one side of the distribution (high-poverty students),

the distribution properties changed substantially (as depicted

in Fig 3):

The skew is now −2.18. Substantial error has been added to

the variable (SD is increased 56%), and it is clear that those 16

students at the very bottom of the distribution do not belong to

the normal population of interest. Removal of these outliers

returned the distribution to a mean of −0.02, SD = 0.78, skew =

0.01, not significantly different from the original population of

over 24000.

Osborne and Overbay

performed similar simulations of

the effects o f small numbers of outliers o n repeated samples

from a known population in the context of correlation and

ANOVA-type analyses. The effects were striking.

⁎

From the National Centers for Educational Statistics NELS 88

data set.

Table 1. The effects of outliers on correlations

Population, (r)N

Average

initial r

Average

cleaned rt

% More

accurate

% Errors

before cleaning

% Errors after

cleaning T

−0.06 52 0.01 −0.08 2.5

⁎

95 78 8 13.40

†

104 −0.54 −0.06 75.44

†

100 100 6% 39.38

†

416 0 −0.06 16.09

†

70 0 21 5.13

†

0.46 52 0.27 0.52 8.1

†

89 53 0 10.57

†

104 0.15 0.50 26.78

†

90 73 0 16.36

†

416 0.30 0.50 54.77

†

95 0 0 –

One hundred samples were randomly drawn for each row. Outliers were actual members of the population who scored at least z = ±3 on the relevant variable.

With n = 52, a correlation of 0.274 is significant at P b .05. With n = 104, a correlation of 0.196 is significant at P b .05. With n = 416, a correlation of 0.098 is

significant at P b .05, two tailed.

⁎

P b .01.

†

P b .001.

Fig 4. Correlation of SES and achievement, with

4% outliers.

VOLUME 10, NUMBER 1, www.nainr.c om

Table 2. The effects of outliers on t tests

Outliers n

Initial

mean

difference

Cleaned

mean

difference t

%more

accurate

mean difference

Average

initial t

Average

cleaned tt

%Type

IorII

errors before

cleaning

%Type

IorII

errors after

cleaning t

Equal group means,

outliers in one cell

52 0.34 0.18 3.70

‡

66.0 −0.20 −0.12 1.02 2.0 1.0 b1

104 0.22 0.14 5.36

‡

67.0 0.05 −0.08 1.27 3.0 3.0 b1

416 0.09 0.06 4.15

‡

61.0 0.14 0.05 0.98 2.0 3.0 b1

Equal group means,

outliers in both cells

52 0.27 0.19 3.21

‡

53.0 0.08 −0.02 1.15 2.0 4.0 b1

104 0.20 0.14 3.98

‡

54.0 0.02 −0.07 0.93 3.0 3.0 b1

416 0.15 0.11 2.28

⁎

68.0 0.26 0.09 2.14

⁎

3.0 2.0 b1

Unequal group means,

outliers in one cell

52 4.72 4.25 1.64 52.0 0.99 1.44 − 4.70

‡

82.0 72.0 2.41

†

104 4.11 4.03 0.42 57.0 1.61 2.06 −2.78

†

68.0 45.0 4.70

‡

416 4.11 4.21 −0.30 62.0 2.98 3.91 −12.97

‡

16.0 0.0 4.34

‡

Unequal group means,

outliers in both cells

52 4.51 4.09 1.67 56.0 1.01 1.36 − 4.57

‡

81.0 75.0 1.37

104 4.15 4.08 0.36 51.0 1.43 2.01 −7.44

‡

71.0 47.0 5.06

‡

416 4.17 4.07 1.16 61.0 3.06 4.12 −17.55

‡

10.0 0.0 3.13

‡

One hundred samples were drawn for each row. Outliers were actual members of the population who scored at least z = ± 3 on the relevant variable.

⁎

P b .05.

†

P b .01.

‡

P b .001.

41NEWBORN & INFANT NURSING REVIEWS, MARCH 2010

The Effect of Extreme Scores on

Correlations and Regression

As Table 1 demonstrates, outliers had adverse effects on

correlations. Removal of the outliers produced more accurate

(ie, closer to the known “population” correlation) estimates of

the population correlation 70% to 100% of the time.

Furthermore, in most cases errors of inference were significantly

less common (between 89.7%–100% of errors of inference were

eliminated for all but the largest data sets, which had few errors

of inference). with cleaned than uncleaned data.

As Fig 4 shows, a few randomly chosen outliers in a sample

of 100 can cause substantial mis-estimation of the population

correlation. In the sample of almost 24 000 students, these two

variables were correlated very strongly, r = 0.46. In this

particular example, the correlation with 4% outliers in the

analysis was r = 0.16 and was not significant, whereas after

removal of the extreme scores, the correlation closely estimated

the expected magnitude (r = 0.48).

The Effect of Outliers on t Tests

and ANOVAs

The second example deals with analyses that look at group

mean differences, such as t tests and ANOVA. For the purpose

of simplicity, these analyses are simple t tests, but these results

easily generalize to more complex analyses such as ANOVA.

For these analyses, two different conditions were examined:

when there were no significant differences between the groups

in the population (sex differences in SES produced a mean

group difference of 0.0007 with an SD of 0.80 and with 24

501 df produced a t of 0.29) and when there were significant

group differences in the population (sex differences in

mathematics achievement test scores produced a mean

difference of 4.06 and an SD of 9.75 and 24 501 df produced

a t of 10.69, P b .0001).

The results in Table 2 again illustrate the expected effects of

outliers on t test analyses designs. Removal of outliers had

beneficial effects, in that the results tended to become more like

the population: for both groups, differences and t statistics

became more accurate in most the samples.

Missing Data as a Special Case of

Extreme Score

Missing data can be thought of as another potential type of

extreme score. As such, much of the previous discu ssion

applies. Ther e are multiple reasons why data might be missing,

and it is important to attempt to ascertain the reason for

missingness, just as extremeness. Cole gives a much more

thorough treatment of how to analyze and deal with missing

data, for those interested.

However, one underutilized technique is analyzing differ-

ences between those with missing data and those with complete

data. F or example, researchers can code a variable that

represents which category each subject falls into and then can

analyze other data as a function of missingness to determine if

missingness is associated with particular subgroups or other

variables. This can shed important light onto whether

missingness can be causing significant bias.

Summary

In sum, the best, most sophisticated ana lyses must be

considered flawed if quantitative researchers do not take the

time to thoroughly understand and examine the ir data to

ensure the best possible outc ome (ie, the most accurate,

generalizable representation of the population). Although over

a century of wri tings on quantitative methods has yielded a

very diverse set of opinions about this topic, analyses and

principles summarized herein should convince the readers

that it is in their best interest to thoroughly clean their data

before analysis.

References

1. Osborne JW. Sweating the small stuff in educational

psychology: how effect size and power reporting failed

to change from 1969 to 1999, and what that means for

the future of changing practices. Educ Psychol. 200 8;2 8:

1-10.

2. Zimmerman DW. A note on the influence of outliers on

parametric and non parametric tests. J Gen Psychol. 1994;

121:391-401.

3. Zimmerman DW. Increasing the power of nonparametric

tests by detecting and downweighting outliers. J Exper Educ.

1995;64:71-78.

4. Jarrell MG. A comparison of two procedures, the Mahala-

nobis Distan ce and the Andrews-Pregibon Statistic, for

identifying multivariate outliers. Res Sch. 1994;1:49-58.

5. Rasmussen JL. Evaluating outlier identification tests:

Mahalanobis D Squared and Comr ey D. Multivariate

Behav Res. 1988;23:189-202.

6. Stevens JP. Outliers and influential data points in regression

analysis. Psychol Bull. 1984;95:334-344.

7. Hawkins DM. Identification o f Outliers. New York:

Chapman and Hall; 1980.

8. Dixon WJ. Analysis of extreme values. Ann Math Stat. 1950;

21:488-506.

9. Wainer H. Robust statistics: a survey and some prescrip-

tions. J Educ Stat. 1976;1:285-312.

10. Schwager SJ, Margolin BH. Detection of multivariate

outliers. Ann Stat. 1982;10:943-954.

11. Miller J. Reaction time analysis with outlier exclusion: bias

varies with sample size. Q J Exp Psychol. 1991;43:907-912.

12. Van Selst M, Jolic oeur P. A solution to the effect of sample

size on outlier elimination. Q J Exp Psych ol. 1994;47:

631-650.

13. Newton RR, Rudestam KE. Your Statistical Consultant:

Answers to Your Data Analysis Questions. Thousand Oaks,

CA: Sage.; 1999.

42 VOLUME 10, NUMBER 1, www.nainr.c om

14. Barnett V, Lewis T. Outliers in Statistical Data. New York:

Wiley; 1994.

15. Huck SW, Sutton CO. Some comments concerning the use

of monotonic transformations to remove the interaction in

two-factor ANOVA's. Educ Psychol Meas. 1975;35:789-791.

16. Osborne JW, Blanchard MR. Random responding from

students is a threat to the validity of educational research

results. Educational Psychology. in press.

17. Brewer CS, Nauenberg E, Osborne JW. Differences among

hospital and non-hospital RNs particip ation, satisfacton,

and organizational committment in western New York.

Paper presented at: National meeting of the Association for

Health Service Research; June, 1998; Washington DC;

1998.

18. Iglewicz B, Hoaglin DC. How to Detect and Handle

Outliers. Wilwaukee, WI: ASQC Quality Press; 1993.

19. Osborne JW. Notes on the use of data transformations.

Practical assessment, research, and evaluation; 2002. p. 8.

Available onl ine a t http://ericae.net/pare/getvn.asp?

v=8&n=6.

20. Evans VP. Strategies for detecting outliers in regression

analysis: an introduc tory primer. In: Thompson B, editor.

Advances in Social Science Methodology, Vol. 5. Stamford,

CT: JAI Press.; 1999. p. 213-233.

21. Sachs L. Applied Statistics: A Handbook of Techniques.

2nd ed. New York: Springer-Verlag; 1982.

22. Rowland-Jones S, Sutton J, Ariyoshi K, et al. HIV-specific

cytotoxic T-cells in HIV-exposed but uninfected Gambian

women. Nat Med. 1995;1:59-64.

23. Judd CM, Mc Clelland GH. Data analysis: A Mo del

Comparison Approach. San Diego, CA: Harcourt Brace

Jovanovich; 1989.

24. Osborne JW, Overbay A. The power of outlie rs (and why

researchers should ALWAYS check for them). Practical

Assessment, Research, and Evaluation; 2004. p. 9.

25. Anscome FJ. Rejection of outliers. Technometrics. 1960;2:

123-147.

26. Rousseeuw P, Leroy A. Robust Regression and Outlier

Detection. New york: Wiley; 1987.

27. Lane K. What Is Robust Regression and How Do You Do It?

Annual meeting of the southwest educational research

association. Austin, TX; 2002.

28. Zimmerman DW. Invalidation of parametric and nonpar-

amteric statistical tests by concurrent violation of two

assumptions. J Exp Educ. 1998;67:55-68.

29. Cole JC. How to deal with missing data. In: Osborne JW,

editor. Best Practices in Quantitative Methods. Thousand

Oaks, CA: Sage Publishing; 2008.

43NEWBORN & INFANT NURSING REVIEWS, MARCH 2010

Multivariate Analysis for Characterization of Air Pollution Sources: Part 1 Prior Data Screening and Underlying Assumptions

Article

Full-text available

Apr 2024
POL J ENVIRON STUD

Mohammed O.A. Mohammed

Analysis of the university experience of undergraduate students of Education degrees

Article

Full-text available

Jun 2024

The Pedagogical Use of Didactic Classes for Teaching Cognitive Psychology

Article

Full-text available

May 2024

The didactic class is a pedagogical tool meant to increase classroom interactivity by encouraging student discussion of real-life cases in connection with theory. This paper evaluates the pedagogical impact of using a one-off didactic class where an external expert is brought in to discuss how to relate a cognitive psychology course’s content to real-life problems. Using a mixed-methods approach, we measure the undergraduate students’ sense of conceptual understanding, their perspective on applying cognitive sciences, their sense of belonging to the department, and their motivation to work. Students’ sense of understanding and their perspective in applying cognitive sciences to real-world problems significantly increased after this class. However, we found no significant differences in their sense of belonging to the department or their motivation to study. This suggests didactic classes may further course-specific content but do not change broader aspects of motivation or belonging. The qualitative interviews support the quantitative results. Students reported that didactic class made them think laterally about content from other modules and how they could apply theoretical insights to real-world problems, which boosted confidence. Students reported great satisfaction with the didactic class. Of course, the speaker must be relevant to the course content, and students should feel empowered and able to speak in class. However, these are practical concerns that should not discourage lecturers from exploring didactic classes as a fun and instructive tool that has significant pedagogical benefits.

Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model

Article

Mar 2024

Missing data are an unavoidable complication frequently encountered in many causal discovery tasks. When a missing process depends on the missing values themselves (known as self-masking missingness), the recovery of the joint distribution becomes unattainable, and detecting the presence of such self-masking missingness remains a perplexing challenge. Consequently, due to the inability to reconstruct the original distribution and to discern the underlying missingness mechanism, simply applying existing causal discovery methods would lead to wrong conclusions. In this work, we found that the recent advances additive noise model has the potential for learning causal structure under the existence of the self-masking missingness. With this observation, we aim to investigate the identification problem of learning causal structure from missing data under an additive noise model with different missingness mechanisms, where the `no self-masking missingness' assumption can be eliminated appropriately. Specifically, we first elegantly extend the scope of identifiability of causal skeleton to the case with weak self-masking missingness (i.e., no other variable could be the cause of self-masking indicators except itself). We further provide the sufficient and necessary identification conditions of the causal direction under additive noise model and show that the causal structure can be identified up to an IN-equivalent pattern. We finally propose a practical algorithm based on the above theoretical results on learning the causal skeleton and causal direction. Extensive experiments on synthetic and real data demonstrate the efficiency and effectiveness of the proposed algorithms.

Factors influencing technopreneurial intention: A confirmatory factor analysis

Conference Paper

Jan 2024

Use and Usefulness of Assessments to Inform Instruction: Developing a K-12 Classroom Teacher Assessment Practice Measure

Article

Feb 2024

Understanding First-Generation College Students’ Help-Seeking Attitudes Using Structural Equation Modeling

Article

Apr 2024

The significance of the cartographic sign: influences of symbol shape on intuitive judgments

Article

Apr 2024

Silvia Klettner

Tradigital Humanities: Experiences in a context of change.

Article

Feb 2024

Jordi PÉREZ GONZÁLEZ

Asumiendo un proverbio de Séneca —mientras enseñamos, aprendemos— debemos entender el oficio del historiador como una profesión en continua renovación, nutriéndonos de nuevos aspectos que mejoren la metodología y las técnicas de la ciencia, descartando aquellos que quedan obsoletos. Desde hace tres décadas, la aportación de metodologías provenientes de otras disciplinas, como la informática y los estudios de Humanidades han ido convergiendo en la disciplina conocida como Humanidades Digitales. Aunque estas técnicas computacionales han introducido nuevos métodos para la identificación de patrones en los datos y prometen acelerar los procesos de análisis de la creciente masa de datos, divergen de la narrativa tradicional y sus métodos. En este sentido, experiencias recientes permiten discutir los beneficios y límites del vínculo entre métodos y técnicas tradicionales, por un lado, aquellas que son computacionales, por otro, a la hora de preparar resultados científicos. Así, creemos que sin perder de vista la esencia básica del historiador de sumergirse continuamente en la lectura y análisis de las fuentes, los cambios de paradigma de la ciencia pueden asumirse adoptando esta nueva técnica. Ahora, con perspectiva, miramos los límites que la propia convergencia ofrece y aquí presentamos nuestra experiencia más reciente para mejorar su desempeño.

Measuring how responsible we are – The development and validation of the personal social responsibility scale (PSRS-Q19)

Article

Mar 2024

The purpose of the article is to introduce the Personal Social Responsibility Scale – a tool used to measure the intensity and multidimensionality of Personal Social Responsibility, and the process of its’ creation. The authors conceptualized the scale and conducted research on a sample of 3019 people. Based on this research, a 19-question scale was built, referring to 6 dimensions of social responsibility: Self-Responsibility, Care for Natural Resources, Care for Animals, Care for Family and Friends, Care for the Future of the World, and Activism.

Robust Regression and Outlier Detection.

Article

Full-text available

Jan 1989

Data Analysis: A Model-Comparison Approach

Article

Feb 1992

Applied Statistics: A Handbook of Techniques

Article

May 1998

Outliers in Statistical Data.

Article

Mar 1995

Strategies for detecting outliers in regression analysis: an introductory primer

Article

Jan 1999

V.P. Evans

Reaction-Time Analysis with Outlier Exclusion---Bias Varies with Sample-Size

Article

Dec 1991

Jeff Miller

To remove the influence of spuriously long response times, many investigators compute "restricted means," obtained by throwing out any response time more than 2.0, 2.5, or 3.0 standard deviations from the overall sample average. Because reaction time distributions are skewed, however, the computation of restricted means introduces a bias: the restricted mean underestimates the true average of the population of response times. This problem may be very serious when investigators compare restricted means across conditions with different numbers of observations, because the bias increases with sample size. Simulations show that there is substantial differential bias when comparing conditions with fewer than 10 observations against conditions with more than 20. With strongly skewed distributions and a cutoff of 3.0 standard deviations, differential bias can influence comparisons of conditions with even more observations.

Identification of Outliers.

Article

Dec 1981

Increasing the Power of Nonparametric Tests by Detecting and Downweighting Outliers

Article

Oct 1995

Donald W. Zimmerman

In this study, methods are examined that can be described, somewhat paradoxically, as robust nonparametric statistics. Although nonparametric tests effectively control the probability of Type I errors through rank randomization, they do not always control the probability of Type II errors and power, which can be grossly inflated or deflated by the shape of distributions. The power of the Student t test and the Wilcoxon-Mann-Whitney test declines substantially when samples are obtained from outlier-prone densities, including mixed-normal, Cauchy, lognormal, and mixed-uniform densities. However, the nonparametric test acquires an advantage, because outliers influence the t test to a relatively greater extent. Under these conditions, an outlier detection and downweighting (ODD) procedure, usually associated with parametric significance tests, augments the power of both the t test and the Wilcoxon-Mann-Whitney test.

A Note on the Influence of Outliers on Parametric and Nonparametric Tests

Article

Oct 1994

Donald W. Zimmerman

Extremely deviant scores, or outliers, reduce the probability of Type I errors of the Student t test and, at the same time, substantially increase the probability of Type II errors, so that power declines. The magnitude of the change depends jointly on the probability of occurrence of an outlier and its extremity, or its distance from the mean. Although outliers do not modify the probability of Type I errors of the Mann-Whitney-Wilcoxon test, they nevertheless increase the probability of Type II errors and reduce power. The effect on this nonparametric test depends largely on the probability of occurrence and not the extremity. Because deviant scores influence the t test to a relatively greater extent, the nonparametric method acquires an advantage for outlier-prone densities despite its loss of power.

Rejection of Outliers

Article

May 1960

F. J. Anscombe

If one reading is a long way from the rest in a series of replicate determinations, or if in a least-squares analysis one reading is found to have a much greater residual than the others, there is temptation to reject it as spurious. Numerous criteria for the rejection of outliers have been proposed and discussed during the past 100 years. They seem always to have been regarded as something like significance tests, and attention has been focussed on rejection rates. It is suggested that rejection rules are not significance tests but insurance policies, and attention would be better focussed on error variance. A detailed study is made of the effect of routine application of rejection criteria to replicate determinations of a single value. Determinations in triplicate and quadruplicate are especially considered. Complex patterns of observations are also considered, especially factorial arrangements with high symmetry, and there is a study of the correlations between residuals. Attention is focussed mainly on rejection rules appropriate when the population variance is known, but some consideration is also given to Studentieed rules.

Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data

Abstract and Figures

Recommended publications

Sensorless Physiological Control of Implantable Rotary Blood Pumps for Heart Failure Patients Using...

Effect of low-intensity training on transient kinetics of pulmonary oxygen uptake during moderate-in...

Randomized, open-label study to evaluate patient-reported outcomes with fingolimod after changing fr...

Magnetic Marketing Success Factors and Their Impact on Purchasing Decision Making Exploration Resear...