Question

Asked 28th Jun, 2014

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

When is it justifiable to exclude 'outlier' data points from statistical analyses?

Data analyzers inspecting tables or figures might decide to exclude from statistical analyses unusual data points sometimes called 'outlier' data points. Statistical patterns and conclusions might differ between analyses including versus excluding outliers.

The exact underlying mechanisms that create outlier data points are often unknown. People might always find arguments to exclude or keep data in analyses. How important is familiarity with model species or model systems in the justification of data point selection, or the definition of statistical rules in general?

Statistical Data Analysis

Philosophy Of Science

University of Hull

I agree with James.

1 Recommendation

Behrouz Ahmadi-Nedushan

Yazd University

Dear Marcel,

An outlier is an observation that appears to deviate markedly from other observations in the sample An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly.

If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible).

In some cases, it may not be possible to determine if an outlying point is bad data. Outliers may be due to random variation or may indicate something scientifically interesting. In any event, we should not simply delete the outlying observation before a through investigation. In running experimdnts , we may repeat the experiment. If the data contains significant outliers, we may need to consider the use of robust statistical techniques.

An excellent book on the subject:

Rousseeuw, P. J., & Hubert, M. (2011). Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 73-79.

http://onlinelibrary.wiley.com/doi/10.1002/widm.2/abstract

47 Recommendations

University of Lincoln

Albert Manfredi

The Boeing Company

Orlando M Lourenço

University of Lisbon

New Bulgarian University

In statistics, an outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.

For more information please see:

http://en.wikipedia.org/wiki/Outlier

5 Recommendations

Behrouz Ahmadi-Nedushan

Yazd University

Dear Marcel,

An outlier is an observation that appears to deviate markedly from other observations in the sample An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly.

If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible).

In some cases, it may not be possible to determine if an outlying point is bad data. Outliers may be due to random variation or may indicate something scientifically interesting. In any event, we should not simply delete the outlying observation before a through investigation. In running experimdnts , we may repeat the experiment. If the data contains significant outliers, we may need to consider the use of robust statistical techniques.

An excellent book on the subject:

Rousseeuw, P. J., & Hubert, M. (2011). Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 73-79.

http://onlinelibrary.wiley.com/doi/10.1002/widm.2/abstract

47 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

I agree that outlier data are not always 'errors' (e.g. resulting from experimental artifacts or typing errors in data files), but just the result of an unusual event/factor that was missed during the study.

3 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Is it scientifically wise to define outlier data when the data analyzer only has access to the data distribution pattern? This will happen when data analyzers are not familiar with the study system or model species involved.

You could also say: How important is it to reveal what people call 'statistically significant patterns'? Each individual data point in the cloud of points will probably be influenced by a unique cocktail of underlying mechanisms.

3 Recommendations

Behrouz Ahmadi-Nedushan

Yazd University

Dear Marcel,

When we have a bad data, it is easy. we just delete the outlier(s).

You are right that in some cases the outliers may results from an unknown/unusual factor.These cases are hard to deal with as keeping or deleting the outliers result in very different conclusions!. As I wrote in my previous post, I think that one possible way is to apply robust statistical methods. I am going to close the RG session for now. good day!

7 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Perhaps people just increase the sample sizes until statistically significant patterns are obtained, and one reason might be to reduce the statistical impacts of so-called outlier data points without excluding them from the data set.....

The question then is: Should the study or sampling period stop once statistically significant patterns are obtained, for instance to avoid the risk of the reappearance of new so-called outlier data points?

4 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Another point for discussion:

Can the same data point be considered as an 'outlier' in one study domain, but accepted as a 'normal' data point in another study domain? Any examples?

2 Recommendations

Applied Science Private University

An outlier is an observation that is distant from the mean of observations. In SPSS, we may delete outliers if they affect the results.

4 Recommendations

Applied Science Private University

Dear Marcel,

I believe an outlier of same data in one study domain can not be accepted while in another domain it may be accepted. The sensitivity and risk of data collected for medicine studies is different than data collected for marketing studies.

3 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Dear Mahfuz,

Concerning your point related to SPSS, you assume that only the pattern is important, not the underlying mechanism. A data point caused by an error or an unusual event is therefore considered of equal importance.

Concerning your second point, let's take following two examples for more discussion:

1) A study in human sciences interested in the underlying causes of human errors may assume that individual errors are just a part of human nature. Unusual data points in a data cloud where body length is plotted against body mass could in this framework be considered as a 'normal' data point, even if the points results from measurement errors or typing errors. However, the same unusual data points (e.g. an unusual length) might not be accepted in a business study interested in marketing and cloth manufacturing. Producing cloths for people with unusual dimensions would perhaps be considered too expensive.

2) When body length is plotted against body mass there may be a nice positive relationship with scatter when different human age classes are combined or when data from men and women are combined or when data from a single age and gender class are considered, etc.... But we also know that each individual data point representing body mass and body length from a single individual may be influenced by a unique cocktail of underlying biology-based mechanisms related to the rearing environment, genetic background, culture based diet, etc.... . Perhaps in these conditions a data point A representing an individual A that is situated in the middle of a data cloud plotting body length against body mass might have been caused by an unusual biology-based unidentified mechanism.

The definition of 'unusual' might be scale-dependent.... either 'pattern-based' (e.g. SPSS) or mechanism-based (research domain dependent) or .....

8 Recommendations

State University of New York at New Paltz

If you believe, for a good reason, that the result contains measurement error.

If you know that the result is associated with a one-off event that is unlikely to ever happen again.

4 Recommendations

Applied Science Private University

Yes Marcel. Concerning my point related to SPSS, I assume that only the pattern is important, not the underlying mechanism. In Humanities, the most important thing is the pattern of data. As for the second point, you have mentioned a good example.

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Very interesting discussion!

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

So one option might be to systematically exclude before the statistical analyses the upper (5%) and lower (5%) extremes from a data set, accepting that there is always a high probability to make errors just by chance alone....

1 Recommendation

Linas Balciauskas

Nature Research Centre

@Marcel

I think,that in zoology upper and lower 5% are the most interesting - or at least, could be the most interesting.

I exclude outliers if only mistake/error is obvious. My last exclusion, and fully recalculated statistics - error in the recorded body length of trapped mice. it was 67 mm, while body weight - over 40 g. being in the field for more than 30 years i know, that this species simply cannot be such small in size (or obese). So, I excluded both measures from data, as error cannot be corrected.

On the other hand, if 5% of the mice with highest body condition are all from control zone, and those with worst body condition are from most heavily poluted zone, this is nice finding, don't you agree?

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Dear Linus,

I fully agree. We use similar procedures of judgement before the data are entered in the long-term data base of small passerine birds. When the values are extreme based on >30 years of observation, they are not included in the data base. For instance, a great tit has a wing length of 70-78 mm (no values below 70 mm). If in the note books there is written '66 mm' this must have been a writing error and therefore not considered. On the other hand, there may indeed be border cases and there may be exchange between populations differing in phenotype, like wing length... .

Thus, how many 'outlier data points' that are found in field note books will not end up in electronic data bases, and how do data managers varying in background information about model species handle these special cases when the files are constructed?

4 Recommendations

Linas Balciauskas

Nature Research Centre

@Marcel, geographic differences in small mammals may be exciting; we just get new species, where diagnostic characters are out of the range comparing to southern populations. There were not outliers, just smaller values. paper is accepter, soon I will present it to RG

3 Recommendations

Hashem Adnan Kilani

University of Jordan

outlier is not always out of the analysis unless the results will be out of the context or meaningful. some times we should be casious in considering outlier as error. in biomechanics research, filter can be used to determine the range of real values from fake one due to error.

2 Recommendations

University of South Florida

I do not like to exclude data just because it is an outlier, especially if it is based on enough data. However, I deal with consumer review data and often some of the data points are based on one review. In this circumstance I do two analyses of the data, one with this data and one without and explain the difference. That way the user of the data can see both sets but are aware of the lack of data supporting some of the data points.

19 Recommendations

Wirtschaftsuniversität Wien

So we end up with the question what we consider as outlier, how we define an outlier. In descriptive statistics exclusion of a few extreme observations within a large mass of data can be very helpful: The difference in the results with and without the extreme observations might be the issue of interest. More general, comparing results from robust methods with standard methods follows the same idea.

In applied statistics, learning from evidence often is the focus of the exercise. In such a situation, the extreme observations are possibly those observations from we learn most.

2 Recommendations

Texas Children's Hospital

Extreme outliers will affect the mean a lot, but will not affect the median. So you can include outliers (if there is no other compelling reason to remove them) if you are computing a median, or a mode.

As others have said, if an outlier is too extreme to be believable, such as being likely due to measurement error, then it is best to exclude it. If the outlier is plausible, it may be best to analyze the data both with and without the outliers.

In logistic regression, it can be useful to show the risk factors that predict them. But including outliers in the data may also mask the effect of predictors on less-extreme data that are not outliers. In linear regression, outliers can greatly affect the regression (the slope, r-value, and r-squared). It may be best to remove them from linear regression, and then explain and describe them separately in some other way.

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Dear Jerry,

thanks for your advice.

The impact of outliers will depend on the proportion of outliers in a data set (thus sample size dependent) and the values of the outliers in relation to the values frequently observed (median). Perhaps one outlier is enough to create a biased (statistical) pattern when the value is really extreme. Extreme values can be found out just by looking at the values in a data set, also based on past experience with a model species/system.

Potentially there are several types of data filtering at different levels:

From observations to field notes (did I really see that phenomenon, probably not? What observations will be noted done?)

From the field notes to the computer file (did I really note this down in the field? it must have been an error)

From the computer file data set to the data set used for statistical analysis....

Perhaps some extreme values just result from typing/copy errors.

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

I prefer to share data sets/statistical outputs among potential contributors so that each person has the possibility to have a look at the note books, the computer data set, the statistical analysis,.... If different persons come up with the same findings/conclusions, and if results are repeatable in time, I think the analysis is OK

1 Recommendation

I think there is always some amount of subjectivity involved to decide which event ( I mean which data point) is to be considered as an outlier and which one is not. The important point is that this decision comes from experience and there is nothing absolutely objective criterion.

But the very important thing here is (as pointed out by others) that sometimes an outlier data point (We call it an event) may belong to an entirely different class of events and very careful study of even one, two or a few such events leads us to new physics.

To give an example, we may carry out an analysis of the spectral hardness of the Classical Gamma Ray Bursts (GRBs). Here, spectral hardness means the ratio of the number of detected photons in a higher energy band to the number of detected photons in a lower energy band.

For the Classical GRBs the data points will lie in a certain band and naturally there will be some events significantly off from the mean. But here the point is, "SIGNIFICANT" means how much quantitatively ? I think there is no objective criterion for this.

Now, the interesting thing that happened is that when we try to do this for many GRBs only a few events are found that are significantly off from the mean (alongwith many such outliers) but these four or five belong to an entirely different class of objects, viz. the SGRs (Soft Gamma Repeaters).

Therefore, it is not at all good to throw outliers. Instead, very careful and in-depth analysis of these events is required. It will be nice to look for their other characteristics.

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Dear Srikanta,

very good remark. There were some examples mentioned above that when two populations significantly differ in phenotype (e.g. L=Large versus Small), and there is some exchange between the two populations where one individual L ends up in a population of S, L might be defined as an outlier because of biological reasons, not methodological reasons.

So what is attributable to real physics/biology and what is attributable to an experimental/methodological artifact?

1 Recommendation

Mohamed Benmerikhi Ph.D

EDHEC Business School Lille

Intuitively, I guess you can use a simple test. Include the outlier, see what you get and the exclude it and see what you then get. If it affects the mean significantly, then it must be eliminated from the sample.

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Dear Mohamed,

Imagine two populations A and B that you would like to compare. Interesting is that your decision to keep or exclude an outlier value in population A will depend on the mean value and SD in population B. If there is a large difference in mean between population A and B you will decide not to exclude the outlier from population A because the conclusion that population B is larger than population A will not be altered when analyses include versus exclude the outlier from population A. However, when there is a small difference between the two populations, analyses with versus without the outlier from population A may change conclusions.

Thus, in this case, it is not the population that contains the outlier that will decide whether you keep or exclude it from the analysis....

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Personality profiles and selection/exclusion of outliers?

Data analyzers inspecting tables or figures may decide to exclude from data sets unusual data points named ‘outliers’. The outcome of statistical analyses will probably differ in analyses with versus without outlier data points. The identification of outliers may depend on statistical rules taking observed variation into account (e.g. points exceeding or not exceeding standard deviation values, which may be sample-dependent) or familiarity with model species or model systems (e.g. data points considered to be biologically impossible, perhaps caused by copy errors). However, underlying mechanisms creating outlier data points are usually unknown. People might always find published or unpublished arguments to exclude or keep ‘unusual’ data points in analyses. Perhaps selection of data points will depend on baseline knowledge of data analyzers and, why not, personality profiles, like more or less critical people independent from education background.

3 Recommendations

Regina Wikinski

Universidad de Buenos Aires

if you find data which is higher or lower than +/- 3 SD from the usual data points, is useful a second review of the original data in your laboratory notes, to see if you find an explanation of the appearance of this unique point which can be originated by a methodological or copy error. In this case you can see this unique point as an outlier. Of course , data obtained by duplicate or triplicate are useful to find logical outliers. Comparison of the A group with the possible outlier with the B group without it, is a simple way to detect outliers, but a set of "outlier" points may be the result of another unknown biological mechanism and the source of a doubt on the first work hypothesis. After writing my answer, I see that I agree with other RG members .

2 Recommendations

Texas Children's Hospital

In reference to Regina's comment, simply plotting the data in a graph can show you how extreme any potential outliers really are. If you have a lot of datapoints, there will always be some that are beyond +/- 3 standard deviations from the mean. So you cannot automatically consider anything beyond three standard deviations as an outlier. (if you remove such points, you will then have a new mean and standard deviation, with new points that are beyond 3 standard deviations!) Something may look like an outlier in a small sample, but as the sample gets larger it becomes less like an outlier and simply fits somewhere on the bell-shaped curve, in a normally distributed sample.

4 Recommendations

Aliakbar Haghighi

Prairie View A&M University

For some basic measures such as median, I do not ignore any. Other than that a treamed data with 5% cut from above and below won't hurt, unless someone can show me mathematical contradiction.

2 Recommendations

Illinois Institute of Technology

OK to exclude outliers if a) you have good reason (see above) AND b) you clearly state what you are excluding and why (to avoid loss of data, allow for alternate interpretations of data, and otherwise protect readers from being mislead).

5 Recommendations

Alberto Enrique García

University of Havana

The case of the outlier should be treated with much care in dependence of the nature of the data and of the knowledge that one has on the process of its obtaining. In occasions the outlier is erroneous simple data, but in others they represent important deviations of the behavior average of the sample.

2 Recommendations

Cliff RICHARD Kikawa

Kabale University

Excusion would not be the first stepm to think about, remember much as it may be an outlier, it may have vast influence on the overall outcome. I would suggest that you first ascertain the influence of that particular outlier before you think of exclusion. obtaining the influence will not be a one step, you may have to work out several procedures, since that outlier may as well be influential when combined with some other explanatory variables. In other wise to exclude will largely be determined by the level and importance of the model being built.

1 Recommendation

Aliakbar Haghighi

Prairie View A&M University

I think, noting exceptions to the general rules will take care of all concens mentioned.

1 Recommendation

José Fausto de Morais

Universidade Federal de Uberlândia (UFU)

I exclude the outlier if it is produced by a measurement error, otherwise I prefer to use methods that are less susceptible to outlier

2 Recommendations

Aliakbar Haghighi

Prairie View A&M University

How do you know if it were due to error, Jose?

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Dear Stefan,

I agree with your comments. All what we call 'replicated' phenomena deserve more study, even when they are defined as outlier

1 Recommendation

Durham University

In the physical sciences, Chauvenet’s Criterion - a test based on the Gaussian distribution with the aim of assessing whether one data point which lies many error bars from the mean

(an outlier) should be regarded as spurious and hence discarded - is often used. This procedure should be applied with care, and one should always ask if there is a possible reason why the outlier occurred.

1 Recommendation

Sheikh Mohammed Shariful Islam

Deakin University

I would look into the outlier data by case and try to verify if the outlier value corresponds with other variables in the same case, for example if an income is too high, it is reasonable to check if the person owns a land and son on. That is it data makes sense. Often outlier data might not be meaningful to interpret and easy techniques mentioned above might be more helpful.

3 Recommendations

Mesay Sata Shanka

Université Internationale de Rabat

in addition to being a problem to be fixed, outlier can bring a unique phenomena that can lead to new theoretical insights.

1 Recommendation

Felix Aguboshim

Walden University

Hi Marcel,

Outlier values can come from poor recording of data during data collection in the field or during data coding and input in the office among others. There are cases where it is difficult to identify outlies, and there are case where they can be easily identified. When the possible range of data is known such as in age of humans measured in years. May be an age of 25yrs was erroneously recorded as 250 yrs. A human height value of 4.2 meters most be due to error and definitely will be discarded. In this case it is statistically justifiable not to include such data points in the statistical analysis. The researcher must know what sounds reasonable and justifiable data for his analysis. Plotting a Boxplot diagram helps to identify where the outliers are coming from.

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

In other words, you have to know somehow the model system which helps to decide what is 'normal' versus 'not normal', ideally before the data analyses start?

Thanks!

1 Recommendation

Felix Aguboshim

Walden University

Yes, but not necessarily before the analyses start. It could be during the analyses or preliminary analyses. For instance the boxplot which is part of the analyses can suggest to the researcher where outlies might exist in the data. This could be handled before the final analyses begin

Thanks

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

But then you do not know the model system very well, but start to discover it during the course of the analyses?

Wirtschaftsuniversität Wien

Dear Marcel,

You pose an interesting question. It is interesting because you assume to know that a certain observation is an outlier. How do you know that?

In my understanding, an outlier is an observation whose generating process is different from the one of the rest of the data. The first issue: How can we identify an outlier? Various techniques have been designed for this task; ask, e.g., Google for outlyer detection. The second issue is your question: How can/should this observation be used in the analysis of the data. Options are: to exclude or not to exclude; or you can use robust methods which are designed such that the influence of outliers, e.g., on parameter estimates, is limited.

However, the crucial point in this situation is that we should understand the relevance of the outlier: Is it bad data, e.g., erroneously recorded data, is it due to a local change in the data generating process, is it an indication that my understanding of the data generating process is incorrect, etc. The outlier can teach us more about the data generating process than the rest of the data. In any case, the first thing to do when an outlier is indicated: Try to explain and understand how this observation has been generated.

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

What is an outlier for individual A is not an outlier for individual B. This make analyses of the same data set individual-specific?

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

What is an outlier for individual A is not an outlier for individual B. This makes the details of the analyses of the same data set individual-specific and unreplicable across research teams?

Wirtschaftsuniversität Wien

The analysis of data is in any case a subjective task. Considering an observation as an outlyer or not, choosing a distribution for the disturbance term of a model, using a parametric or a non-parametric approach to analyse the data: The analyst should give enough metainformation (why is observation A for me an outlyer, etc.) to allow the consumer/reader of the results understanding the relevance of the analysis.

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Are all these details allowed to be presented in methods sections of journals, and even when they are presented, do you think that readers are interested in exploring/replicating them?

If you give the same data set to 10 researchers and ask them to analyse that data set to test hypothesis A, do you think that the details of the 10 analyses applied will be exactly the same? And I do not even mention the fact that the 10 researchers might not have access to the same statistical tool/program (e.g. GLIM versus SAS versus R versus Statgraphics versus.......)

Interesting that is that statistical tools apparently follow fashions. More than 15 years ago it was GLIM, right? Is GLIM still used today?

https://en.wikipedia.org/wiki/GLIM_(software)

What will be the future of R in 20 years from now?

The University of Arizona

There is no standard and universal definition of "outlier" so you have to first ask what criterion is used to label a value as an outlier. Note that in the case of spatial data, i.e. where each value is associated with a location, being an outlier also relates to where it occurs (in relationship) to values at other locations.

Some statistical computations are moderately insensitive to "outliers" whereas others are very sensitive. Also note that in general the validity of the results of a statistical analysis are always dependent on whether various statistical assumptions are satisfied. For some problems it is possible to design the experiment to ensure that the assumptions are satisfied but for other data sets it is not.

In some cases the appearance of "outliers" may indicate the sample is from more than one population.

Statistical analyses are often used to make a decision or to take an action, a key question about whether to include or exclude "outliers" is whether that action would change the decision, what would the consequences be of an erroneous choice?

It is important and useful to ask about "outliers" but don't expect simple answers that are appropriate in all situations.

4 Recommendations

Indian Institute of Management, Lucknow

Context is critical to the formulation of the research problem. That context often decides the fate of outliers. If in fact even without outliers apparent in the graphical data analysis one feels the boundary data at minimum and maximum are not valid and share a different relationship one may winsorise the data or consider a tobit regression. If in fact we have paucity of data we may find outliers showing up significantly and changing the meaning of the relationship between the dv and the iv. Here one has to decide on the meaningfulness of the relationship including both direction and relationships..correlations and eventually causation. I would strongly advise use of subject matter expertise and a consideration /caveat that your answer changes in the presence of outliers or there are outliers.

Thus one is likely to find the problem of outliers to be more tractable if sufficient data points are available in which case the outliers seen best in graphical analyses including the box plots and qq plots must be removed from the analysis

2 Recommendations

Behrouz Ahmadi-Nedushan

Yazd University

Dear Marcel,

Outlier detection is not an easy subject. In regression context, robust regression methods are recommended if there are outliers in the data. There are many references in statistical and data mining journals which deal with outlier detection. I have attached a few:

Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial intelligence review, 22(2), 85-126.

Rousseeuw, Peter J., and Annick M. Leroy. Robust regression and outlier detection. Vol. 589. John Wiley & Sons, 2005.

Aggarwal, C. C., & Yu, P. S. (2001, May). Outlier detection for high dimensional data. In ACM Sigmod Record (Vol. 30, No. 2, pp. 37-46). ACM.

Zhang, K., Hutter, M., & Jin, H. (2009, April). A new local distance-based outlier detection approach for scattered real-world data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 813-822). Springer Berlin Heidelberg.

Aggarwal, C. C. (2015). Outlier analysis. In Data Mining (pp. 237-263). Springer International Publishing.

Akoglu, L., Tong, H., & Koutra, D. (2015). Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery, 29(3), 626-688.

Wang, Y., Wang, X., & Wang, X. L. (2016). A Spectral Clustering Based Outlier Detection Technique. In Machine Learning and Data Mining in Pattern Recognition (pp. 15-27). Springer International Publishing.

von Brünken, J., Houle, M. E., & Zimek, A. (2015). Intrinsic Dimensional Outlier Detection in High-Dimensional Data. Technical Report 2015-003E, NII.

Kim, Y. G., & Lee, K. M. (2015). Association-based outlier detection for mixed data. Indian Journal of Science and Technology, 8(25).

Jiang, F., & Chen, Y. M. (2015). Outlier detection based on granular computing and rough set theory. Applied Intelligence, 42(2), 303-322.

http://www.jstor.org/stable/3054624?seq=1#page_scan_tab_contents

http://arxiv.org/pdf/1404.4679

Conference Paper A Comparative Study of Outlier Detection for Large-scale Tra...

4 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

All these statistical methods assume somehow that outliers can be treated in a similar way in quite different model systems, whatever the underlying mechanisms of the outliers?

I would be worried if conclusions of statistical analyses would change when a single data point would be removed from the data set

2 Recommendations

Behrouz Ahmadi-Nedushan

Yazd University

Wikipedia has a good page on outliers:

Outliers can have many anomalous causes. A physical apparatus for taking measurements may have suffered a transient malfunction. There may have been an error in data transmission or transcription. Outliers arise due to changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. A sample may have been contaminated with elements from outside the population being examined. Alternatively, an outlier could be the result of a flaw in the assumed theory, calling for further investigation by the researcher. Additionally, the pathological appearance of outliers of a certain form appears in a variety of datasets, indicating that the causative mechanism for the data might differ at the extreme end (King effect).

Retention

Even when a normal distribution model is appropriate to the data being analyzed, outliers are expected for large sample sizes and should not automatically be discarded if that is the case. The application should use a classification algorithm that is robust to outliers to model data with naturally occurring outlier points.

Exclusion

Deletion of outlier data is a controversial practice frowned upon by many scientists and science instructors; while mathematical criteria provide an objective and quantitative method for data rejection, they do not make the practice more scientifically or methodologically sound, especially in small sets or where a normal distribution cannot be assumed. Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known. An outlier resulting from an instrument reading error may be excluded but it is desirable that the reading is at least verified.

https://en.m.wikipedia.org/wiki/Outlier

There is another recent question (a more harmonious discussion and without DOWNVOTEs!)

https://www.researchgate.net/post/when_an_outlier_must_be_relevant_to_an_investigation_and_when_should_be_discarded

https://www.researchgate.net/post/when_an_outlier_must_be_relevant_to_an_investigation_and_when_should_be_discarded

5 Recommendations

Sarwan Kumar Dubey

Emeritus Scientist ICAR-Central Soil Salinity Research Institute Karnal National Vice President Bhartiya Agro Economic Research Center New Delhi EX Head ICAR-Indian Institute of Soil and Water Conservation (Earlier known as CSWCRTI)

As per my view researcher must try to search the reasons why it is outlying instead of excluding it. Thanks

2 Recommendations

Behrouz Ahmadi-Nedushan

Yazd University

A good article:

Wiggins, B. C. (2000). Detecting and Dealing with Outliers in Univariate and Multivariate Contexts.

http://eric.ed.gov/?id=ED448189

4 Recommendations

George Kapitsopoulos

University of South Wales

I am currently doing a project using MANOVA, and in my data set I have an outlier that skews the data by a rather large amount. However, it doesn't really affect the outcome, my test is still not significant. If I exclude the value, p is much closer to .05, but still non significant. Any ideas?

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Present the two analyses and conclude it is P>0.05?

1 Recommendation

University of Southern California

To expand a bit on some of the answers you got, if there values that are outliers among the dependent variable, but valid, removing them and applying standard methods to the remaining results in using the wrong standard error. There is a vast literature on how to do this in a technically sound manner and it is easily demonstrated that this can make a substantial difference in the conclusions reached. For details see

Wilcox, R. R. (2017). Introduction to Robust Estimation and

Hypothesis Testing 4th Edition. San Diego, CA: Academic Press

3 Recommendations

Wirtschaftsuniversität Wien

Congratulations to Rand on the appearance of the 4th edition of this successful book. Otherwise, I wonder whether something new can be expected after 1,5 years of discussion of the outlier issue. Does somebody know any mechanism to finish the "question", ideally with a summary of the most relevant answers?

2 Recommendations

George Kapitsopoulos

University of South Wales

What about unequal sample sizes? I am trying to run a MANOVA with two levels on the IV, I have n=14 on one,and n=36 on the other...

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Take all at n=14 or include only those with n=36 and see if the results change. Here again, present the two analyses

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

To correct for multiple tests on the same data set? How often do people analyse their own data sets in private (e.g. trying out several analyses to see what happens) before exposing often only one analysis to the public? How to handle situations where individual A analyses data set A in year 1 and individual B analyses data set A in year 2, perhaps also using different methods of analysis in different years, as is often the case in long-term studies focusing on a single study population?

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

I don't know how to handle this at the between publication level. For instance, analyses of data set X from long-term study X has been presented in 10 publications, by adding one more of study year or by adding one more factor. In addition, the data from factor X have been used 10 times, whereas the data from factor Y only once..... You also 'peek' at the data just by reading the former publications dealing with the same data set...

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

If you sample data, calculate a pvalue, sample data again, then calculate a pvalue again, you need to correct.

This occurs across publications dealing with the same long-term data set, right? You will not correct your p value in publication 3 based on the p value exposed in a former publication 2, do you?

Example

In a long-term data set, you 'sample' every year in the same population/study plot.

In publication 1, you analyse the sample from years 1995-2000 and calculate a p value. In publication 10, you analyse the sample from years 2014-2015 and calculate a p value. Will you correct the p value in publication 10 because there is the published p value from publication 1?

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

If it is all about 'philosophy', statistical specialists that do not know the model system can get it all wrong? Some start to use one-tailed tests simply based on the results and hypotheses of former publications?

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Perhaps have a p value of p<0.0001 to be safe with the conclusions?

Colquhoun20
14.pdf
759.01 KB
lambrechts-pub2016juillet201
6.docx
462.26 KB

Mehmet Guven Gunver

Istanbul University

Indeed, NEVER. try our new paper

Article TO DETERMINE SKEWNESS, MEAN AND DEVIATION WITH A NEW APPROAC...

1 Recommendation

Sergiy Prykhodko

Admiral Makarov National University of Shipbuilding

It is justifiable to exclude 'outlier' data points from statistical analysis for significance level of 0.005 or less according R.A. Johnson and D.W. Wichern (2007) Applied Multivariate Statistical Analysis. However, choosing a value of significance level for outlier detection is one of the problems. The second problem is that well-known statistical methods are used to detect outliers in a data set under the assumption that the data is generated by the Gaussian distribution. But this assumption is valid only in particular cases. The second problem can be solved by applying the normalizing transformations. For example, follow the links below.

Conference Paper Statistical anomaly detection techniques based on normalizin...

Conference Paper Detecting bivariate outliers on the basis of normalizing tra...

Conference Paper Multivariate Outlier Detection Technique Based on Normalizin...

2 Recommendations

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

What to do in the framework of studies that wish to promote exact replication (see Kelly 2006)?

Exact replication implies the use of the same study system, the same model species, the same methods in different studies. What to do if one study presents data distribution A (with or without an outlier) and the other study aimed to be an exact replicate presents data distribution B (with or without an outlier)? Should statistical methods also be replicated in exactly the same way in different studies aimed to promote exact replication, whatever the data distributions of the different samples?

PS: 'Exact replication' becomes impossible given the fast evolution of techniques and methods. Given the fast technical advancements during the last 20 years (e.g. to identify outliers), will the methods and analyses conducted today be outdated tomorrow, and if so, why doing the analyses today if we know they will be outdated tomorrow?

Kelly200
6e.pdf
138.93 KB

Qazvin Islamic Azad University

An interesting abstract topic! I wished I could follow it twice!

1 Recommendation

The University of Arizona

The original question really pertains to a very specific data set which none of the responders have access to. The various responses are conditioned on specific but different data set(s) that the responder has encountered or abstract circumstances that are not applicable to most data sets. Thus none of the responses, including my own, really provide any reliable guide to the query. It is like saying we have a variety of statistical tools at hand, hunt around among them maybe one of them would be useful but there is no way to tell (perhaps it should be one we don't know about).

Linas Balciauskas

Nature Research Centre

re-read all thread today... why?

In my data, one outlier was found, and I was recommended to exclude it.

I recalculated statistics - nope, conclusions are the same, differences near p=0.05 and p<0.05.

However, I have no worries about data error or bad protocol. Outlier is from species having widest trophic niche.

I will not exclude the outlier - I think, having more data, it may become not an outlier, but normal value.

University of KwaZulu-Natal

On a different note, in qualitative studies, outliers are very important for us. They provide new insights and therefore we do not exclude them.

1 Recommendation

Osama Rahil Shaltami

University of Benghazi

I recommend Mehmet's answer

Best Regards

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

If the results of your statistical analyses will depend on the presence/absence of some unusual data points, how reliable/robust are the interpretations of these results? Does this imply that your data set is not big enough?

PS: People spending a lot of time to track one 'error' in their data set, but ignore at the same time the errors that have been commited during data sampling and data interpretation

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Accepting that your statistical results will depend on the statistical tool used (e.g. SAS vs R), e.g. to discover subtle effects, what will be the probability to find the same results in a new study?

Linas Balciauskas

Nature Research Centre

I think, that results should not depend on the tool used

2 Recommendations

António Manuel Abreu Freire Diogo

University of Coimbra

“Data analyzers inspecting tables or figures might decide to exclude from statistical analyses unusual data points sometimes called 'outlier' data points.”

Any scientific data must represent phenomena real. What is a data analyzer that inspects tables and figures? Should not such expert rather investigate the phenomena under consideration?

1 Recommendation

The University of Arizona

Linas

You say you found one "outlier", what criteria did you use to determine it was an outlier?

One that comes to mind (but which would almost never be usable) is that you have non-statistical evidence that the value is incorrect, e.g. a typo, a mis-reading of an instrument, an impossible value (violating a physical condition).

Certainly for a given data set, an investigator may know enough about possible outcomes to recognize a data value as being suspicious but that is not the same as an outlier.

1 Recommendation

Linas Balciauskas

Nature Research Centre

statistical criteria, value out of range.

If there are non-statistical evidences, I'd call it "mistake", not an outlier, and definitely such values should be excluded

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Why not accpeting that a statistical outlier might be caused by an identified unusual natural event, e.g. an unusual weather event, an unusual mutation event, an unusual interspecific interaction, etc... ?

Linas Balciauskas

Nature Research Centre

It has stable isotope value. Of course, deviations are understandable and acceptable

1 Recommendation

Texas Children's Hospital

You should familiarize with the statistical models that you will apply to your data. Sometimes outliers can cause the statistical assumptions of a procedure to be violated. The most obvious is violations of normality (i.e., having a non-normal distribution); it will skew your data, can cause residuals (from regression) to be non-normally distributed; and of course can cause a mean to be misleading. But there are statistics for describing non-normality too: medians, tests of normality, tests of skewness, to name a few. It is often best to do your statistics both with and without the outlier(s) and see if the results are appreciably different. Sometimes you can do a procedure anyway and simply note the presence of outliers and how they could affect the statistical validity or interpretation of the results. Most readers of research study results will want some kind of familiar statistical analysis to be presented, even if the data violate the required assumptions to some degree. One way to constructively use outliers is to incorporate them into the analysis of a trend. Outliers may ruin linear regression results but might be perfect for an exponential curve or a test of trend.

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

But outliers are also part of 'nature'....

1 Recommendation

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

Let's discuss when to decide to exclude outliers.

Example:

Outliers decrease the probability to find statistically significant results. When you analyse your large data set that includes outliers and that provides statistically highly significant results will you decide to exclude the outliers versus when you analyse your small data set that includes outliers and that provides statistically non-significant results will you decide not to exclude outliers?

Marcel M. Lambrechts

Centre d'Ecologie Fonctionnelle et Evolutive

INTERLUDIUM

Daniel Meurois et Anne Givaudan. Récits d'un voyageur de l'astral. J'ai Lu, Aventure secrète

Page 222: La science est ... neutre…. Seul ce n'est pas celui qui la manie.

La science est neutre. Seuls ceux qui le pratiquent ne sont pas

1 Recommendation

Parakrama Sunil Dharmapala

Lone Star College System

If it is not a human or machine error, and by excluding it from the data set the solution does not deviate drastically, I would keep it, as there is no perfect criterion to define an Outlier. One must weigh the Risk of losing information (by excluding) against Deviation from estimated solution.

1 Recommendation

Linas Balciauskas

Nature Research Centre

@ Marcel, good point!!! I definitely keep outliers, when possible.

2 Recommendations

Kafri Nihul Ltd.

There are two reasons for outlier measurement. The first is an erroneous measurement and the second is a low probability correct measurement. The decision is statistical. First, I would calculate the probability of the "strange outcome" according to the distribution of the other points. For example, if the distribution is a Gaussian and the probability of the outlier is 10-80 the point should be considered an error. However, if its probability is 10-3 and I measured 10 points it should not be deleted. In other distributions like the long-tail distribution, there is a high probability to outliers. Therefore I would not delete i.e. the bank balance of Bill Gates as an error.

4 Recommendations

James E. McLean

University of Alabama

The simple answer is that you should not exclude outliers unless you determine a specific reason that they are not valid. Any other action would just introduce bias into the process.

6 Recommendations

University of Hull

I agree with James.

1 Recommendation

Which is the best method for removing outliers in a data set?

Asked 30th Nov, 2015

Manoj K.

In statistically analyzing a data set, suppose we have to found some of the outliers, if necessary to remove them which method is appropriate?

Should outliers be removed before or after data transformation?

Asked 13th Aug, 2015

Elizabeth Kirkham

Hello, I have some data which I want to transform. Is it better to remove outliers prior to transformation, or after transformation?

Removal of outliers creates a normal distribution in some of my variables, and makes transformations for the other variables more effective. Therefore, it seems that removal of outliers before transformation is the better option.

However I believe detection of outliers differs between normal and non-normally distributed data? I don't know if the method I want to use (outlier labelling rule) is appropriate for non-normal distributions. If it is not, then removing outliers from the non-normal distribution (prior to transformation) might be a problem?

How do I report the results of a linear mixed models analysis?

Asked 3rd Jan, 2015

Subina Saini

1) Because I am a novice when it comes to reporting the results of a linear mixed models analysis, how do I report the fixed effect, including including the estimate, confidence interval, and p-value in addition to the size of the random effects. I am not sure how to report these in writing. For example, how do I report the confidence interval in APA format and how do I report the size of the random effects?

2) How do you determine the significance of the size of the random effects (i.e. how do you determine if the size of the random effects is too large and how do you determine the implications of that size)?

3) Our study consisted of 16 participants, 8 of which were assigned a technology with a privacy setting and 8 of which were not assigned a technology with a privacy setting. Survey data was collected weekly. Our fixed effect was whether or not participants were assigned the technology. Our random effects were week (for the 8-week study) and participant. How do I justify using a linear mixed model for this study design? Is it accurate to say that we used a linear mixed model to account for missing data (i.e. non-response; technology issues) and participant-level effects (i.e. how frequently each participant used the technology; differences in technology experience; high variability in each individual participant's responses to survey questions across the 8-week period). Is this a sufficient justification?

I am very new to mixed models analyses, and I would appreciate some guidance.

What should I do if my data after log transformation remain not normally distributed?

Asked 29th Dec, 2014

Usama Mahmoud

I have some data about body core temperature and surface temperature which is not normally distributed.

I made normal log, log 10, box-cox to transform these data but they are still not normally distributed. I have 6 treatments - I am talking about normal distribution for each single treatment if we neglect the treatment data are normally distributed.

What should I do?

When should I delete outliers from a data set?

Asked 5th Jul, 2020

Shihab A. Shahriar

When discussing data collection, outliers inevitably come up. What is an outlier exactly? As a definition, It’s a data point that is significantly different from other data points in a data set. While this definition might seem straightforward, determining what is or isn’t an outlier is actually pretty subjective, depending on the study and the breadth of information being collected. But I have a question that when should I delete outliers from a data set?

How can I report regression analysis results professionally in a research paper?

Asked 6th Jun, 2017

M Bakri Hammami

Any provided material would be helpful.

Is there a non-parametric equivalent of a two way ANOVA?

Asked 24th Nov, 2016

Athanasius Opara

Ordinary two-way ANOVA is based on normal data. When the data is ordinal one would require a non-parametric equivalent of a two way ANOVA. Is there a test like that?

What is the acceptable range of skewness and kurtosis for normal distribution of data?

Asked 19th Apr, 2014

Naeem Aslam

It is desirable that for the normal distribution of data the values of skewness should be near to 0. What if the values are +/- 3 or above?

Can you export the results of Google Scholar search to excel?

Asked 3rd Dec, 2013

Giora Rahav

Exporting the list of papers to Excel allow you to sort papers and delete duplicates

The Environment: Philosophy, Science, and Ethics by William P. Kabasenche, Michael O’Rourke, and Matthew H. Slater, eds.

Article

Jan 2014

Wayne Ouderkirk

Philosophy, Science and Ethics

Article

Cosmic Evolution: Synthesizing Evolution, Energy and Ethics (Filosofskie Nauki' [Philosophy, Science and Humanities], Moscow, 2005)

Article

Full-text available

Jan 2005

Eric Chaisson

We are currently entering an age of synthesis such as occurs only once every few generations. The years ahead will surely be exciting and productive times in the world of science, largely because the scenario of cosmic evolution grants us an opportunity to systematically and synergistically inquire into the nature of our existence—to mount an inter...

Got a technical question?

Get high-quality answers from experts.