Content uploaded by Vikas Yellapu
Author content
All content in this area was uploaded by Vikas Yellapu on Sep 05, 2019
Content may be subject to copyright.
BIOSTATISTICS
Year : 2018 | Volume : 4 | Issue : 1 | Page : 60--63
Descriptive statistics
Parampreet Kaur1, Jill Stoltzfus2, Vikas Yellapu1,
1 Department of Research and Innovation, The Research Institute, St. Luke's University Health Network, Bethlehem, PA 18015, USA
2 Department of Research and Innovation, The Research Institute, St. Luke's University Health Network; Temple University School of Medicine,
Bethlehem, PA 18015, USA
Correspondence Address:
Dr. Parampreet Kaur
St. Luke's University Health Network, 801 Ostrum Street, Bethlehem, PA 18015
USA
Abstract
Descriptive statistics are used to summarize data in an organized manner by describing the relationship between variables in a
sample or population. Calculating descriptive statistics represents a vital first step when conducting research and should always
occur before making inferential statistical comparisons. Descriptive statistics include types of variables (nominal, ordinal, interval, and
ratio) as well as measures of frequency, central tendency, dispersion/variation, and position. Since descriptive statistics condense
data into a simpler summary, they enable health-care decision-makers to assess specific populations in a more manageable form.
The following core competencies are addressed in this article: Practice-based learning and improvement, Medical knowledge.
How to cite this article:
Kaur P, Stoltzfus J, Yellapu V. Descriptive statistics.Int J Acad Med 2018;4:60-63
How to cite this URL:
Kaur P, Stoltzfus J, Yellapu V. Descriptive statistics. Int J Acad Med [serial online] 2018 [cited 2019 Sep 5 ];4:60-63
Available from: http://www.ijam-web.org/text.asp?2018/4/1/60/230853
Full Text
Introduction
Quantitative research provides important statistical information to health-care decision-makers that enable them to accomplish tasks
such as budget justification, departmental and network needs assessments, and allocation of medical resources. In addition, health-
care statistics are critical to both quality improvement and product development. Various hospitals measure their performance
outcomes using the results of statistical analysis, as well as implement quality improvement programs to improve their efficiency.
Health-care statistics are also helpful for pharmaceutical and technology companies in developing new products and conducting
market research analysis of their products.[1]
Within the health-care context, as in other sectors, there are two main approaches to statistical methodology: (1) descriptive analysis,
which summarizes raw data from a sample or population and (2) inferential analysis, which draws causative, associative, or other
conclusions from the data. Descriptive analysis is a prerequisite for, and provides the foundation of, inferential statistics.[2]
Variable Type
Before analyzing any dataset, one should be familiar with different types of variables.
Categorical variables (also known as qualitative or discrete) may be further classified as nominal, ordinal, or dichotomous. Nominal
variables, which are the simplest in nature, include two or more categories that lack intrinsic order (e.g., types of wounds; abrasion,
laceration, puncture, or avulsion). Dichotomous nominal variables have only two categories (e.g., male or female). Ordinal variables
have two or more categories that can be ranked or ordered, but there is no objective value to the rankings (e.g., a patient satisfaction
scale with “strongly disagree,” “disagree,” “unsure,” “agree,” and “strongly agree”).
Continuous variables (also known as quantitative or numerical) are further categorized as either interval or ratio. Interval variables
can be measured along a continuum and have a numeric value, but no true zero point (e.g., temperature measured in Celsius or
Fahrenheit). Ratio variables have all the properties of interval variables as well as a true zero point (e.g., height, weight, fasting
glucose).
In addition to variable type, descriptive statistics include measures of frequency, central tendency, dispersion/variation, and position
[Table 1].{Table 1}
Measure of Frequency
Absolute frequency is the number of times a particular value occurs in the data. In contrast, relative frequency is the number of times
a particular value occurs in the data (absolute frequency) relative to the total number of values for that variable. The relative
frequency may be expressed in ratios, rates, proportions, and percentages.
Ratios compare the frequency of one value for a variable with another value for the same variable. For example, in thirty participants,
the ratio of an experimental drug's adverse effects to no adverse effects is 2:28; conversely, the ratio of no adverse effects to adverse
effects is 28:2.
Rate is the measurement of one value for a variable in relation to the entire sample of values within a given period. For example, in a
total of thirty participants, there are 2 who show adverse effects after taking an experimental drug; therefore, the rate of adverse
effects is 2/30 participants.
Proportion is the fraction of a total sample that has some value. For example, in a total of thirty participants, with two participants
having adverse drug effects, the proportion of adverse effects is 2/30 = 0.066
Percentage is another way of expressing a proportion as fraction of 100. The total percentage of an entire dataset should always add
up to 100%. For example, in total of thirty participants, where 2 experience adverse drug effects, 2/30 = 0.066 × 100 = 6.6% of
participants experience adverse effects.
The above measures of frequency are often expressed visually in the form of tables, histograms (for quantitative variables), or bar
graphs (for qualitative variables) to make the information more easily interpretable.
Measures of Central Tendency
Central tendency is the value that describes the entire set of data as a single measurement. The three primary measures of central
tendency are the mean, median, and mode.
The following example will be used to demonstrate these three measures.
Sample A (age in years) - 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60Sample B (age in years) - 52, 54, 54, 54, 55, 56, 57, 57, 58, 58,
60, 60
Mean is the arithmetic average or the sum of values in a dataset divided by the total number of observations. Using the above
example, the mean of Sample A is 54 + 54 + 54 + 55 + 56 + 57 + 57 + 58 + 58 + 60 + 60 = 623, divided by 11 (total number of
observations) = 56.6 years.
The mean should only be reported with interval and ratio data that are normally distributed (i.e., look like a “bell-shaped” curve) since
this measure of central tendency is strongly affected by outliers and skewed distributions.
Median is the middle value in distribution when the data are ranked in order from highest to lowest (or vice versa). If there are an odd
number of values, the median is the exact middle value; however, if there is an even number of values, the median is the average of
the two middle values. In the example above, the median for Sample A is 57 and for Sample B is 56 + 57/2 = 56.5.
Since the median is less affected by outliers and skewed distributions, it is the appropriate measure to report when data do not follow
a “bell-shaped” curve. The median should also be reported with ordinal data.
Mode is the most common value in a dataset. In the above examples, the mode for Samples A and B is 54.
Although the mode may be used for both qualitative and quantitative variables, it may not accurately represent the center of the
distribution. Using the above example, the Sample A mode is 54, but the center of distribution is 57 years.
Sometimes, there may not be a mode if all values are different or if there is a bimodal or multimodal sample (signifying peaks at two
or more places in the data distribution). In such cases, one may report the mean or median as appropriate.
As illustrated previously, the shape of the data distribution may influence the measures of central tendency. When the distribution is
symmetrical (i.e., “bell-shaped”), the mean, median, and mode are all in the middle [[Figure 1], center]. When the distribution is
skewed toward the low end of values (positive skew), the mode remains the most common value, and the median remains the
middle value, but the mean is pulled toward the right tail of the distribution [[Figure 1], right]. When the distribution is skewed toward
the high end of values (negative skew), the mean is pulled toward the left tail of the distribution [Figure 1], left].[3]{Figure 1}
Outliers, which are extreme or unusual values, may also influence the measures of central tendency. The mean is more sensitive to
outliers than the median and mode. However, even if the presence of outliers, the mean is still appropriate to report for interval or
ratio data as long as the overall distribution is normal/“bell-shaped.”
Measures of Dispersion/variation
Although measures of central tendency provide important information when describing one's data, they fail to capture variability
within a dataset.[4] Measures of dispersion/variation describe the degree to which a variable's values are similar or diverse. This type
of measure only applies to ordinal, interval, and ratio data that can be ranked and includes the range, variance, and standard
deviation.
The range is the difference between the lowest and the highest values in a dataset. For example, the range of Sample A above is 6
(60–54 = 6), while the range of Sample B is 8 (60–52 = 8).
The variance and standard deviation are measures of spread that reveal how close each observed value is to the mean of the entire
dataset. In datasets with small spread, all values are close to the mean, yielding smaller variance and standard deviation. In contrast,
datasets with greater spread of values away from the mean have larger variance and standard deviation. Therefore, if all values of a
dataset are the same, the variance and standard deviation will be zero.
In a normally distributed dataset, 68% of the values are within one standard deviation on either side of the mean, 95% of values are
within two standard deviations, and 99% of values are within three standard deviations.[4]
Measures of Position
Determining the position of values in a dataset may be accomplished in three main ways.
Percentiles divide the dataset into 100 equal sections, deciles divide it into ten equal parts, and quartiles divide an ordered dataset
into four equal parts. The differences between percentiles and quartiles are minor and often disappear with a large number of values
in a dataset. One may clearly see how they are associated as follows:
The lower quartile, Q1 (25th percentile), is the point between the lowest 25% and highest 75% of values. The second quartile, Q2
(50th percentile), is the median (middle of the dataset). The upper quartile, Q3 (75th percentile), is the point between the lowest 75%
and highest 25% of values. If the quartile falls between two values, the average of those values represents the quartile value. Using
the previous example, in Sample B, Q1 is 54 (54 + 54/2 = 54); Q2 is 56.5 (56 + 57/2 = 56.5); and Q3 is 58 (58 + 58/2 = 58).
The interquartile range is the difference between the upper and lower quartiles and describes the middle 50% of values when
ordered from lowest to highest. It is considered a better measure of dispersion than the range, as it is not affected by outliers. For
example, in Sample B, Q3–Q1 is 4 (58-54).
Box plots are often useful for interpreting descriptive data in graphical form.[4] As seen in [Figure 2], box plots are constructed using
the 25th percentile (lower quartile), the median (50th percentile), the 75th percentile (upper quartile), the minimum data value, and
the maximum data value. Box plots also show outlier values.{Figure 2}
Conclusion
Descriptive statistics are a critical part of initial data analysis and provide the foundation for comparing variables with inferential
statistical tests.[5] Therefore, as part of good research practice, it is essential that one report the most appropriate descriptive
statistics using a systematic approach to reduce the likelihood of presenting misleading results.[6] Since the results of statistical
analysis are fundamental in influencing the future of public health and health sciences, the appropriate use of descriptive statistics
allow health-care administrators and providers to more effectively weigh the impact of health policies and programs.[7]
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References
1 Rae C. Why are Statistics Important in the HealthCare Field? Livstrong.com; 2017.
2 Spriestersbach A, Röhrig B, du Prel JB, Gerhold-Ay A, Blettner M. Descriptive statistics: The specification of statistical
measures and their presentation in tables and graphs. Part 7 of a series on evaluation of scientific publications. Dtsch Arztebl
Int 2009;106:578-83.
3 Ali Z, Bhaskar SB. Basic statistical tools in research and data analysis. Indian J Anaesth 2016;60:662-9.
4 Sonnad SS. Describing data: Statistical and graphical methods. Radiology 2002;225:622-8.
5 Laerd Statistics. Types of variables; 2018. Available from: https://statistics.laerd.com/statistical-guides/types-of-variable.php.
[Last accessed on 2018 Apr 04].
6 Huebner M, Vach W, le Cessie S. A systematic approach to initial data analysis is good research practice. J Thorac
Cardiovasc Surg 2016;151:25-7.
7 Peace K, Hsu JP. The Importance of Statistics in Medical Science; 2018. Available from:
https://www.researchgate.net/publication/237518872_The_Importance_of_Statistics_in_Medical_Science. [Last accessed
2018 Apr 04].