ArticlePDF Available

Secondary data analysis to answer questions in human biology

Authors:

Abstract and Figures

Despite a growing number of publicly available datasets, the use of these datasets for secondary analyses in human biology is less common compared with other fields. Secondary analysis of existing data offers an opportunity for human biologists to ask unique questions through an evolutionary and biocultural lens, allowing for an analysis of cultural and structural nuances that affect health. Leveraging publicly available datasets for human biology research is a way for students and established researchers to complement their data collection, use existing data for master's and doctoral theses, pilot test questions, and use existing data to answer interesting new questions or explore questions at the population level. Here we describe where publicly available data are stored, highlighting some data repositories and how to access them. We then discuss how to decide which dataset is right, depending on the research question. Next, we describe steps to construct datasets, analytical considerations and methodological challenges, best practices, and limitations depending on the structure of the study. We close by highlighting a number of publicly available datasets that have been used by human biologists and other datasets that may be of interest to the community, including research that has been conducted on some example datasets.
Content may be subject to copyright.
METHODS SERIES
Secondary data analysis to answer questions in human biology
Asher Y. Rosinger
1,2
| Gillian Ice
3,4
1
Department of Biobehavioral Health,
Pennsylvania State University, State College,
Pennsylvania
2
Department of Anthropology, Pennsylvania State
University, State College, Pennsylvania
3
Department Social Medicine, Ohio University,
Heritage College of Osteopathic Medicine,
Athens, Ohio
4
Global Health Initiative, College of Health
Sciences and Professions, Athens, Ohio
Correspondence
Asher Rosinger, 219 Biobehavioral Health
Building, University Park, PA 16802.
Email: arosinger@psu.edu
Funding information
College of Health and Human Development,
Pennsylvania State University
Abstract
Despite a growing number of publicly available datasets, the use of these datasets
for secondary analyses in human biology is less common compared with other
fields. Secondary analysis of existing data offers an opportunity for human biolo-
gists to ask unique questions through an evolutionary and biocultural lens, allowing
for an analysis of cultural and structural nuances that affect health. Leveraging pub-
licly available datasets for human biology research is a way for students and estab-
lished researchers to complement their data collection, use existing data for
master's and doctoral theses, pilot test questions, and use existing data to answer
interesting new questions or explore questions at the population level. Here we
describe where publicly available data are stored, highlighting some data reposito-
ries and how to access them. We then discuss how to decide which dataset is right,
depending on the research question. Next, we describe steps to construct datasets,
analytical considerations and methodological challenges, best practices, and limita-
tions depending on the structure of the study. We close by highlighting a number of
publicly available datasets that have been used by human biologists and other data-
sets that may be of interest to the community, including research that has been con-
ducted on some example datasets.
1|INTRODUCTION
In the last 10 years, the use of large population-based data-
sets has increased in human biology research; however, sec-
ondary analysis of existing data remains less common than
in other fields. We define secondary data analysis as the
analysis of existing datasets of which the researcher conduct-
ing data analysis did not have a hand in designing, collect-
ing, or collaborating with the original study team. Secondary
analysis of existing datasets has become common practice
in most academic fields, including public health, education,
economics, genetics, social work, nursing, nutrition, prima-
tology, paleontology, and genetics (Doolan & Froelicher,
2009; Greene, Garmire, Gilbert, Ritchie, & Hunter, 2017;
Steyn et al., 2005; Vartanian, 2010). Typically data are used
to ask questions not originally posed by the primary investi-
gators. Additionally, many academic fields and researchers
are using secondary datasets to retest hypotheses and repli-
cate findings from previous studies to help the scientific
community, as well as testing new hypotheses using the data
(Greene et al., 2017).
The use of secondary data sets in human biology and bio-
logical anthropology has become increasingly common in
recent years, notably in reanalyzing Boas' database of skull
sizes to retest his hypotheses that the environment affected
head sizes (Boas, 1912; Gravlee, Bernard, & Leonard, 2003;
Sparks & Jantz, 2002). Another notable existing data source
used by anthropologists is the human relations area files
(HRAF). This cultural anthropology resource, which has data
that dates to the early 20th century, has been used to assess
evolutionary questions related to nutrition, diet, and obesity
(Brown & Konner, 1987) and more recently stress and
resource sharing (Ember, Skoggard, Ringen, & Farrer, 2018).
Secondary analysis of existing data offer a number of oppor-
tunities to answer scientific questions in disparate fields.
The primary advantage of the use of existing data are that
they are inexpensive, require less time, and offer large
and often representative samples (Cheng & Phillips, 2014;
Doolan & Froelicher, 2009; Grady, Cummings, & Hulley,
2013; Okafor, Chiejina, de Pretis, & Talwalkar, 2016).
Moreover, as the field of human biology has matured, ques-
tions of interest have been asked with different levels of
Received: 26 September 2018 Revised: 3 January 2019 Accepted: 18 February 2019
DOI: 10.1002/ajhb.23232
American Journal of Human Biology
Am J Hum Biol. 2019;e23232. wileyonlinelibrary.com/journal/ajhb © 2019 Wiley Periodicals, Inc. 1of19
https://doi.org/10.1002/ajhb.23232
precision, in some cases yielding multiple years of data at
different time points within the same site. These data can be
used by human biologists to ask new questions or test devel-
oping conceptual models. However, it is important to note
that using big data is incredibly time-intensive as it is critical
to learn the nuances of the dataset (e.g., manuals, question-
naires, skip logic used, data transformations, coding, and
analytic considerations), and is not necessarily easier than
primary data analysis. Secondary data analysis can be con-
ducted on any existing data, including clinical records,
administrative data, program evaluation data, and so forth;
however, in this toolkit we focus on large publically avail-
able, research-based data sets with biomarker data.
The inclusion of bioindicators or biomarkers in house-
hold surveys, often drawn on in secondary data analysis, has
provided a wealth of data relevant to human biological
research. While the practice of collecting biomarkers in sur-
vey research is common practice today, the cost-benefit, eth-
ical, and logistical considerations of how to incorporate
these biomarkers alongside household surveys and what
types of questions could be answered from them gained
steam around 2000 (NRC, 2001). These biodemography
surveys incorporated methodological advancements from the
human biology toolkit, including anthropometrics, dried
blood spots, salivary sampling, urine samples, handgrip
strength, and environmental sampling, drawing on extensive
experience working in nonclinical settings (NRC, 2001).
These bioindicators are the bread and butter of human biol-
ogy research, and in secondary data analysis often it is these
types of data that researchers examine.
Existing datasets are helpful because they allow for an
examination of questions using previously collected data that
reduce the time from question generation to findings while
also serving as a useful way to examine new exploratory
questions (Gettler, Sarma, Gengo, Oka, & McKenna, 2017;
Rivara & Miller, 2017). Analyzing existing data in new ways
with human biological perspectives has potential to add criti-
cal scientific knowledge. For example, they can be used to test
how rapidly changing environments affect risk of obesity in
transitioning populations (Thompson et al., 2014), or ask
interesting questions surrounding insect eating in a global
context using data from the World List of Edible Insects in a
novel way (Lesnik, 2017). These questions often require data
that far exceeds the capability of a single researcher to collect.
When datasets across years and sites are combined, they
provide a powerful resource to compare populations and ask
critical environmental questions related to human adapta-
tion, body size, and shape (Hruschka, Hadley, & Brewis,
2014). In particular questions related to inter-generational
patterns of health and epigenetics can only be addressed
after data on multiple generations within a cohort have been
collected (e.g., The Cebu Longitudinal Health and Nutrition
Study). Nationally-representative datasets, such as demo-
graphic health surveys (DHS), or US Federal, World Health
Organization, country or regionally-sponsored surveys not
only offer an opportunity to ask country specific and cross-
national questions but can also be used as comparative tools
against which researchers can contrast smaller scale studies
(Raichlen et al., 2017). For example, many human biologists
work in remote settings where obtaining large sample sizes
is a challenge due to small population sizes and expensive
biomarker work or time intensive data collection protocols
(Pontzer et al., 2015; Raichlen et al., 2017). Existing datasets
often have much larger sample sizes that provide statistical
power to detect even modest effects between hypothesized
relationships.
Secondary analysis of existing data may be particularly
useful for students who may be on a shorter timeline or have
limited funding. For example, honors theses can reanalyze
faculty data or publicly available databases. Masters students
rarely have sufficient time to collect original data and thus
secondary analysis offers the opportunity to produce a pub-
lishable article in the shortened time frame. While human bio-
logy/anthropology programs typically have the underlying
assumption that research should come from original data
sources and field research, secondary analysis can be used to
bolster grant applications with preliminary data or allow for
additional publications to strengthen curriculum vitae. Learn-
ing the limitations of big data is an important reason to
encourage students to cultivate their own field site. But it is
important to consider to what degree all research has to come
from data collected in their own site. Coupling primary
research with data collected by larger surveys with more sys-
tematic training and seasoned expertise can provide additional
perspective or scale to research questions. Secondary analysis
offers similar benefits for faculty who are between grants or
looking to strengthen grant applications. In addition, various
sections in the National Institutes of Health provide grant
funding opportunities for secondary analyses of existing data
within the section's lens, for example, obesity research for the
National Institute of Diabetes and Digestive and Kidney
Diseases.
The objective of this methods toolkit article is to provide
a basic guide to secondary analysis of existing datasets by
describing tools, techniques, and best practices for using
these data, though many of the tips are relevant for primary
data analysis as well. This article is aimed at providing an
overall picture of the process of secondary dataset use with
first-time users in mind, though insights and resources will
also be valuable to established researchers. This article does
not go into great depth on analytical practices, as previous
toolkit methods articles have examined this for human
growth research (Johnson, 2015) and specific analyses will
differ by research questions. Instead, here we present a
primer on where to find datasets, highlighting data reposito-
ries, and discussing properly aligning the research question
to the dataset, special analytical considerations, examples of
2of19 ROSINGER AND ICE
American Journal of Human Biology
secondary analysis of existing data in human biology, and
limitations of secondary analysis.
2|WHERE TO FIND DATASETS? DATA
REPOSITORIES
An increasing number of countries collect nutrition, demo-
graphic, economic, and health surveys which track the growth,
development, aging, weight status, infection risk, sanitation,
morbidity and mortality, and health practices of their popula-
tion. These key indicators can provide important insights about
human biology and can be used to answer many classic and
emerging questions surrounding demographic, epidemiologic,
and nutritional transitions as well as snapshots of health and
disease. These datasets are often stored in data repositories.
Data repositories are online sites that house or link researchers
to documentation, description, and datasets of previously or
ongoing archived studies (Table 1).
Not all country-level surveys are housed in data reposito-
ries. Some nationally-representative samples on specific
aspects of health may also be collected by governments, but
require an in-country collaborator and specific data use
agreement to access. Some topical surveys may be shared or
adapted across countries, with similar or identical instru-
ments within the survey. For example, the WHO Study on
global AGEing and adult health (SAGE) (six countries), the
English longitudinal study of aging (ELSA), the survey of
health, Ageing, and retirement in Europe (SHARE) (20 +
European countries and Israel), the longitudinal aging study
of India (LASI), the China health and retirement longitudinal
survey (CHARLS), and The Irish Longitudinal Study
on Ageing (TILDA) all share similar instruments as the
U.S. health and retirement survey (HRS). The gateway to
global aging data (https://g2aging.org/) provides links to the
HRS family of studies and descriptions of harmonized data-
sets to allow for efficient cross-country comparisons. Each
study has a different protocol to access data.
Federally-funded research, such as by the National
Science Foundation and the National Institutes of Health,
require that data collected by researchers whose projects were
funded in part or in full by their respective federal agencies
should be made available to other researchers in a way that
does not compromise any confidentiality of human subjects
either through an online repository or at a host institution,
where other researchers and the public may access or request
access to the data (https://grants.nih.gov/grants/policy/data_
sharing/; https://www.nsf.gov/pubs/policydocs/pappg17_1/
pappg_11.jsp#XID4). De-identification may prove difficult
for some studies conducted in small-scale populations, in
which case exceptions can be made about data release.
In addition, it is increasingly common for peer-review
journals to require that the data and code used in a publica-
tion be deposited in a repository to allow reviewers and
readers to access the data (eg, PLOS One, Proceedings of
the Royal Society B: Biological Sciences, Scientific Data).
These journals often have recommended data repositories
and may cover the cost of the deposit up to 20 GB to their
preferred repository (as is the case for Royal Society Science
Journals with Dryad: https://royalsociety.org/journals/ethics-
policies/data-sharing-mining/).
Data repositories can be sector-related (i.e., government
funded data), field or topic-specific, like aging-related (e.g.,
Gateway to Global Aging Data, https://g2aging.org/?), or
institution-specific (e.g., Dataverse). For example, data.gov is
the home of all open data collected by U.S. federal agencies
and offices. The site provides access to 253 410 datasets
(as of June 1, 2018). Researchers can enter search terms based
on research interests, such as nutrition, physical activity, men-
tal health, and all datasets with applicable questions will be
listed. The federal government also has a health-specialized
open repository: HealthData.gov, which currently houses
2774 datasets and operates in the same way with researchers
entering search terms to find relevant datasets.
Another large institutional data repository is the Inter-
university Consortium for Political and Social Research
(ICPSR), which includes a variety of social science datasets
(https://www.icpsr.umich.edu/icpsrweb/deposit/index.jsp).
They have data curating and management services as well.
Their search function allows for a direct link to many data-
bases with data documentation, related publications and data,
often available in different formats. The site also contains a
number of resources for students and instructional resources.
The National Institute of Aging published a document
highlighting many of the datasets available which they have
funded in the Publicly Available Databases for Aging-Related
Secondary Analyses in the Behavioral and Social Sciences
(https://www.nia.nih.gov/research/dbsr/publicly-available-data-
bases-aging-related-secondary-analyses-behavioral-and-social).
Data repositories are particularly useful for researchers
who have not pre-identified a data set. They are often search-
able by topic as well as variables. The access and documenta-
tion vary by study, so it takes time to sift through and select
the appropriate dataset. In the summer of 2018, Google
launched a datasets database search engine, which allows
researchers from disparate fields to search for any dataset
(https://toolbox.google.com/datasetsearch). The search pro-
vides a basic description of the dataset, links to the site where
the data are stored, and some available documentation.
3|ASSESSING WHICH DATASET IS
RIGHT
A plethora of data are available, but several factors should
be considered prior to selection of the dataset. First, a litera-
ture review will help determine the ideal study design, mea-
sures, population, and sample size, and will narrow the
selection of datasets. Ultimately, datasets are assessed for
strengths and weaknesses relative to the research question
ROSINGER AND ICE 3of19
American Journal of Human Biology
(Doolan & Froelicher, 2009). Some questions you might
consider are: Does the dataset have the appropriate variables
with suitable measurement? Does it have the appropriate
sample? Depending on the question of interest, the recency
of data collection may be important. Datasets should also be
evaluated for levels of missingness and adequate sample size
for any sub-group analysis (eg, ethnic group, sex, and age).
Some questions may require the combination of more than
one dataset.
3.1 |The question
The research question derived from theory or conceptual
models drives dataset selection and analysis (Magee, Lee, Giu-
liano, & Munro, 2006; Osborne, 2008). In human biology, a
broad array of research questions can be tested using previ-
ously collected data. Datasets are evaluated based on the
operationalization of the constructs, on the availability of
appropriate variables, and the level of missingness of important
variables. A conceptual match between the research question
and the primary dataset is important to minimize possible
errors and limitations in study design (Magee et al., 2006).
For example, one research question may be to test
how water intake differs by weight status in affecting hydra-
tion status among adults (such as in Rosinger, Lawman,
Akinbami, & Ogden, 2016). In this instance, theory posits
that variation in body composition may dysregulate homeo-
stasis of body water (Stookey, Barclay, Arieff, & Popkin,
2007). Cross-sectional (i.e., a snapshot of a population or
sample at a single time point) data from one population may
be sufficient to test the association but cannot establish cau-
sality. A dataset with variables, which measure total water
intake with dietary recall, weight status via measured anthro-
pometry, and hydration status via urinary biomarkers is
needed to adequately address the question. In addition, the
question necessitates variation between individuals in water
intake, weight status (i.e., normal weight, overweight, and
with obesity), and measured hydration status. Therefore, a
dataset that measured a population with a small percentage
of overweight participants would not have sufficient power
to tease apart a relationship with hydration status.
TABLE 1 Select data repositories, which house publicly available datasets
Name/organization Purpose Site Way to access data
ICPSR: hosted by
University of Michigan
ICPSR maintains a data archive of more than 250 000 files of
research in the social and behavioral sciences.
https://www.icpsr.umich.
edu/icpsrweb/ICPSR/
University is a member
create a login account
U.S. government open
data portal
Provide access to public to all US gov't funded research https://Data.gov Web download
U.S. government health
data portal
Provide access to public to all US gov't funded health-related
research
HealthData.gov Web download
The Dataverse
Project
Dataverse is an open source web application to share, preserve,
cite, explore, and analyze research data. It facilitates making
data available to others, and replicate others' work more
easily.
https://dataverse.org/ Submit name and
institution for
immediate download.
Many institutions have
their own (eg, UNC
Dataverse, Harvard
Dataverse)
Demographic health
surveys (DHS) data
portal
DHS collects, analyzes, and disseminates accurate and
representative data on population, health, HIV, and nutrition
through more than 300 surveys in over 90 countries.
https://www.dhsprogram.
com/Data/
Register and download
Human relations area files
(HRAF)
The human relations area files' mission is to facilitate
worldwide comparative studies of human behavior, society,
and culture.
http://hraf.yale.edu/
http://ehrafworldcultures.yale.
edu/ehrafe/; http://hraf.yale.
edu/cross-cultural-
research/basic-guide-to-
cross-cultural-research/
University is a member
NIH data sharing
repositories
NIH-supported data repositories that make data accessible for
reuse, specified by NIH Institute or Center and may be
searched using keywords so to find repositories most
relevant to data.
https://www.nlm.nih.
gov/NIHbmic/nih_data_
sharing_repositories.html
Varies by center, some
require requests, some
can be downloaded
without agreement
forms
WHO multi-country
studies archive
The focus of the unit is on adult health and well-being in lower
and middle income countries, engaging in methodological
development, primary data collection, secondary data
analysis and data dissemination. Recent and current
nationally representative surveys include the 2000-01 Multi-
Country Survey Study, 2002-04 World Health Surveys, and
2007-10 Study on global Aging and adult health (SAGE),
with ongoing contributions to the World Mental Health
Surveys.
http://apps.who.int/healthinfo/
systems/surveydata/
index.php/catalog
Request data via email
Dryad Curated resource that makes the data underlying scientific
publications discoverable, freely reusable, and citable.
https://datadryad.org/discover?
query=&submit=
Search#advanced
Download on web
Note: Weblinks to data repositories may change over time. If links are broken, search online by data repository name.
4of19 ROSINGER AND ICE
American Journal of Human Biology
Another research question may examine how exposure to
infections and pathogens in childhood affect markers of
inflammation later in life (such as, McDade, Rutherford,
Adair, & Kuzawa, 2010). For this question, the conceptual
model posits that events in early life, that is, exposures, may
affect the development of inflammation and the immune sys-
tem. Therefore, longitudinal data are needed, that is, data on
the same individuals collected at two or more points in time.
For this specific question, variables are needed that measured
environmental exposures, sanitation, diarrheal episodes during
infancy, birth weight, and birth season, variables on the same
individual that measured inflammation through C-reactive
protein, and socioeconomic factors. This way, hypotheses
regarding how early life exposures drive inflammation pro-
files can be tested while controlling for individual fixed
effects.
3.2 |The sample
Different questions necessitate different sample populations
(e.g., geographic location, age groups, sex breakdown) and,
therefore, datasets. Datasets need to have adequate subpopula-
tion sizes to enable testing certain questions. A question about
secular changes in age of menarche may examine all women
aged 15 years and older in a population, stratified by birth
cohort, to capture all females in a population who have
already experienced menarche (McDowell, Brody, & Hughes,
2007). On the other hand, a question about how exposure to
endocrine-disrupting chemicals affects age of menarche,
requires an adequate sub-sample of females who recently
went through menarche and are living in the same area in
order to capture the potential relationship; therefore, a sample
of females aged 12 to 16 would be more appropriate (Buttke,
Sircar, & Martin, 2012).
When the research question relates to differences across
sites or countries in more global analyses, it is important to
decide which countries are relevant to the question. For exam-
ple, with data from the DHS, a researcher must ask him/her-
self the question, is there an analytical justification (i.e.,
geographic region, population size, GDP, prevalence of cer-
tain health behaviors) to compare them? Other times, some
questions necessitate that certain countries are excluded
because demographic factors, like household wealth, differ
substantially from other countries in the sample, thereby
affecting comparability of the results (Hruschka et al., 2014).
3.3 |Variables needed
Datasets are evaluated based on the specific variables needed
to address a research question. In addition to main variables
of interest, important confounding variables are necessary.
The match of variables to question will also be influenced by
the methods of measurement. The measures should be appro-
priate to the research question and collected with a sufficient
level of accuracy and detail (Doolan & Froelicher, 2009).
For example, if the question relates to how trends in adult
obesity in the US have changed over time, then multiple data-
sets are potentially useful. The behavioral risk factor surveil-
lance system (BRFSS) is a large nationally-representative
dataset that contains self-reported anthropometric variables,
such as height and weight, whereas the national health and
nutrition examination survey (NHANES) contains physically-
measured anthropometrics. Systematic bias is well noted in
self-report of these variables, with height usually overesti-
mated and weight underestimated. Self-report is also subject
to social desirability bias and may vary by weight status
(Gorber, Tremblay, Moher, & Gorber, 2007). Therefore,
while both datasets could be used to address the question,
NHANES would likely provide less biased results.
There may not be a perfect fit between variables in a
dataset and those required to answer a question. The question
may need to be modified depending on the available data.
Required constructs may not have been part of the original
analysis, but can be measured by combining appropriate vari-
ables in a scale and creating a derived variable. For example,
original analyses of the Botswana AIDS impact survey
(BAIS) include assessment of multiple risky sexual behav-
iorsthat were reported separately, in secondary analysis
these can be combined to create a scale of risky sexual behav-
ior. Like any scale, these should be examined for reliability
and validity.
The level of analysis, or the unit for which data are col-
lected, will also affect the ability to properly address a
research question. For example, the national health interview
survey (NHIS) has household-level and individual-level var-
iables, NHANES has individual-level variables, but also
food and medication-level variables, where each participant
has multiple observations. Asking a question about how
individual diets have changed will need individual-level
rather than household-level data. However, it may be useful
to link household-level data with individual characteristics
to gain information on household socioeconomic status or
demographics. For this to occur, researchers will need to
merge the datasets (detailed in section 4).
3.4 |The timeframe of the data
Depending on the question, a single survey cycle of a cross-
sectional study may be adequate to fulfill the needs of the
researcher. However, many questions in human biology
examine trends or changes in demographic and health
characteristics such as mortality, height, weight, age of men-
arche, lactation duration, and season of birth (Gurven,
Kaplan, & Supa, 2007; Malina et al., 2004; Rosinger &
Godoy, 2016) and for these types of questions observations
from multiple years are necessary. These kinds of data can
be combined with population-level characteristics or envi-
ronmental data to ask unique questions that situate trends in
historical, economic, or environmental context. However,
many existing data sources are limited by what has been
ROSINGER AND ICE 5of19
American Journal of Human Biology
released to the public. For example, a question examining
the effects of a policy, like taxes on sugar-sweetened bever-
age intake, would only be answerable with pre- and post-
implementation data for the region where the policy went
into effect. Finally, the age of the data can be a limitation as
some medical and public health journals (e.g., JAMA, Amer-
ican Journal of Public Health, American Journal of Preven-
tive Medicine) want to publish articles based on the most
recently released wave of survey data.
4|DATASET CONSTRUCTION,
STRUCTURE OF DATA, AND ANALYTICAL
CONSIDERATIONS
4.1 |Dataset construction and management
Once the data source and dataset(s) have been decided on,
one must construct the dataset that will eventually be used
for analysis in a way that all appropriate variables and years
that are available are combined. Dataset construction can be
a long and onerous process. Dataset construction includes
merging datasets, selecting important variables to maintain,
and possibly variable transformation.
For secondary data analysis, it is important to conceptu-
ally think through how key variables that are potentially in
different datasets will be combined, that is merged and/or
appended into one larger dataset (Figure 1). Datasets are
merged using a unique identifier (or index variable)that is
an identification code unique to each unit (e.g., individual,
household, and geographic location) in the dataset. Append-
ing a dataset occurs when the same variable(s) that are pre-
sent in the main dataset, but for additional subjects, for
example, for an additional year or for an additional site, are
added to the end of the dataset. While merging results in
wider datasets (i.e., ones with more variables), appending
results in a longer dataset (i.e., ones with more observations
for each unit) (Figure 1).
For example, an individual, household, or geographic
location in one dataset has an identifier that links the same
individuals, households, or geographic locations to another
dataset. Depending on the software, it may be necessary to
sort the unique identifier in both datasets before running the
merge (Vartanian, 2010) so that the data are ordered accord-
ing to the unique identifier. During the merging process, you
may also choose to only include specific variables from each
dataset in the merged file and this can be indicated in the
FIGURE 1 Simplified visual guide merging and appending datasets
6of19 ROSINGER AND ICE
American Journal of Human Biology
merge code, albeit by different methods depending on soft-
ware. The simplest, and most common merge is a one-to-
one merge, but other options (one-to-many, many-to-one,
and many-to-many merge) are available in other situations,
that is, when there is one identifier in one dataset and multi-
ple records per identifier in another dataset (e.g., ecological
momentary data, repeated food records for individuals)
(Hamilton, 2012).
It is useful to merge all datasets for a single survey cycle
or a single site and confirm that the merge was successful,
that is, that the units in one dataset exactly match the same
units in the other dataset. It is important to save copies of the
original datasets to make sure the data were not altered or
observations dropped during the merge. Additional ways to
confirm that the merge was successful are to run correlations
between the same variables and generate scatterplots. Once
this code is written, it is possible to reuse the same code for
additional years and sites by simply changing the file name.
After all merges have taken place, additional years or sites
can be appended, or added, to the main dataset.
Dataset and variable transformations are sometimes neces-
sary to reduce data into easily analyzed units, for example, if
multiple observations of a variable (e.g., dietary intake) exist
per participant, yet all other variables (e.g., demographic, body
measurements) are measured once. Collapsing a dataset can
provide statistics, like the mean, sum, and so forth, per individ-
ual across a variable that is measured at multiple time points
(Hamilton, 2012). Additionally, it is sometimes necessary to
create an index variable from multiple variables present in the
dataset to assess a latent variable like allostatic load
(Geronimus, Hicken, Keene, & Bound, 2006).
Using an example from NHANES with sugar-sweetened
beverage (SSB) intake (Rosinger, Herrick, Gahche, & Park,
2017), we briefly illustrate the steps necessary to collapse
multiple observations per participant (Figure 2). The individ-
ual foods dietary dataset for NHANES lists all the dietary
items (foods and beverages) consumed in a 24-hour period,
from midnight to midnight, on the day preceding the physi-
cal exam. To assess how many calories or grams of sugar
intake Americans consume from SSBs or any other food or
beverage, these variables need to be transformed. The indi-
vidual food codes for the food or beverage of interest must
be combed through in the support file, which is an external
database (Dietary Interview Technical Support FileFood
Codes), which lists all food and beverage codes listed by
any participant in the dietary recall interviews. Once all the
food codes are decided upon, another variable must be cre-
ated for all the food codes that qualify. Next, another vari-
able must be created that sums the calories of the food codes
that qualify. This variable is then collapsed with the max
value kept generating one observation per participant. This
dataset is saved, and is then ready to merge with other data-
sets using the unique identifier of the participant.
4.2 |Analytical considerations
Before data analysis begins, it is important to take note of
several considerations. For instance, it is critical to take into
account the sampling design of the dataset. Was it a random
sample, a cluster, multi-stage, complex design, or a longitu-
dinal design? Depending on the sampling design, the ana-
lytic plan, the way the analysis is set up, and specific code
will vary by statistical software and can be found in the
respective reference manuals.
Prior to analyses, researchers may have to use specific
commands or code to let the statistical software program
know the type of data being used, for instance, by declaring
FIGURE 2 A simplified visual guide to collapsing data for analysis
ROSINGER AND ICE 7of19
American Journal of Human Biology
that the dataset is part of a survey or longitudinal data. With
survey data, the command indicates that it is a survey data-
set, specifying the sample weights (this accounts for the
unequal probability of sampling and nonresponse within
the sample), the primary-sampling units (this accounts for
the design effects of clustering), and the strata (this accounts
for design effects of stratification), all of which are critical in
the estimation (sample weights are discussed later).
One key difference in analysis of complex survey data
with sample weights vs non-survey data is how sub-
populations are treated to exclude certain individuals (Korn &
Graubard, 1999). For example, when estimating a regression
with a non-weighted analysis for adults aged 20 and over who
are not pregnant, it is possible to use a qualifier excluding
these observations from the analysis. However, with complex
survey data using sample weights, a subpopulation command
must be used with the regression command instead of the
qualifier excluding them. Failing to do this will result in incor-
rect results by not accounting for the survey nonresponse and
adjusting for the weights appropriately.
With longitudinal data, code must be used to indicate to
the software that there are multiple observations per unit. This
declaration tells the software the unique identifier that has
repeated observations and the time variable. The commands
to run an analysis after the dataset has been declared are
slightly altered. However, the code may now have a prefix
indicating that it is panel data. This code is similar to the non-
weighted code above in specification of a subpopulation.
Setting up the dataset correctly in the statistical software
of choice is an important step, which should be carefully
researched prior to data analysis.
4.2.1 |Sample weights
Sample weights are used to account for the design of the sur-
vey that is being used (Korn & Graubard, 1999). Weighting
adjusts for many factors associated with the complex survey
design including oversampling of certain sub-populations into
the survey, nonresponse, noncoverage, and day of the week.
If a survey or dataset is not intended to be representative of a
specific city or country, for example, a cohort study that is fol-
lowing a sampled set of individuals or households, then sam-
ple weights are not necessary to use. Sample weights allow
for the estimates that are generated from the analysis to be
representative of the overall population from which the sam-
ple is drawn.
Using sample weights is necessary when the underlying
sample is unrepresentative of the target population. If there
are known systematic differences in the makeup of the popu-
lation to the sample, then inverse-probability selection weights
can be used to estimate population statistics and create
representative statistics for the overall population (Korn &
Graubard, 1999). Not using the sample weights can result in
heavily biased estimates since the unweighted results do not
take into account the overall population size and composition.
Korn and Graubard (1999) lay out a number of examples
illustrating how unweighted results lead to biased estimates,
and we present one example in the next section. On the other
hand, using sampling weights can lead to inefficiencies in
analytical estimation like regression analysis, meaning that a
larger sample size is needed to assess a relationship than when
not using sample weights, leading to larger confidence inter-
vals. A way to deal with these inefficiencies is to add vari-
ables to the model which were used in the survey design, like
sex, age, and race/ethnicity, and are exogenous to the relation-
ship under study (Korn & Graubard, 1999).
Sometimes datasets have multiple sample weights to
choose from. For example, using NHANES data, sample
weights depend on the dataset being used: (a) interview
weights assigned to the in-home interview, (b) examination
weights for those who have also taken part in the mobile
examination center (MEC) exam, (c) dietary weights for those
who have taken part in the 24-hour multiple pass dietary recall
interview which takes place after the anthropometrics are
measured in the MEC, and (d) subsample weights for select
individuals who were told to fast (NHANES documentation).
This often leads to confusion for researchers attempting to
decide which sample weights to use. Review the question of
interest and the variables used in the analytic model. The
weights relevant to the smallest sample subpopulation in the
analysis are the correct selection. For example, if assessing
how sleep (measured in the in-home interviewinterview
weights) is associated with urine specific gravity (measured in
the mobile examination centerMEC weights) while adjust-
ing for water intake (measured in the 24-hour dietary recall
day 1 dietary weights), the dietary weights are the appropriate
weights to use because it is the smallest subpopulation for
which all data are available (Rosinger et al., 2019). These
weights appropriately adjust for survey nonresponse.
4.2.2 |Example from NHANES of unweighted vs weighted
estimates
In Table 2, we present an example from 2011-2014 NHANES
of unweighted and weighted estimates using MEC weights to
illustrate how the estimates change using sampling weights
and adjustments for covariates chosen in the sampling design.
We test the research question of how lipid-lowering medica-
tions are associated with the prevalence of high total choles-
terol (240 mg/dL) among U.S. adults aged 20 and over (See
methodology in Rosinger, Carroll, Lacher, & Ogden, 2017).
Here, the unweighted analyses underestimate the associa-
tion between lipid-lowering medications and high total cho-
lesterol compared with the weighted analyses, that is, the odds
ratio and 95% confidence intervals are closer to one (Table 2).
When adjusted for factors associated with the sampling frame,
that is, age categories (20-39, 40-59, and 60+), sex, and
race/ethnicity, the unweighted results still underestimate the
association.
8of19 ROSINGER AND ICE
American Journal of Human Biology
When examining the predicted probabilities, it is clear
why the unweighted association is underestimated as the
nonlipid-lowering medication users' probability of high total
cholesterol is predicted at 12.6% vs 13.1% for the weighted
analysis while the lipid-lowering medication user group is
overestimated at 8.1% vs 7.8% for the weighted analysis. In
the predicted probability, adjusting for the items in the sam-
pling design without using sampling weights actually provides
a close estimate (13.2, 95% CI: 12.4-13.9) to the unadjusted
weighted probability of the majority group - non-lipid-
lowering medications (13.1, 95% CI: 12.0-14.2), but it under-
estimates the smaller sub-group of lipid-lowering medication
users (6.8, 95% CI: 5.8-7.9 vs 7.8, 95% CI 6.3-9.2). Neverthe-
less, the unweighted adjusted estimates do not accurately
assess the true relationship between the two groups. This is
highlighted further in the comparison of the two adjusted
models especially when the outcome is associated with a spe-
cific group that was oversampled due to the sampling design.
This example of unweighted vs weighted estimates dem-
onstrates that failing to use the sample weights-even if not
trying to make claims about national representativeness of
results-can substantially affect the magnitude of the esti-
mates between groups. With data analysis intended for pol-
icy development, it is particularly important to use proper
analytical technique to assure appropriate application of the
results as the estimates of the affected populations can affect
decisions about potential interventions (Osborne, 2008).
4.2.3 |Missing data
Prior to analysis, it is important to assess and decide how to
deal with missing data of variables in the secondary analysis
of existing datasets. The first step is to examine how missing
data are documented in the dataset, for example, (.), 9, 999,
or 777, and make sure that the software is treating those
values as missing. Additionally, it is critical to examine the
questionnaires across years to assess whether the value
assigned to Don't know,”“Missing,or Respondent gave
non-numeric answeris consistent.
Whereas the sample weights adjust for individual nonre-
sponse, that is, sampling procedures and people who choose
not to respond to the survey, item nonresponse in the form of
refusing to answer certain questions leads to missing data for
an individual who has been sampled. In this case, there are
different ways in which a researcher can handle the missing-
ness problem. One common way is to exclude individuals
who have missing data on any of the variables of interest so
that all the tables have the same sample size. Using the appro-
priate subpopulation command in the analysis will help adjust
for the nonresponse. One critique of this approach is that it
can lead to more biased estimates if data are not missing at
random (for instance, if low-income individuals are less likely
to report income) (Korn & Graubard, 1999). However, if
excluding those with missing data, it is still important to
model it (and the response Don't know) explicitly to assess
whether certain demographic characteristics are associated
with missingness for key variables to understand potential
bias in the results.
When numerous individuals are excluded and survey
weights are used, one approach is to reweight the data based
on the sample characteristics-the same ones used in the sam-
pling frame, such as sex, age, race/ethnicity, and compare
results. Sometimes, this leads to similar results and the origi-
nal weights are used since they are publicly available to
allow for easier replication (Rosinger et al., 2016).
A secondary approach is imputing a value for the missing
item. Two primary imputation techniques are mean imputa-
tion and hot-deck imputation (Korn & Graubard, 1999). Mean
imputation refers to using the mean (or median) values of the
variable for the individuals with the missing values for the
cell. One key problem with this technique is that, in the case
of categorical variables, a nonrealistic value may result (eg,
2.5 for education when two equals high school and three
equals college, 2.5 is not a real-world value). As a result, hot-
deck imputation is preferred over mean imputation (Korn &
Graubard, 1999). The hot-deck imputation method occurs
when values are determined for missing values based on cer-
tain characteristics, like individuals of the same age and sex.
TABLE 2 Example of unweighted and weighted estimates using NHANES 2011-2014 dataodds ratio, predicted probabilities, and 95% confidence
intervals of relationship of lipid-lowering medication status on high total cholesterol
a
n
Unweighted MEC weights Unweighted adjusted
b
MEC weights adjusted
b
OR (95% CI) OR (95% CI) OR (95% CI) OR (95% CI)
No-lipid-lowering medications 8228 Ref Ref Ref Ref
Lipid-lowering medications 1993 0.61 0.56 0.48 0.42
(0.52-0.73) (0.45-0.68) (0.40-0.58) (0.34-0.51)
Predicted probability Predicted probability Predicted probability Predicted probability
(95% CI) (95% CI) (95% CI) (95% CI)
No- Lipid-lowering Medications 8228 12.6 13.1 13.2 13.8
(11.9-13.3) (12.0-14.2) (12.4-13.9) (12.7-15.0)
Lipid-lowering Medications 1993 8.1 7.8 6.8 6.3
(6.9-9.3) (6.39.2) (5.8-7.9) (5.2-7.5)
a
Refers to the High total cholesterol cutoff.
b
Refers to which variables were adjusted for (Adjusted for age category, race, etc).
ROSINGER AND ICE 9of19
American Journal of Human Biology
One hot-deck imputation method that has gained traction uses
nearest-neighbor matching to choose the missing values
(Chen & Shao, 2000). This method can reduce bias particu-
larly compared to excluding data.
4.2.4 |Goodness of fit
In noncomplex survey analyses, researchers use goodness-of-fit
(GoF) measures like R-squared, Akaike information criteria
(AIC), and Bayesian information criteria (BIC) to assess
whether the regression models are appropriately suited to the
data. In complex surveys, the GoF measures are not as clear
cut. There is a debate in the field whether these GoF measures
mean the same thing for complex survey data, as they have to
take into account the survey weights (Korn & Graubard, 1999).
In recent years progress in this sphere has been made. For
example, with logistic regression, researchers can use the
Archer-Lemeshow test, which states whether the model is
appropriate (Archer, Lemeshow, & Hosmer, 2007). R-squared
is used by some for linear regressions, but it is not advocated
by all since it gives information about model fit but not about
actual relationships between variables in the model (Korn &
Graubard, 1999). For log-binomial regressions, used when
assessing prevalence ratios or incidence ratios, there has not
been a definitive GoF measure, though a couple of methods
have been proposed (Jann, 2008).
Key Points for Data Management
Download and retain original datasets
Create spreadsheet of original variables of interest from the
datasets
Keep unique identifier/index variables (for linking cases and
observations across datasets)
Save syntax for merging and appending datasets
Run merge checks
Save syntax for deriving variables on built dataset
Assess missingness of variables
Save syntax for analytic decisions and annotate code
5|EXAMPLE DATASETS
While many uses exist for the secondary analysis of existing
datasets in human biology, a few notable ones include: (a) the
replication or reexamination of data for the same research
question as has previously been published, but potentially
with different methods, which can also be a valuable teaching
and learning to experience as class projects; (b) the analysis
of a standalone, single country or site to assess a snapshot of
a certain question or provide baseline data about morbidity or
mortality or to examine trends over time (Rivara & Miller,
2017); (c) examining multiple sites over a single time period
to have a snapshot comparison (e.g., Gildner, Liebert, Kowal,
Chatterji, & Josh Snodgrass, 2014; Hruschka & Hagaman,
2015); (d) analysis of multiple sites across years to provide
trend data over time and across space; (e) using existing data
from a country, like the United States with NHANES data,
as a comparative or reference population for a specific
population(s) that human biologists currently work with (e.g.,
Blackwell et al., 2011); and (f) testing of theoretical models.
Here we highlight 14 publicly available datasets that
may be of interest to human biologists (Table 3). We
describe in slightly more detail four of these datasets: DHS,
NHANES, the Cebu longitudinal health and nutrition study
(CLHNS), and the WHO SAGE. We discuss some topics
and questions that have been tested with these data. Many
other datasets exist and can be found through the data repos-
itories described previously. The datasets we highlight pro-
vide a mix of US and international datasets, the majority of
which contain anthropometric and biomarker data in con-
junction with interview data on demographic, economic, and
health information. Some studies provide access to training
materials to help with data analysis.
5.1 |Demographic and health Surveys
One of the most widely known and largest available data-
sets used by human biologists is from the DHS due to its
presence in many countries where anthropologists work
(https://dhsprogram.com/What-We-Do/Survey-Types/DHS.
cfm). DHS, supported by USAID, comprises a number of
nationally-representative surveys, from 93 low and middle-
income countries, which have been designed and imple-
mented based on a systematic and centralized protocol. The
standard DHS household cross-sectional survey collects
information on 5-30 000 households every 5 years to allow
for comparisons over time and across countries, while
smaller, more specific interim surveys are conducted
between the standard surveys. In sum, DHS currently col-
lects information in 44 sub-Saharan African countries,
12 North African/West Asia/European, 5 Central Asian,
15 South and SE Asian, 2 Oceana, and 15 Latin American
and Caribbean countries. Many low- and middle-income
countries rely on the DHS data repository infrastructure to
provide access to their national datasets, for example,
India's National Family Health Survey. It should be noted
that not all countries have regular surveys. The website pro-
vides a listing of all available surveys by country and pro-
vides summary reports and data documentation. This
collection of surveys includes many health modules and
items of interest to human biologists ranging from anemia
prevalence to environmental health assessments to dietary
assessments to health behaviors, like female genital cutting,
while also collecting information on socioeconomic and
demographic (education, wealth, and household data) vari-
ables which are valuable predictors in regression analyses
10 of 19 ROSINGER AND ICE
American Journal of Human Biology
TABLE 3 Highlighted study datasets with publicly available data of interest to human biologists
Study name Primary purpose
Ex. of data available (not
exhaustive) Study design Years available
Sample/cohort size
per survey cycle Link to study site and info Data access
Adolescent brain
cognitive
development
(ABCD)
The largest long-term study of brain
development and child health in the United
States. Researchers will track their biological
and behavioral development through
adolescence into young adulthood.
Physical health (anthropometry),
brain imaging (MRI and fMRI),
biospecimens (hair, saliva,
blood), mental health screeners,
neurocognitive tasks, substance
use, environment survey
Longitudinal First wave released
2018
10 000 9-10 year
olds in 21 sites
https://abcdstudy.org/
https://data-archive.nimh.
nih.gov/abcd
Create account with
NIMH and request
data
Behavioral risk
factor
surveillance
system
(BRFSS)
U.S. nationally-representative and state-
representative survey to assess health related
risk behaviors, chronic health conditions, and
use of preventive services.
Health-related perceptions,
conditions, and behaviors (eg,
health status, health care access,
alcohol consumption, tobacco
use, fruits and vegetable
consumptions, HIV/AIDS risks),
demographic questions
Cross-sectional;
Phone based
1984-ongoing
(2016 most
recent)
400 000/year https://www.cdc.
gov/brfss/index.html
Download from
website
Cebu longitudinal
health and
nutrition survey
(CLHNS)
Investigate how infant feeding decisions by the
household interact with various social,
economic, and environmental factors to
affect health, nutritional, demographic, and
economic outcomes. Research is now
focused on long-term effects of prenatal and
early childhood nutrition and health on later
adult outcomes, including education and
work outcomes and development of chronic
disease risk factors.
Anthropometry, biomarkers (added
in 2005), dietary intake,
economic, environmental
exposures, demographic
Longitudinal 1983-2014 (2009
most recent
publicly
available)
3327 women-infant
pairs in initial
cohort;
7 subsequent
follow-ups with
~2300 of the
women and
index children
https://www.cpc.unc.
edu/projects/cebu;
https://dataverse.unc.edu/;
http://data.cpc.unc.
edu/projects/6/view
Download from
website; request
biomarker data by
email
China health and
nutrition survey
(CHNS)
Designed to examine the effects of the health,
nutrition, and family planning policies and
programs implemented by national and local
governments and to see how the social and
economic transformation of Chinese society
is affecting the health and nutritional status
of its population.
Biomarkers, survey questionnaires,
anthropometrics, nutrition,
childcare, community-level,
income, survey weights
Cohort 1989-ongoing
(2015 latest
publicly
available data)
35 000 individuals,
8600
households,
12 provinces
https://www.cpc.unc.
edu/projects/china
https://data.cpc.unc.
edu/projects/7/view
Register, then
download from
website
Demographic and
Health Surveys
(DHS)
Nationally representative data on fertility,
family planning, maternal and child health,
gender, HIV/AIDS, malaria, and nutrition
across middle-and low-income countries.
Anthropometry, demographic,
alcohol consumption, anemia,
domestic violence, HIV
behavior, malaria parasitemia,
vitamin A, tobacco use
Cross-sectional 1962-ongoing
(2015 most
recent)
Varies by country http://dhsprogram.
com/Who-We-
Are/About-Us.
cfm#sthash.
YX0BUWo3.dpuf;
http://dhsprogram.
com/data/Using-
DataSets-for-Analysis.
cfm;
https://blog.dhsprogram.
com/sampling-
weighting-at-dhs/
Register, then
download from
website
(Continues)
ROSINGER AND ICE 11 of 19
American Journal of Human Biology
TABLE 3 (Continued)
Study name Primary purpose
Ex. of data available (not
exhaustive) Study design Years available
Sample/cohort size
per survey cycle Link to study site and info Data access
National Family
Health Survey,
India
To provide essential data on health and family
welfare in India and emerging health and
family welfare issues.
Anthropometry, HIV status,
fertility, sexual behaviors,
biomarkers, blood pressure,
hemoglobin, environmental
assessments
Cross-sectional 1992-2016 ~100 000
nationally-
representative
households until
05-06;
2015-2016 has
~600 000
households
Available through DHS:
http://rchiips.
org/nfhs/data1.shtml
Register, then
download from
DHS website
National Health
and Nutrition
Examination
Survey
(NHANES)
To monitor health and nutritional status of US
children and adults and determine the
prevalence of major diseases and risk factors
for diseases. NHANES findings are also the
basis for national standards for such
measurements as height, weight, and blood
pressure.
Anthropometry, demographic,
dietary, questionnaire, dental,
laboratory, environmental
exposures
Primarily cross-
sectional,
linkage to
medicare
records,
mortality records
1959-ongoing
(2015-16 most
recent)
10 000 children and
adults per
2 year cycle
https://www.cdc.gov/
nchs/nhanes/index.htm;
Tutorial: https://www.
cdc.gov/nchs/
tutorials/Nhanes/
index_continuous.htm
Download from
website
National Health
Interview
Survey (NHIS)
To monitor the health of the United States
population through the collection and
analysis of data on a broad range of health
topics. A major strength of this survey lies in
the ability to display these health
characteristics by many demographic and
socioeconomic characteristics.
Physical and mental health status;
chronic conditions, including
asthma and diabetes; access to
and use of health care services;
health insurance coverage and
type of coverage; health-related
behaviors, including smoking,
alcohol use, and physical
activity; measures of functioning
and activity limitations;
immunizations
Cross-sectional;
Phone based
1957-ongoing
(2017 most
recent)
~100 000/year https://www.cdc.
gov/nchs/nhis/index.htm;
https://www.cdc.
gov/nchs/nhis/data-
questionnaires-
documentation.htm
Download from
website
National social
life, health, and
aging project
(NSHAP)
A longitudinal, population-based study of
health and social factors, aiming to
understand the well-being of older,
community-dwelling Americans by
examining the interactions among physical
health and illness, medication use, cognitive
function, emotional health, sensory function,
health behaviors, social connectedness,
sexuality, and relationship quality.
In-person questionnaire including
demographics, physical and
mental health, economic survey
and biomarkers including sleep
and physical activity
Longitudinal 2005-2016 3005 in Wave
1, 3377 in Wave
2 (2261 from
Wave 1), 4777
in Wave 3 (from
wave 1 and wave
2 cohort and
partners)
http://www.norc.org/
Research/Projects
/Pages/national-social-
life-health-and-aging-
project.aspx
Request NSHAP
Restricted Data Use
Agreement form.
Completed forms
with signature(s)
emailed: icpsr-
nacda@umich.edu.
National Survey
on Drug Use
and Health
(NSDUH)
Provides up-to-date information on tobacco,
alcohol, and drug use, mental health and
other health-related issues in the United
States.
Tobacco, alcohol, illicit drug, and
opioid use, mental health,
suicidal thoughts and behavior,
and treatment information
Cross-sectional 1971-ongoing
(2016 most
recent)
Target of 67 500
individuals per
wave
https://datafiles.samhsa.
gov/
Download from
website
The Russia
Longitudinal
Monitoring
Survey
Nationally representative surveys designed to
monitor the effects of Russian reforms on the
health and economic welfare of households
and individuals in the Russian Federation
SES, labor & occupation,
education, physical health,
nutrition, biological function and
development, physical activity,
anthropometry, risk behavior,
demographics, reproductive
health
Longitudinal 1994-2014 43 244 adults,
11 633 children,
30 712
households with
longitudinal data
http://data.cpc.unc.
edu/projects/3/view
http://www.cpc.unc.
edu/projects/rlms-hse
Download from UNC
dataverse; sensitive
data must be
requested via email
(Continues)
12 of 19 ROSINGER AND ICE
American Journal of Human Biology
TABLE 3 (Continued)
Study name Primary purpose
Ex. of data available (not
exhaustive) Study design Years available
Sample/cohort size
per survey cycle Link to study site and info Data access
Tsimane
Amazonian
Panel Study
(TAPS)
The impact of lifestyle change on the well-
being of an indigenous Bolivian group.
Anthropometry, demographic,
expenditure data, economic data,
health behaviors
Longitudinal 2002-2010 633 adults,
820 children
http://www.
sciencedirect.com/
science/article/pii/
S1570677X15000544;
http://heller.brandeis.
edu/sustainable-
international-
development/tsimane/
Request
permission
UNICEF Multiple
Indicator
Cluster Surveys
(MICS)
Surveys from 100+ countries examining well-
being of children and women. Major source
of data to monitor Millennium and
Sustainable Development Goals progress.
Household (demographics,
education, assets, assistance,
energy use, bednets, water
quality testing/sanitation); Men
(media exposure, fertility,
domestic violence, victimization,
function, sexual/reproductive
health, circumcision,
alcohol/tobacco, life
satisfaction); Women (media
exposure, victimization, fertility,
pre/post-natal care, FGM,
domestic violence function,
sexual/reproductive health,
alcohol/tobacco, life
satisfaction); Children
(development, discipline,
breastfeeding/diet,
immunization, illness,
anthropometrics, child labor,
parental involvement).
Serial cross-
sectional
probability-
stratified cluster
1993-current,
depending on
country
Variable
6000-37 000
households,
3000-34 000
individuals
http://mics.unicef.org/ Create account with
MICS and describe
research objectives
WHO Study on
global Aging
and adult health
(SAGE)
A longitudinal study of health, wellbeing and
the aging process in nationally representative
samples of older adult populations in China,
Ghana, India, Mexico, Russian Federation
and South Africa, with smaller samples of
younger adults for comparison purposes.
Household data (assets, malaria
prevention, healthcare, maternal
health); Individual data
(anthropometry, biomarkers,
health behaviors, well-being,
social cohesion, economics),
mortality data
Longitudinal 2002-Ongoing
(2017 most
recent)
Wave 1 was
conducted
during
20072010 and
included a total
of 34 124
respondents aged
50+ and 8340
aged 18-49
http://www.who.int/
healthinfo/sage/en/
Request permission:
sagesurvey@who.
int
Note: Weblinks to study sites may change over time. If links are broken, search online by study name.
ROSINGER AND ICE 13 of 19
American Journal of Human Biology
(Kaiser, Hruschka, & Hadley, 2017). The emphasis of the
basic survey is on maternal and child health; however, the
household survey allows for studies about household envi-
ronment and health. The website contains a number of
training resources as wellboth in text and video format.
Since cross-cultural analyses can be confounded by differ-
ences in wealth, dietary traditions, physical activity, access to
healthcare, and other factors, it is possible to use DHS data to
restrict analyses to countries that share similarities on critical
confounders. Previous human biology research using DHS
data have examined topics such as body size among adults
(Hruschka et al., 2014), height, BMI, and economic factors in
children (Murasko, 2017), how economic conditions can
affect tradeoffs in reproduction costs (Hruschka & Hagaman,
2015), how different kinds of material wealth are related to
growth, well-being, and disease risk (Hadley, Maxfield, &
Hruschka, 2019; Hruschka, Hadley, & Hackman, 2017), inter-
generational fertility (Murphy, 2012), immune activation,
growth, and food insecurity (Hadley & DeCaro, 2014), and
parental investment theory within a single DHS country
(Sparks, 2011). DHS can be used to do comparative studies
across countries or assessments of a single country.
5.2 |National health and nutrition examination survey
Another well-known dataset used by human biologists is
NHANES. This US-based nationally-representative survey
is a cross-sectional assessment used to monitor the health
and nutritional status of the non-institutionalized, household,
civilian population. Since 1999, NHANES has been con-
ducted continuously, visiting 30 sites within the US every
2 years and data are released to the public in 2-year cycles
(n = ~10 000 adults and children). NHANES is unique in
that it collects information related to many different inter-
view modules as diverse as 24-hour dietary intake to cogni-
tive function in addition to a wide array of anthropometric
and biomarker data, including hormone data. Many other
countries like South Korea (KNHANES) (Brewis, Han, &
SturtzSreetharan, 2017; Kweon et al., 2014) and Kazakhstan
(KHANES) (Facchini et al., 2007) have modeled nationally-
representative surveys after NHANES. The national center
for health statistics, which administers NHANES, developed
growth curves, including BMI-Z scores, that researchers can
use to plot how children in their populations fall along the
curve (Kuczmarski et al., 2002).
The use of NHANES has expanded in recent years among
human biologists exploring questions surrounding allostatic
load (Geronimus et al., 2006), disparities in how people meet
their water needs (Rosinger, Herrick, Wutich, Yoder, & Ogden,
2018), sleep and dehydration risk (Rosinger et al., 2019),
hydration, obesity and water intake (Rosinger et al., 2016),
trends in nutritional indicators like the relationship between
milk consumption and growth (Wiley, 2005), and more specific
questions like how male partnering relationships are associated
with testosterone and depression levels (Gettler & Oka, 2016).
Another unique aspect of NHANES is that mortality records can
be linked to participants from earlier NHANES survey cycles to
createaprospectivecohortstudy and assess the relationships
between behaviors and health states and risk of mortality (Kant
& Graubard, 2016; Shattuck & Sparks, 2017). NHANES can be
used on its own to test hypotheses in the US among the general
population or among subpopulations, given the large sample
size, as well as to compare to other populations.
5.3 |Cebu longitudinal health and nutrition study
The CLHNS study is a longitudinal study now aimed at
investigating the long-term effects of early life conditions,
diet, and stress on adult health outcomes and chronic disease
risk. CLHNS began in 1983, collecting data on 3327 women-
infant pairs in Cebu, Philippines, and has had seven subse-
quent follow-ups through 2014 with ~2300 of the women and
index children. CLHNS is a unique contribution to this list as
it allows for testing of questions related to intergenerational
effects on health and human biology as it is now following
three generations of participants. An in-depth cohort profile
and study description is open-access (Adair et al., 2010). The
majority of the data, questionnaires, and codebooks are pub-
licly available on the University of North Carolina at Chapel
Hill's dataverse (https://dataverse.unc.edu/), including ques-
tionnaire data related to diet, activities, environment, and
anthropometric data. However, some data, including bio-
marker and genetic data are restricted and must be requested
by contacting the CLHNS Data Use Committee.
A wide variety of critical human biology questions have
been examined with CLHNS including: HPA-axis related
questions, such as how stress profiles are associated with
birthweight (Kuzawa, Tallman, Adair, Lee, & McDade, 2012;
Lee, Fried, Thayer, & Kuzawa, 2014; Thayer, Agustin
Bechayda, & Kuzawa, 2018); dietary questions, such as how
dietary intake of fish is associated with breastmilk composi-
tion (Quinn & Kuzawa, 2012) and how diet is associated with
telomere length and adiposity (Bethancourt et al., 2017);
questions related to environmental exposures and immune
system programming, including how early life exposures
affect inflammation (Mcdade et al., 2010; McDade, Hoke,
Borja, Adair, & Kuzawa, 2013); questions surrounding
genetic modifications, such as intergenerational effects on
telomere length (Eisenberg, Borja, Hayes, & Kuzawa, 2017);
and questions related to social relationships and their cascad-
ing effects on hormones, including how parental relationships
affect testosterone levels (Gettler, McDade, Bragg, Feranil, &
Kuzawa, 2015; Gettler, McDade, Feranil, & Kuzawa, 2011).
5.4 |The World Health Organization Study on Global
AGEing and Adult Health
The WHO SAGE study is a newer longitudinal study of health,
well-being, and the aging process in nationally-representative
samples of older adult populations in China, Ghana, India,
14 of 19 ROSINGER AND ICE
American Journal of Human Biology
Mexico, the Russian Federation, and South Africa, with smaller
samples of younger adults for comparison purposes, which
began in 2007-2010. An in-depth description of the data of the
WHO SAGE study is openly available (Kowal et al., 2012).
SAGE is designed to provide comparable data to high-income
countries related to aging and adult health generated by the US
HRS, ELSA, and the Collaborative Research on Ageing in
Europe (SHARE) Project.
Research questions related to multimorbidity, physical
and mental health (Arokiasamy et al., 2015), prevalence and
risk factors for noncommunicable diseases like hypertension
and obesity (Salinas et al., 2015; Wu et al., 2015), risk fac-
tors for frailty and disability (Biritwum et al., 2016), physi-
cal function and activity (Barrett et al., 2016), food security,
social disadvantage and body composition (Schrock et al.,
2017) and sleep duration and sleep quality on risk of obesity
(Gildner et al., 2014) have been examined with WHO
SAGE. This dataset has also been useful in examining ques-
tions surrounding research methods, like bias in self-reported
vs measured weight and implications for BMI and obesity
prevalence in low and middle-income countries (Gildner,
Barrett, Liebert, Kowal, & Snodgrass, 2015). As the follow-
up waves of this study are released, additional questions of
relevance to human biologists can be explored, both for indi-
vidual countries and as cross-country comparisons.
6|LIMITATIONS AND BEST PRACTICES
A good resource for analyzing survey data is Korn and
Graubard's (1999) Analysis of Health Surveys. The book
goes into every issue of using survey data for health
research, including the appropriate analytic techniques, sam-
pling, weighting of data, regression analysis, treatment of
missing data, and cross-sectional and longitudinal analyses.
6.1 |Limitations of secondary data use
Often, an assumption exists that large nationally-representa-
tive datasets are devoid of the problems common to smaller
datasets such as missing data or low response rates; however,
they are not (Korn & Graubard, 1999). One issue that
emerges when analyzing secondary data is that a researcher
did not have a hand in collecting the data. While this may
help in blinding a researcher to particular values or biases
regarding individual participants, it also means that he/she
does not know the participants and does not know the issues
that went on during data collection, which prohibits the
researcher from re-examining field notes from the interview
in question to understand why certain values may seem off or
missing, but this could also be a positive in that researchers
avoid explaining missing data or introducing bias.
While larger sample sizes increase power and help over-
come nonsystematic bias, they can also magnify any potential
biases associated with sampling or study design (Kaplan,
Chambers, & Glasgow, 2014). Therefore, it is better to remain
cautious and systematic and document all decisions related to
dataset construction, for example, dropping any outliers or
observations that seem unrealistic. Setting up rules a priori to
data usage will assist in decision-making without temptation
to assess whether dropping observations affects the results.
With the proliferation of large datasets, statistical power
increases substantially, which may increase the risk of false
positives. Reliance on P-values is problematic in any analysis,
but in secondary data analysis if hypotheses and protocols are
not adhered to the risk of p-hacking, or changing the covari-
ates until a predictor variable is significant, increases (Head,
Holman, Lanfear, Kahn, & Jennions, 2015). In response to
the reproducibility crisis that has befallen many fields, a call
has been made to change statistical significance for claims of
novel discoveries from P<0.05 to P< 0.005 (Benjamin
et al., 2018). We suggest that models, with all covariates of
interest, be set up a priori to limit this issue and to use correc-
tions for multiple testing (ie, Bonferroni corrections) when
applicable. Additionally, researchers may want to preregister
their study hypotheses at an open science preregistration site.
Preregistering protocols and hypotheses has been shown to
increase null findings (Allen & Mehler, 2018).
The use of secondary datasets, especially within medi-
cine, has not been without controversy. In 2016, editors of
the New England Journal of Medicine (NEJM), Longo and
Drazen (2016) introduced the concept of research parasites
and defined them as people who had nothing to do with the
design and execution of the study but use another group's
data for their own ends, possibly stealing from the research
productivity planned by the data gatherers, or even use the
data to try to disprove what the original investigators had
posited(p. 276). The article stimulated a heated social
media response and a number of editorials in other journals
and on scientific association blogs (Schneider, 2016). In
these responses, researchers argued that the editorial was
counter to the scientific process (Karczewski et al., 2017;
Ornstein, 2016; Shaywitz, 2016). Many have pointed out
that responsiblesecondary data analysis is important for
both the accuracy and efficiency of science by maximizing
the use of data which is expensive and time-consuming to
collect (Greene et al., 2017). Researchers who retest hypoth-
eses can add value to the scientific process by validating and
correcting results and maintaining transparency (Choudhury,
Fishman, McGowan, & Juengst, 2014; Greene et al., 2017;
Karczewski et al., 2017).
Within days of the NEJM editorial publication, the Inter-
national Committee of Medical Journal Editors (ICMJE)
issued a policy proposal that all articles on clinical trials pub-
lished in associated journals share data within 6 months of
publication (Taichman et al., 2016). Following a period for
feedback, the ICMJE acknowledged concerns and logistical
challenges of the short time frame and the policy was revised
to require registration of clinical trials and a data statement in
ROSINGER AND ICE 15 of 19
American Journal of Human Biology
articles published in their associated journals (Taichman et al.,
2017). In response to the controversy, the Pacific Symposium
on Biocomputing began giving annual Research Parasites
awards to acknowledge the value of scientists who conduct
secondary data analysis (Greene et al., 2017).
6.2 |Best practices
In the face of a number of high profile scandals in academia
where data were fabricated (McNutt, 2015; Reardon, 2015),
open sharing of data and code with publication has become
increasingly important. Conducting quantitative analysis has
been called a moral and ethical responsibility of researchers
(Osborne, 2008). In addition to the suggestions highlighted
to this point, we discuss a few additional best practices in
the secondary analysis of data.
1. Double replication of data analysis. This practice is
used at the National Center for Health Statistics NHANES
analysis branch. There, a co-author of the project, usually the
second author also builds the dataset based on the research
question and the variables and sample sizes and estimates are
confirmed. This may be challenging because double-blind
replication takes extra time, and extra resources, which many
researchers do not have. However, this is advisable, espe-
cially when students are taking the lead on an analysis. A
good first step in analysis of complex surveys is to replicate
the data in published reports, for example, NCHS data briefs
(Ogden et al., 2018). In addition, DHS has model datasets
that researchers can use to practice data analysis and replicate
findings from country level reports.
2. Review study documentation for sampling and vari-
able changes. Study documentation should be reviewed to
identify nuances in the wording of survey questions, skip
logic used between questions, and methodology. Do not
assume knowing what the question was that generated the
variable present in the dataset. When analyzing changes in
a specific variable over time, study documentation should be
examined for all the years being used for changes in method-
ology, that is, wording changes or sampling strategies, to
assess comparability of questions over time. The change in
the way a question is worded can strongly affect the results
and make differences between surveys incomparable.
3. Document the process. During the construction of the
dataset, defining of the variables of interest, and analyses,
researchers should take care to create a clear, documented
log of all the steps. A number of statistical packages allow
an opportunity to document data modifications and proce-
dures. This allows the user and any future person to recreate
the dataset and reestimate the analyses as a replication
check. Steps to document are during: (a) dataset construction
(merging and appending datasets to create a larger dataset),
(b) data cleaning/imputation with notes on specific unique
identifiers or variables that were cleaned or changed to miss-
ing values, (c) variable construction, that is how variables
are defined, and (d) the analysis for the article, with notes
for each table and Figure.
4. Sound statistical practice. While not unique to second-
ary data analysis, researchers should review and test the
underlying assumptions behind statistical procedures prior to
using them (i.e., normality, heteroskedasticity, independence
of observations). To help readers interpret the results, the
point estimates should always be accompanied by the mea-
sure of uncertainty (standard errors and 95% confidence inter-
vals), which provide more information than the P-value,
which some fields wish to move away from (Wasserstein &
Lazar, 2016). Another important aspect for human biologists
is to assess the practical and biological significance (i.e., is a
0.3% increase in body fat meaningful even if the P-value is
0.001) of the results instead of only considering the statistical
significance. Additionally, we would encourage researchers to
consider depositing their data and code to their papers in data
repositories, dataverses, or personal sites. Dataverse and
Dryad both have a number of suggestions for best practices
for sharing data in nonproprietary formats.
6.3 |Conclusion
In this article, we described how to find and use publicly-
available datasets to ask questions of relevance to the field
of human biology. Human biological and anthropological
examination of these datasets may allow for an analysis of
cultural and structural nuances that impact health as well as
providing an evolutionary context. The call to increase the
use of secondary analysis of existing datasets has been ech-
oed in many fields adjacent to human biology, including
recently in evolutionary anthropology (Mattison & Sear,
2016; Stulp, Sear, Schaffnit, Mills, & Barrett, 2016). The
use of nationally-representative data is particularly powerful
for statistical inferences. Oftentimes, results can draw sub-
stantial public and media attention as their generalizability
can apply to a broad audience. Moreover, analyzing and
publishing findings from publicly available datasets is net
beneficial for both the researcher and the studies whose data
are being used, as these additional publications can be
pointed to as evidence of data sharing and the importance of
their study for future funding.
Remember, with great (statistical) power, comes great
responsibility. It is critical to avoid pitfalls of p-hacking and
of using incorrect testing procedures. Setting up hypotheses a
priori,preregistering protocols, and creating table shells with-
out results is helpful to staying honest to the data. Adding sec-
ondary data analysis to your toolkit is a powerful skill to have
for research productivity, mentoring, and teaching. It's a won-
derful, data-filled world out there, let's go exploring!
USEFUL RESOURCES
Korn, E. L., & Graubard, B. I. (1999). Analysis of health
surveys. John Wiley & Sons.
16 of 19 ROSINGER AND ICE
American Journal of Human Biology
Hamilton, L. C. (2012). Statistics with Stata: Version 12.
Cengage Learning.
Mehmetoglu, M., & Jakobsen, T. G. (2016). Applied sta-
tistics using Stata: a guide for the social sciences. Sage.
Osborne, J. W. (Ed.). (2008). Best practices in quantita-
tive methods. Sage.
Wooldridge, J. M. (2015). Introductory econometrics: A
modern approach. Nelson Education.
Richard McElreath on good data practices - Statistical
Re-thinking - book and series of open lectures using a
Bayesian approach and with R statistical software examples:
(https://www.youtube.com/channel/UCNJK6_
DZvcMqNSzQdEkzvzA/videos).
ACKNOWLEDGMENTS
We thank the panelists and attendees of the Breakout session
around this topic at the Human Biology Association 42nd
Annual Meeting. Thanks to Kelly Ochs Rosinger and Dan
T.A. Eisenberg for helpful conversations about this article.
AR was supported by the College of Health and Human
Development at Pennsylvania State University.
AUTHOR CONTRIBUTIONS
AR drafted and critically revised the manuscript. GI criti-
cally revised the manuscript.
ORCID
Asher Y. Rosinger https://orcid.org/0000-0001-9587-1447
REFERENCES
Adair, L. S., Popkin, B. M., Akin, J. S., Guilkey, D. K., Gultiano, S., Borja, J.,
Hindin, M. J. (2010). Cohort profile: The Cebu longitudinal health and
nutrition survey. International Journal of Epidemiology,40(3), 619625.
Allen, C. P. G., & Mehler, D. M. A. (2018). Open Science challenges, benefits
and tips in early career and beyond. https://doi.org/10.31234/osf.io/3czyt
Archer, K. J., Lemeshow, S., & Hosmer, D. W. (2007). Goodness-of-fit tests for
logistic regression models when data are collected using a complex sampling
design. Computational Statistics & Data Analysis,51(9), 44504464.
Arokiasamy, P., Uttamacharya, U., Jain, K., Biritwum, R. B., Yawson, A. E.,
Wu, F., Afshar, S. (2015). The impact of multimorbidity on adult physical
and mental health in low-and middle-income countries: What does the study on
global ageing and adult health (SAGE) reveal? BMC Medicine,13(1), 178.
Barrett, T. M., Liebert, M. A., Schrock, J. M., Cepon-Robins, T. J., Mathur, A.,
Agarwal, H., Snodgrass, J. J. (2016). Physical function and activity among
older adults in Jodhpur, India. Annals of Human Biology,43(5), 488491.
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A.,
Wagenmakers, E. J., Berk, R., Cesarini, D. (2018). Redefine statistical
significance. Nature Human Behaviour,2(1), 610.
Bethancourt, H. J., Kratz, M., Beresford, S. A., Hayes, M. G., Kuzawa, C. W.,
Duazo, P. L., Eisenberg, D. T. (2017). No association between blood telo-
mere length and longitudinally assessed diet or adiposity in a young adult
Filipino population. European Journal of Nutrition,56(1), 295308.
Biritwum, R. B., Minicuci, N., Yawson, A. E., Theou, O., Mensah, G. P.,
Naidoo, N., WHO SAGE Collaboration. (2016). Prevalence of and factors
associated with frailty and disability in older adults from China, Ghana,
India, Mexico, Russia and South Africa. Maturitas,91,818. https://doi.
org/10.1016/j.maturitas.2016.05.012
Blackwell, A. D., Gurven, M. D., Sugiyama, L. S., Madimenos, F. C.,
Liebert, M. A., Martin, M. A., Snodgrass, J. J. (2011). Evidence for a
peak shift in a humoral response to helminths: Age profiles of IgE in the
Shuar of Ecuador, the Tsimane of Bolivia, and the US NHANES. PLoS
Neglected Tropical Diseases,5(6), e1218.
Boas, F. (1912). Changes in the bodily form of descendants of immigrants.
American Anthropologist,14, 530562.
Brewis, A. A., Han, S. Y., & SturtzSreetharan, C. L. (2017). Weight, gender, and
depressive symptoms in South Korea. American Journal of Human Biology,29
(4), e22972.
Brown, P. J., & Konner, M. (1987). An anthropological perspective on obesity.
Annals of the New York Academy of Sciences,499(1), 2946.
Buttke, D. E., Sircar, K., & Martin, C. (2012). Exposures to endocrine-disrupting
chemicals and age of menarche in adolescent girls in NHANES (20032008).
Environmental Health Perspectives,120(11), 16131618.
Chen, J., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal
of Official Statistics,16(2), 113.
Cheng, H. G., & Phillips, M. R. (2014). Secondary analysis of existing data:
Opportunities and implementation. Shanghai Archives of Psychiatry,26(6),
371375.
Choudhury, S., Fishman, J., McGowan, M., & Juengst, E. (2014). Big data, open
science and the brain: Lessons learned from genomics. Frontiers in Human
Neuroscience,8(May), 239. https://doi.org/10.3389/fnhum.2014.00239
Doolan, D. M., & Froelicher, E. S. (2009). Using an existing data set to answer
new research questions: A methodological review. Research and Theory for
Nursing Practice,23(3), 203215.
Eisenberg, D. T., Borja, J. B., Hayes, M. G., & Kuzawa, C. W. (2017). Early life
infection, but not breastfeeding, predicts adult blood telomere lengths in the
Philippines. American Journal of Human Biology,29(4), e22962.
Ember, C. R., Skoggard, I., Ringen, E. J., & Farrer, M. (2018). Our better nature:
Does resource stress predict beyond-household sharing? Evolution and
Human Behavior,39, 380391.
Facchini, F., Fiori, G., Bedogni, G., Galletti, L., Belcastro, M. G., Ismagulov, O.,
Goldoni, M. (2007). Prevalence of overweight and cardiovascular risk
factors in rural and urban children from Central Asia: The Kazakhstan health
and nutrition examination survey. American Journal of Human Biology,19
(6), 809820.
Geronimus, A. T., Hicken, M., Keene, D., & Bound, J. (2006). Weathering
and age patterns of allostatic load scores among blacks and whites in the
United States. American Journal of Public Health,96(5), 826833.
Gettler, L. T., McDade, T. W., Bragg, J. M., Feranil, A. B., & Kuzawa, C. W.
(2015). Developmental energetics, sibling death, and parental instability as pre-
dictors of maturational tempo and life history scheduling in males from Cebu,
Philippines. American Journal of Physical Anthropology,158(2), 175184.
Gettler, L. T., McDade, T. W., Feranil, A. B., & Kuzawa, C. W. (2011). Longitu-
dinal evidence that fatherhood decreases testosterone in human males. Pro-
ceedings of the National Academy of Sciences,108(39), 1619416199.
Gettler, L. T., & Oka, R. C. (2016). Are testosterone levels and depression risk
linked based on partnering and parenting? Evidence from a large population-
representative study of US men and women. Social Science & Medicine,
163, 157167.
Gettler, L. T., Sarma, M. S., Gengo, R. G., Oka, R. C., & McKenna, J. J. (2017).
Adiposity, CVD risk factors and testosterone: Variation by partnering status
and residence with children in US men. Evolution, medicine, and public
health,2017(1), 6780.
Gildner, T. E., Barrett, T. M., Liebert, M. A., Kowal, P., & Snodgrass, J. J.
(2015). Does BMI generated by self-reported height and weight measure up
in older adults from middle-income countries? Results from the study on
global AGEing and adult health (SAGE). BMC obesity,2(1), 44.
Gildner, T. E., Liebert, M. A., Kowal, P., Chatterji, S., & Josh Snodgrass, J.
(2014). Sleep duration, sleep quality, and obesity risk among older adults from
six middle-income countries: Findings from the study on global ageing and
adult health (SAGE). American Journal of Human Biology,26(6), 803812.
Gorber, S. C., Tremblay, M., Moher, D., & Gorber, B. (2007). A comparison of
direct vs. self-report measures for assessing height, weight and body mass
index: A systematic review. Obesity Reviews,8(4), 307326.
Grady, D. G., Cummings, S. R., & Hulley, S. B. (2013). Research using existing
data. In Designing clinical research (4th ed., pp. 192204). Philadelphia:
Lippincott Williams & Wilkins.
ROSINGER AND ICE 17 of 19
American Journal of Human Biology
Gravlee, C. C., Bernard, H. R., & Leonard, W. R. (2003). Heredity, environment,
and cranial form: A reanalysis of Boas's immigrant data. American Anthro-
pologist,105(1), 125138.
Greene, C. S., Garmire, L. X., Gilbert, J. A., Ritchie, M. D., & Hunter, L. E.
(2017). Celebrating parasites. Nature Genetics,49(4), 483484.
Gurven, M., Kaplan, H., & Supa, A. Z. (2007). Mortality experience of Tsimane
Amerindians of Bolivia: Regional variation and temporal trends. American
Journal of Human Biology,19(3), 376398.
Hadley, C., & Decaro, J. A. (2014). Testing hypothesized predictors of immune
activation in Tanzanian infants and children: community, household, caretaker,
and child effects. American Journal of Human Biology,26(4), 523529.
Hadley, C., Maxfield, A., & Hruschka, D. (2019). Different forms of household
wealth are associated with opposing risks for HIV infection in East Africa.
World Development,113, 344351.
Hamilton, L. C. (2012). Statistics with stata: Version 12. Cengage Learning (8th
ed., pp. 1469).
Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015).
The extent and consequences of p-hacking in science. PLoS Biology,13(3),
e100210.
Hruschka, D. J., Hadley, C., & Brewis, A. (2014). Disentangling basal and accu-
mulated body mass for cross-population comparisons. American Journal of
Physical Anthropology,153(4), 542550.
Hruschka, D. J., Hadley, C., & Hackman, J. (2017). Material wealth in 3D: Map-
ping multiple paths to prosperity in low-and middle-income countries. PLoS
ONE,12(9), e0184616.
Hruschka, D. J., & Hagaman, A. (2015). The physiological cost of reproduction
for rich and poor across 65 countries. American Journal of Human Biology,
27(5), 654659.
Jann, B. (2008). Multinomial goodness-of-fit: Large-sample tests with survey
design correction and exact tests for small samples. Stata Journal,8(2),
147169.
Johnson, W. (2015). Analytical strategies in human growth research. American
Journal of Human Biology,27(1), 6983.
Kaiser, B. N., Hruschka, D., & Hadley, C. (2017). Measuring material wealth in
low-income settings: A conceptual and how-to guide. American Journal of
Human Biology,29(4), e22987.
Kant, A. K., & Graubard, B. I. (2016). A prospective study of water intake and
subsequent risk of all-cause mortality in a national cohort. The American
Journal of Clinical Nutrition,105(1), 212220.
Kaplan, R. M., Chambers, D. A., & Glasgow, R. E. (2014). Big data and large
sample size: A cautionary note on the potential for bias. Clinical and Trans-
lational Science,7(4), 342346.
Karczewski, K., Tatonetti, N., Manrai, A., Patel, C., Brown, C. T., &
Ioannidis, J. (2017). Methods to ensure the reproducibility of biomedical
research. Pacific Symposium on Biocomputing,22, 117119. https://doi.
org/10.1142/9789813207813_0012
Korn, E. L., & Graubard, B. I. (1999). Analysis of health surveys. New York:
John Wiley & Sons.
Kowal, P., Chatterji, S., Naidoo, N., Biritwum, R., Fan, W., Lopez Ridaura, R.,
the SAGE Collaborators. (2012). Data resource profile: The World Health
Organization Study on global ageing and adult health (SAGE). International
Journal of Epidemiology,41(6), 16391649.
Kuczmarski, R. J., Ogden, C. L., Guo, S. S., Grummer-Strawn, L. M.,
Flegal, K. M., Mei, Z., Johnson, C. L. (2002). 2000 CDC Growth Charts
for the United States: Methods and development. Vital and Health Statistics.
Series 11, Data from the national health survey,246,1190.
Kuzawa, C. W., Tallman, P. S., Adair, L. S., Lee, N., & McDade, T. W. (2012).
Inflammatory profiles in the non-pregnant state predict offspring birth weight
at Cebu: Evidence for inter-generational effects of low grade inflammation.
Annals of Human Biology,39(4), 267274.
Kweon, S., Kim, Y., Jang, M. J., Kim, Y., Kim, K., Choi, S., Oh, K. (2014).
Data resource profile: The Korea national health and nutrition examination
survey (KNHANES). International Journal of Epidemiology,43(1), 6977.
Lee, J., Fried, R., Thayer, Z., & Kuzawa, C. W. (2014). Preterm delivery as a
predictor of diurnal cortisol profiles in adulthood: Evidence from Cebu, Phil-
ippines. American Journal of Human Biology,26(5), 598602.
Lesnik, J. J. (2017). Not just a fallback food: Global patterns of insect consump-
tion related to geography, not agriculture. American Journal of Human Biol-
ogy,29(4), e22976.
Longo, D. L., & Drazen, J. M. (2016). Data sharing. The New England Journal
of Medicine,374(3), 276277. https://doi.org/10.1056/NEJMe1516564
Magee, T., Lee, S. M., Giuliano, K. K., & Munro, B. (2006). Generating new
knowledge from existing data: The use of large data sets for nursing research.
Nursing Research,55(2), S50S56.
Malina, R. M., Pena Reyes, M. E., Tan, S. K., Buschang, P. H., Little, B. B., &
Koziel, S. (2004). Secular change in height, sitting height and leg length in
rural Oaxaca, southern Mexico: 19682000. Annals of Human Biology,31
(6), 615633.
Mattison, S. M., & Sear, R. (2016). Modernizing evolutionary anthropology.
Human Nature,27(4), 335350.
McDade, T. W., Hoke, M., Borja, J. B., Adair, L. S., & Kuzawa, C. (2013). Do
environments in infancy moderate the association between stress and inflam-
mation in adulthood? Initial evidence from a birth cohort in the Philippines.
Brain, Behavior, and Immunity,31,2330.
McDade, T. W., Rutherford, J., Adair, L., & Kuzawa, C. W. (2010). Early ori-
gins of inflammation: Microbial exposures in infancy predict lower levels of
C-reactive protein in adulthood. Proceedings of the Royal Society of London
B: Biological Sciences,277(1684), 11291137.
McDowell, M. A., Brody, D. J., & Hughes, J. P. (2007). Has age at menarche
changed? Results from the national health and nutrition examination survey
(NHANES) 19992004. Journal of Adolescent Health,40(3), 227231.
McNutt, M. (2015). Editorial retraction of LaCour, M. J., & Green, D. P. (2014).
When contact changes minds: An experiment on transmission of support for
gay equality. Science,346(6215), 1366s1369.
Murasko, J. E. (2017). Height, BMI, and relative economic standing in children
from developing countries. American Journal of Human Biology,29(3),
e22958.
Murphy, M. (2012). Intergenerational fertility correlations in contemporary
developing counties. American Journal of Human Biology,24(5), 696704.
National Research Council. (2001). Cells and surveys: Should biological mea-
sures be included in social science research? Washington, DC: The National
Academies Press. https://doi.org/10.17226/9995
Ogden, C. L., Fryar, C. D., Hales, C. M., Carroll, M. D., Aoki, Y., &
Freedman, D. S. (2018). Differences in obesity prevalence by demographics
and urbanization in US children and adolescents, 2013-2016. JAMA,319
(23), 24102418.
Okafor, P. N., Chiejina, M., de Pretis, N., & Talwalkar, J. A. (2016). Secondary
analysis of large databases for hepatology research. Journal of Hepatology,
64(4), 946956.
Ornstein, C. (April 5, 2016). Amid public feuds, a venerated medical journal finds
itself under attack. ProPublica, 2016 https://www.propublica.org/article/amid-
public-feuds-a-venerated-medical-journal-finds-itself-under-attack.
Osborne, J. W. (Ed.). (2008). Best practices in quantitative methods. Los
Angeles: Sage.
Pontzer, H., Raichlen, D. A., Wood, B. M., Emery Thompson, M., Racette, S.
B., Mabulla, A. Z., & Marlowe, F. W. (2015). Energy expenditure and activ-
ity among Hadza hunter-gatherers. American Journal of Human Biology,27
(5), 628637.
Quinn, E. A., & Kuzawa, C. W. (2012). A doseresponse relationship between
fish consumption and human milk DHA content among Filipino women in
Cebu City, Philippines. Acta Paediatrica,101(10), e439e445.
Raichlen, D. A., Pontzer, H., Harris, J. A., Mabulla, A. Z., Marlowe, F. W., Josh
Snodgrass, J., Wood, B. M. (2017). Physical activity patterns and bio-
markers of cardiovascular disease risk in hunter-gatherers. American Journal
of Human Biology,29(2), e22919.
Reardon, S. (2015). US vaccine researcher sentenced to prison for fraud. Nature,
523(7559), 138139.
Rivara, A. C., & Miller, E. M. (2017). Pregnancy and immune stimulation:
Re-imagining the fetus as parasite to understand age-related immune system
changes in US women. American Journal of Human Biology,29(6), e23043.
Rosinger, A., Carroll, M., Lacher, D., & Ogden, C. (2017). Decreasing trends in
total cholesterol, triglycerides, and low-density lipoprotein in U.S. adults,
19992014. JAMA Cardiology,2(3), 339441.
Rosinger, A., & Godoy, R. (2016). Adult Weight and Height of Native Populations.
The Oxford Handbook of Economics and Human Biology, 192209.
Rosinger, A., Herrick, K., Gahche, J., & Park, S. (2017). Sugar-sweetened bever-
age (SSB) consumption among U.S. youth, 2011-2014. NCHS Data Brief,
271,18.
Rosinger, A. Y., Chang, A. M., Buxton, O. M., Li, J., Wu, S., & Gao, X. (2019).
Short sleep duration is associated with inadequate hydration: Cross-cultural
evidence from US and Chinese adults. Sleep,42(2), zsy210. https://doi.
org/10.1093/sleep/ zsy210.
18 of 19 ROSINGER AND ICE
American Journal of Human Biology
Rosinger, A. Y., Herrick, K. A., Wutich, A. Y., Yoder, J. S., & Ogden, C. L.
(2018). Disparities in plain, tap and bottled water consumption among US
adults: National health and nutrition examination survey (NHANES)
20072014. Public Health Nutrition,21(8), 14551464.
Rosinger, A. Y., Lawman, H. G., Akinbami, L. J., & Ogden, C. L. (2016). The
role of obesity in the relation between total water intake and urine osmolality
in US adults, 200920123. The American Journal of Clinical Nutrition,
104(6), 15541561.
Salinas, A., Rivas-Marino, G., Negin, J., Salinas-Rodriguez, A., Manrique-
Espinoza, B., Sterner, K. N., Kowal, P. (2015). Prevalence of overweight
and obesity in older mexican adults and its association with physical activity
and related factors: An analysis of the study on global ageing and adult
health. American Journal of Human Biology,27(3), 326333.
Schneider, L. (January 22, 2016). Research parasitismand authorship rights.
For Better Science https://forbetterscience.com/2016/01/22/research-parasit-
ism-and-authorship-rights/.
Schrock, J. M., McClure, H. H., Snodgrass, J. J., Liebert, M. A., Charlton, K. E.,
Arokiasamy, P., Kowal, P. (2017). Food insecurity partially mediates
associations between social disadvantage and body composition among older
adults in India: Results from the study on global AGEing and adult health
(SAGE). American Journal of Human Biology: The Official Journal of the
Human Biology Council,29(6), 29. https://doi.org/10.1002/ajhb.23033
Shattuck, E. C., & Sparks, C. (2017). Sleep duration and quality are associated
with immune markers, self-rated health, and mortality in a large cohort
(NHANES 20052014). Brain, Behavior, and Immunity,66, e2.
Shaywitz, D. (January 21, 2016). Data Scientists = Research Parasites? Forbes
https://www.forbes.com/sites/davidshaywitz/2016/01/21/data-scientists-research-
parasites/#626e6d5d66a6.
Sparks, C. S. (2011). Parental investment and socioeconomic status influences
on children's height in Honduras: An analysis of national data. American
Journal of Human Biology,23(1), 8088.
Sparks, C. S., & Jantz, R. L. (2002). A reassessment of human cranial plasticity:
Boas revisited. Proceedings of the National Academy of Sciences,99(23),
1463614639.
Steyn, N. P., Labadarios, D., Maunder, E., Nel, J., Lombard, C., & Directors of
the National Food Consumption Survey. (2005). Secondary anthropometric
data analysis of the National Food Consumption Survey in South Africa: the
double burden. Nutrition,21(1), 413.
Stookey, J. D., Barclay, D., Arieff, A., & Popkin, B. M. (2007). The altered fluid
distribution in obesity may reflect plasma hypertonicity. European Journal
of Clinical Nutrition,61(2), 190199.
Stulp, G., Sear, R., Schaffnit, S. B., Mills, M. C., & Barrett, L. (2016). The
reproductive ecology of industrial societies, part ii. Human Nature,27(4),
445470.
Taichman, D. B., Backus, J., Baethge, C., Bauchner, H., de Leeuw, P. W.,
Drazen, J., et al. (2016). Sharing clinical trial data: A proposal from the inter-
national committee of medical journal editors. PLoS Medicine,13(1),
e1001950. https://doi.org/10.1371/journal.pmed.1001950
Taichman, D. B., Sahni, P., Pinborg, A., Peiperl, L., Laine, C., James, A.,
Backus, J. (2017). Data sharing statements for clinical trials: A requirement
of the international committee of medical journal editors. JAMA,317(24),
24912492. https://doi.org/10.1001/jama.2017.6514
Thayer, Z. M., Agustin Bechayda, S., & Kuzawa, C. W. (2018). Circadian corti-
sol dynamics across reproductive stages and in relation to breastfeeding in
the Philippines. American Journal of Human Biology,30, e23115.
Thompson, A. L., Houck, K. M., Adair, L., Gordon-Larsen, P., Du, S.,
Zhang, B., & Popkin, B. (2014). Pathogenic and obesogenic factors associ-
ated with inflammation in Chinese children, adolescents and adults. Ameri-
can Journal of Human Biology,26(1), 1828.
Vartanian, T. P. (2010). Secondary data analysis. Oxford: Oxford University
Press.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA's statement on p-values:
Context, process, and purpose. The American Statistician,70(2),
129133.
Wiley, A. S. (2005). Does milk make children grow? relationships between milk
consumption and height in NHANES 19992002. American Journal of
Human Biology,17, 425441.
Wu, F., Guo, Y., Chatterji, S., Zheng, Y., Naidoo, N., Jiang, Y., Manrique-
Espinoza, B. (2015). Common risk factors for chronic non-communicable
diseases among older adults in China, Ghana, Mexico, India, Russia and
South Africa: The study on global AGEing and adult health (SAGE) wave 1.
BMC Public Health,15(1), 88.
How to cite this article: Rosinger AY, Ice G. Sec-
ondary data analysis to answer questions in human
biology. Am J Hum Biol. 2019;e23232. https://doi.
org/10.1002/ajhb.23232
ROSINGER AND ICE 19 of 19
American Journal of Human Biology
... After classification, I generated three variables: 1) total kilocalories (kcals) consumed from SSBs, which summed energy intake from SSBs by participant; 2) the percent of kcals from SSBs, which was the total kcal from SSBs divided by total caloric intake of the participant to estimate the contribution of SSBs to daily total energy; and 3) a dichotomous variable of whether that percentage exceeded 10% or not, to estimate whether a participant exceeded the USDA"s recommendation (29). I then collapsed this dataset to the participant level and merged it with the demographic and total nutrient intakes datasets (45). ...
Article
Background: In the US, problems with the provision of safe, affordable water have resulted in an increasing number of adults who avoid their tap water, which may indicate underlying water insecurity. Dietary recalls provide critical nutritional surveillance data yet have been underexplored as a water insecurity monitoring tool. Objectives: This article aims to demonstrate how water intake variables from dietary recall data relate to and predict a key water insecurity proxy, i.e., tap water avoidance. Methods: Using 2005-2018 National Health and Nutrition Examination Survey data among 32,329 adults, I examine distributions and trends of mean intakes of total, plain (sum of tap and bottled water), tap, and bottled water, and % consuming no tap and exclusive bottled water. Second, I use multiple linear and logistic regressions to test how tap water avoidance relates to plain water intake and sugar-sweetened beverage consumption. Next, I use receiver operating characteristics (ROC) curves to test the predictive accuracy of no plain water, no tap, and exclusive bottled water intake, and varying percentages of plain water consumed from tap water compared to tap water avoidance. Results: Trends indicate increasing plain water intake between 2005-2018, driven by increasing bottled water intake. In 2017-18, 51.4% of adults did not drink tap water on a given day, while 35.8% exclusively consumed bottled water. Adults who avoided their tap water consumed less tap and plain water, and significantly more bottled water and SSBs on a given day. No tap intake and categories of tap water intake produced 77% and 78% areas under the ROC curve in predicting tap water avoidance. Conclusion: This study demonstrates that water intake variables from dietary recalls can be used to accurately predict tap water avoidance and provide a window into water insecurity. Growing reliance on bottled water may indicate increasing concerns about tap water.
... In contrast, when two or more datasets have been combined, and, for example, only country-level data is available, researchers typically rely on correlation and regression analyses (e.g., Basabe & Valencia, 2007;Inman et al., 2017). Recommendations for performing secondary data analyses exist, for example, for social studies (Fitchett & Heafner, 2017), medical sciences (Cheng & Phillips, 2014), human biology (Rosinger & Ice, 2019), and qualitative research (Sherif, 2018). ...
Article
Full-text available
The Covid-19 pandemic has far-reaching implications for researchers. For example, many researchers cannot access their labs anymore and are hit by budget-cuts from their institutions. Luckily, there are a range of ways how high-quality research can be conducted without funding and face-to-face interactions. In the present paper, I discuss nine such possibilities, including meta-analyses, secondary data analyses, web-scraping, scientometrics, or sharing one’s expert knowledge (e.g., writing tutorials). Most of these possibilities can be done from home, as they require only access to a computer, the internet, and time; but no state-of-the art equipment or funding to pay for participants. Thus, they are particularly relevant for researchers with limited financial resources beyond pandemics and quarantines.
... We followed all NHCS guidelines for the analysis of NHANES data 62 . As the survey weights relevant to the smallest sample subpopulation for which all data are available should be used, we used mobile examination center (MEC) weights to adjust for complex survey design, oversampling, non-coverage, day of the week, and survey nonresponse to compute nationally representative estimates 63,64 . Per NHANES analytical guidelines for combining data across cycles, 12-year MEC weights were calculated using the NHANES-provided variables WTMEC4YR and WTMEC2YR as follows: WTMEC12YR = 1 3 * WTMEC4YR for the 1999 − 2000 and 2001 − 2002 cycles and WTMEC12YR = 1 6 * WTMEC2YR for all subsequent cycles. ...
Article
Full-text available
Understanding factors contributing to variation in ‘biological age’ is essential to understanding variation in susceptibility to disease and functional decline. One factor that could accelerate biological aging in women is reproduction. Pregnancy is characterized by extensive, energetically-costly changes across numerous physiological systems. These ‘costs of reproduction’ may accumulate with each pregnancy, accelerating biological aging. Despite evidence for costs of reproduction using molecular and demographic measures, it is unknown whether parity is linked to commonly-used clinical measures of biological aging. We use data collected between 1999 and 2010 from the National Health and Nutrition Examination Survey (n = 4418) to test whether parity (number of live births) predicted four previously-validated composite measures of biological age and system integrity: Levine Method, homeostatic dysregulation, Klemera–Doubal method biological age, and allostatic load. Parity exhibited a U-shaped relationship with accelerated biological aging when controlling for chronological age, lifestyle, health-related, and demographic factors in post-menopausal, but not pre-menopausal, women, with biological age acceleration being lowest among post-menopausal women reporting between three and four live births. Our findings suggest a link between reproductive function and physiological dysregulation, and allude to possible compensatory mechanisms that buffer the effects of reproductive function on physiological dysregulation during a woman’s reproductive lifespan. Future work should continue to investigate links between parity, menopausal status, and biological age using targeted physiological measures and longitudinal studies.
... The investment of time needed to develop the skills for data management, analysis and interpretation should not be under-estimated. Secondary datasets may be very large and complex, requiring substantial preparation before they are suitable for analysis; sophisticated statistical analysis to understand associations between variables; and rigorous documentation during preparation and analysis (see for tips on the use of secondary data: Rosinger & Ice, 2019;Second. data Anal. ...
... Statistical analyses accounted for the complex survey design of NHANES. We used the dayone dietary sample weights, which adjusted for oversampling, non-response, non-coverage, and day of week, since that is the smallest subpopulation for which all data were available and the point at which tap water avoidance and bottled water data were collected (Korn and Graubard 2011;Rosinger and Ice 2019). All estimates and 95% confidence intervals presented, except for sample sizes, are weighted and generated using survey commands following Korn and Graubard (Korn and Graubard 2011). ...
Article
Full-text available
Despite evidence that tap water is often safer and cheaper than alternative sources, tap water is avoided when perceived to be unsafe. Therefore, we conducted the first nationally representative U.S. trends analysis of in‐home tap water avoidance between 2007 and 2016. We tested whether changes occurred during/after the Flint water crisis, and whether not drinking tap from one's main water source differed by age, race/ethnicity, and socioeconomic status across time. Finally, we tested whether tap water avoidance was associated with higher prevalence of bottled water consumption among children. We used data on 12,915 children and 23,139 adults from the National Health and Nutrition Examination Survey. Significant covariate‐adjusted quadratic time trends were found in the prevalence of avoiding tap water with an inflection at 2013–2014 for children, but not adults. Piecewise log‐binomial regressions estimated that between 2007 and 2014 each survey cycle was associated with 14% lower prevalence of not drinking tap water (prevalence ratio [PR] 0.86, 95% CI: 0.80–0.93), but in 2014–2016 a 53% (95% CI: 1.12–2.09) higher prevalence was found for children corresponding to the water crisis. Younger children, Hispanic, non‐Hispanic black, and those from low socioeconomic status backgrounds had consistently higher probability of avoiding tap water over time. Children who avoided tap water had 92% higher prevalence of drinking bottled water. In 2015–2016, 78% of non‐Hispanic black children who avoided tap water drank bottled water on a given day. Avoiding tap water may indicate underlying water insecurity in the United States. Efforts to address tap water distrust have critical health and economic implications.
... In contrast, when two or more datasets have been combined, and, for example, only country-level data is available, researchers typically rely on correlation and regression analyses (e.g., Basabe & Valencia, 2007;Inman et al., 2017). Recommendations for performing secondary data analyses exist, for example, for social studies (Fitchett & Heafner, 2017), medical sciences (Cheng & Phillips, 2014), human biology (Rosinger & Ice, 2019), and qualitative research (Sherif, 2018). ...
Preprint
Full-text available
The Covid-19 pandemic has far-reaching implications for researchers. For example, many researchers cannot access their labs anymore and are hit by budget-cuts from their institutions. Luckily, there are a range of ways how high-quality research can be conducted without funding and face-to-face interactions. In the present paper, I discuss eight such possibilities, including meta-analyses, secondary data analyses, web-scrapping, scientometrics, or sharing one’s expert knowledge (e.g., writing tutorials). Most of these possibilities can be done from home, as they require only access to a computer, the internet, and time; but no state-of-the art equipment or funding to pay for participants. Thus, they are particularly relevant for researchers with limited financial resources beyond pandemics and quarantines.
Article
Full-text available
Biological anthropology in 2018 encapsulated what past scholars envisioned for its future: a multidisciplinary approach to understanding human and nonhuman primate evolution and diversity using the most innovative techniques and rigorous standards available. This year also built on a tradition of introspection about what biological anthropology encompasses and by whom and how it is conducted. This review highlights research and movements in the field that reflect both of these pursuits. Studies drew on evolutionary theory to generate novel insights into human and nonhuman primate biology, behavior, and organization. Studies on hominin evolution and human biology have upended previous understandings by revealing more dynamic and context-dependent processes in our ancestry and phenotypic expressions. Across subdisciplines, biological anthropologists have advanced the use of new technologies and analytical techniques and begun to promote open, transparent, and reproducible science among a more diverse community of researchers. [year in review, evolutionary anthropology, context and variation, emerging technologies, transparent methods, researcher diversity].
Article
Full-text available
Key findings: Data from the National Health and Nutrition Examination Survey •Almost two-thirds of boys and girls consumed at least one sugar-sweetened beverage on a given day. •Boys consumed an average 164 kilocalories (kcal) from sugar-sweetened beverages, which contributed 7.3% of total daily caloric intake. Girls consumed an average 121 kcal from sugar-sweetened beverages, which contributed 7.2% of total daily caloric intake. •Among both boys and girls, older youth had the highest mean intake and percentage of daily calories from sugar-sweetened beverages relative to younger children. •Non-Hispanic Asian boys and girls consumed the least calories and the lowest percentage of total calories from sugar-sweetened beverages compared with non-Hispanic white, non-Hispanic black, and Hispanic boys and girls. Sugar-sweetened beverages contribute calories and added sugars to the diets of U.S. children (1). Studies have suggested a link between the consumption of sugar-sweetened beverages and dental caries, weight gain, type 2 diabetes, dyslipidemia, and nonalcoholic fatty liver disease in children (2-6). The 2015-2020 Dietary Guidelines for Americans recommend reducing added sugars consumption to less than 10% of calories per day and, specifically, to choose beverages with no added sugars (1). This report presents results for consumption of sugar-sweetened beverages among U.S. youth aged 2-19 years for 2011-2014 by sex, age, and race and Hispanic origin.
Article
The relationship between material wealth and HIV infection in sub-Saharan Africa has been the subject of considerable debate in part because many studies show that wealth is positively associated with infection. Others have critiqued such results, suggesting that the widely used indicators of wealth underlying these results fail to capture the diversity of livelihood portfolios in East Africa. Using population representative data from 35,799 households in Kenya, Ethiopia, and Tanzania, we estimate household wealth along two different dimensions, associated respectively with success in wage economies and agricultural economies. Regression models for men and women show consistent and opposing associations between type of wealth and HIV infection. Controlling for age, education, and urban dwelling, increasing achievement along the wage economy dimension is positively (often significantly) associated with HIV infection. In contrast, increasing achievement along the agricultural economy dimension is often negatively associated with HIV infection, and is never associated with increased HIV risk. Interestingly, variables to assess risky sexual behaviors do not mediate the relationship between either type of wealth and HIV infection. Our results suggest that future studies on the relationship between HIV and wealth need to take into account the different dimensions of household wealth found in East African countries. Our results also generate new, important questions about why and how different forms of wealth drive HIV infection.
Article
Study Objectives Short and long sleep duration are linked to reduced kidney function, but little research has examined how sleep is associated with hydration status. Our aim was to assess the relationship between sleep duration and urinary hydration biomarkers among adults in a cross-cultural context. Methods Three samples of adults aged 20y were analyzed: 2007–2008 National Health and Nutrition Examination Survey (NHANES; n=4,680), 2009–2012 NHANES (n=9,559), and 2012 cross-sectional wave of the Chinese Kailuan Study (n=11,903), excluding pregnant women and adults with failing kidneys. We estimated multiple linear regression models between self-reported usual night-time sleep duration (<6, 6, 7, 8 (reference), 9 hrs/day) and urine specific gravity (Usg) and urine osmolality (Uosm) as continuous variables and logistic regression models dichotomized as inadequate hydration (>1.020 g/ml; >831 mOsm/kg). In primary analyses, we estimated models excluding diabetes and diuretic medications for healthier sub-populations (NHANES n=11,353; Kailuan n=8,766). Results In the healthier NHANES subset, 6 hours was associated with significantly higher Usg and odds of inadequate hydration (adjusted OR: 1.59, 95% CI: 1.25, 2.03) compared to 8 hours. Regression results were mixed using Uosm, but in the same direction as Usg. Among Chinese adults, short sleep duration (<6 and 6 hours) was associated with Usg and higher likelihood of inadequate hydration (6 hours adjusted OR: 1.42, 95% CI: 1.26, 1.60). No consistent association was found with sleeping ≥9 hours. Conclusions Short sleep duration was associated with higher odds of inadequate hydration in US and Chinese adults relative to sleeping 8 hours.
Preprint
The movement towards open science is an unavoidable consequence of seemingly pervasive failures to replicate previous research. This transition comes with great benefits but also significant challenges that are likely to afflict those who carry out the research, usually Early Career Researchers (ECRs). Here, we describe key benefits including reputational gains, increased chances of publication and a broader increase in the reliability of research. These are balanced by challenges that we have encountered, and which involve increased costs in terms of flexibility, time and issues with the current incentive structure, all of which seem to affect ECRs acutely. Although there are major obstacles to the early adoption of open science, overall open science practices should benefit both the ECR and improve the quality and plausibility of research. We review three benefits, three challenges and provide suggestions from the perspective of ECRs for moving towards open science practices.
Article
Importance Differences in childhood obesity by demographics and urbanization have been reported. Objective To present data on obesity and severe obesity among US youth by demographics and urbanization and to investigate trends by urbanization. Design, Setting, and Participants Measured weight and height among youth aged 2 to 19 years in the 2001-2016 National Health and Nutrition Examination Surveys, which are serial, cross-sectional, nationally representative surveys of the civilian, noninstitutionalized population. Exposures Sex, age, race and Hispanic origin, education of household head, and urbanization, as assessed by metropolitan statistical areas (MSAs; large: ≥ 1 million population). Main Outcomes and Measures Prevalence of obesity (body mass index [BMI] ≥95th percentile of US Centers for Disease Control and Prevention [CDC] growth charts) and severe obesity (BMI ≥120% of 95th percentile) by subgroups in 2013-2016 and trends by urbanization between 2001-2004 and 2013-2016. Results Complete data on weight, height, and urbanization were available for 6863 children and adolescents (mean age, 11 years; female, 49%). In 2013-2016, the prevalence among youth aged 2 to 19 years was 17.8% (95% CI, 16.1%-19.6%) for obesity and 5.8% (95% CI, 4.8%-6.9%) for severe obesity. Prevalence of obesity in large MSAs (17.1% [95% CI, 14.9%-19.5%]), medium or small MSAs (17.2% [95% CI, 14.5%-20.2%]) and non-MSAs (21.7% [95% CI, 16.1%-28.1%]) were not significantly different from each other (range of pairwise comparisons P = .09-.96). Severe obesity was significantly higher in non-MSAs (9.4% [95% CI, 5.7%-14.4%]) compared with large MSAs (5.1% [95% CI, 4.1%-6.2%]; P = .02). In adjusted analyses, obesity and severe obesity significantly increased with greater age and lower education of household head, and severe obesity increased with lower level of urbanization. Compared with non-Hispanic white youth, obesity and severe obesity prevalence were significantly higher among non-Hispanic black and Hispanic youth. Severe obesity, but not obesity, was significantly lower among non-Hispanic Asian youth than among non-Hispanic white youth. There were no significant linear or quadratic trends in obesity or severe obesity prevalence from 2001-2004 to 2013-2016 for any urbanization category (P range = .07-.83). Conclusions and Relevance In 2013-2016, there were differences in the prevalence of obesity and severe obesity by age, race and Hispanic origin, and household education, and severe obesity was inversely associated with urbanization. Demographics were not related to the urbanization findings.
Article
Food sharing and (to a lesser extent) labor sharing play central roles in the evolution of cooperation literature. One popular explanation for sharing beyond the family is that it reduces the likelihood of shortages by pooling risk across households. However, the frequency and scope of sharing have never been systematically documented across nonindustrial societies, and the literature is driven by theoretical models, experimental games, and case studies among a few extensively-studied populations. Here we explore the cross-cultural context, frequency, and scope of food and labor sharing customs in relation to resource stress. Using ethnographic data from a worldwide sample of 98 societies in the Standard Cross-Cultural Sample (SCCS), we test the following hypotheses: 1) customary sharing of food and labor beyond the household are cultural universals, 2) societies subject to more resource stress (unpredictable food-destroying natural hazards) will share more frequently, and 3) the more frequent the resource stress, the broader the geographic and social scope of sharing customs. Hypotheses 1 and 2 are generally supported and are consistent with the theory that extensive beyond-household sharing is adaptive in societies that are subject to more resource stress. Hypothesis 3 was not supported and, contrary to our predictions, there is suggestive evidence that sharing beyond-relatives may be attenuated when resource stress is high. In light of these findings, we consider how resource stress may constitute an important selection pressure for maintaining extensive cooperation and help to explain the ubiquity of beyond-household sharing.
Article
Objective: An increase in cortisol during human pregnancy helps coordinate the onset of parturition, and can have long-term effects on offspring biology. Maternal cortisol can also be transferred to offspring via breast milk during lactation. However, little is known about how diurnal cortisol profiles vary by trimester of pregnancy or during the postpartum period. Here, we describe diurnal cortisol profiles among a large cross-sectional sample of healthy Filipino young adult women varying in reproductive status and, during the postpartum period, in whether or not they are breastfeeding. Methods: Salivary cortisol, anthropometric, and questionnaire data were obtained from participants in a birth cohort in metropolitan Cebu, Philippines (N = 741; age 20.8-22.4 years). Cortisol was assessed at waking, thirty minutes after waking (cortisol awakening response, CAR), and before bed. Results: Compared with nulliparous women, morning cortisol was roughly 50% higher among women in late gestation, while evening cortisol was roughly 4-fold higher and the CAR was lower. Postpartum waking and evening cortisol were lower among currently breastfeeding women compared to nulliparous women, but were comparable in the absence of breastfeeding. The CAR was significantly lower among postpartum women compared to nulliparous women irrespective of breastfeeding status. Conclusions: These findings are consistent with known alterations in hypothalamic-pituitary-adrenal axis function during reproduction, and in particular point to marked and progressive elevation in maternal cortisol during the course of gestation. Cortisol appears to return to nulliparous levels after parturition, with levels suppressed below nulliparous levels during lactation.
Article
Objective Differences in bottled v. tap water intake may provide insights into health disparities, like risk of dental caries and inadequate hydration. We examined differences in plain, tap and bottled water consumption among US adults by sociodemographic characteristics. Design Cross-sectional analysis. We used 24 h dietary recall data to test differences in percentage consuming the water sources and mean intake between groups using Wald tests and multiple logistic and linear regression models. Setting National Health and Nutrition Examination Survey (NHANES), 2007–2014. Subjects A nationally representative sample of 20 676 adults aged ≥20 years. Results In 2011–2014, 81·4 ( se 0·6) % of adults drank plain water (sum of tap and bottled), 55·2 ( se 1·4) % drank tap water and 33·4 ( se 1·4) % drank bottled water on a given day. Adjusting for covariates, non-Hispanic (NH) Black and Hispanic adults had 0·44 (95 % CI 0·37, 0·53) and 0·55 (95 % CI 0·45, 0·66) times the odds of consuming tap water, and consumed B =−330 ( se 45) ml and B =−180 ( se 45) ml less tap water than NH White adults, respectively. NH Black, Hispanic and adults born outside the fifty US states or Washington, DC had 2·20 (95 % CI 1·79, 2·69), 2·37 (95 % CI 1·91, 2·94) and 1·46 (95 % CI 1·19, 1·79) times the odds of consuming bottled water than their NH White and US-born counterparts. In 2007–2010, water filtration was associated with higher odds of drinking plain and tap water. Conclusions While most US adults consumed plain water, the source (i.e. tap or bottled) and amount differed by race/Hispanic origin, nativity status and education. Water filters may increase tap water consumption.