Access to this full-text is provided by Wiley.
Content available from Clinical and Translational Science
This content is subject to copyright. Terms and conditions apply.
Citation: Clin Transl Sci (2019) 12, 329–333; doi:10.1111/cts.12638
COMMENTARY
Clinical Data: Sources and Types, Regulatory Constraints,
Applications
Stanley C. Ahalt1,†, Christopher G. Chute2, Karamarie Fecho1,*, Gustavo Glusman3, Jennifer Hadlock3, Casey Overby Taylor2,
Emily R. Pfaff4, Peter N. Robinson5, Harold Solbrig2, Casey Ta6, Nicholas Tatonetti6 and Chunhua Weng6 The Biomedical Data
Translator Consortium
Access to clinical data is critical for the advance-
ment of translational research. However, the nu-
merous regulations and policies that surround
the use of clinical data, although critical to ensure
patient privacy and protect against misuse, often
present challenges to data access and sharing. In
this article, we provide an overview of clinical
data types and associated regulatory constraints
and inferential limitations. We highlight several
novel approaches that our team has developed
for openly exposing clinical data.
BACKGROUND
Recognizing the need to respect and protect patient pri-
vacy, numerous regulations have been established to gov-
ern the use of clinical data by researchers, including the
federal Health Insurance Portability and Accountability
Act of 1996 (HIPAA) and the European Union General Data
Protection Regulation. Institution- specific guidelines and
governing bodies such as institutional review boards (IRBs)
also address research involving patient data and other sen-
sitive data available in electronic medical records (e.g., ad-
ministrative data), in part as a result of concerns regarding
the liability of healthcare providers and institutions.1,2
The Biomedical Data Translator (Translator) program,
funded by the National Center for Advancing Translational
Sciences, aims to facilitate the transformation of basic sci-
ence discoveries into clinically actionable knowledge and
leverage clinical expertise to drive research innovations.3,4
Access to clinical data is central to the vision of the program.
Yet, the program’s dedication to open science adds com-
plexity to the regulatory, technical, and cultural challenges
that already surround access to clinical data.
We review here the types of clinical data sets that can
be derived from paper or electronic medical records, their
applications and limitations, and their associated regulatory
constraints, focusing primarily on compliance requirements
mandated in the United States under HIPAA (Table 1). We
briefly describe several clinical data types that are com-
monly employed in clinical and translational research, in-
cluding fully identified clinical data, HIPAA- limited clinical
data, deidentified clinical data, and synthetic data. We high-
light several novel approaches for openly exposing clini-
cal data that we have developed as part of the Translator
program, namely, HIPAA Safe Harbor Plus (HuSH+) clinical
data, clinical profiles, Columbia Open Health Data (COHD),
and the Integrated Clinical and Environmental Exposures
Service (ICEES).
TYPES OF CLINICAL DATA SETS
Fully identified clinical data sets
Fully identified clinical data sets comprise observational
patient data, including direct patient identifiers (i.e., pro-
tected health information (PHI)), as defined in the privacy
rule issued under HIPAA. Access requires a specific re-
search hypothesis, study approval by an IRB, a full or partial
waiver of HIPAA- informed consent, and typically a secure
workspace. For investigators not affiliated with a specific
institution, additional regulations and approvals may apply,
including a data use agreement (DUA) with the provider in-
stitution. Fully identified clinical data sets may be used for
clinical interpretation and scientific inference and discovery.
However, as with all data sets but especially observational
administrative data sets, issues of data quality and integrity
must be taken into account when drawing conclusions.1
HIPAA- limited clinical data sets
HIPAA- limited clinical data sets comprise observational pa-
tient data with limited PHI: dates such as admission, dis-
charge, service, and dates of birth and death; city, state, and
five digits or more zip codes; and ages in years, months,
days, or hours. HIPAA- limited clinical data sets may be
used or disclosed for purposes of research, public health,
or healthcare operations without obtaining patient authori-
zation or a waiver of HIPAA- informed consent but with IRB
approval and (in some cases) a fully executed DUA. HIPAA-
limited clinical data sets may be used for clinical interpre-
tation and scientific inference and discovery but with the
1Renaissance Computing Institute,Universit y of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA; 2Johns Hopkins University, Baltimore, Maryland, USA;
3Institute for Systems Biology, Seat tle, Washington, USA ; 4North Carolina Translational and Clinic al Sciences Institute,University of North Carolina at Chapel Hill, Chapel
Hill, North Carolina, U SA; 5The Jackson L aboratory, Farmington, Connecticut, USA; 6Columbia University, New York, New York, USA. *Correspondence: Karamarie
Fecho (kfecho@copperlineprofessionalsolutions.com)
Received: January 24, 2019; accepted: March 27, 2019. doi:10.1111/c ts.12 6 3 8
†Authors are listed alphabetically.
330
Clinical and Translational Science
Overview and Application of Clinical Data Types
Ahalt et al.
understanding that certain data elements have been re-
moved from the data and/or transformed (e.g., age vs. birth
date).
Deidentified clinical data sets
Deidentified clinical data sets comprise observational
patient data from which all PHI elements have been re-
moved. Access to deidentified clinical data sets does
not require IRB approval, although an IRB Request for
Determination of Human Subjects Research is advised.
In addition, a fully executed DUA is sometimes required.
Deidentified clinical data sets may be used for clinical
interpretation and scientific inference and discovery
but to a lesser extent than HIPAA- limited clinical data
sets because of the fact that key variables or covariates
may have been removed from the data. For instance,
dates are required to make inferences regarding sea-
sonal patterns in clinical outcomes and correlations with
Table 1 Clinical data ty pes, r egul ator y acce ss res tric tions, and applicat ions
Clinical data type Brief description Regulatory access restrictions Applications
Fully identified
clinical data sets
Observational patient data derived from
paper- based or electronic medical records
IRB approval is required; an executed
data use agreeme nt is possibly
requireda
Clinical interpretation and
scientific inference and
discover y
HIPAA- limited
clinical data sets
Obser vationa l patient data cont aining only a
limited set of HIPAA- defined PHI
IRB approval is required; an executed
data use agreeme nt is possibly
requireda
Clinical interpretation and
scientific inference and
discover y, but with the
understanding that certain
data elements have been
removed fr om the data and/or
transformed
Deidentified clinical
data sets
Obser vationa l patient data, but with all HIPAA-
defined PHI eleme nts remove d
IRB approval is not requiredb; IRB
“Reque st for Dete rmination of
Human Subjects Researc h” is
typically recommended; an executed
data use agreeme nt is possibly
required
Clinical interpretation and
scientific inference and
discover y, but with the
understanding that inferences
regarding time and potentia lly
other factors cannot be made
HuSH+ clinical data
sets
Observational patient data, fully compliant with
HIPAA Safe H arbor, but unlike deide ntified
clinica l data sets, HuSH+ cli nical data sets have
been altered suc h that (i) real patie nt identifiers
(includ ing geoc odes) have been replaced with
random patient identifiers and (ii) dates
(includ ing bir th dates) have b een shifted by a
random number of days (maxim um of
± 50 days), with all dates for a gi ven patient
shifte d by the same n umber of days
Data are derived fr om UNC Health Care System
An executed data use agreeme nt is
requiredc
Clinical interpretation and
scientific inference and
discover y, but with the
understanding that any
inferences based on date/
time and lo cation (g eocode)
cannot be made with
precision, and all other
inferences must consider
date/tim e and location as
potentially hidden covariates
Clinical profiles Statistical profiles of disease and associated
phenot ypic pre sentation derived from
observational patient data
Data are derived from Johns Hopkins Medicine
IRB approval is required to generate
clinica l profile s; no other restric tions
apply
Clinical interpretation and
scientific inference, but with
the understanding that the
data represent statisti cal
profiles
Synthetic clinical
data sets
Realistic, but not real, observational patient data
generated statistically using population
distributions of observational patient data
None Feasibility assessments and
algorithm validation;
generation of clinical profiles
COHD Counts of observational clinical co- occurrences
(e.g., co- occurre nces of sp ecific diagnos es and
prescr ibed medicatio ns), as well as th eir
relative frequen cy and observed –expected
frequency ratio
Data are derived fr om Columbia University
Irving Medical Center
None Clinical interpretation and
scientific inference, but with
the understanding that the
data are re stricted to
co- occurrences
ICEES Patient- level or visit- level counts of observational
patient data integrated at the patient a nd visit
level with a variety of e nvironmental exp osures
derived from multiple public data sources
Data are derived fr om UNC Health Care System
and a varie ty of public data sources on
environmental exposures
IRB approval is required to generate
ICEES integ rated feature tables; no
other restrictions apply
Clinical interpretation and
scientific inference, but with
the understanding that the
raw data have b een
transfo rmed (e.g., binned or
categorized)
COHD, Columbia Ope n Health D ata; HIPAA, Heal th Insurance Por tabili ty and Ac counta bility A ct; HuSH+, HIPAA Safe Harb or Plus; ICEES, Integr ated Clinical
and Enviro nmenta l Expos ures Service; IRB, institutional review boa rd; PHI, protected health information; UNC, Univer sity of North Ca rolina.
aIndividual insti tutions m ay require a secure wo rkspac e for data ac cess and use. bWhile HIPAA and IRB regulations do not apply, institutional approvals may
be requir ed. cHuSH+ clinical d ata sets we re conceptualized and cre ated by UNC a s part of t he Nation al Center for Advanc ing Translational Sciences –funded
Biomedical Data Translator pr ogram. T he institution re quires a fully execute d data use agreem ent for access to the d ata.
331
www.cts-journal.com
Overview and Application of Clinical Data Types
Ahalt et al.
natural disasters, system- related issues such as protocol
changes, and regulatory issues such as new black- box
warnings.
HuSH+ clinical data sets
HuSH+ clinical data sets were created by Translator team
members as a hybrid deidentification approach that is com-
pletely compliant with HIPAA and provides restricted access
to observational patient data from the UNC Health Care
System. HuSH+ clinical data sets differ from deidentified clin-
ical data sets in that (i) real patient identifiers (including geo-
codes) have been replaced with random patient identifiers and
(ii) dates (including birth dates) have been shifted by a random
number of days (maximum of ± 50 days), with all dates for a
given patient shifted by the same number of days. Access to
HuSH+ clinical data does not require IRB approval but does
require a fully executed DUA per institutional mandate. HuSH+
clinical data sets may be used in a limited fashion for clini-
cal interpretation and scientific inference and discovery. The
main considerations are that any inferences based on date/
time and location (geocode) cannot be made with precise ac-
curacy or correlated with seasonal trends or specific events,
and all other inferences must consider date/time and location
as potentially hidden covariates.
Clinical profiles
Clinical profiles have been developed as part of the Translator
program an d represent statistica l profiles of disease and a sso-
ciated phenotypic presentations derived from observational
patient data from Johns Hopkins Medicine using the Health
Level Seven International Fast Healthcare Interoperability
Resources common data model. At present, clinical profiles
include data on demographics, diagnoses, disease comor-
bidities, symptoms, medications, procedures, and labora-
tory measures. IRB approval is required to generate clinical
profiles but once generated, clinical profiles can be openly
shared. Institutional restrictions may apply, however. Clinical
profiles can be used for clinical interpretation and scientific
inference and discovery but with the understanding that they
represent statistical summaries of patient populations and
only indirectly represent patient- level observations. Multiple
computational tools and example output files are openly avail-
able for creating and using clinical profiles (see Supplemental
Information on Clinical Profiles in Further Reading).
Synthetic clinical data sets
Synthetic clinical data sets comprise realistic (but not real)
data generated statistically by applying simulation techniques
to population distributions of observational patient data.
Synthetic clinical data sets can be openly shared. A publicly
available example, the Synthetic Mass data set, was gener-
ated using the Synthea method5 to simulate patient- level and
population- level data on patients who reside in the state of
Massachusetts. A similar open effort is Simulacrum, which
is based on observational patient data held by Public Health
England’s National Cancer Registration and Analysis Service.
The data include realistic patient histories with clinically rel-
evant patient encounters; as such, the data can be used for
feasibility assessments and algorithm validation but not for
clinical interpretation or scientific inference and discovery.
COHD
Translator team members have pioneered the use of clin-
ical co- occurrence tables as part of the COHD initiative.6
COHD provides open access to observational patient data
from Columbia University Irving Medical Center in the form
of co- occurrence counts of pairs of concepts or clinical
feature variables (e.g., medications and diagnoses), as well
as their relative frequency and observed–expected fre-
quency ratio. The data are publicly accessible via an open
web interface or Application Programming Interface. Risks
to patient privacy are mitigated by excluding rare features
(counts ≤ 10) and perturbing the counts according to the
Poisson distribution. The data can be used to derive in-
sights into questions of clinical relevance and importance
for translational research. For instance, an individual user
may wish to know the frequency of asthma among African
American patients (Figure 1a). A search of the COHD ser-
vice reveals that there are 11,716 African American pa-
tients with a diagnosis of asthma among 208,438 African
American patients (5.62%). For comparison, a second
search reveals that there are 29,913 white patients with a
diagnosis of asthma among 601,167 white patients (4.98%).
ICEES
ICEES was designed by Translator team members as a
novel extension of COHD.7 Specifically, ICEES permits
open access to observational patient data from the UNC
Health Care System that have been integrated at the pa-
tient and visit level with environmental exposures data (e.g.,
airborne and roadway pollutants, socioeconomic factors)
derived from multiple public sources. A complex data ex-
traction and integration software pipeline has been devel-
oped to create ICEES integrated feature tables.8 The tables
are generated using PHI (geocodes and dates), but the data
are then binned or recoded and stripped of PHI. Thus, the
ICEES pipeline must be executed under an approved IRB
protocol, but subsequent steps are not subject to IRB reg-
ulation, and ICEES is publicly accessible via an Application
Programming Interface. ICEES provides a number of func-
tionalities for clinical interpretation and scientific inference
and discovery. For example, Fig ure 1b demonstrates that
for COHORT:60 (African Americans with asthma- like condi-
tions in calendar year 2010), the percentage of patients with
two or more annual emergency department or inpatient vis-
its for respiratory issues is higher among patients with high
average daily exposure to particulate matter ≤ 2.5 μm in
diameter than among patients with low average daily expo-
sure to particulate matter ≤ 2.5 μm in diameter (21.10% vs.
8.90%, P < 0.0001, N = 6,379), thus replicating published
literature on the association between airborne pollutant ex-
posures and asthma exacerbations.9 The data additionally
suggest that African Americans with asthma- like conditions
have relatively high exposure to particulate matter, with
~ 95% of the cohort exposed to ≥ 9.63 μg/m3 average daily
particulate matter ≤ 2.5 μm in diameter.
Clinical fingerprints
Although not a new clinical data type per se, Translator
teams have been working to develop privacy- preserving
analytic approaches to visualize and compare patient data,
332
Clinical and Translational Science
Overview and Application of Clinical Data Types
Ahalt et al.
Figure 1 Example queries, including input parameters and output, for Columbia Open Health Data (COHD) (a) and the Integrated
Clinical and Environmental Exposures Service (ICEES) (b). AvgDailyPM2.5Exposure = average daily patient exposure to PM2.5 (μg/m3)
over a 1- year study period; TotalEDInpatient Vists = total number of emergency department or inpatient visits for respiratory issues
during a 1- year study period. The study period shown here is for calendar year 2010. AvgDailyPM2.5Exposure <3 range: 1.58, 9.63 μg/
m3; AvgDailyPM2.5Exposure ≥3 range: 9.63, 17.33 μg/m3. ID, identifier; PM2.5, airborne particulate matter ≤2.5 μm in diameter.
(a)
(b)
COHD example queries
In
put: Asthma (ID #317009) and Black or African American (ID #8516)
Output:
In
put: Asthma (ID #317009) and White (ID #8527)
Output
:
ICEES example query
In
put:
Feature variables: AvgDailyPM2.5Exposures < 3, TotalEDInpatientVisits <
2
Version of data: 1.0.0
Table: patient
Year: 2010
Cohort ID: COHORT:60
Output
:*
+----------------------------+------------------------------+-------------------------------+---------+
| feature | TotalEDInpatientVisits < 2 | TotalEDInpatientVisits >= 2 ||
+============================+==============================+===============================+=========
+
| AvgDailyPM2.5Exposure < 3 | 297 91.10%
| 29 8.90% | 326|
||
5.85% 4.66% | 2.22% 0.45% | 5.11% |
+----------------------------+------------------------------+-------------------------------+---------+
| AvgDailyPM2.5Exposure >= 3 | 4776 78.90%
| 1277 21.10% | 6053 |
||
94.15% 74.87% | 97.78% 20.02% | 94.89% |
+----------------------------+------------------------------+-------------------------------+---------+
| | 5073 | 1306
| 6379 |
||
79.53% | 20.47% | 100.00% |
+----------------------------+------------------------------+-------------------------------+---------+
+-------------+---------------+
| p_value | chi_squared |
+=============+===============+
| 3.16593e-06 |
28.2841 |
+-------------+---------------+
333
www.cts-journal.com
Overview and Application of Clinical Data Types
Ahalt et al.
including genomic data and clinical records in semistruc-
tured JavaScript Object Notation or eXtensible Markup
Language formats. Genomic data typically consist of lists
of variants relative to a reference allele sorted by position.
Genome fingerprints capture the unique patterns gener-
ated by pairs of consecutive single- nucleotide variants
as patient- level matrices or fingerprints.10 The correlation
between two fingerprints reflects the degree of related-
ness between two genomes. Clinical fingerprints simi-
larly transform clinical records from the Fast Healthcare
Interoperability Resources format into numerical vectors
that greatly simplify their comparison. Translator team
members are working to adapt this methodology for ap-
plication to the ICEES integration pipeline and incorpora-
tion into the ICEES integrated feature tables.
CONCLUSION
In this article, we described various types of clinical data
sets and associated inferential limitations and regulatory
constraints, focusing primarily on compliance requirements
mandated in the United States under HIPAA. We highlighted
several novel approaches that we have developed as part
of the Translator program to openly expose observational
patient data, while respecting and protecting patient pri-
vacy. We recognize that each of these approaches retains
a residual risk of patient reidentification; thus, we continue
to work with experts in regulatory protections and com-
puter security to ensure that those risks remain minimal.
Although the Translator approaches are designed to be
disease- agnostic and generalizable, they were developed
to comply with HIPAA and institutional guidelines; as such,
our approaches may need to be modified prior to adoption
elsewhere. Nonetheless, through these open services, we
hope to accelerate clinical and translational science and
foster biomedical discovery.
Supporting Information. Supplementary information accompa-
nies this paper on the Clinical and Translational Science website (www.
cts-journal.com). The Further Reading includes supplementary in-
formation on Clinical Profiles, Synthetic Clinical Datasets, COHD, and
ICEES, as well as relevant regulator y information and information on
related large-scale patient de-identification and data-sharing efforts.
Clinical Data: Sources and Types, Regulatory Constraints, Applications.
Acknowledgments. The authors acknowledge and appreciate the
contributions provided by the following individuals: Chris Bizon, Steve
Cox, Ashok Krishnamurthy, Lisa Stillwell, and Hao Xu of the University
of North Carolina Renaissance Computing Institute; James Champion
of the North Carolina Translational and Clinical Sciences Institute;
David B. Peden of the University of North Carolina School of Medicine;
Sarav Arunachalam of the University of North Carolina Institute for the
Environment; Max Robinson of the Institute for Systems Biology; and
Stefano Rensi of Stanford University.
Funding. Support for this project was provided by the National
Center for Advancing Translational Sciences, National Institutes of Health
through the Biomedical Data Translator program (awards 1OT3TR002019,
1OT3TR002020, 1OT3TR002025, 1OT3TR002026, 1OT3TR002027,
1OT2TR002514, 1OT2TR002515, 1OT2TR002517, 1OT2TR002520,
1OT2TR002584) and the Clinical and Translational Sciences Award pro-
gram (award UL1TR002489).
Conflict of Interest. All authors declared no competing interests
for this work.
1. Harman, L .B., Flit e, C.A . & Bond, K. Electro nic health records: pr ivacy, confid enti-
ality, and sec urity. Virtual Mentor 14, 712–7 19 (2012 ).
2. Na, L ., Yang, C., Lo, C .C., Zhao, F., Fukuok a, Y. & Aswani, A . Feasibility of reidenti-
fying individuals in lar ge national physical ac tivit y data set s from which p rotecte d
health inf ormation h as been removed with use of machine learning. JAMA Network
Open 1, e186 04 0 (2018) .
3. The Biomedical Data Translator Co nsortium. The Biomedical Dat a Translator pr o-
gram: conception, cult ure, and community. Clin. Transl. Sci. 12 , 92– 94 (2019).
ht tps :/ /do i. or g /( 20 18 a10 /1111 /c ts .12 59 2.
4. The Biome dical Dat a Translator C onsor tium Toward a univer sal biomedical data
translator. Clin. Transl. Sci. 12, 86–90 (2019). https://doi.org/(2018b10/1111/
cts.12591.
5. Walonoski, J. et al. Synthea: an appr oach, method, and sof tware mechanism fo r
generating synthetic patients and the synthetic electronic health care record. J.
Am. Med. Inform. Assoc. 25, 230– 238 (2018).
6. Ta, C., Dumontier, M., Hripcsak, G., Tatonetti, N. & Weng, C. C olumbia Open Health
Data, clinical concept prevalenc e and co- occurren ce from elec tronic health re-
cords. Sci. D ata 5, 180 273 (2 018) .
7. Fecho, K. et al. A novel a pproach fo r exposing an d sharing clini cal data: the
Translator Integrated Clinical and Environmental Exposures Service. J. Am. Med.
Inform. Assoc. (in press). https://doi.org /10.1093/jamia /ocz042
8. P faff, E. R. et al. All r oads lead to F HIR: a n extensib le clinical data convers ion
pipeline. American Medical Informatics Association 2019 Informatics Summit,
San Francisco, CA , March 25–28, 2 019. Abstr act.
9. Mirabelli, M. C., Vaidyanathan, A ., Flanders, W.D., Qin, X . & Garbe, P. Outdoor PM2.5,
ambient air temperature, and asthma s ymptoms in the past 14days amo ng adults
with acti ve asthma. Environ. Health Perspect. 12 4, 18 82–189 0 (2 016).
10. Glusman, G., Mauldin, D.E., Ho od, L.E. & Robinson, M. Ultra fast comparison of p er-
sonal genomes via precomputed genome fingerprints. Front. Genet. 8, 136 (2017 ).
© 2019 The Authors. Clinical and Translational Science
published by Wiley Periodicals, Inc. on behalf of the
American Society for Clinical Pharmacology and
Therapeutics. This is an open access article under
the terms of the Creative Commons Attribution-
NonCommercial License, which permits use, distribution
and reproduction in any medium, provided the original
work is properly cited and is not used for commercial
purposes.
Available via license: CC BY-NC
Content may be subject to copyright.