ArticlePDF Available

Abstract and Figures

Access to clinical data is critical for the advancement of translational research. However, the numerous regulations and policies that surround the use of clinical data, although critical to ensure patient privacy and protect against misuse, often present challenges to data access and sharing. In this article, we provide an overview of clinical data types and associated regulatory constraints and inferential limitations. We highlight several novel approaches that our team has developed for openly exposing clinical data.
This content is subject to copyright. Terms and conditions apply.
Citation: Clin Transl Sci (2019) 12, 329–333; doi:10.1111/cts.12638
COMMENTARY
Clinical Data: Sources and Types, Regulatory Constraints,
Applications
Stanley C. Ahalt1,†, Christopher G. Chute2, Karamarie Fecho1,*, Gustavo Glusman3, Jennifer Hadlock3, Casey Overby Taylor2,
Emily R. Pfaff4, Peter N. Robinson5, Harold Solbrig2, Casey Ta6, Nicholas Tatonetti6 and Chunhua Weng6 The Biomedical Data
Translator Consortium
Access to clinical data is critical for the advance-
ment of translational research. However, the nu-
merous regulations and policies that surround
the use of clinical data, although critical to ensure
patient privacy and protect against misuse, often
present challenges to data access and sharing. In
this article, we provide an overview of clinical
data types and associated regulatory constraints
and inferential limitations. We highlight several
novel approaches that our team has developed
for openly exposing clinical data.
BACKGROUND
Recognizing the need to respect and protect patient pri-
vacy, numerous regulations have been established to gov-
ern the use of clinical data by researchers, including the
federal Health Insurance Portability and Accountability
Act of 1996 (HIPAA) and the European Union General Data
Protection Regulation. Institution- specific guidelines and
governing bodies such as institutional review boards (IRBs)
also address research involving patient data and other sen-
sitive data available in electronic medical records (e.g., ad-
ministrative data), in part as a result of concerns regarding
the liability of healthcare providers and institutions.1,2
The Biomedical Data Translator (Translator) program,
funded by the National Center for Advancing Translational
Sciences, aims to facilitate the transformation of basic sci-
ence discoveries into clinically actionable knowledge and
leverage clinical expertise to drive research innovations.3,4
Access to clinical data is central to the vision of the program.
Yet, the program’s dedication to open science adds com-
plexity to the regulatory, technical, and cultural challenges
that already surround access to clinical data.
We review here the types of clinical data sets that can
be derived from paper or electronic medical records, their
applications and limitations, and their associated regulatory
constraints, focusing primarily on compliance requirements
mandated in the United States under HIPAA (Table 1). We
briefly describe several clinical data types that are com-
monly employed in clinical and translational research, in-
cluding fully identified clinical data, HIPAA- limited clinical
data, deidentified clinical data, and synthetic data. We high-
light several novel approaches for openly exposing clini-
cal data that we have developed as part of the Translator
program, namely, HIPAA Safe Harbor Plus (HuSH+) clinical
data, clinical profiles, Columbia Open Health Data (COHD),
and the Integrated Clinical and Environmental Exposures
Service (ICEES).
TYPES OF CLINICAL DATA SETS
Fully identified clinical data sets
Fully identified clinical data sets comprise observational
patient data, including direct patient identifiers (i.e., pro-
tected health information (PHI)), as defined in the privacy
rule issued under HIPAA. Access requires a specific re-
search hypothesis, study approval by an IRB, a full or partial
waiver of HIPAA- informed consent, and typically a secure
workspace. For investigators not affiliated with a specific
institution, additional regulations and approvals may apply,
including a data use agreement (DUA) with the provider in-
stitution. Fully identified clinical data sets may be used for
clinical interpretation and scientific inference and discovery.
However, as with all data sets but especially observational
administrative data sets, issues of data quality and integrity
must be taken into account when drawing conclusions.1
HIPAA- limited clinical data sets
HIPAA- limited clinical data sets comprise observational pa-
tient data with limited PHI: dates such as admission, dis-
charge, service, and dates of birth and death; city, state, and
five digits or more zip codes; and ages in years, months,
days, or hours. HIPAA- limited clinical data sets may be
used or disclosed for purposes of research, public health,
or healthcare operations without obtaining patient authori-
zation or a waiver of HIPAA- informed consent but with IRB
approval and (in some cases) a fully executed DUA. HIPAA-
limited clinical data sets may be used for clinical interpre-
tation and scientific inference and discovery but with the
1Renaissance Computing Institute,Universit y of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA; 2Johns Hopkins University, Baltimore, Maryland, USA;
3Institute for Systems Biology, Seat tle, Washington, USA ; 4North Carolina Translational and Clinic al Sciences Institute,University of North Carolina at Chapel Hill, Chapel
Hill, North Carolina, U SA; 5The Jackson L aboratory, Farmington, Connecticut, USA; 6Columbia University, New York, New York, USA. *Correspondence: Karamarie
Fecho (kfecho@copperlineprofessionalsolutions.com)
Received: January 24, 2019; accepted: March 27, 2019. doi:10.1111/c ts.12 6 3 8
Authors are listed alphabetically.
330
Clinical and Translational Science
Overview and Application of Clinical Data Types
Ahalt et al.
understanding that certain data elements have been re-
moved from the data and/or transformed (e.g., age vs. birth
date).
Deidentified clinical data sets
Deidentified clinical data sets comprise observational
patient data from which all PHI elements have been re-
moved. Access to deidentified clinical data sets does
not require IRB approval, although an IRB Request for
Determination of Human Subjects Research is advised.
In addition, a fully executed DUA is sometimes required.
Deidentified clinical data sets may be used for clinical
interpretation and scientific inference and discovery
but to a lesser extent than HIPAA- limited clinical data
sets because of the fact that key variables or covariates
may have been removed from the data. For instance,
dates are required to make inferences regarding sea-
sonal patterns in clinical outcomes and correlations with
Table 1 Clinical data ty pes, r egul ator y acce ss res tric tions, and applicat ions
Clinical data type Brief description Regulatory access restrictions Applications
Fully identified
clinical data sets
Observational patient data derived from
paper- based or electronic medical records
IRB approval is required; an executed
data use agreeme nt is possibly
requireda
Clinical interpretation and
scientific inference and
discover y
HIPAA- limited
clinical data sets
Obser vationa l patient data cont aining only a
limited set of HIPAA- defined PHI
IRB approval is required; an executed
data use agreeme nt is possibly
requireda
Clinical interpretation and
scientific inference and
discover y, but with the
understanding that certain
data elements have been
removed fr om the data and/or
transformed
Deidentified clinical
data sets
Obser vationa l patient data, but with all HIPAA-
defined PHI eleme nts remove d
IRB approval is not requiredb; IRB
“Reque st for Dete rmination of
Human Subjects Researc h” is
typically recommended; an executed
data use agreeme nt is possibly
required
Clinical interpretation and
scientific inference and
discover y, but with the
understanding that inferences
regarding time and potentia lly
other factors cannot be made
HuSH+ clinical data
sets
Observational patient data, fully compliant with
HIPAA Safe H arbor, but unlike deide ntified
clinica l data sets, HuSH+ cli nical data sets have
been altered suc h that (i) real patie nt identifiers
(includ ing geoc odes) have been replaced with
random patient identifiers and (ii) dates
(includ ing bir th dates) have b een shifted by a
random number of days (maxim um of
± 50 days), with all dates for a gi ven patient
shifte d by the same n umber of days
Data are derived fr om UNC Health Care System
An executed data use agreeme nt is
requiredc
Clinical interpretation and
scientific inference and
discover y, but with the
understanding that any
inferences based on date/
time and lo cation (g eocode)
cannot be made with
precision, and all other
inferences must consider
date/tim e and location as
potentially hidden covariates
Clinical profiles Statistical profiles of disease and associated
phenot ypic pre sentation derived from
observational patient data
Data are derived from Johns Hopkins Medicine
IRB approval is required to generate
clinica l profile s; no other restric tions
apply
Clinical interpretation and
scientific inference, but with
the understanding that the
data represent statisti cal
profiles
Synthetic clinical
data sets
Realistic, but not real, observational patient data
generated statistically using population
distributions of observational patient data
None Feasibility assessments and
algorithm validation;
generation of clinical profiles
COHD Counts of observational clinical co- occurrences
(e.g., co- occurre nces of sp ecific diagnos es and
prescr ibed medicatio ns), as well as th eir
relative frequen cy and observed –expected
frequency ratio
Data are derived fr om Columbia University
Irving Medical Center
None Clinical interpretation and
scientific inference, but with
the understanding that the
data are re stricted to
co- occurrences
ICEES Patient- level or visit- level counts of observational
patient data integrated at the patient a nd visit
level with a variety of e nvironmental exp osures
derived from multiple public data sources
Data are derived fr om UNC Health Care System
and a varie ty of public data sources on
environmental exposures
IRB approval is required to generate
ICEES integ rated feature tables; no
other restrictions apply
Clinical interpretation and
scientific inference, but with
the understanding that the
raw data have b een
transfo rmed (e.g., binned or
categorized)
COHD, Columbia Ope n Health D ata; HIPAA, Heal th Insurance Por tabili ty and Ac counta bility A ct; HuSH+, HIPAA Safe Harb or Plus; ICEES, Integr ated Clinical
and Enviro nmenta l Expos ures Service; IRB, institutional review boa rd; PHI, protected health information; UNC, Univer sity of North Ca rolina.
aIndividual insti tutions m ay require a secure wo rkspac e for data ac cess and use. bWhile HIPAA and IRB regulations do not apply, institutional approvals may
be requir ed. cHuSH+ clinical d ata sets we re conceptualized and cre ated by UNC a s part of t he Nation al Center for Advanc ing Translational Sciences –funded
Biomedical Data Translator pr ogram. T he institution re quires a fully execute d data use agreem ent for access to the d ata.
331
www.cts-journal.com
Overview and Application of Clinical Data Types
Ahalt et al.
natural disasters, system- related issues such as protocol
changes, and regulatory issues such as new black- box
warnings.
HuSH+ clinical data sets
HuSH+ clinical data sets were created by Translator team
members as a hybrid deidentification approach that is com-
pletely compliant with HIPAA and provides restricted access
to observational patient data from the UNC Health Care
System. HuSH+ clinical data sets differ from deidentified clin-
ical data sets in that (i) real patient identifiers (including geo-
codes) have been replaced with random patient identifiers and
(ii) dates (including birth dates) have been shifted by a random
number of days (maximum of ± 50 days), with all dates for a
given patient shifted by the same number of days. Access to
HuSH+ clinical data does not require IRB approval but does
require a fully executed DUA per institutional mandate. HuSH+
clinical data sets may be used in a limited fashion for clini-
cal interpretation and scientific inference and discovery. The
main considerations are that any inferences based on date/
time and location (geocode) cannot be made with precise ac-
curacy or correlated with seasonal trends or specific events,
and all other inferences must consider date/time and location
as potentially hidden covariates.
Clinical profiles
Clinical profiles have been developed as part of the Translator
program an d represent statistica l profiles of disease and a sso-
ciated phenotypic presentations derived from observational
patient data from Johns Hopkins Medicine using the Health
Level Seven International Fast Healthcare Interoperability
Resources common data model. At present, clinical profiles
include data on demographics, diagnoses, disease comor-
bidities, symptoms, medications, procedures, and labora-
tory measures. IRB approval is required to generate clinical
profiles but once generated, clinical profiles can be openly
shared. Institutional restrictions may apply, however. Clinical
profiles can be used for clinical interpretation and scientific
inference and discovery but with the understanding that they
represent statistical summaries of patient populations and
only indirectly represent patient- level observations. Multiple
computational tools and example output files are openly avail-
able for creating and using clinical profiles (see Supplemental
Information on Clinical Profiles in Further Reading).
Synthetic clinical data sets
Synthetic clinical data sets comprise realistic (but not real)
data generated statistically by applying simulation techniques
to population distributions of observational patient data.
Synthetic clinical data sets can be openly shared. A publicly
available example, the Synthetic Mass data set, was gener-
ated using the Synthea method5 to simulate patient- level and
population- level data on patients who reside in the state of
Massachusetts. A similar open effort is Simulacrum, which
is based on observational patient data held by Public Health
England’s National Cancer Registration and Analysis Service.
The data include realistic patient histories with clinically rel-
evant patient encounters; as such, the data can be used for
feasibility assessments and algorithm validation but not for
clinical interpretation or scientific inference and discovery.
COHD
Translator team members have pioneered the use of clin-
ical co- occurrence tables as part of the COHD initiative.6
COHD provides open access to observational patient data
from Columbia University Irving Medical Center in the form
of co- occurrence counts of pairs of concepts or clinical
feature variables (e.g., medications and diagnoses), as well
as their relative frequency and observed–expected fre-
quency ratio. The data are publicly accessible via an open
web interface or Application Programming Interface. Risks
to patient privacy are mitigated by excluding rare features
(counts 10) and perturbing the counts according to the
Poisson distribution. The data can be used to derive in-
sights into questions of clinical relevance and importance
for translational research. For instance, an individual user
may wish to know the frequency of asthma among African
American patients (Figure 1a). A search of the COHD ser-
vice reveals that there are 11,716 African American pa-
tients with a diagnosis of asthma among 208,438 African
American patients (5.62%). For comparison, a second
search reveals that there are 29,913 white patients with a
diagnosis of asthma among 601,167 white patients (4.98%).
ICEES
ICEES was designed by Translator team members as a
novel extension of COHD.7 Specifically, ICEES permits
open access to observational patient data from the UNC
Health Care System that have been integrated at the pa-
tient and visit level with environmental exposures data (e.g.,
airborne and roadway pollutants, socioeconomic factors)
derived from multiple public sources. A complex data ex-
traction and integration software pipeline has been devel-
oped to create ICEES integrated feature tables.8 The tables
are generated using PHI (geocodes and dates), but the data
are then binned or recoded and stripped of PHI. Thus, the
ICEES pipeline must be executed under an approved IRB
protocol, but subsequent steps are not subject to IRB reg-
ulation, and ICEES is publicly accessible via an Application
Programming Interface. ICEES provides a number of func-
tionalities for clinical interpretation and scientific inference
and discovery. For example, Fig ure 1b demonstrates that
for COHORT:60 (African Americans with asthma- like condi-
tions in calendar year 2010), the percentage of patients with
two or more annual emergency department or inpatient vis-
its for respiratory issues is higher among patients with high
average daily exposure to particulate matter 2.5 μm in
diameter than among patients with low average daily expo-
sure to particulate matter ≤ 2.5 μm in diameter (21.10% vs.
8.90%, P < 0.0001, N = 6,379), thus replicating published
literature on the association between airborne pollutant ex-
posures and asthma exacerbations.9 The data additionally
suggest that African Americans with asthma- like conditions
have relatively high exposure to particulate matter, with
~ 95% of the cohort exposed to ≥ 9.63 μg/m3 average daily
particulate matter ≤ 2.5 μm in diameter.
Clinical fingerprints
Although not a new clinical data type per se, Translator
teams have been working to develop privacy- preserving
analytic approaches to visualize and compare patient data,
332
Clinical and Translational Science
Overview and Application of Clinical Data Types
Ahalt et al.
Figure 1 Example queries, including input parameters and output, for Columbia Open Health Data (COHD) (a) and the Integrated
Clinical and Environmental Exposures Service (ICEES) (b). AvgDailyPM2.5Exposure = average daily patient exposure to PM2.5 (μg/m3)
over a 1- year study period; TotalEDInpatient Vists = total number of emergency department or inpatient visits for respiratory issues
during a 1- year study period. The study period shown here is for calendar year 2010. AvgDailyPM2.5Exposure <3 range: 1.58, 9.63 μg/
m3; AvgDailyPM2.5Exposure ≥3 range: 9.63, 17.33 μg/m3. ID, identifier; PM2.5, airborne particulate matter ≤2.5 μm in diameter.
(a)
(b)
COHD example queries
In
put: Asthma (ID #317009) and Black or African American (ID #8516)
Output:
In
put: Asthma (ID #317009) and White (ID #8527)
Output
:
ICEES example query
In
put:
Feature variables: AvgDailyPM2.5Exposures < 3, TotalEDInpatientVisits <
2
Version of data: 1.0.0
Table: patient
Year: 2010
Cohort ID: COHORT:60
Output
:*
+----------------------------+------------------------------+-------------------------------+---------+
| feature | TotalEDInpatientVisits < 2 | TotalEDInpatientVisits >= 2 ||
+============================+==============================+===============================+=========
+
| AvgDailyPM2.5Exposure < 3 | 297 91.10%
| 29 8.90% | 326|
||
5.85% 4.66% | 2.22% 0.45% | 5.11% |
+----------------------------+------------------------------+-------------------------------+---------+
| AvgDailyPM2.5Exposure >= 3 | 4776 78.90%
| 1277 21.10% | 6053 |
||
94.15% 74.87% | 97.78% 20.02% | 94.89% |
+----------------------------+------------------------------+-------------------------------+---------+
| | 5073 | 1306
| 6379 |
||
79.53% | 20.47% | 100.00% |
+----------------------------+------------------------------+-------------------------------+---------+
+-------------+---------------+
| p_value | chi_squared |
+=============+===============+
| 3.16593e-06 |
28.2841 |
+-------------+---------------+
333
www.cts-journal.com
Overview and Application of Clinical Data Types
Ahalt et al.
including genomic data and clinical records in semistruc-
tured JavaScript Object Notation or eXtensible Markup
Language formats. Genomic data typically consist of lists
of variants relative to a reference allele sorted by position.
Genome fingerprints capture the unique patterns gener-
ated by pairs of consecutive single- nucleotide variants
as patient- level matrices or fingerprints.10 The correlation
between two fingerprints reflects the degree of related-
ness between two genomes. Clinical fingerprints simi-
larly transform clinical records from the Fast Healthcare
Interoperability Resources format into numerical vectors
that greatly simplify their comparison. Translator team
members are working to adapt this methodology for ap-
plication to the ICEES integration pipeline and incorpora-
tion into the ICEES integrated feature tables.
CONCLUSION
In this article, we described various types of clinical data
sets and associated inferential limitations and regulatory
constraints, focusing primarily on compliance requirements
mandated in the United States under HIPAA. We highlighted
several novel approaches that we have developed as part
of the Translator program to openly expose observational
patient data, while respecting and protecting patient pri-
vacy. We recognize that each of these approaches retains
a residual risk of patient reidentification; thus, we continue
to work with experts in regulatory protections and com-
puter security to ensure that those risks remain minimal.
Although the Translator approaches are designed to be
disease- agnostic and generalizable, they were developed
to comply with HIPAA and institutional guidelines; as such,
our approaches may need to be modified prior to adoption
elsewhere. Nonetheless, through these open services, we
hope to accelerate clinical and translational science and
foster biomedical discovery.
Supporting Information. Supplementary information accompa-
nies this paper on the Clinical and Translational Science website (www.
cts-journal.com). The Further Reading includes supplementary in-
formation on Clinical Profiles, Synthetic Clinical Datasets, COHD, and
ICEES, as well as relevant regulator y information and information on
related large-scale patient de-identification and data-sharing efforts.
Clinical Data: Sources and Types, Regulatory Constraints, Applications.
Acknowledgments. The authors acknowledge and appreciate the
contributions provided by the following individuals: Chris Bizon, Steve
Cox, Ashok Krishnamurthy, Lisa Stillwell, and Hao Xu of the University
of North Carolina Renaissance Computing Institute; James Champion
of the North Carolina Translational and Clinical Sciences Institute;
David B. Peden of the University of North Carolina School of Medicine;
Sarav Arunachalam of the University of North Carolina Institute for the
Environment; Max Robinson of the Institute for Systems Biology; and
Stefano Rensi of Stanford University.
Funding. Support for this project was provided by the National
Center for Advancing Translational Sciences, National Institutes of Health
through the Biomedical Data Translator program (awards 1OT3TR002019,
1OT3TR002020, 1OT3TR002025, 1OT3TR002026, 1OT3TR002027,
1OT2TR002514, 1OT2TR002515, 1OT2TR002517, 1OT2TR002520,
1OT2TR002584) and the Clinical and Translational Sciences Award pro-
gram (award UL1TR002489).
Conflict of Interest. All authors declared no competing interests
for this work.
1. Harman, L .B., Flit e, C.A . & Bond, K. Electro nic health records: pr ivacy, confid enti-
ality, and sec urity. Virtual Mentor 14, 712–7 19 (2012 ).
2. Na, L ., Yang, C., Lo, C .C., Zhao, F., Fukuok a, Y. & Aswani, A . Feasibility of reidenti-
fying individuals in lar ge national physical ac tivit y data set s from which p rotecte d
health inf ormation h as been removed with use of machine learning. JAMA Network
Open 1, e186 04 0 (2018) .
3. The Biomedical Data Translator Co nsortium. The Biomedical Dat a Translator pr o-
gram: conception, cult ure, and community. Clin. Transl. Sci. 12 , 92– 94 (2019).
ht tps :/ /do i. or g /( 20 18 a10 /1111 /c ts .12 59 2.
4. The Biome dical Dat a Translator C onsor tium Toward a univer sal biomedical data
translator. Clin. Transl. Sci. 12, 86–90 (2019). https://doi.org/(2018b10/1111/
cts.12591.
5. Walonoski, J. et al. Synthea: an appr oach, method, and sof tware mechanism fo r
generating synthetic patients and the synthetic electronic health care record. J.
Am. Med. Inform. Assoc. 25, 230– 238 (2018).
6. Ta, C., Dumontier, M., Hripcsak, G., Tatonetti, N. & Weng, C. C olumbia Open Health
Data, clinical concept prevalenc e and co- occurren ce from elec tronic health re-
cords. Sci. D ata 5, 180 273 (2 018) .
7. Fecho, K. et al. A novel a pproach fo r exposing an d sharing clini cal data: the
Translator Integrated Clinical and Environmental Exposures Service. J. Am. Med.
Inform. Assoc. (in press). https://doi.org /10.1093/jamia /ocz042
8. P faff, E. R. et al. All r oads lead to F HIR: a n extensib le clinical data convers ion
pipeline. American Medical Informatics Association 2019 Informatics Summit,
San Francisco, CA , March 25–28, 2 019. Abstr act.
9. Mirabelli, M. C., Vaidyanathan, A ., Flanders, W.D., Qin, X . & Garbe, P. Outdoor PM2.5,
ambient air temperature, and asthma s ymptoms in the past 14days amo ng adults
with acti ve asthma. Environ. Health Perspect. 12 4, 18 82–189 0 (2 016).
10. Glusman, G., Mauldin, D.E., Ho od, L.E. & Robinson, M. Ultra fast comparison of p er-
sonal genomes via precomputed genome fingerprints. Front. Genet. 8, 136 (2017 ).
© 2019 The Authors. Clinical and Translational Science
published by Wiley Periodicals, Inc. on behalf of the
American Society for Clinical Pharmacology and
Therapeutics. This is an open access article under
the terms of the Creative Commons Attribution-
NonCommercial License, which permits use, distribution
and reproduction in any medium, provided the original
work is properly cited and is not used for commercial
purposes.
... Specifically, ICEES provides a disease-agnostic, regulatory-compliant approach for openly exposing and analyzing clinical data (e.g., electronic health records, survey data) that have been integrated at the patient level with environmental exposures data. ICEES has been validated in the context of a driving use case on asthma, in which we demonstrated associations between asthma exacerbations and race, sex, obesity, and airborne pollutant exposure [11][12][13][14][15]. The service itself is disease-agnostic. ...
... We then applied three statistical methods-CRF, CTree, GLM-to evaluate the impact of select independent variables on the dependent variable of annual ED or inpatient visits for respiratory issues among patients with asthma and related conditions. Among seven potential predictor variables (sex, race, prescriptions for prednisone, diagnosis of obesity, exposure to airborne particulates, residential proximity to a major roadway or highway, and residential density) that were selected a priori on the basis of our prior results and those of other groups [1][2][3][4][5][6][11][12][13]15], we found five to be of significant importance using both the CRF and CTree analytic methods, namely prednisone, race, exposure to airborne particulates, obesity, and sex. Moreover, both machine learning methods ranked the predictor variables in the same order, with prednisone use as the most important predictor variable. ...
... However, there were exceptions. For instance, the final GLM did not identify sex as a significant predictor, which our group [12] and others [5,6] have found to be significant, with asthma exacerbations more common among females than males. However, the CRF and CTree methods did identify sex as a significant predictor, with asthma exacerbations more common among females than males. ...
Article
Full-text available
ICEES (Integrated Clinical and Environmental Exposures Service) provides a disease-agnostic, regulatory-compliant approach for openly exposing and analyzing clinical data that have been integrated at the patient level with environmental exposures data. ICEES is equipped with basic features to support exploratory analysis using statistical approaches, such as bivariate chi-square tests. We recently developed a method for using ICEES to generate multivariate tables for subsequent application of machine learning and statistical models. The objective of the present study was to use this approach to identify predictors of asthma exacerbations through the application of three multivariate methods: conditional random forest, conditional tree, and generalized linear model. Among seven potential predictor variables, we found five to be of significant importance using both conditional random forest and conditional tree: prednisone, race, airborne particulate exposure, obesity, and sex. The conditional tree method additionally identified several significant two-way and three-way interactions among the same variables. When we applied a generalized linear model, we identified four significant predictor variables, namely prednisone, race, airborne particulate exposure, and obesity. When ranked in order by effect size, the results were in agreement with the results from the conditional random forest and conditional tree methods as well as the published literature. Our results suggest that the open multivariate analytic capabilities provided by ICEES are valid in the context of an asthma use case and likely will have broad value in advancing open research in environmental and public health.
... Various types of clinical data are frequently utilized in clinical and translational research. These encompass fully identified clinical data, HIPAA-limited clinical data, de-identified clinical data, and synthetic data [14] . HIPAA-limited clinical datasets consist of observational patient data that include restricted personally identifiable information (PHI), such as specific dates like admission, discharge, and service dates, as well as limited demographic details like city, state, zip codes, and age expressed in years, months, days, or hours. ...
Preprint
Full-text available
Health datasets have immense potential to drive research advancements and improve healthcare outcomes. However, realizing this potential requires careful consideration of governance and ownership frameworks. This article explores the importance of nurturing governance and ownership models that facilitate responsible and ethical use of health datasets for research purposes. We highlight the importance of adopting governance and ownership models that enable responsible and ethical utilization of health datasets and clinical data registries for research purposes. The article addresses the important local and international regulations related to the utilization of health data/medical records in research, and emphasizes the urgent need for developing clear institutional and national guidelines on data access, sharing, and utilization, ensuring transparency, privacy, and data protection. By establishing robust governance structures and fostering ownership among stakeholders, collaboration, innovation, and equitable access to health data can be promoted, ultimately unlocking its full power for transformative research and improving global health outcomes.
... These encompass fully identified clinical data, HIPAA-limited clinical data, deidentified clinical data, and synthetic data. [14] HIPAA-limited clinical datasets consist of observational patient data that include restricted personally identifiable information (PHI), such as specific dates such as admission, discharge, and service dates, as well as limited demographic details such as city, state, zip codes, and age expressed in years, months, days, or hours. On the other hand, deidentified clinical datasets consist of observational patient data where all personally identifiable elements have been removed. ...
Article
Full-text available
Health datasets have immense potential to drive research advancements and improve healthcare outcomes. However, realizing this potential requires careful consideration of governance and ownership frameworks. This article explores the importance of nurturing governance and ownership models that facilitate responsible and ethical use of health datasets for research purposes. We highlight the importance of adopting governance and ownership models that enable responsible and ethical utilization of health datasets and clinical data registries for research purposes. The article addresses the important local and international regulations related to the utilization of health data/medical records in research, and emphasizes the urgent need for developing clear institutional and national guidelines on data access, sharing, and utilization, ensuring transparency, privacy, and data protection. By establishing robust governance structures and fostering ownership among stakeholders, collaboration, innovation, and equitable access to health data can be promoted, ultimately unlocking its full power for transformative research and improving global health outcomes.
... Of importance, the Translator clinical KPs do not expose raw clinical data, but rather aggregated or semi-aggregated data and statistical associations or machine learning predictions derived from clinical data, in full compliance with all federal and institutional regulations. 14 The Translator Consortium has adopted several tools and approaches to support standardization, harmonization, and interoperability across the diverse Translator system. First, all Translator services are accessible via APIs. ...
Article
Full-text available
Clinical, biomedical, and translational science has reached an inflection point in the breadth and diversity of available data and the potential impact of such data to improve human health and well‐being. However, the data are often siloed, disorganized, and not broadly accessible due to discipline‐specific differences in terminology and representation. To address these challenges, the Biomedical Data Translator Consortium has developed and tested a pilot knowledge graph–based ‘Translator’ system capable of integrating existing biomedical data sets and ‘translating’ those data into insights intended to augment human reasoning and accelerate translational science. Having demonstrated feasibility of the Translator system, the Translator program has since moved into development, and the Consortium has made significant progress in the research, design, and implementation of an operational system. Herein, we describe the current system’s architecture, performance, and quality of results. We apply Translator to several real‐world use cases developed in collaboration with subject‐matter experts. Finally, we discuss the scientific and technical features of Translator and compare those features to other state‐of‐the‐art biomedical graph‐based question‐answering systems.
... As part of the Biomedical Data Translator program (Translator) [3,4], supported by the National Center for Advancing Translational Sciences, we have developed a disease-agnostic, regulatory-compliant framework and approach for openly exposing and exploring patient data: the Integrated Clinical and Environmental Exposures Service (ICEES) [5]. ICEES was designed to overcome the regulatory, cultural, and technical challenges that hinder efforts to openly share and explore patient data [6,7]. ICEES is unique from similar efforts toward open patient data in that the service provides access to clinical data that have been integrated at the patient level with environmental exposures data derived from a variety of public sources. ...
Article
Full-text available
Background The Integrated Clinical and Environmental Exposures Service (ICEES) serves as an open-source, disease-agnostic, regulatory-compliant framework and approach for openly exposing and exploring clinical data that have been integrated at the patient level with a variety of environmental exposures data. ICEES is equipped with tools to support basic statistical exploration of the integrated data in a completely open manner. Objective This study aims to further develop and apply ICEES as a novel tool for openly exposing and exploring integrated clinical and environmental data. We focus on an asthma use case. Methods We queried the ICEES open application programming interface (OpenAPI) using a functionality that supports chi-square tests between feature variables and a primary outcome measure, with a Bonferroni correction for multiple comparisons (α=.001). We focused on 2 primary outcomes that are indicative of asthma exacerbations: annual emergency department (ED) or inpatient visits for respiratory issues; and annual prescriptions for prednisone. ResultsOf the 157,410 patients within the asthma cohort, 26,332 (16.73%) had 1 or more annual ED or inpatient visits for respiratory issues, and 17,056 (10.84%) had 1 or more annual prescriptions for prednisone. We found that close proximity to a major roadway or highway, exposure to high levels of particulate matter ≤2.5 μm (PM2.5) or ozone, female sex, Caucasian race, low residential density, lack of health insurance, and low household income were significantly associated with asthma exacerbations (P
... According to the General Data Protection Regulation set in effect in the European Union, organizations are responsible for the misuse of information that is processed on their systems [90]. Thus it is not just the individual person that is interested in the security of their data [6,140,148,176], but many commercial enterprises who process these data are motivated to ensure that they are not subject to unintended disclosure through neglect or otherwise. ...
Article
With the dramatic improvements in both the capability to collect personal data and the capability to analyze large amounts of data, increasingly sophisticated and personal insights are being drawn. These insights are valuable for clinical applications but also open up possibilities for identification and abuse of personal information. In this article, we survey recent research on classical methods of privacy-preserving data mining. Looking at dominant techniques and recent innovations to them, we examine the applicability of these methods to the privacy-preserving analysis of clinical data. We also discuss promising directions for future research in this area.
... Some data constraints, limitations, or complexities may not be discovered until deep into a project. All told, the use of empirical techniques is often encumbered with high resource demands and legal, institutional, and technical hurdles [42]. Semantic data -in the form of digitally encoded vocabularies, groupers, code set repositories, etc. -though they may be encumbered by licensing constraints and complexities (e.g., around version synchronization), they are not burdened with the same kind of privacy issues that are always present with patient data. ...
Preprint
Objective: Code sets play a central role in analytic work with clinical data warehouses, as components of phenotype, cohort, or analytic variable algorithms representing specific clinical phenomena. Code set quality has received critical attention and repositories for sharing and reusing code sets have been seen as a way to improve quality and reduce redundant effort. Nonetheless, concerns regarding code set quality persist. In order to better understand ongoing challenges in code set quality and reuse, and address them with software and infrastructure recommendations, we determined it was necessary to learn how code sets are constructed and validated in real-world settings. Methods: Survey and field study using semi-structured interviews of a purposive sample of code set practitioners. Open coding and thematic analysis on interview transcripts, interview notes, and answers to open-ended survey questions. Results: Thirty-six respondents completed the survey, of whom 15 participated in follow-up interviews. We found great variability in the methods, degree of formality, tools, expertise, and data used in code set construction and validation. We found universal agreement that crafting high-quality code sets is difficult, but very different ideas about how this can be achieved and validated. A primary divide exists between those who rely on empirical techniques using patient-level data and those who only rely on expertise and semantic data. We formulated a method- and process-based model able to account for observed variability in formality, thoroughness, resources, and techniques. Conclusion: Our model provides a structure for organizing a set of recommendations to facilitate reuse based on metadata capture during the code set development process. It classifies validation methods by the data they depend on — semantic, empirical, and derived — as they are applied over a sequence of phases: (1) code collection; (2) code evaluation; (3) code set evaluation; (4) code set acceptance; and, optionally, (5) reporting of methods used and validation results. This schematization of real-world practices informs our analysis of and response to persistent challenges in code set development. Potential re-users of existing code sets can find little evidence to support trust in their quality and fitness for use, particularly when reusing a code set in a new study or database context. Rather than allowing code set sharing and reuse to remain separate activities, occurring before and after the main action of code set development, sharing and reuse must permeate every step of the process in order to produce reliable evidence of quality and fitness for use.
Article
Full-text available
The need for sufficient clinical evidence and the collection of real-world evidence (RWE) is at the forefront of medical device and drug regulations, however, the collection of clinical data can be a time consuming and costly process. The advancement of Digital Health Technologies (DHTs) is transforming the way health data can be collected, analysed, and shared, presenting an opportunity for the implementation of DHTs in clinical research to aid with obtaining clinical evidence, particularly RWE. DHTs can provide a more efficient and timely way of collecting numerous types of clinical data (e.g., physiological, and behavioural data) and can be beneficial with regards to participant recruitment, data management and cost reduction. Recent guidelines and regulations on the use of RWE within regulatory decision-making processes opens the door for the wider implementation of DHTs. However, challenges and concerns remain regarding the use of DHT (such as data security and privacy). Nevertheless, the implementation of DHT in clinical research presents a promising opportunity for providing meaningful and patient-centred data to aid with regulatory decisions.
Article
Full-text available
The Integrated Clinical and Environmental Exposures Service (ICEES) provides regulatory-compliant open access to sensitive patient data that have been integrated with public exposures data. ICEES was designed initially to support dynamic cohort creation and bivariate contingency tests. The objective of the present study was to develop an open approach to support multivariate analyses using existing ICEES functionalities and abiding by all regulatory constraints. We first developed an open approach for generating a multivariate table that maintains contingencies between clinical and environmental variables using programmatic calls to the open ICEES application programming interface. We then applied the approach to data on a large cohort (N = 22,365) of patients with asthma or related conditions and generated an eight-feature table. Due to regulatory constraints, data loss was incurred with the incorporation of each successive feature variable, from a starting sample size of N = 22,365 to a final sample size of N = 4,556 (20.4%), but data loss was < 10% until the addition of the final two feature variables. We then applied a generalized linear model to the subsequent dataset and focused on the impact of seven select feature variables on asthma exacerbations, defined as annual emergency department or inpatient visits for respiratory issues. We identified five feature variables—sex, race, obesity, prednisone, and airborne particulate exposure—as significant predictors of asthma exacerbations. We discuss the advantages and disadvantages of ICEES open multivariate analysis and conclude that, despite limitations, ICEES can provide a valuable resource for open multivariate analysis and can serve as an exemplar for regulatory-compliant informatic solutions to open patient data, with capabilities to explore the impact of environmental exposures on health outcomes.
Article
Full-text available
Importance Despite data aggregation and removal of protected health information, there is concern that deidentified physical activity (PA) data collected from wearable devices can be reidentified. Organizations collecting or distributing such data suggest that the aforementioned measures are sufficient to ensure privacy. However, no studies, to our knowledge, have been published that demonstrate the possibility or impossibility of reidentifying such activity data. Objective To evaluate the feasibility of reidentifying accelerometer-measured PA data, which have had geographic and protected health information removed, using support vector machines (SVMs) and random forest methods from machine learning. Design, Setting, and Participants In this cross-sectional study, the National Health and Nutrition Examination Survey (NHANES) 2003-2004 and 2005-2006 data sets were analyzed in 2018. The accelerometer-measured PA data were collected in a free-living setting for 7 continuous days. NHANES uses a multistage probability sampling design to select a sample that is representative of the civilian noninstitutionalized household (both adult and children) population of the United States. Exposures The NHANES data sets contain objectively measured movement intensity as recorded by accelerometers worn during all walking for 1 week. Main Outcomes and Measures The primary outcome was the ability of the random forest and linear SVM algorithms to match demographic and 20-minute aggregated PA data to individual-specific record numbers, and the percentage of correct matches by each machine learning algorithm was the measure. Results A total of 4720 adults (mean [SD] age, 40.0 [20.6] years) and 2427 children (mean [SD] age, 12.3 [3.4] years) in NHANES 2003-2004 and 4765 adults (mean [SD] age, 45.2 [19.9] years) and 2539 children (mean [SD] age, 12.1 [3.4] years) in NHANES 2005-2006 were included in the study. The random forest algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4478 adults (94.9%) and 2120 children (87.4%) in NHANES 2003-2004 and 4470 adults (93.8%) and 2172 children (85.5%) in NHANES 2005-2006 (P < .001 for all). The linear SVM algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4043 adults (85.6%) and 1695 children (69.8%) in NHANES 2003-2004 and 4041 adults (84.8%) and 1705 children (67.2%) in NHANES 2005-2006 (P < .001 for all). Conclusions and Relevance This study suggests that current practices for deidentification of accelerometer-measured PA data might be insufficient to ensure privacy. This finding has important policy implications because it appears to show the need for deidentification that aggregates the PA data of multiple individuals to ensure privacy for single individuals.
Article
Full-text available
Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center’s Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013–2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.
Article
Full-text available
We present an ultrafast method for comparing personal genomes. We transform the standard genome representation (lists of variants relative to a reference) into “genome fingerprints” via locality sensitive hashing. The resulting genome fingerprints can be meaningfully compared even when the input data were obtained using different sequencing technologies, processed using different pipelines, represented in different data formats and relative to different reference versions. Furthermore, genome fingerprints are robust to up to 30% missing data. Because of their reduced size, computation on the genome fingerprints is fast and requires little memory. For example, we could compute all-against-all pairwise comparisons among the 2504 genomes in the 1000 Genomes data set in 67 s at high quality (21 μs per comparison, on a single processor), and achieved a lower quality approximation in just 11 s. Efficient computation enables scaling up a variety of important genome analyses, including quantifying relatedness, recognizing duplicative sequenced genomes in a set, population reconstruction, and many others. The original genome representation cannot be reconstructed from its fingerprint, effectively decoupling genome comparison from genome interpretation; the method thus has significant implications for privacy-preserving genome analytics.
Article
Full-text available
Objective: Our objective is to create a source of synthetic electronic health records that is readily available; suited to industrial, innovation, research, and educational uses; and free of legal, privacy, security, and intellectual property restrictions. Materials and Methods: We developed Synthea, an open-source software package that simulates the lifespans of synthetic patients, modeling the 10 most frequent reasons for primary care encounters and the 10 chronic conditions with the highest morbidity in the United States. Results: Synthea adheres to a previously developed conceptual framework, scales via open-source deployment on the Internet, and may be extended with additional disease and treatment modules developed by its user community. One million synthetic patient records are now freely available online, encoded in standard formats (eg, Health Level-7 [HL7] Fast Healthcare Interoperability Resources [FHIR] and Consolidated-Clinical Document Architecture), and accessible through an HL7 FHIR application program interface. Discussion: Health care lags other industries in information technology, data exchange, and interoperability. The lack of freely distributable health records has long hindered innovation in health care. Approaches and tools are available to inexpensively generate synthetic health records at scale without accidental disclosure risk, lowering current barriers to entry for promising early-stage developments. By engaging a growing community of users, the synthetic data generated will become increasingly comprehensive, detailed, and realistic over time. Conclusion: Synthetic patients can be simulated with models of disease progression and corresponding standards of care to produce risk-free realistic synthetic health care records at scale.
Article
Full-text available
Background: Relationships between air quality and health are well-described, but little information is available about the joint associations between particulate air pollution, ambient temperature, and respiratory morbidity. Objectives: To evaluate associations between concentrations of particulate matter ≤2.5 microns in diameter (PM2.5) and exacerbation of existing asthma and modification of the associations by ambient air temperature. Methods: Data from 50,356 adult 2006-2010 Asthma Call-back Survey respondents were linked by interview date and county of residence to estimates of daily averages of PM2.5 and maximum air temperature. Associations between 14-day average PM2.5 and the presence of any asthma symptoms during the 14 days leading up to and including the interview date were evaluated using binomial regression. We explored variation by air temperature using similar models, stratified into quintiles of the 14-day average maximum temperature. Results: Among adults with active asthma, 57.1% reported asthma symptoms within the past 14 days and 14-day average PM2.5 ≥7.07 µg·m(-3) was associated with an estimated 4 to 5% higher asthma symptom prevalence. In the range of 4.00 to 7.06 µg·m(-3) of PM2.5, each µg·m(-3) increase was associated with a 3.4% (95% confidence interval: 1.1, 5.7) increase in symptom prevalence; across categories of temperature from 1.1 to 80.5°F, each µg·m(-3) increase was associated with increased symptom prevalence (1.1-44.4°F: 7.9%; 44.5-58.6°F: 6.9%; 58.7-70.1°F: 2.9%; 70.2-80.5°F: 7.3%). Conclusions: These results suggest that each unit increase in PM2.5 may be associated with an increase in the prevalence of asthma symptoms, even at levels as low as 4.00 to 7.06 µg·m(-3).
Article
Objective: This study aimed to develop a novel, regulatory-compliant approach for openly exposing integrated clinical and environmental exposures data: the Integrated Clinical and Environmental Exposures Service (ICEES). Materials and methods: The driving clinical use case for research and development of ICEES was asthma, which is a common disease influenced by hundreds of genes and a plethora of environmental exposures, including exposures to airborne pollutants. We developed a pipeline for integrating clinical data on patients with asthma-like conditions with data on environmental exposures derived from multiple public data sources. The data were integrated at the patient and visit level and used to create de-identified, binned, "integrated feature tables," which were then placed behind an OpenAPI. Results: Our preliminary evaluation results demonstrate a relationship between exposure to high levels of particulate matter ≤2.5 µm in diameter (PM2.5) and the frequency of emergency department or inpatient visits for respiratory issues. For example, 16.73% of patients with average daily exposure to PM2.5 >9.62 µg/m3 experienced 2 or more emergency department or inpatient visits for respiratory issues in year 2010 compared with 7.93% of patients with lower exposures (n = 23 093). Discussion: The results validated our overall approach for openly exposing and sharing integrated clinical and environmental exposures data. We plan to iteratively refine and expand ICEES by including additional years of data, feature variables, and disease cohorts. Conclusions: We believe that ICEES will serve as a regulatory-compliant model and approach for promoting open access to and sharing of integrated clinical and environmental exposures data.
All roads lead to FHIR: an extensible clinical data conversion pipeline
  • E R Pfaff
Pfaff, E.R. et al. All roads lead to FHIR: an extensible clinical data conversion pipeline. American Medical Informatics Association 2019 Informatics Summit, San Francisco, CA, March 25-28, 2019. Abstract.