ArticlePDF Available

Clinical Data: Sources and Types, Regulatory Constraints, Applications

Clinical and Translational Science

May 2019
12(4)

DOI:10.1111/cts.12638

License
CC BY-NC 4.0

Authors:

Stanley C. Ahalt

University of North Carolina at Chapel Hill

Christopher G Chute

Johns Hopkins Medicine

Karamarie Fecho

Copperline Professional Solutions

Gwênlyn Glusman

Institute for Systems Biology

Show all 12 authorsHide

Access to clinical data is critical for the advancement of translational research. However, the numerous regulations and policies that surround the use of clinical data, although critical to ensure patient privacy and protect against misuse, often present challenges to data access and sharing. In this article, we provide an overview of clinical data types and associated regulatory constraints and inferential limitations. We highlight several novel approaches that our team has developed for openly exposing clinical data.

Example queries, including input parameters and output, for Columbia Open Health Data (COHD) (a) and the Integrated Clinical and Environmental Exposures Service (ICEES) (b). AvgDailyPM2.5Exposure = average daily patient exposure to PM2.5 (μg/m³) over a 1‐year study period; TotalEDInpatient Vists = total number of emergency department or inpatient visits for respiratory issues during a 1‐year study period. The study period shown here is for calendar year 2010. AvgDailyPM2.5Exposure <3 range: 1.58, 9.63 μg/m³; AvgDailyPM2.5Exposure ≥3 range: 9.63, 17.33 μg/m³. ID, identifier; PM2.5, airborne particulate matter ≤2.5 μm in diameter.

…

Figures - available from: Clinical and Translational Science

This content is subject to copyright. Terms and conditions apply.

Access to this full-text is provided by Wiley.

Learn more

Content available from Clinical and Translational Science

This content is subject to copyright. Terms and conditions apply.

Citation: Clin Transl Sci (2019) 12, 329–333; doi:10.1111/cts.12638

COMMENTARY

Clinical Data: Sources and Types, Regulatory Constraints,

Applications

Stanley C. Ahalt1,†, Christopher G. Chute2, Karamarie Fecho1,*, Gustavo Glusman3, Jennifer Hadlock3, Casey Overby Taylor2,

Emily R. Pfaff4, Peter N. Robinson5, Harold Solbrig2, Casey Ta6, Nicholas Tatonetti6 and Chunhua Weng6 The Biomedical Data

Translator Consortium

Access to clinical data is critical for the advance-

ment of translational research. However, the nu-

merous regulations and policies that surround

the use of clinical data, although critical to ensure

patient privacy and protect against misuse, often

present challenges to data access and sharing. In

this article, we provide an overview of clinical

data types and associated regulatory constraints

and inferential limitations. We highlight several

novel approaches that our team has developed

for openly exposing clinical data.

BACKGROUND

Recognizing the need to respect and protect patient pri-

vacy, numerous regulations have been established to gov-

ern the use of clinical data by researchers, including the

federal Health Insurance Portability and Accountability

Act of 1996 (HIPAA) and the European Union General Data

Protection Regulation. Institution- specific guidelines and

governing bodies such as institutional review boards (IRBs)

also address research involving patient data and other sen-

sitive data available in electronic medical records (e.g., ad-

ministrative data), in part as a result of concerns regarding

the liability of healthcare providers and institutions.1,2

The Biomedical Data Translator (Translator) program,

funded by the National Center for Advancing Translational

Sciences, aims to facilitate the transformation of basic sci-

ence discoveries into clinically actionable knowledge and

leverage clinical expertise to drive research innovations.3,4

Access to clinical data is central to the vision of the program.

Yet, the program’s dedication to open science adds com-

plexity to the regulatory, technical, and cultural challenges

that already surround access to clinical data.

We review here the types of clinical data sets that can

be derived from paper or electronic medical records, their

applications and limitations, and their associated regulatory

constraints, focusing primarily on compliance requirements

mandated in the United States under HIPAA (Table 1). We

briefly describe several clinical data types that are com-

monly employed in clinical and translational research, in-

cluding fully identified clinical data, HIPAA- limited clinical

data, deidentified clinical data, and synthetic data. We high-

light several novel approaches for openly exposing clini-

cal data that we have developed as part of the Translator

program, namely, HIPAA Safe Harbor Plus (HuSH+) clinical

data, clinical profiles, Columbia Open Health Data (COHD),

and the Integrated Clinical and Environmental Exposures

Service (ICEES).

TYPES OF CLINICAL DATA SETS

Fully identified clinical data sets

Fully identified clinical data sets comprise observational

patient data, including direct patient identifiers (i.e., pro-

tected health information (PHI)), as defined in the privacy

rule issued under HIPAA. Access requires a specific re-

search hypothesis, study approval by an IRB, a full or partial

waiver of HIPAA- informed consent, and typically a secure

workspace. For investigators not affiliated with a specific

institution, additional regulations and approvals may apply,

including a data use agreement (DUA) with the provider in-

stitution. Fully identified clinical data sets may be used for

clinical interpretation and scientific inference and discovery.

However, as with all data sets but especially observational

administrative data sets, issues of data quality and integrity

must be taken into account when drawing conclusions.1

HIPAA- limited clinical data sets

HIPAA- limited clinical data sets comprise observational pa-

tient data with limited PHI: dates such as admission, dis-

charge, service, and dates of birth and death; city, state, and

five digits or more zip codes; and ages in years, months,

days, or hours. HIPAA- limited clinical data sets may be

used or disclosed for purposes of research, public health,

or healthcare operations without obtaining patient authori-

zation or a waiver of HIPAA- informed consent but with IRB

approval and (in some cases) a fully executed DUA. HIPAA-

limited clinical data sets may be used for clinical interpre-

tation and scientific inference and discovery but with the

1Renaissance Computing Institute,Universit y of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA; 2Johns Hopkins University, Baltimore, Maryland, USA;

3Institute for Systems Biology, Seat tle, Washington, USA ; 4North Carolina Translational and Clinic al Sciences Institute,University of North Carolina at Chapel Hill, Chapel

Hill, North Carolina, U SA; 5The Jackson L aboratory, Farmington, Connecticut, USA; 6Columbia University, New York, New York, USA. *Correspondence: Karamarie

Fecho (kfecho@copperlineprofessionalsolutions.com)

Received: January 24, 2019; accepted: March 27, 2019. doi:10.1111/c ts.12 6 3 8

†Authors are listed alphabetically.

330

Clinical and Translational Science

Overview and Application of Clinical Data Types

Ahalt et al.

understanding that certain data elements have been re-

moved from the data and/or transformed (e.g., age vs. birth

date).

Deidentified clinical data sets

Deidentified clinical data sets comprise observational

patient data from which all PHI elements have been re-

moved. Access to deidentified clinical data sets does

not require IRB approval, although an IRB Request for

Determination of Human Subjects Research is advised.

In addition, a fully executed DUA is sometimes required.

Deidentified clinical data sets may be used for clinical

interpretation and scientific inference and discovery

but to a lesser extent than HIPAA- limited clinical data

sets because of the fact that key variables or covariates

may have been removed from the data. For instance,

dates are required to make inferences regarding sea-

sonal patterns in clinical outcomes and correlations with

Table 1 Clinical data ty pes, r egul ator y acce ss res tric tions, and applicat ions

Clinical data type Brief description Regulatory access restrictions Applications

Fully identified

clinical data sets

Observational patient data derived from

paper- based or electronic medical records

IRB approval is required; an executed

data use agreeme nt is possibly

requireda

Clinical interpretation and

scientific inference and

discover y

HIPAA- limited

clinical data sets

Obser vationa l patient data cont aining only a

limited set of HIPAA- defined PHI

IRB approval is required; an executed

data use agreeme nt is possibly

requireda

Clinical interpretation and

scientific inference and

discover y, but with the

understanding that certain

data elements have been

removed fr om the data and/or

transformed

Deidentified clinical

data sets

Obser vationa l patient data, but with all HIPAA-

defined PHI eleme nts remove d

IRB approval is not requiredb; IRB

“Reque st for Dete rmination of

Human Subjects Researc h” is

typically recommended; an executed

data use agreeme nt is possibly

required

Clinical interpretation and

scientific inference and

discover y, but with the

understanding that inferences

regarding time and potentia lly

other factors cannot be made

HuSH+ clinical data

sets

Observational patient data, fully compliant with

HIPAA Safe H arbor, but unlike deide ntified

clinica l data sets, HuSH+ cli nical data sets have

been altered suc h that (i) real patie nt identifiers

(includ ing geoc odes) have been replaced with

random patient identifiers and (ii) dates

(includ ing bir th dates) have b een shifted by a

random number of days (maxim um of

± 50 days), with all dates for a gi ven patient

shifte d by the same n umber of days

Data are derived fr om UNC Health Care System

An executed data use agreeme nt is

requiredc

Clinical interpretation and

scientific inference and

discover y, but with the

understanding that any

inferences based on date/

time and lo cation (g eocode)

cannot be made with

precision, and all other

inferences must consider

date/tim e and location as

potentially hidden covariates

Clinical profiles Statistical profiles of disease and associated

phenot ypic pre sentation derived from

observational patient data

Data are derived from Johns Hopkins Medicine

IRB approval is required to generate

clinica l profile s; no other restric tions

apply

Clinical interpretation and

scientific inference, but with

the understanding that the

data represent statisti cal

profiles

Synthetic clinical

data sets

Realistic, but not real, observational patient data

generated statistically using population

distributions of observational patient data

None Feasibility assessments and

algorithm validation;

generation of clinical profiles

COHD Counts of observational clinical co- occurrences

(e.g., co- occurre nces of sp ecific diagnos es and

prescr ibed medicatio ns), as well as th eir

relative frequen cy and observed –expected

frequency ratio

Data are derived fr om Columbia University

Irving Medical Center

None Clinical interpretation and

scientific inference, but with

the understanding that the

data are re stricted to

co- occurrences

ICEES Patient- level or visit- level counts of observational

patient data integrated at the patient a nd visit

level with a variety of e nvironmental exp osures

derived from multiple public data sources

Data are derived fr om UNC Health Care System

and a varie ty of public data sources on

environmental exposures

IRB approval is required to generate

ICEES integ rated feature tables; no

other restrictions apply

Clinical interpretation and

scientific inference, but with

the understanding that the

raw data have b een

transfo rmed (e.g., binned or

categorized)

COHD, Columbia Ope n Health D ata; HIPAA, Heal th Insurance Por tabili ty and Ac counta bility A ct; HuSH+, HIPAA Safe Harb or Plus; ICEES, Integr ated Clinical

and Enviro nmenta l Expos ures Service; IRB, institutional review boa rd; PHI, protected health information; UNC, Univer sity of North Ca rolina.

aIndividual insti tutions m ay require a secure wo rkspac e for data ac cess and use. bWhile HIPAA and IRB regulations do not apply, institutional approvals may

be requir ed. cHuSH+ clinical d ata sets we re conceptualized and cre ated by UNC a s part of t he Nation al Center for Advanc ing Translational Sciences –funded

Biomedical Data Translator pr ogram. T he institution re quires a fully execute d data use agreem ent for access to the d ata.

331

www.cts-journal.com

Overview and Application of Clinical Data Types

Ahalt et al.

natural disasters, system- related issues such as protocol

changes, and regulatory issues such as new black- box

warnings.

HuSH+ clinical data sets

HuSH+ clinical data sets were created by Translator team

members as a hybrid deidentification approach that is com-

pletely compliant with HIPAA and provides restricted access

to observational patient data from the UNC Health Care

System. HuSH+ clinical data sets differ from deidentified clin-

ical data sets in that (i) real patient identifiers (including geo-

codes) have been replaced with random patient identifiers and

(ii) dates (including birth dates) have been shifted by a random

number of days (maximum of ± 50 days), with all dates for a

given patient shifted by the same number of days. Access to

HuSH+ clinical data does not require IRB approval but does

require a fully executed DUA per institutional mandate. HuSH+

clinical data sets may be used in a limited fashion for clini-

cal interpretation and scientific inference and discovery. The

main considerations are that any inferences based on date/

time and location (geocode) cannot be made with precise ac-

curacy or correlated with seasonal trends or specific events,

and all other inferences must consider date/time and location

as potentially hidden covariates.

Clinical profiles

Clinical profiles have been developed as part of the Translator

program an d represent statistica l profiles of disease and a sso-

ciated phenotypic presentations derived from observational

patient data from Johns Hopkins Medicine using the Health

Level Seven International Fast Healthcare Interoperability

Resources common data model. At present, clinical profiles

include data on demographics, diagnoses, disease comor-

bidities, symptoms, medications, procedures, and labora-

tory measures. IRB approval is required to generate clinical

profiles but once generated, clinical profiles can be openly

shared. Institutional restrictions may apply, however. Clinical

profiles can be used for clinical interpretation and scientific

inference and discovery but with the understanding that they

represent statistical summaries of patient populations and

only indirectly represent patient- level observations. Multiple

computational tools and example output files are openly avail-

able for creating and using clinical profiles (see Supplemental

Information on Clinical Profiles in Further Reading).

Synthetic clinical data sets

Synthetic clinical data sets comprise realistic (but not real)

data generated statistically by applying simulation techniques

to population distributions of observational patient data.

Synthetic clinical data sets can be openly shared. A publicly

available example, the Synthetic Mass data set, was gener-

ated using the Synthea method5 to simulate patient- level and

population- level data on patients who reside in the state of

Massachusetts. A similar open effort is Simulacrum, which

is based on observational patient data held by Public Health

England’s National Cancer Registration and Analysis Service.

The data include realistic patient histories with clinically rel-

evant patient encounters; as such, the data can be used for

feasibility assessments and algorithm validation but not for

clinical interpretation or scientific inference and discovery.

COHD

Translator team members have pioneered the use of clin-

ical co- occurrence tables as part of the COHD initiative.6

COHD provides open access to observational patient data

from Columbia University Irving Medical Center in the form

of co- occurrence counts of pairs of concepts or clinical

feature variables (e.g., medications and diagnoses), as well

as their relative frequency and observed–expected fre-

quency ratio. The data are publicly accessible via an open

web interface or Application Programming Interface. Risks

to patient privacy are mitigated by excluding rare features

(counts ≤ 10) and perturbing the counts according to the

Poisson distribution. The data can be used to derive in-

sights into questions of clinical relevance and importance

for translational research. For instance, an individual user

may wish to know the frequency of asthma among African

American patients (Figure 1a). A search of the COHD ser-

vice reveals that there are 11,716 African American pa-

tients with a diagnosis of asthma among 208,438 African

American patients (5.62%). For comparison, a second

search reveals that there are 29,913 white patients with a

diagnosis of asthma among 601,167 white patients (4.98%).

ICEES

ICEES was designed by Translator team members as a

novel extension of COHD.7 Specifically, ICEES permits

open access to observational patient data from the UNC

Health Care System that have been integrated at the pa-

tient and visit level with environmental exposures data (e.g.,

airborne and roadway pollutants, socioeconomic factors)

derived from multiple public sources. A complex data ex-

traction and integration software pipeline has been devel-

oped to create ICEES integrated feature tables.8 The tables

are generated using PHI (geocodes and dates), but the data

are then binned or recoded and stripped of PHI. Thus, the

ICEES pipeline must be executed under an approved IRB

protocol, but subsequent steps are not subject to IRB reg-

ulation, and ICEES is publicly accessible via an Application

Programming Interface. ICEES provides a number of func-

tionalities for clinical interpretation and scientific inference

and discovery. For example, Fig ure 1b demonstrates that

for COHORT:60 (African Americans with asthma- like condi-

tions in calendar year 2010), the percentage of patients with

two or more annual emergency department or inpatient vis-

its for respiratory issues is higher among patients with high

average daily exposure to particulate matter ≤ 2.5 μm in

diameter than among patients with low average daily expo-

sure to particulate matter ≤ 2.5 μm in diameter (21.10% vs.

8.90%, P < 0.0001, N = 6,379), thus replicating published

literature on the association between airborne pollutant ex-

posures and asthma exacerbations.9 The data additionally

suggest that African Americans with asthma- like conditions

have relatively high exposure to particulate matter, with

~ 95% of the cohort exposed to ≥ 9.63 μg/m3 average daily

particulate matter ≤ 2.5 μm in diameter.

Clinical fingerprints

Although not a new clinical data type per se, Translator

teams have been working to develop privacy- preserving

analytic approaches to visualize and compare patient data,

332

Clinical and Translational Science

Overview and Application of Clinical Data Types

Ahalt et al.

Figure 1 Example queries, including input parameters and output, for Columbia Open Health Data (COHD) (a) and the Integrated

Clinical and Environmental Exposures Service (ICEES) (b). AvgDailyPM2.5Exposure = average daily patient exposure to PM2.5 (μg/m3)

over a 1- year study period; TotalEDInpatient Vists = total number of emergency department or inpatient visits for respiratory issues

during a 1- year study period. The study period shown here is for calendar year 2010. AvgDailyPM2.5Exposure <3 range: 1.58, 9.63 μg/

m3; AvgDailyPM2.5Exposure ≥3 range: 9.63, 17.33 μg/m3. ID, identifier; PM2.5, airborne particulate matter ≤2.5 μm in diameter.

(a)

(b)

COHD example queries

put: Asthma (ID #317009) and Black or African American (ID #8516)

Output:

put: Asthma (ID #317009) and White (ID #8527)

Output

ICEES example query

put:

Feature variables: AvgDailyPM2.5Exposures < 3, TotalEDInpatientVisits <

Version of data: 1.0.0

Table: patient

Year: 2010

Cohort ID: COHORT:60

Output

+----------------------------+------------------------------+-------------------------------+---------+

| feature | TotalEDInpatientVisits < 2 | TotalEDInpatientVisits >= 2 ||

+============================+==============================+===============================+=========

| AvgDailyPM2.5Exposure < 3 | 297 91.10%

| 29 8.90% | 326|

5.85% 4.66% | 2.22% 0.45% | 5.11% |

+----------------------------+------------------------------+-------------------------------+---------+

| AvgDailyPM2.5Exposure >= 3 | 4776 78.90%

| 1277 21.10% | 6053 |

94.15% 74.87% | 97.78% 20.02% | 94.89% |

+----------------------------+------------------------------+-------------------------------+---------+

| | 5073 | 1306

| 6379 |

79.53% | 20.47% | 100.00% |

+----------------------------+------------------------------+-------------------------------+---------+

+-------------+---------------+

| p_value | chi_squared |

+=============+===============+

| 3.16593e-06 |

28.2841 |

+-------------+---------------+

333

www.cts-journal.com

Overview and Application of Clinical Data Types

Ahalt et al.

including genomic data and clinical records in semistruc-

tured JavaScript Object Notation or eXtensible Markup

Language formats. Genomic data typically consist of lists

of variants relative to a reference allele sorted by position.

Genome fingerprints capture the unique patterns gener-

ated by pairs of consecutive single- nucleotide variants

as patient- level matrices or fingerprints.10 The correlation

between two fingerprints reflects the degree of related-

ness between two genomes. Clinical fingerprints simi-

larly transform clinical records from the Fast Healthcare

Interoperability Resources format into numerical vectors

that greatly simplify their comparison. Translator team

members are working to adapt this methodology for ap-

plication to the ICEES integration pipeline and incorpora-

tion into the ICEES integrated feature tables.

CONCLUSION

In this article, we described various types of clinical data

sets and associated inferential limitations and regulatory

constraints, focusing primarily on compliance requirements

mandated in the United States under HIPAA. We highlighted

several novel approaches that we have developed as part

of the Translator program to openly expose observational

patient data, while respecting and protecting patient pri-

vacy. We recognize that each of these approaches retains

a residual risk of patient reidentification; thus, we continue

to work with experts in regulatory protections and com-

puter security to ensure that those risks remain minimal.

Although the Translator approaches are designed to be

disease- agnostic and generalizable, they were developed

to comply with HIPAA and institutional guidelines; as such,

our approaches may need to be modified prior to adoption

elsewhere. Nonetheless, through these open services, we

hope to accelerate clinical and translational science and

foster biomedical discovery.

Supporting Information. Supplementary information accompa-

nies this paper on the Clinical and Translational Science website (www.

cts-journal.com). The Further Reading includes supplementary in-

formation on Clinical Profiles, Synthetic Clinical Datasets, COHD, and

ICEES, as well as relevant regulator y information and information on

related large-scale patient de-identification and data-sharing efforts.

Clinical Data: Sources and Types, Regulatory Constraints, Applications.

Acknowledgments. The authors acknowledge and appreciate the

contributions provided by the following individuals: Chris Bizon, Steve

Cox, Ashok Krishnamurthy, Lisa Stillwell, and Hao Xu of the University

of North Carolina Renaissance Computing Institute; James Champion

of the North Carolina Translational and Clinical Sciences Institute;

David B. Peden of the University of North Carolina School of Medicine;

Sarav Arunachalam of the University of North Carolina Institute for the

Environment; Max Robinson of the Institute for Systems Biology; and

Stefano Rensi of Stanford University.

Funding. Support for this project was provided by the National

Center for Advancing Translational Sciences, National Institutes of Health

through the Biomedical Data Translator program (awards 1OT3TR002019,

1OT3TR002020, 1OT3TR002025, 1OT3TR002026, 1OT3TR002027,

1OT2TR002514, 1OT2TR002515, 1OT2TR002517, 1OT2TR002520,

1OT2TR002584) and the Clinical and Translational Sciences Award pro-

gram (award UL1TR002489).

Conflict of Interest. All authors declared no competing interests

for this work.

1. Harman, L .B., Flit e, C.A . & Bond, K. Electro nic health records: pr ivacy, confid enti-

ality, and sec urity. Virtual Mentor 14, 712–7 19 (2012 ).

2. Na, L ., Yang, C., Lo, C .C., Zhao, F., Fukuok a, Y. & Aswani, A . Feasibility of reidenti-

fying individuals in lar ge national physical ac tivit y data set s from which p rotecte d

health inf ormation h as been removed with use of machine learning. JAMA Network

Open 1, e186 04 0 (2018) .

3. The Biomedical Data Translator Co nsortium. The Biomedical Dat a Translator pr o-

gram: conception, cult ure, and community. Clin. Transl. Sci. 12 , 92– 94 (2019).

ht tps :/ /do i. or g /( 20 18 a10 /1111 /c ts .12 59 2.

4. The Biome dical Dat a Translator C onsor tium Toward a univer sal biomedical data

translator. Clin. Transl. Sci. 12, 86–90 (2019). https://doi.org/(2018b10/1111/

cts.12591.

5. Walonoski, J. et al. Synthea: an appr oach, method, and sof tware mechanism fo r

generating synthetic patients and the synthetic electronic health care record. J.

Am. Med. Inform. Assoc. 25, 230– 238 (2018).

6. Ta, C., Dumontier, M., Hripcsak, G., Tatonetti, N. & Weng, C. C olumbia Open Health

Data, clinical concept prevalenc e and co- occurren ce from elec tronic health re-

cords. Sci. D ata 5, 180 273 (2 018) .

7. Fecho, K. et al. A novel a pproach fo r exposing an d sharing clini cal data: the

Translator Integrated Clinical and Environmental Exposures Service. J. Am. Med.

Inform. Assoc. (in press). https://doi.org /10.1093/jamia /ocz042

8. P faff, E. R. et al. All r oads lead to F HIR: a n extensib le clinical data convers ion

pipeline. American Medical Informatics Association 2019 Informatics Summit,

San Francisco, CA , March 25–28, 2 019. Abstr act.

9. Mirabelli, M. C., Vaidyanathan, A ., Flanders, W.D., Qin, X . & Garbe, P. Outdoor PM2.5,

ambient air temperature, and asthma s ymptoms in the past 14days amo ng adults

with acti ve asthma. Environ. Health Perspect. 12 4, 18 82–189 0 (2 016).

10. Glusman, G., Mauldin, D.E., Ho od, L.E. & Robinson, M. Ultra fast comparison of p er-

sonal genomes via precomputed genome fingerprints. Front. Genet. 8, 136 (2017 ).

published by Wiley Periodicals, Inc. on behalf of the

American Society for Clinical Pharmacology and

Therapeutics. This is an open access article under

the terms of the Creative Commons Attribution-

NonCommercial License, which permits use, distribution

and reproduction in any medium, provided the original

work is properly cited and is not used for commercial

purposes.

Available via license: CC BY-NC

Content may be subject to copyright.

Open Application of Statistical and Machine Learning Models to Explore the Impact of Environmental Exposures on Health and Disease: An Asthma Use Case

Article

Full-text available

Oct 2021
Int J Environ Res Publ Health

ICEES (Integrated Clinical and Environmental Exposures Service) provides a disease-agnostic, regulatory-compliant approach for openly exposing and analyzing clinical data that have been integrated at the patient level with environmental exposures data. ICEES is equipped with basic features to support exploratory analysis using statistical approaches, such as bivariate chi-square tests. We recently developed a method for using ICEES to generate multivariate tables for subsequent application of machine learning and statistical models. The objective of the present study was to use this approach to identify predictors of asthma exacerbations through the application of three multivariate methods: conditional random forest, conditional tree, and generalized linear model. Among seven potential predictor variables, we found five to be of significant importance using both conditional random forest and conditional tree: prednisone, race, airborne particulate exposure, obesity, and sex. The conditional tree method additionally identified several significant two-way and three-way interactions among the same variables. When we applied a generalized linear model, we identified four significant predictor variables, namely prednisone, race, airborne particulate exposure, and obesity. When ranked in order by effect size, the results were in agreement with the results from the conditional random forest and conditional tree methods as well as the published literature. Our results suggest that the open multivariate analytic capabilities provided by ICEES are valid in the context of an asthma use case and likely will have broad value in advancing open research in environmental and public health.

Unlocking the Power of Health Datasets and Registries: The Need for Urgent Institutional and National Ownership and Governance Regulations for Research Advancement

Preprint

Full-text available

Jun 2023

Ahmed S. Bahammam

Health datasets have immense potential to drive research advancements and improve healthcare outcomes. However, realizing this potential requires careful consideration of governance and ownership frameworks. This article explores the importance of nurturing governance and ownership models that facilitate responsible and ethical use of health datasets for research purposes. We highlight the importance of adopting governance and ownership models that enable responsible and ethical utilization of health datasets and clinical data registries for research purposes. The article addresses the important local and international regulations related to the utilization of health data/medical records in research, and emphasizes the urgent need for developing clear institutional and national guidelines on data access, sharing, and utilization, ensuring transparency, privacy, and data protection. By establishing robust governance structures and fostering ownership among stakeholders, collaboration, innovation, and equitable access to health data can be promoted, ultimately unlocking its full power for transformative research and improving global health outcomes.

Unlocking the Power of Health Datasets and Registries: The Need for Urgent Institutional and National Ownership and Governance Regulations for Research Advancement

Article

Full-text available

Jul 2023

Ahmed S. Bahammam

Progress Toward a Universal Biomedical Data Translator

Article

Full-text available

Jun 2022
CTS-CLIN TRANSL SCI

Clinical, biomedical, and translational science has reached an inflection point in the breadth and diversity of available data and the potential impact of such data to improve human health and well‐being. However, the data are often siloed, disorganized, and not broadly accessible due to discipline‐specific differences in terminology and representation. To address these challenges, the Biomedical Data Translator Consortium has developed and tested a pilot knowledge graph–based ‘Translator’ system capable of integrating existing biomedical data sets and ‘translating’ those data into insights intended to augment human reasoning and accelerate translational science. Having demonstrated feasibility of the Translator system, the Translator program has since moved into development, and the Consortium has made significant progress in the research, design, and implementation of an operational system. Herein, we describe the current system’s architecture, performance, and quality of results. We apply Translator to several real‐world use cases developed in collaboration with subject‐matter experts. Finally, we discuss the scientific and technical features of Translator and compare those features to other state‐of‐the‐art biomedical graph‐based question‐answering systems.

Development and Application of an Open Tool for Sharing and Analyzing Integrated Clinical and Environmental Exposures Data: an Asthma Use Case (Preprint)

Article

Full-text available

Jul 2021

Background The Integrated Clinical and Environmental Exposures Service (ICEES) serves as an open-source, disease-agnostic, regulatory-compliant framework and approach for openly exposing and exploring clinical data that have been integrated at the patient level with a variety of environmental exposures data. ICEES is equipped with tools to support basic statistical exploration of the integrated data in a completely open manner. Objective This study aims to further develop and apply ICEES as a novel tool for openly exposing and exploring integrated clinical and environmental data. We focus on an asthma use case. Methods We queried the ICEES open application programming interface (OpenAPI) using a functionality that supports chi-square tests between feature variables and a primary outcome measure, with a Bonferroni correction for multiple comparisons (α=.001). We focused on 2 primary outcomes that are indicative of asthma exacerbations: annual emergency department (ED) or inpatient visits for respiratory issues; and annual prescriptions for prednisone. ResultsOf the 157,410 patients within the asthma cohort, 26,332 (16.73%) had 1 or more annual ED or inpatient visits for respiratory issues, and 17,056 (10.84%) had 1 or more annual prescriptions for prednisone. We found that close proximity to a major roadway or highway, exposure to high levels of particulate matter ≤2.5 μm (PM2.5) or ozone, female sex, Caucasian race, low residential density, lack of health insurance, and low household income were significantly associated with asthma exacerbations (P

Recent Developments in Privacy-Preserving Mining of Clinical Data

Article

Nov 2021

With the dramatic improvements in both the capability to collect personal data and the capability to analyze large amounts of data, increasingly sophisticated and personal insights are being drawn. These insights are valuable for clinical applications but also open up possibilities for identification and abuse of personal information. In this article, we survey recent research on classical methods of privacy-preserving data mining. Looking at dominant techniques and recent innovations to them, we examine the applicability of these methods to the privacy-preserving analysis of clinical data. We also discuss promising directions for future research in this area.

Practices, norms, and aspirations regarding the construction, validation, and reuse of code sets in the analysis of real-world data

Preprint

Oct 2021

Objective: Code sets play a central role in analytic work with clinical data warehouses, as components of phenotype, cohort, or analytic variable algorithms representing specific clinical phenomena. Code set quality has received critical attention and repositories for sharing and reusing code sets have been seen as a way to improve quality and reduce redundant effort. Nonetheless, concerns regarding code set quality persist. In order to better understand ongoing challenges in code set quality and reuse, and address them with software and infrastructure recommendations, we determined it was necessary to learn how code sets are constructed and validated in real-world settings. Methods: Survey and field study using semi-structured interviews of a purposive sample of code set practitioners. Open coding and thematic analysis on interview transcripts, interview notes, and answers to open-ended survey questions. Results: Thirty-six respondents completed the survey, of whom 15 participated in follow-up interviews. We found great variability in the methods, degree of formality, tools, expertise, and data used in code set construction and validation. We found universal agreement that crafting high-quality code sets is difficult, but very different ideas about how this can be achieved and validated. A primary divide exists between those who rely on empirical techniques using patient-level data and those who only rely on expertise and semantic data. We formulated a method- and process-based model able to account for observed variability in formality, thoroughness, resources, and techniques. Conclusion: Our model provides a structure for organizing a set of recommendations to facilitate reuse based on metadata capture during the code set development process. It classifies validation methods by the data they depend on — semantic, empirical, and derived — as they are applied over a sequence of phases: (1) code collection; (2) code evaluation; (3) code set evaluation; (4) code set acceptance; and, optionally, (5) reporting of methods used and validation results. This schematization of real-world practices informs our analysis of and response to persistent challenges in code set development. Potential re-users of existing code sets can find little evidence to support trust in their quality and fitness for use, particularly when reusing a code set in a new study or database context. Rather than allowing code set sharing and reuse to remain separate activities, occurring before and after the main action of code set development, sharing and reuse must permeate every step of the process in order to produce reliable evidence of quality and fitness for use.

Digital Health Technologies for Medical Devices – Real World Evidence Collection – Challenges and Solutions Towards Clinical Evidence

Article

Full-text available

Aug 2022

The need for sufficient clinical evidence and the collection of real-world evidence (RWE) is at the forefront of medical device and drug regulations, however, the collection of clinical data can be a time consuming and costly process. The advancement of Digital Health Technologies (DHTs) is transforming the way health data can be collected, analysed, and shared, presenting an opportunity for the implementation of DHTs in clinical research to aid with obtaining clinical evidence, particularly RWE. DHTs can provide a more efficient and timely way of collecting numerous types of clinical data (e.g., physiological, and behavioural data) and can be beneficial with regards to participant recruitment, data management and cost reduction. Recent guidelines and regulations on the use of RWE within regulatory decision-making processes opens the door for the wider implementation of DHTs. However, challenges and concerns remain regarding the use of DHT (such as data security and privacy). Nevertheless, the implementation of DHT in clinical research presents a promising opportunity for providing meaningful and patient-centred data to aid with regulatory decisions.

Enabling Longitudinal Exploratory Analysis of Clinical COVID Data

Conference Paper

Oct 2021

An approach for open multivariate analysis of integrated clinical and environmental exposures data

Article

Full-text available

Sep 2021

The Integrated Clinical and Environmental Exposures Service (ICEES) provides regulatory-compliant open access to sensitive patient data that have been integrated with public exposures data. ICEES was designed initially to support dynamic cohort creation and bivariate contingency tests. The objective of the present study was to develop an open approach to support multivariate analyses using existing ICEES functionalities and abiding by all regulatory constraints. We first developed an open approach for generating a multivariate table that maintains contingencies between clinical and environmental variables using programmatic calls to the open ICEES application programming interface. We then applied the approach to data on a large cohort (N = 22,365) of patients with asthma or related conditions and generated an eight-feature table. Due to regulatory constraints, data loss was incurred with the incorporation of each successive feature variable, from a starting sample size of N = 22,365 to a final sample size of N = 4,556 (20.4%), but data loss was < 10% until the addition of the final two feature variables. We then applied a generalized linear model to the subsequent dataset and focused on the impact of seven select feature variables on asthma exacerbations, defined as annual emergency department or inpatient visits for respiratory issues. We identified five feature variables—sex, race, obesity, prednisone, and airborne particulate exposure—as significant predictors of asthma exacerbations. We discuss the advantages and disadvantages of ICEES open multivariate analysis and conclude that, despite limitations, ICEES can provide a valuable resource for open multivariate analysis and can serve as an exemplar for regulatory-compliant informatic solutions to open patient data, with capabilities to explore the impact of environmental exposures on health outcomes.

Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning

Article

Full-text available

Dec 2018

Importance Despite data aggregation and removal of protected health information, there is concern that deidentified physical activity (PA) data collected from wearable devices can be reidentified. Organizations collecting or distributing such data suggest that the aforementioned measures are sufficient to ensure privacy. However, no studies, to our knowledge, have been published that demonstrate the possibility or impossibility of reidentifying such activity data. Objective To evaluate the feasibility of reidentifying accelerometer-measured PA data, which have had geographic and protected health information removed, using support vector machines (SVMs) and random forest methods from machine learning. Design, Setting, and Participants In this cross-sectional study, the National Health and Nutrition Examination Survey (NHANES) 2003-2004 and 2005-2006 data sets were analyzed in 2018. The accelerometer-measured PA data were collected in a free-living setting for 7 continuous days. NHANES uses a multistage probability sampling design to select a sample that is representative of the civilian noninstitutionalized household (both adult and children) population of the United States. Exposures The NHANES data sets contain objectively measured movement intensity as recorded by accelerometers worn during all walking for 1 week. Main Outcomes and Measures The primary outcome was the ability of the random forest and linear SVM algorithms to match demographic and 20-minute aggregated PA data to individual-specific record numbers, and the percentage of correct matches by each machine learning algorithm was the measure. Results A total of 4720 adults (mean [SD] age, 40.0 [20.6] years) and 2427 children (mean [SD] age, 12.3 [3.4] years) in NHANES 2003-2004 and 4765 adults (mean [SD] age, 45.2 [19.9] years) and 2539 children (mean [SD] age, 12.1 [3.4] years) in NHANES 2005-2006 were included in the study. The random forest algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4478 adults (94.9%) and 2120 children (87.4%) in NHANES 2003-2004 and 4470 adults (93.8%) and 2172 children (85.5%) in NHANES 2005-2006 (P < .001 for all). The linear SVM algorithm successfully reidentified the demographic and 20-minute aggregated PA data of 4043 adults (85.6%) and 1695 children (69.8%) in NHANES 2003-2004 and 4041 adults (84.8%) and 1705 children (67.2%) in NHANES 2005-2006 (P < .001 for all). Conclusions and Relevance This study suggests that current practices for deidentification of accelerometer-measured PA data might be insufficient to ensure privacy. This finding has important policy implications because it appears to show the need for deidentification that aggregates the PA data of multiple individuals to ensure privacy for single individuals.

Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records

Article

Full-text available

Nov 2018

Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center’s Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013–2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.

The Biomedical Data Translator Program: Conception, Culture, and Community: The Biomedical Data Translator Consortium

Article

Full-text available

Nov 2018

Toward A Universal Biomedical Data Translator

Article

Full-text available

Nov 2018

Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints

Article

Full-text available

Sep 2017

We present an ultrafast method for comparing personal genomes. We transform the standard genome representation (lists of variants relative to a reference) into “genome fingerprints” via locality sensitive hashing. The resulting genome fingerprints can be meaningfully compared even when the input data were obtained using different sequencing technologies, processed using different pipelines, represented in different data formats and relative to different reference versions. Furthermore, genome fingerprints are robust to up to 30% missing data. Because of their reduced size, computation on the genome fingerprints is fast and requires little memory. For example, we could compute all-against-all pairwise comparisons among the 2504 genomes in the 1000 Genomes data set in 67 s at high quality (21 μs per comparison, on a single processor), and achieved a lower quality approximation in just 11 s. Efficient computation enables scaling up a variety of important genome analyses, including quantifying relatedness, recognizing duplicative sequenced genomes in a set, population reconstruction, and many others. The original genome representation cannot be reconstructed from its fingerprint, effectively decoupling genome comparison from genome interpretation; the method thus has significant implications for privacy-preserving genome analytics.

Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record

Article

Full-text available

Sep 2017

Objective: Our objective is to create a source of synthetic electronic health records that is readily available; suited to industrial, innovation, research, and educational uses; and free of legal, privacy, security, and intellectual property restrictions. Materials and Methods: We developed Synthea, an open-source software package that simulates the lifespans of synthetic patients, modeling the 10 most frequent reasons for primary care encounters and the 10 chronic conditions with the highest morbidity in the United States. Results: Synthea adheres to a previously developed conceptual framework, scales via open-source deployment on the Internet, and may be extended with additional disease and treatment modules developed by its user community. One million synthetic patient records are now freely available online, encoded in standard formats (eg, Health Level-7 [HL7] Fast Healthcare Interoperability Resources [FHIR] and Consolidated-Clinical Document Architecture), and accessible through an HL7 FHIR application program interface. Discussion: Health care lags other industries in information technology, data exchange, and interoperability. The lack of freely distributable health records has long hindered innovation in health care. Approaches and tools are available to inexpensively generate synthetic health records at scale without accidental disclosure risk, lowering current barriers to entry for promising early-stage developments. By engaging a growing community of users, the synthetic data generated will become increasingly comprehensive, detailed, and realistic over time. Conclusion: Synthetic patients can be simulated with models of disease progression and corresponding standards of care to produce risk-free realistic synthetic health care records at scale.

Outdoor PM2.5, Ambient Air Temperature, and Asthma Symptoms in the Past 14 Days among Adults with Active Asthma

Article

Full-text available

Jul 2016

Background: Relationships between air quality and health are well-described, but little information is available about the joint associations between particulate air pollution, ambient temperature, and respiratory morbidity. Objectives: To evaluate associations between concentrations of particulate matter ≤2.5 microns in diameter (PM2.5) and exacerbation of existing asthma and modification of the associations by ambient air temperature. Methods: Data from 50,356 adult 2006-2010 Asthma Call-back Survey respondents were linked by interview date and county of residence to estimates of daily averages of PM2.5 and maximum air temperature. Associations between 14-day average PM2.5 and the presence of any asthma symptoms during the 14 days leading up to and including the interview date were evaluated using binomial regression. We explored variation by air temperature using similar models, stratified into quintiles of the 14-day average maximum temperature. Results: Among adults with active asthma, 57.1% reported asthma symptoms within the past 14 days and 14-day average PM2.5 ≥7.07 µg·m(-3) was associated with an estimated 4 to 5% higher asthma symptom prevalence. In the range of 4.00 to 7.06 µg·m(-3) of PM2.5, each µg·m(-3) increase was associated with a 3.4% (95% confidence interval: 1.1, 5.7) increase in symptom prevalence; across categories of temperature from 1.1 to 80.5°F, each µg·m(-3) increase was associated with increased symptom prevalence (1.1-44.4°F: 7.9%; 44.5-58.6°F: 6.9%; 58.7-70.1°F: 2.9%; 70.2-80.5°F: 7.3%). Conclusions: These results suggest that each unit increase in PM2.5 may be associated with an increase in the prevalence of asthma symptoms, even at levels as low as 4.00 to 7.06 µg·m(-3).

A novel approach for exposing and sharing clinical data: the Translator Integrated Clinical and Environmental Exposures Service

Article

Apr 2019
J AM MED INFORM ASSN

Objective: This study aimed to develop a novel, regulatory-compliant approach for openly exposing integrated clinical and environmental exposures data: the Integrated Clinical and Environmental Exposures Service (ICEES). Materials and methods: The driving clinical use case for research and development of ICEES was asthma, which is a common disease influenced by hundreds of genes and a plethora of environmental exposures, including exposures to airborne pollutants. We developed a pipeline for integrating clinical data on patients with asthma-like conditions with data on environmental exposures derived from multiple public data sources. The data were integrated at the patient and visit level and used to create de-identified, binned, "integrated feature tables," which were then placed behind an OpenAPI. Results: Our preliminary evaluation results demonstrate a relationship between exposure to high levels of particulate matter ≤2.5 µm in diameter (PM2.5) and the frequency of emergency department or inpatient visits for respiratory issues. For example, 16.73% of patients with average daily exposure to PM2.5 >9.62 µg/m3 experienced 2 or more emergency department or inpatient visits for respiratory issues in year 2010 compared with 7.93% of patients with lower exposures (n = 23 093). Discussion: The results validated our overall approach for openly exposing and sharing integrated clinical and environmental exposures data. We plan to iteratively refine and expand ICEES by including additional years of data, feature variables, and disease cohorts. Conclusions: We believe that ICEES will serve as a regulatory-compliant model and approach for promoting open access to and sharing of integrated clinical and environmental exposures data.

Electronic Health Records: Privacy, Confidentiality, and Security

Article

Sep 2012

All roads lead to FHIR: an extensible clinical data conversion pipeline

E R Pfaff

Pfaff, E.R. et al. All roads lead to FHIR: an extensible clinical data conversion pipeline. American Medical Informatics Association 2019 Informatics Summit, San Francisco, CA, March 25-28, 2019. Abstract.

Clinical Data: Sources and Types, Regulatory Constraints, Applications

Abstract and Figures

Recommended publications

Regulation and Non-Compliance: Magnitudes and Patterns for India's Factories Act

Sex, Obesity, Diabetes, and Exposure to Particulate Matter among Patients with Severe Asthma: Scient...

FHIR PIT: an open software application for spatiotemporal integration of clinical data and environme...

FHIR PIT: an open software application for spatiotemporal integration of clinical data and environme...

Development and Application of an Open Tool for Sharing and Analyzing Integrated Clinical and Enviro...