Figure - uploaded by Stephane M Meystre
Content may be subject to copyright.
Patient identifiers defined in the HIPAA "Safe Harbor" legislation.

Patient identifiers defined in the HIPAA "Safe Harbor" legislation.

Source publication
Article
Full-text available
In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be consid...

Context in source publication

Context 1
... laws typically require the informed consent of the patient and approval of the Internal Review Board (IRB) to use data for research purposes, but these requirements are waived if data is de-identified, or if patient consent is not possible (e.g., data mining of retrospective records). For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Pro- tected Health Information) to be removed as shown in Figure 1 [3]. ...

Similar publications

Article
Full-text available
Healthcare organizations must de-identify patient records before sharing data. Many organizations rely on the Safe Harbor Standard of the HIPAA Privacy Rule, which enumerates 18 identifiers that must be suppressed (eg, ages over 89). An alternative model in the Privacy Rule, known as the Statistical Standard, can facilitate the sharing of more deta...
Article
Full-text available
Background: Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identificati...
Article
Full-text available
Maintaining data security and privacy in an era of cybersecurity is a challenge. The enormous and rapidly growing amount of health-related data available today raises numerous questions about data collection, storage, analysis, comparability and interoperability but also about data protection. The US Health Portability and Accountability Act (HIPAA...

Citations

... [1]), we gave the redactors instructions as to what types of names to retain (names of famous people and authors) and what types of names to redact (names of instructors and students), but asked them to use their judgment in selecting which was which. In many past papers in other fields (see review in [8]), human redactors are given a list of known student names for redaction, but doing so here would have biased in favor of human redactors by giving them information the LLM did not have. In total, human coders redacted 2134 words, with 1282 posts containing at least one redaction (36.6% of the total posts). ...
Conference Paper
Full-text available
Education is increasingly taking place in learning environments mediated by technology. This transition has made it easier to collect student-generated data including comments in discussion forums and chats. Although this data is extremely valuable to researchers, it often contains sensitive information like names, locations, social media links, and other personally identifying information (PII) that must be carefully redacted before utilizing the data for research to protect their privacy. Historically, this task of redacting PII has been painstakingly conducted by humans; more recently, some researchers have attempted to use regular expressions and supervised machine-learning methods. Nowadays, with the recent high performance shown by Large Language Models in a wide range of tasks, they have become another alternative to be explored for de-identifying educational data. In this work, we assess GPT-4's performance in de-identifying data from discussion forums in 9 Massive Open Online Courses. Our results show an average recall of 0.958 for identifying PII that needs to be redacted, suggesting that it is an appropriate tool for this purpose. Our tool is also successful at identifying cases missed by humans when redacting data. These findings indicate that GPT-4 can not only increase the efficiency but also enhance the quality of the redaction process. However, the precision of such redaction is considerably worse (0.526), over-redacting names and locations that do not represent PII, showing a need for further improvement.
... Stringent regulations like the General Data Protection Regulation (GDPR) [3] in Europe mandate anonymization of personal information in many contexts [4]. However, applying existing de-identification techniques to public administration introduces unique challenges due to the vast scale and heterogeneity of data involved [5]. ...
... It outlines where the data came from, how it was collected, and how it is structured for the experiments. 5. Pre-training on Masked Language Task This section covers the initial phase of the machine learning pipeline. ...
Preprint
Full-text available
Recent advances in Natural Language Processing have demonstrated the effectiveness of pretrained language models like BERT for a variety of downstream tasks. We present GiusBERTo, the first BERT-based model specialized for anonymizing personal data in Italian legal documents. GiusBERTo is trained on a large dataset of Court of Auditors decisions to recognize entities to anonymize, including names, dates, locations, while retaining contextual relevance. We evaluate GiusBERTo on a held-out test set and achieve 97% token-level accuracy. GiusBERTo provides the Italian legal community with an accurate and tailored BERT model for de-identification, balancing privacy and data protection.
... Anonymizing patient data before using it for research is therefore the gold standard to preserve privacy in research, but automated anonymization of medical documents as clinical letters is not trivial. 14,15 Medical documents exist in a plethora of formats that vary between hospitals and even departments, making it challenging to find a universal solution for anonymization. Current approaches often rely on time-consuming, expensive and imprecise manual work 16 or Named Entity Recognition (NER) keyword search requiring costly software. ...
Preprint
Full-text available
Background: Medical research with real-world clinical data can be challenging due to privacy requirements. Ideally, patient data are handled in a fully pseudonymised or anonymised way. However, this can make it difficult for medical researchers to access and analyze large datasets or to exchange data between hospitals. De-identifying medical free text is particularly difficult due to the diverse documentation styles and the unstructured nature of the data. However, recent advancements in natural language processing (NLP), driven by the development of large language models (LLMs), have revolutionized the ability to extract information from unstructured text. Methods: We hypothesize that LLMs are highly effective tools for extracting patient-related information, which can subsequently be used to de-identify medical reports. To test this hypothesis, we conduct a benchmark study using eight locally deployable LLMs (Llama-3 8B, Llama-3 70B, Llama-2 7B, Llama-2 70B, Llama-2 7B "Sauerkraut", Llama-2 70B "Sauerkraut", Mistral 7B, and Phi-3-mini) to extract patient-related information from a dataset of 100 real-world clinical letters. We then remove the identified information using our newly developed LLM-Anonymizer pipeline. Results: Our results demonstrate that the LLM-Anonymizer, when used with Llama-3 70B, achieved a success rate of 98.05% in removing text characters carrying personal identifying information. When evaluating the performance in relation to the number of characters manually identified as containing personal information and identifiable characteristics, our system missed only 1.95% of personal identifying information and erroneously redacted only 0.85% of the characters. Conclusion: We provide our full LLM-based Anonymizer pipeline under an open source license with a user-friendly web interface that operates on local hardware and requires no programming skills. This powerful tool has the potential to significantly facilitate medical research by enabling the secure and efficient de-identification of clinical free text data on premise, thereby addressing key challenges in medical data sharing.
... Then, we iteratively coded the identified tuples by relying on selective coding techniques which is a process to identify and refine categories at a highly generalizable degree [65]. In all 14 coding iterations, one author continuously compares, relates, and associates categories and properties and discusses the coding results with DD2: Staging of cancer [48] DD3: Objective assessment in image interpretation [48] DD4: Genomic cancer therapy [49] DD5: Voice analysis for Parkinson's disease [50] DD6: Electroencephalography analysis to detect seizures [51] DD7: Facial analysis for detection of rare disease [52] BR1: De novo drug design [6] BR2: Predictive biomarkers in aging for drug development [53] BR3: De-identification of private health information [54] BR4: Genomic splicing in research [55] CA1: Emergency triage [56] CA2: Predictions of mortality in the intensive care unit [57] CA3: Operating room scheduling [34] CA4: Automated text summarization [58] T1: Prediction of the required insulin [8] T2: Prediction of vasopressor medication dosage [35] T3: Chatbots for patients [59] IR1: Intelligent prosthesis [36] IR2: AI-based surgery robots [60] IR3: Workflow detection for human-robot surgery [61] another author. We modified some tuples during the coding process in two ways. ...
... E7 validates that "our conviction is […] that administrational tasks generate the greatest added value and benefit for doctors and caregivers. " Administrative tasks include the creation of case summaries (use case CA4) or automated de-identification of private health information in electronic health records (use case BR2) [54]. E8 says that resource optimization enables "more time for direct contact with patients. ...
Article
Full-text available
Artificial intelligence (AI) applications pave the way for innovations in the healthcare (HC) industry. However, their adoption in HC organizations is still nascent as organizations often face a fragmented and incomplete picture of how they can capture the value of AI applications on a managerial level. To overcome adoption hurdles, HC organizations would benefit from understanding how they can capture AI applications’ potential. We conduct a comprehensive systematic literature review and 11 semi-structured expert interviews to identify, systematize, and describe 15 business objectives that translate into six value propositions of AI applications in HC. Our results demonstrate that AI applications can have several business objectives converging into risk-reduced patient care, advanced patient care, self-management, process acceleration, resource optimization, and knowledge discovery. We contribute to the literature by extending research on value creation mechanisms of AI to the HC context and guiding HC organizations in evaluating their AI applications or those of the competition on a managerial level, to assess AI investment decisions, and to align their AI application portfolio towards an overarching strategy.
... The availability of large-scale, machine-readable data enables the successful implementation of queries that could retrieve private or sensitive information and harm individuals captured in such data. Experts have in the past explored and discussed this issue using specially trained, fine-tuned models (Ahmed et al. 2021;Meystre et al. 2010). However, complete de-identification remains a challenging issue, one that has become more fraught with AI advancement and increased data access. ...
Article
Full-text available
Understanding “how to optimize the production of scientific knowledge” is paramount to those who support scientific research—funders as well as research institutions—to the communities served, and to researchers. Structured archives can help all involved to learn what decisions and processes help or hinder the production of new knowledge. Using artificial intelligence (AI) and large language models (LLMs), we recently created the first structured digital representation of the historic archives of the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health. This work yielded a digital knowledge base of entities, topics, and documents that can be used to probe the inner workings of the Human Genome Project, a massive international public-private effort to sequence the human genome, and several of its offshoots like The Cancer Genome Atlas (TCGA) and the Encyclopedia of DNA Elements (ENCODE). The resulting knowledge base will be instrumental in understanding not only how the Human Genome Project and genomics research developed collaboratively, but also how scientific goals come to be formulated and evolve. Given the diverse and rich data used in this project, we evaluated the ethical implications of employing AI and LLMs to process and analyze this valuable archive. As the first computational investigation of the internal archives of a massive collaborative project with multiple funders and institutions, this study will inform future efforts to conduct similar investigations while also considering and minimizing ethical challenges. Our methodology and risk-mitigating measures could also inform future initiatives in developing standards for project planning, policymaking, enhancing transparency, and ensuring ethical utilization of artificial intelligence technologies and large language models in archive exploration.Author Contributions: Mohammad Hosseini: Investigation; Project Administration; Writing – original draft; Writing – review & editing. Spencer Hong: Conceptualization, Data curation, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing. Thomas Stoeger: Conceptualization; Investigation; Project Administration; Supervision; Writing – original draft; Writing – review & editing. Kristi Holmes: Funding acquisition, Supervision, Writing – review & editing. Luis A. Nunes Amaral: Funding acquisition, Supervision, Writing – review & editing. Christopher Donohue: Conceptualization, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing. Kris Wetterstrand: Conceptualization, Funding acquisition, Project administration.
... To reuse such records and conduct health data-related studies, the task of de-identification has become essential [4][5][6]. This is necessary to protect the confidentiality of personal data in EHRs and comply with government regulations set in our case by the French Data Protection Authority, Commission Nationale de l'Informatique et des Libertés-(CNIL), 1 ...
... Automatic de-identification of electronic health records is generally considered a task of named entity recognition, which enables the extraction of personal information from unstructured medical text [5,21]. ...
Article
Full-text available
Background Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. De-identification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic de-identification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the de-identification pipeline to other clinical centers. Methods We proposed an automated annotation process for French clinical de-identification, exploiting data from the eHOP clinical data warehouse (CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods. Results A French de-identification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive information. Conclusions This study provides an automatic de-identification pipeline for clinical notes, which can facilitate the reuse of EHRs for secondary purposes such as clinical research. Our study highlights the importance of using advanced NLP techniques for effective de-identification, as well as the need for innovative solutions such as distant supervision to overcome the challenge of limited annotated data in the medical domain.
... [32][33][34][35] This is compounded by the challenge of externally validating NLP models, which risk leaking protected health information (PHI) when trained on clinical notes. [36][37][38][39][40] Objectives In this study, we aim to develop and externally validate ML models to predict postoperative AKI using structured data and medical concepts extracted from raw clinical notes as concept unique identifiers (CUIs). We further aim to compare different approaches to modeling CUI data to determine which method demonstrates the highest discrimination and calibration. ...
Article
Full-text available
Abstract Objectives To develop and externally validate machine learning models using structured and unstructured electronic health record data to predict postoperative acute kidney injury (AKI) across inpatient settings. Materials and Methods Data for adult postoperative admissions to the Loyola University Medical Center (2009-2017) were used for model development and admissions to the University of Wisconsin-Madison (2009-2020) were used for validation. Structured features included demographics, vital signs, laboratory results, and nurse-documented scores. Unstructured text from clinical notes were converted into concept unique identifiers (CUIs) using the clinical Text Analysis and Knowledge Extraction System. The primary outcome was the development of Kidney Disease Improvement Global Outcomes stage 2 AKI within 7 days after leaving the operating room. We derived unimodal extreme gradient boosting machines (XGBoost) and elastic net logistic regression (GLMNET) models using structured-only data and multimodal models combining structured data with CUI features. Model comparison was performed using the receiver operating characteristic curve (AUROC), with Delong’s test for statistical differences. Results The study cohort included 138 389 adult patient admissions (mean [SD] age 58 [16] years; 11 506 [8%] African-American; and 70 826 [51%] female) across the 2 sites. Of those, 2959 (2.1%) developed stage 2 AKI or higher. Across all data types, XGBoost outperformed GLMNET (mean AUROC 0.81 [95% confidence interval (CI), 0.80-0.82] vs 0.78 [95% CI, 0.77-0.79]). The multimodal XGBoost model incorporating CUIs parameterized as term frequency-inverse document frequency (TF-IDF) showed the highest discrimination performance (AUROC 0.82 [95% CI, 0.81-0.83]) over unimodal models (AUROC 0.79 [95% CI, 0.78-0.80]). Discussion A multimodality approach with structured data and TF-IDF weighting of CUIs increased model performance over structured data-only models. Conclusion These findings highlight the predictive power of CUIs when merged with structured data for clinical prediction models, which may improve the detection of postoperative AKI.
... Second, we find systems that focus on the so-called protected health information (PHI) (D1, C2), i.e., identifying and personal health information (e.g., name, age, special medical conditions). PHI extraction is concerned with recognizing, processing, and anonymizing information such as name, age, and medical record numbers (Meystre et al., 2010). Besides these two extraction tasks, we identify encoding (Casey et al., 2021;Miranda-Escalada et al., 2020) and querying as further tasks. ...
Conference Paper
Electronic health records (EHR) have significantly amplified the volume of information accessible in the healthcare sector. Nevertheless, this information load also translates into elevated workloads for clinicians engaged in extracting and generating patient information. Natural Language Process (NLP) aims to overcome this problem by automatically extracting and structuring relevant information from medical texts. While other methods related to artificial intelligence have been implemented successfully in healthcare (e.g., computer vision in radiology), NLP still lacks commercial success in this domain. The lack of a structured overview of NLP systems is exacerbating the problem, especially with the emergence of new technologies like generative pre-trained transformers. Against this background, this paper presents a taxonomy to inform integration decisions of NLP systems into healthcare IT landscapes. We contribute to a better understanding of how NLP systems can be integrated into daily clinical contexts. In total, we reviewed 29 papers and 36 commercial NLP products.
... Of the 18 articles, eight examined free text health data from electronic health records [15,[19][20][21][23][24][25][26]. Another eight articles mentioned big data [12, 28, 30-35] but did not elaborate further on data type. ...
... Individuals' identifiers (such as credit card records) and interaction privacy (e.g., use of voice/fingerprint) [30] Key attributes (e.g., ID, name, social security), quasi-identifiers (e.g., birth date, zip code, position, job, blood type), sensitive attributes (e.g., salary, medical examinations, credit card releases) [12,[31][32][33] 7 types of PHII, including personal names, ages, geographical locations, hospitals and healthcare organisations, dates, contact information, IDs [19] PHI: patient name, phone number, physician name, medical history. PII: names, addresses, contact numbers [22] 18 categories of PHI according to HIPAA, quasi-identifiers, 9 categories of personal information according to China Civil Code (name, birthday, ID number, biometric information, home address, phone number, email address, health condition information, and personal tracking information) [23] PHI according to HIPAA, doctor's name and years extracted from dates [20] Direct identifiers (e.g., name, mailing address, email, social security number, phone number or driver's license number) and indirect identifiers (e.g., birth date, postal code, and sex) ...
... [27, 28] Nine articles referred to methods like rule-based automated learning, i.e., methods created to de-identify text data automatically using HMS Scrubber, an open-source deidentification tool that employs a three-step process to remove PHII from medical documents [36], and DE-ID rulebased automated system that uses sets of rules, patternmatching algorithms, and dictionaries to identify PHII in medical documents [19][20][21]. Machine learning approaches such as MIST (MITRE Identification Scrubber Toolkit, software that uses samples of de-identified text that enable it to learn contextual features that are necessary for accuracy) were mentioned in four articles, [19][20][21]24] the Health Information De-identification (HIDE) system was mentioned in two articles [19,20]. ...
Article
Full-text available
Introduction Using data in research often requires that the data first be de-identified, particularly in the case of health data, which often include Personal Identifiable Information (PII) and/or Personal Health Identifying Information (PHII). There are established procedures for de-identifying structured data, but de-identifying clinical notes, electronic health records, and other records that include free text data is more complex. Several different ways to achieve this are documented in the literature. This scoping review identifies categories of de-identification methods that can be used for free text data. Methods We adopted an established scoping review methodology to examine review articles published up to May 9, 2022, in Ovid MEDLINE; Ovid Embase; Scopus; the ACM Digital Library; IEEE Explore; and Compendex. Our research question was: What methods are used to de-identify free text data? Two independent reviewers conducted title and abstract screening and full-text article screening using the online review management tool Covidence. Results The initial literature search retrieved 3,312 articles, most of which focused primarily on structured data. Eighteen publications describing methods of de-identification of free text data met the inclusion criteria for our review. The majority of the included articles focused on removing categories of personal health information identified by the Health Insurance Portability and Accountability Act (HIPAA). The de-identification methods they described combined rule-based methods or machine learning with other strategies such as deep learning. Conclusion Our review identifies and categorises de-identification methods for free text data as rule-based methods, machine learning, deep learning and a combination of these and other approaches. Most of the articles we found in our search refer to de-identification methods that target some or all categories of PHII. Our review also highlights how de-identification systems for free text data have evolved over time and points to hybrid approaches as the most promising approach for the future.
... The goal of the deidentification process in unstructured EHR text notes is to identify SHI by inspecting entire medical records. Deidentification by medical experts is time-consuming, error prone, and expensive [6]. In contrast, automated deidentification techniques based on recent advances in artificial intelligence can be used to simplify the entire process [7]. ...
Article
Full-text available
Background Electronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning–based methods have been shown to be effective in deidentification. However, very few studies investigated the combination of transformer-based language models and rules. Objective The objective of this study is to develop a hybrid deidentification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pretrained word embedding and transformer-based language models. Methods In this study, we present a hybrid deidentification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2100 pathology reports with 38,414 SHI entities from 1833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pretrained language models. Results The OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various preprocessing and postprocessing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8000 unstructured EHR text notes in real time. Conclusions The OpenDeID pipeline is a hybrid deidentification pipeline to deidentify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.