Figure - available from: BMC Medical Research Methodology
This content is subject to copyright. Terms and conditions apply.
Left censoring and feature construction

Left censoring and feature construction

Source publication
Article
Full-text available
Abstract Background The goal of our study is to examine the impact of the lookback length when engineering features to use in developing predictive models using observational healthcare data. Using a longer lookback for feature engineering gives more insight about patients but increases the issue of left-censoring. Methods We used five US observati...

Similar publications

Conference Paper
Full-text available
In order to have a comprehensive understanding of non-work travel behavior, this paper takes into account five dimensions of travel pattern: mode, departure date, departure time, trip destination, and the number of companions. A multinomial logit model is developed to analyze relations between parking fee and non-work mode choice based on a survey...

Citations

... The articles that perform predictive analysis on other than cancerous data partially use different machine learning and deep learning methods. One of these studies is Hardin et al. [46] that uses the OHDSI PLP module for the development of predictive models. Since these excluded studies also contain a valuable source of information for the current review, detailed information of the most important excluded articles and the finally included five articles can be obtained in the attached Supplementary Table S1 (color-coded in grey). ...
Article
Full-text available
The current generation of sequencing technologies has led to significant advances in identifying novel disease-associated mutations and generated large amounts of data in a high-throughput manner. Such data in conjunction with clinical routine data are proven to be highly useful in deriving population-level and patient-level predictions, especially in the field of cancer precision medicine. However, data harmonization across multiple national and international clinical sites is an essential step for the assessment of events and outcomes associated with patients, which is currently not adequately addressed. The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is an internationally established research data repository introduced by the Observational Health Data Science and Informatics (OHDSI) community to overcome this issue. To address the needs of cancer research, the genomic vocabulary extension was introduced in 2020 to support the standardization of subsequent data analysis. In this review, we evaluate the current potential of the OMOP CDM to be applicable in cancer prediction and how comprehensively the genomic vocabulary extension of the OMOP can serve current needs of AI-based predictions. For this, we systematically screened the literature for articles that use the OMOP CDM in predictive analyses in cancer and investigated the underlying predictive models/tools. Interestingly, we found 248 articles, of which most use the OMOP for harmonizing their data, but only 5 make use of predictive algorithms on OMOP-based data and fulfill our criteria. The studies present multicentric investigations, in which the OMOP played an essential role in discovering and optimizing machine learning (ML)-based models. Ultimately, the use of the OMOP CDM leads to standardized data-driven studies for multiple clinical sites and enables a more solid basis utilizing, e.g., ML models that can be reused and combined in early prediction, diagnosis, and improvement of personalized cancer care and biomarker discovery.
Article
Full-text available
Background: This study used machine learning to develop a 3-year lung cancer risk prediction model with large real-world data in a mostly younger population. Methods: Over 4.7 million individuals, aged 45-65 years with no history of any cancer or lung cancer screening, diagnostic, or treatment procedures, with an outpatient visit in 2013 were identified in the Optum® De-identified Electronic Health Record (EHR) Dataset. A Least Absolute Shrinkage and Selection Operator model was fit using all available data in the 365 days prior. Temporal validation was assessed with recent data. External validation was assessed with data from Mercy Health Systems EHR and Optum® De-Identified Clinformatics Data Mart. Racial inequities in model discrimination were assessed with xAUCs. Results: The model AUC was 0.76. Top predictors included age, smoking, race, ethnicity, and diagnosis of chronic obstructive pulmonary disease. The model identified a high-risk group with lung cancer incidence 9 times the average cohort incidence, representing 10% of lung cancer patients. Model performed well temporally and externally, while performance was reduced for Asians and Hispanics. Conclusions: A high-dimensional model trained using big data identified a subset of patients with high lung cancer risk. The model demonstrated transportability to EHR and claims data, while underscoring the need to assess racial disparities when using machine learning methods. Impact: This internally and externally validated real-world data-based lung cancer prediction model is available on an open-source platform for broad sharing and application. Model integration into an EHR system could minimize physician burden by automating identification of high-risk patients.