Horizontal versus vertical partitioning. a Individual-level data for 3 variables held in 6 data files, one for each study. b Eight variables (for the same subjects) stored in 3 datasets (D1, D2 and D3) held by 3 distinct studies.

Horizontal versus vertical partitioning. a Individual-level data for 3 variables held in 6 data files, one for each study. b Eight variables (for the same subjects) stored in 3 datasets (D1, D2 and D3) held by 3 distinct studies.

Source publication
Article
Full-text available
Background: DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonised Individual levEL Databases) has been proposed to facilitate the co-analysis of individual-level data from multiple studies without physically sharing the data. In a previous paper, we investigated whether DataSHIELD could protect participant confidentiali...

Citations

... A special feature of SAVVY was the way data from the 17 RCTs was shared and analyzed: In a big collaborative effort, data had been gathered within 10 sponsor organizations (nine pharmaceutical companies and one academic trial center). In order to avoid challenges with data sharing, SAVVY used an approach familiar from Health Informatics, see, e.g., Budin et al. [24]. A standardized data structure was defined [10] based on which SAS and R macros were developed by the academic project group members. ...
Article
Full-text available
Background The SAVVY project aims to improve the analyses of adverse events (AEs) in clinical trials through the use of survival techniques appropriately dealing with varying follow-up times and competing events (CEs). This paper summarizes key features and conclusions from the various SAVVY papers. Methods Summarizing several papers reporting theoretical investigations using simulations and an empirical study including randomized clinical trials from several sponsor organizations, biases from ignoring varying follow-up times or CEs are investigated. The bias of commonly used estimators of the absolute (incidence proportion and one minus Kaplan-Meier) and relative (risk and hazard ratio) AE risk is quantified. Furthermore, we provide a cursory assessment of how pertinent guidelines for the analysis of safety data deal with the features of varying follow-up time and CEs. Results SAVVY finds that for both, avoiding bias and categorization of evidence with respect to treatment effect on AE risk into categories, the choice of the estimator is key and more important than features of the underlying data such as percentage of censoring, CEs, amount of follow-up, or value of the gold-standard. Conclusions The choice of the estimator of the cumulative AE probability and the definition of CEs are crucial. Whenever varying follow-up times and/or CEs are present in the assessment of AEs, SAVVY recommends using the Aalen-Johansen estimator (AJE) with an appropriate definition of CEs to quantify AE risk. There is an urgent need to improve pertinent clinical trial reporting guidelines for reporting AEs so that incidence proportions or one minus Kaplan-Meier estimators are finally replaced by the AJE with appropriate definition of CEs.
... Initiatives such as the Health Data Consortium work toward the ethical sharing of data, whereas technologies such as DataSHIELD offer methods for secure, privacypreserving analysis across different data sets, promoting data consistency and integrity without compromising patient privacy. 8 In organ transplantation, effective data harmonization is crucial because of the intricate interplay of factors and the need for large, high-quality data sets to accurately analyze continuously improving outcomes. In this overview, we describe the terminology and principles of data harmonization, present tools for data harmonization that emerged in other fields of healthcare and discuss the potential of these new tools for use in organ transplantation. ...
... 47 DataSHIELD is the most prominent tool that adopts this concept of taking the analysis to the data, not the data to the analysis, ensuring data owners and officers retain control over the data. 8 It allows co-analysis of individuallevel data from multiple studies or sources without the need for physical transfer of the actual data. Differential privacy, trusted execution environments, and secure data enclaves are additional tools, frameworks, and environments that serve a similar purpose. ...
Article
Full-text available
In organ transplantation, accurate analysis of clinical outcomes requires large, high-quality data sets. Not only are outcomes influenced by a multitude of factors such as donor, recipient, and transplant characteristics and posttransplant events but they may also change over time. Although large data sets already exist and are continually expanding in transplant registries and health institutions, these data are rarely combined for analysis because of a lack of harmonization. Promoted by the digitalization of the healthcare sector, effective data harmonization tools became available, with potential applications also for organ transplantation. We discuss herein the present problems in the harmonization of organ transplant data and offer solutions to enhance its accuracy through the use of emerging new tools. To overcome the problem of inadequate representation of transplantation-specific terms, ontologies and common data models particular to this field could be created and supported by a consortium of related stakeholders to ensure their broad acceptance. Adopting clear data-sharing policies can diminish administrative barriers that impede collaboration between organizations. Secure multiparty computation frameworks and the artificial intelligence (AI) approach federated learning can facilitate decentralized and harmonized analysis of data sets, without sharing sensitive data and compromising patient privacy. A common image data model built upon a standardized format would be beneficial to AI-based analysis of pathology images. Implementation of these promising new tools and measures, ideally with the involvement and support of transplant societies, is expected to produce improved integration and harmonization of transplant data and greater accuracy in clinical decision-making, enabling improved patient outcomes.
... Our work is different in that it focuses on a typical deployment of a common medical research platform and that its content has been, in large parts, abstracted away from country-specific requirements. Previous work has also focused on compliance for deployments of specific research systems (see the work by Wallace et al. (37) and by Budin-Ljøsne et al. (38) for an example on the DataSHIELD software). To the best of our knowledge, our work is the first to target OHDSI deployments. ...
Article
Full-text available
Introduction The open-source software offered by the Observational Health Data Science and Informatics (OHDSI) collective, including the OMOP-CDM, serves as a major backbone for many real-world evidence networks and distributed health data analytics platforms. While container technology has significantly simplified deployments from a technical perspective, regulatory compliance can remain a major hurdle for the setup and operation of such platforms. In this paper, we present OHDSI-Compliance, a comprehensive set of document templates designed to streamline the data protection and information security-related documentation and coordination efforts required to establish OHDSI installations. Methods To decide on a set of relevant document templates, we first analyzed the legal requirements and associated guidelines with a focus on the General Data Protection Regulation (GDPR). Moreover, we analyzed the software architecture of a typical OHDSI stack and related its components to the different general types of concepts and documentation identified. Then, we created those documents for a prototypical OHDSI installation, based on the so-called Broadsea package, following relevant guidelines from Germany. Finally, we generalized the documents by introducing placeholders and options at places where individual institution-specific content will be needed. Results We present four documents: (1) a record of processing activities, (2) an information security concept, (3) an authorization concept, as well as (4) an operational concept covering the technical details of maintaining the stack. The documents are publicly available under a permissive license. Discussion To the best of our knowledge, there are no other publicly available sets of documents designed to simplify the compliance process for OHDSI deployments. While our documents provide a comprehensive starting point, local specifics need to be added, and, due to the heterogeneity of legal requirements in different countries, further adoptions might be necessary.
... The reason is that the AJE is the standard (non-parametric) estimator 6 that accounts for CEs, censoring, and varying follow-up times simultaneously, and, 7 being non-parametric, does not rely on restrictive parametric assumptions, such as 8 constant hazards. Any other estimator of AE probability, such as incidence propor- 9 tion, probability transform incidence density, or one minus Kaplan-Meier, delivers 10 biased estimates in general. ...
... 6 2) What is the bias of common estimators that quantify the relative risk of 7 experiencing an AE between two treatment arms in a RCT? 8 3) Can trial characteristics be identified that help explain the bias in estimators? 9 4) How does the use of potentially biased estimators impact qualification of AE 10 probabilities and relative effects in regulatory settings? ...
... 24 An important but largely unrecognized aspect when quantifying AE risk is the 25 likely presence of CEs. Gooley et al. [8] define a CE as "We shall define a competing risk as an event whose occurrence either pre-1 cludes the occurrence of another event under examination or fundamentally 2 alters the probability of occurrence of this other event." 3 whereas the ICH E9(R1) estimands addendum [9] defines an intercurrent event as 4 "Events occurring after treatment initiation that affect either the interpreta-5 tion or the existence of the measurements associated with the clinical question 6 of interest." 7 These two definitions appear to be, if not the same, then at least very related. ...
Preprint
Full-text available
Background The SAVVY project aims to improve the analyses of adverse events (AEs) in clinical trials through the use of survival techniques appropriately dealing with varying follow-up times and competing events (CEs). This paper summarizes key features and conclusions from the various SAVVY papers. Methods Through theoretical investigations using simulations and in an empirical study including randomized clinical trials from several sponsor organisations, biases from ignoring varying follow-up times or CEs are investigated. The bias of commonly used estimators of the absolute and relative AE risk is quantified. Furthermore, we provide a cursory assessment of how pertinent guidelines for the analysis of safety data deal with the features of varying follow-up time and CEs. Results SAVVY finds that for both, avoiding bias and categorization of evidence with respect to treatment effect on AE risk into categories, the choice of the estimator is key and more important than features of the underlying data such as percentage of censoring, CEs, amount of follow-up, or value of the gold-standard. Conclusions The choice of the estimator of the cumulative AE probability and the definition of CEs are crucial. SAVVY recommends using the Aalen-Johansen estimator (AJE) with an appropriate definition of CEs whenever the risk for AEs is to be quantified. There is an urgent need to improve the guidelines of reporting AEs so that incidence proportions or one minus Kaplan-Meier estimators are finally replaced by the \aje with appropriate definition of CEs.
... The Medical Informatics Initiative (MII) was launched to develop infrastructure for the integration of clinical data from patient care and medical research and facilitate data sharing among university hospitals while conforming to privacy regulations [9]. One widely used operating platform for distributed computing is DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonized Individual-level Databases), employed also by MIRACUM, one out of four MII consortia [10][11][12]. It has since been extended to facilitate deep learning-based analyses and Big Data analyses from distributed individual patient data [13,14]. ...
Article
Full-text available
The current state‐of‐the‐art analysis of central nervous system (CNS) tumors through DNA methylation profiling relies on the tumor classifier developed by Capper and colleagues, which centrally harnesses DNA methylation data provided by users. Here, we present a distributed‐computing‐based approach for CNS tumor classification that achieves a comparable performance to centralized systems while safeguarding privacy. We utilize the t‐distributed neighborhood embedding (t‐SNE) model for dimensionality reduction and visualization of tumor classification results in two‐dimensional graphs in a distributed approach across multiple sites (DistSNE). DistSNE provides an intuitive web interface ( https://gin-tsne.med.uni-giessen.de ) for user‐friendly local data management and federated methylome‐based tumor classification calculations for multiple collaborators in a DataSHIELD environment. The freely accessible web interface supports convenient data upload, result review, and summary report generation. Importantly, increasing sample size as achieved through distributed access to additional datasets allows DistSNE to improve cluster analysis and enhance predictive power. Collectively, DistSNE enables a simple and fast classification of CNS tumors using large‐scale methylation data from distributed sources, while maintaining the privacy and allowing easy and flexible network expansion to other institutes. This approach holds great potential for advancing human brain tumor classification and fostering collaborative precision medicine in neuro‐oncology.
... Hagedorn et al. [48] compared MapReduce and Apache Spark in processing and managing "Big Spatial Data" and found that their team's tool (STARK) recorded faster execution times in some instances. Other researchers developed a management and analytical tool known as DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonised Individual levEL Databases), which manages data workflows "without physically transferring or sharing the data and without providing any direct access to individuallevel data" [49]. DataSHIELD was designed to conform to privacy and confidentiality laws in the United Kingdom and address ethical concerns about sharing sensitive research data [49,50]. ...
... Other researchers developed a management and analytical tool known as DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonised Individual levEL Databases), which manages data workflows "without physically transferring or sharing the data and without providing any direct access to individuallevel data" [49]. DataSHIELD was designed to conform to privacy and confidentiality laws in the United Kingdom and address ethical concerns about sharing sensitive research data [49,50]. ...
Article
Full-text available
Many agencies and organizations, such as the U.S. Geological Survey, handle massive geospatial datasets and their auxiliary data and are thus faced with challenges in storing data and ingesting it, transferring it between internal programs, and egressing it to external entities. As a result, these agencies and organizations may inadvertently devote unnecessary time and money to convey data without existing or outdated standards. This research aims to evaluate the components of data conveyance systems, such as transfer methods, tracking, and automation, to guide their improved performance. Specifically, organizations face the challenges of slow dispatch time and manual intervention when conveying data into, within, and from their systems. Conveyance often requires skilled workers when the system depends on physical media such as hard drives, particularly when terabyte transfers are required. In addition, incomplete or inconsistent metadata may necessitate manual intervention, process changes, or both. A proposed solution is organization-wide guidance for efficient data conveyance. That guidance involves systems analysis to outline a data management framework, which may include understanding the minimum requirements of data manifests, specification of transport mechanisms, and improving automation capabilities.
... The "Leipzig Health Atlas (LHA)" [21] consists of an i2b2 and a seek instance, that also rely on imported central data. In addition, tools with a different focus were not investigated further, i.e. the tool "Oncology Data Retrieval Systems (OncDRS)" [14] with a focus on genomic data and "DataSHIELD" [22] which is a multi-purpose tool for distributed computing. Finally the "COVID-Curated and the Open aNalysis aNd rEsearCh platform (CO-CONNECT)" [23] was excluded as the project terminated in October 2022 and there is no indication for further provision or development. ...
Article
Full-text available
Introduction: The increasing need for secondary use of clinical study data requires FAIR infrastructures, i.e. provide findable, accessible, interoperable and reusable data. It is crucial for data scientists to assess the number and distribution of cohorts that meet complex combinations of criteria defined by the research question. This so-called feasibility test is increasingly offered as a self-service, where scientists can filter the available data according to specific parameters. Early feasibility tools have been developed for biosamples or image collections. They are of high interest for clinical study platforms that federate multiple studies and data types, but they pose specific requirements on the integration of data sources and data protection. Methods: Mandatory and desired requirements for such tools were acquired from two user groups - primary users and staff managing a platform's transfer office. Open Source feasibility tools were sought by different literature search strategies and evaluated on their adaptability to the requirements. Results: We identified seven feasibility tools that we evaluated based on six mandatory properties. Discussion: We determined five feasibility tools to be most promising candidates for adaption to a clinical study research data platform, the Clinical Communication Platform, the German Portal for Medical Research Data, the Feasibility Explorer, Medical Controlling, and the Sample Locator.
... If patient data cannot be provided in sufficient volume, for example due to a lack of consent, the data may also be analyzed using federated computing methods (e. g. DataSHIELD or similar [11,12]). Federated learning works without the direct release of patient data from an institution via the indirect provision of (partial) results and, in return, provides sites with algorithms that analyze these data locally. ...
Article
Digitization in the healthcare sector and the support of clinical workflows with artificial intelligence (AI), including AI-supported image analysis, represent a great challenge and equally a promising perspective for preclinical and clinical nuclear medicine. In Germany, the Medical Informatics Initiative (MII) and the Network University Medicine (NUM) are of central importance for this transformation. This review article outlines these structures and highlights their future role in enabling privacy-preserving federated multi-center analyses with interoperable data structures harmonized between site-specific IT infrastructures. The newly founded working group “Digitization and AI” in the German Society of Nuclear Medicine (DGN) as well as the Fach- und Organspezifische Arbeitsgruppe (FOSA, specialty- and organ-specific working group) founded for the field of nuclear medicine (FOSA Nuklearmedizin) within the NUM aim to initiate and coordinate measures in the context of digital medicine and (image-)data-driven analyses for the DGN.
... Cohort-specific description about methods for ascertaining and defining variables are documented in the EU Child Cohort Network catalogue (https://data-catalogue. molgeniscloud.org/catalogue/catalogue/#/) and the Maelstrom Catalogue (http://maelstromresearch.org) for studies in LifeCycle and EUCAN-Connect, respectively. Data were analysed remotely through the R-based and open-source software, DataSHIELD, which allows federated analysis through one-stage and two-stage IPD meta-analysis approaches with active disclosure controls [63,64,65,66]. Fourteen cohorts gave permission to analyse their data via Data-SHIELD, and two cohorts (AOF, CHILD) via data transfer agreements. ...
Article
Full-text available
Background: Preterm birth is the leading cause of perinatal morbidity and mortality and is associated with adverse developmental and long-term health outcomes, including several cardiometabolic risk factors and outcomes. However, evidence about the association of preterm birth with later body size derives mainly from studies using birth weight as a proxy of prematurity rather than an actual length of gestation. We investigated the association of gestational age (GA) at birth with body size from infancy through adolescence. Methods and findings: We conducted a two-stage individual participant data (IPD) meta-analysis using data from 253,810 mother-child dyads from 16 general population-based cohort studies in Europe (Denmark, Finland, France, Italy, Norway, Portugal, Spain, the Netherlands, United Kingdom), North America (Canada), and Australasia (Australia) to estimate the association of GA with body mass index (BMI) and overweight (including obesity) adjusted for the following maternal characteristics as potential confounders: education, height, prepregnancy BMI, ethnic background, parity, smoking during pregnancy, age at child's birth, gestational diabetes and hypertension, and preeclampsia. Pregnancy and birth cohort studies from the LifeCycle and the EUCAN-Connect projects were invited and were eligible for inclusion if they had information on GA and minimum one measurement of BMI between infancy and adolescence. Using a federated analytical tool (DataSHIELD), we fitted linear and logistic regression models in each cohort separately with a complete-case approach and combined the regression estimates and standard errors through random-effects study-level meta-analysis providing an overall effect estimate at early infancy (>0.0 to 0.5 years), late infancy (>0.5 to 2.0 years), early childhood (>2.0 to 5.0 years), mid-childhood (>5.0 to 9.0 years), late childhood (>9.0 to 14.0 years), and adolescence (>14.0 to 19.0 years). GA was positively associated with BMI in the first decade of life, with the greatest increase in mean BMI z-score during early infancy (0.02, 95% confidence interval (CI): 0.00; 0.05, p < 0.05) per week of increase in GA, while in adolescence, preterm individuals reached similar levels of BMI (0.00, 95% CI: -0.01; 0.01, p 0.9) as term counterparts. The association between GA and overweight revealed a similar pattern of association with an increase in odds ratio (OR) of overweight from late infancy through mid-childhood (OR 1.01 to 1.02) per week increase in GA. By adolescence, however, GA was slightly negatively associated with the risk of overweight (OR 0.98 [95% CI: 0.97; 1.00], p 0.1) per week of increase in GA. Although based on only four cohorts (n = 32,089) that reached the age of adolescence, data suggest that individuals born very preterm may be at increased odds of overweight (OR 1.46 [95% CI: 1.03; 2.08], p < 0.05) compared with term counterparts. Findings were consistent across cohorts and sensitivity analyses despite considerable heterogeneity in cohort characteristics. However, residual confounding may be a limitation in this study, while findings may be less generalisable to settings in low- and middle-income countries. Conclusions: This study based on data from infancy through adolescence from 16 cohort studies found that GA may be important for body size in infancy, but the strength of association attenuates consistently with age. By adolescence, preterm individuals have on average a similar mean BMI to peers born at term.
... Cohort-specific description about methods for ascertaining and defining variables are documented in the EU Child Cohort Network catalogue (https://data-catalogue. molgeniscloud.org/catalogue/catalogue/#/) and the Maelstrom Catalogue (http://maelstromresearch.org) for studies in LifeCycle and EUCAN-Connect, respectively. Data were analysed remotely through the R-based and open-source software, DataSHIELD, which allows federated analysis through one-stage and two-stage IPD meta-analysis approaches with active disclosure controls [63,64,65,66]. Fourteen cohorts gave permission to analyse their data via Data-SHIELD, and two cohorts (AOF, CHILD) via data transfer agreements. ...
Preprint
Full-text available
Background Preterm birth is the leading cause of perinatal morbidity and mortality, and is associated with adverse developmental and long-term health outcomes, including several cardio-metabolic risk factors. However, evidence about the association of preterm birth with later body size derives mainly from studies using birth weight as proxy of prematurity rather than actual length of gestation. We investigated the association of gestational age at birth (GA) with body size from infancy through adolescence. Methods and Findings We conducted a two-stage Individual Participant Data (IPD) meta-analysis using data from 253,810 mother-children dyads from 16 general population-based cohort studies in Europe, North America and Australasia to estimate the association of GA with standardized Body Mass Index (BMI) and overweight (including obesity) adjusted for confounders. Using a federated analytical tool (DataSHIELD), we fitted linear and logistic regression models in each cohort separately, and combined the regression estimates and standard errors through random-effects study-level meta-analysis providing an overall effect estimate at early infancy (>0.0-0.5 years), late infancy (>0.5-2.0 years), early childhood (>2.0-5.0 years), mid-childhood (>5.0-9.0 years), late childhood (>9.0-14.0 years) and adolescence (>14.0-19.0 years). GA was positively associated with BMI in the first decade of life with mean differences in BMI z-score (0.01-0.02) per week of increase in GA, however preterm infants reached similar levels of BMI as term infants by adolescence. The association of GA with risk of overweight revealed a similar pattern of results from late infancy through mid-childhood with an increased odds of overweight (OR 1.01-1.02) per week increase in GA. By adolescence, however, GA was slightly negatively associated with risk of overweight (OR 0.98 [95% CI: 0.97:1.00]) per week of increase in GA, and children born very preterm had increased odds of overweight (OR 1.46 [95% CI: 1.03; 2.08]) compared with term. The findings were consistent across cohorts and sensitivity analyses, despite considerable heterogeneity in cohort characteristics. Conclusion Higher GA is potentially clinically important for higher BMI in infancy, while the association attenuates consistently with age. By adolescence, preterm children have on average a similar mean BMI to those born term.