ArticlePDF Available

Applied Logistic Regression

Authors:

Abstract

Introduction to the Logistic Regression Model Multiple Logistic Regression Interpretation of the Fitted Logistic Regression Model Model-Building Strategies and Methods for Logistic Regression Assessing the Fit of the Model Application of Logistic Regression with Different Sampling Models Logistic Regression for Matched Case-Control Studies Special Topics References Index.
... Finally, we tested the utility of NODDI and tensor metrics in predicting cognitive outcomes. We used the general rule that an AUC of 0.5 suggests no ability to predict, and 0.7 to 0.8 is acceptable prediction [43]. NODDI metrics, notably the NDI of the hippocampal cingulum, proved to be superior predictors of clinical outcomes when compared to demographic variables such as age, gender, and education. ...
Preprint
Full-text available
INTRODUCTION: Diffusion tensor imaging has been used to assess white matter (WM) changes in the early stages of Alzheimer's disease (AD). However, the tensor model is necessarily limited by its assumptions. Neurite Orientation Dispersion and Density Imaging (NODDI) can offer insights into microstructural features of WM change. We assessed whether NODDI more sensitively detects AD-related changes in medial temporal lobe WM than traditional tensor metrics. METHODS: Standard diffusion and NODDI metrics were calculated for medial temporal WM tracts from 199 older adults drawn from ADNI3 who also received PET to measure pathology and neuropsychological testing. RESULTS: NODDI measures in medial temporal tracts were more strongly correlated to cognitive performance and pathology than standard measures. The combination of NODDI and standard metrics exhibited the strongest prediction of cognitive performance in random forest analyses. CONCLUSIONS: NODDI metrics offer additional insights into contributions of WM degeneration to cognitive outcomes in the aging brain.
... Logistic regression is a kind of generalized linear regression. When faced with a regression or classification issue, the cost function is determined, the ideal model parameters are then iteratively solved by the optimization approach, and finally the effectiveness of our solved model is confirmed by testing [9]. ...
Article
Full-text available
In the wine industry, red wine is a kind of fruit wine that made from grapes, which is quite common in our life. With the gradual expansion of people who like red wine, the wine industry is getting more and more attention at the same time, the quality of red wine is also getting more and more attention. In order to better evaluate the quality of red wine, an evaluation model of red wine quality is established based on the collected 200 red wine samples and the corresponding 11 index data. Firstly, the dimensionality of the data index is decreased by using principal component analysis and factor analysis. Then, k-means clustering, sum of squares of deviation and class average algorithm are used to perform cluster analysis on the data processed by principal component analysis or factor analysis. Then, logistic regression analysis is used to test the accuracy of the data classification. Finally, Fisher discriminant method is used to perform discriminant analysis on the data and establish a model. The score function is obtained, and the score function is used to calculate the quality score of each group of wine, and the corresponding suggestions were given according to the quality evaluation results.
... Logistic regression [5]: Logistic regression is a generalized linear regression analysis model. It is a supervised learning in machine learning, mainly solving binary and multi-classification problems. ...
Article
Full-text available
This paper uses logistic regression, SVM, and decision trees in machine learning to analyse 67 data items from question C of the 2022 GCSE Cup National Student Mathematical Modelling Competition. The data were analysed using systematic clustering, the clustering coefficients were analysed, the number of clusters K was further determined by the "elbow" method, a clear classification pattern between the glasses was obtained, and a grid search method was used to classify the glasses. The results show that the new excavated glass artefacts have been classified using the "elbow" method. The results show that newly excavated glass artefacts are classified by varying PbO content. If the lead oxide (PbO) content is below 5.46%, they are considered high-potassium glasses and vice versa for lead-barium drinks. The content of silica was further used as a boundary to divide the high-potassium glass into two subclasses. Lead-barium glasses are divided into three subclasses using the lead oxide and silica content as the boundary. Using a series of models and algorithms, the classification patterns of different types of glass and their subclasses were clarified, and the results were tested for reasonableness and sensitivity. Such a model can be used to classify newly excavated glass artefacts and can also be modified based on this model for the identification and analysis of other ancient artefacts.
... Unter logistischer Regression oder Logit-Modell versteht man Regressionsanalysen zur (meist multiplen) Modellierung der Verteilung abhängiger binärer Variablen. Das Ziel dieser Methode ist es, das Best-fitting-Modell zur Beschreibung der Beziehung einer abhängigen Zielgröße und einer oder mehrerer unabhängiger Einflussgrößen zu bestimmen [21]. Aufgrund der binären Merkmalsausprägungen der Zielgröße "Rauchausbreitung auf die Nutzungseinheit begrenzt" wurde diese Art der Modellierung für dieses Merkmal gewählt. ...
Article
Die deutschen Feuerwehren erfassen seit dem Jahr 2016 systematisch signifikante Brände in Gebäuden. Teil dieser Erfassung ist die Überprüfung, ob Schutzziele verletzt werden. Die Branddirektion München wertet diese Daten für den Fachausschuss Vorbeugender Brand‐ und Gefahrenschutz der deutschen Feuerwehren (FA VB/G) aus und kooperiert dabei mit der Technischen Universität München. Dieser Artikel stellt die Methode und die ersten Erkenntnisse aus dieser Statistik dar. Erste Trends zeigen eine erheblich hohe Anzahl von Rauchausbreitungen, die direkte Wirkung des abwehrenden Brandschutzes und von organisatorischen Brandschutzmaßnahmen. On‐site fire inspection by German Fire Brigades – fire tests in situ The German fire brigades have been systematically recording significant fires in buildings since 2016. Part of this recording is checking whether objectives are being violated. Together with the Technical University of Munich, the Munich Fire Brigade on behalf of the expert group for fire prevention of the German fire brigades evaluates this data on behalf of all Fire Brigades involved in Germany. This article presents the method and the first findings from these statistics. First trends show a considerably high number of smoke spreads, the direct effect of fire‐fighting measures and of organizational fire protection measures.
... In logistic regression, the dependent variable is considered to be binary. Logistic regression is widely used and is also applied, for example, within neural networks (Cucchiara 2012;Karlaftis and Vlahogianni 2011). ...
Book
Full-text available
For Germany alone, it is expected that services and products based on the use of artificial intelligence (AI) will generate revenues of 488 billion euros in 2025 - this would represent 13 percent of Germany’s gross domestic product. In important application sectors, the explainability of decisions made by AI is a prerequisite for acceptance by users, for approval and certification procedures, or for compliance with the transparency obligations required by the GDPR. The explainability of AI products is therefore one of the most important market success factors, at least in the European context. The core of AI-based applications - by which we essentially mean machine learning applications here - is always the underlying AI models. These can be divided into two classes: White-box and black-box models. White-box models, such as decision trees based on comprehensible input variables, allow the basic comprehension of their algorithmic relationships. They are thus self-explanatory with respect to their mechanisms of action and the decisions they make. In the case of black-box models such as neural networks, it is usually no longer possible to understand the inner workings of the model due to their interconnectedness and multi-layered structure. However, at least for the explanation of individual decisions (local explainability), additional explanatory tools can be used in order to subsequently increase comprehensibility. Depending on the specific requirements, AI developers can fall back on established explanation tools, e.g. LIME, SHAP, Integrated Gradients, LRP, DeepLift or GradCAM, which, however, require expert knowledge. For mere users of AI, only few good tools exist so far that provide intuitively understandable decision explanations (saliency maps, counterfactual explanations, prototypes or surrogate models). The participants in the survey conducted as part of this study use popular representatives of white-box models (statistical/probabilistic models, decision trees) and black-box models (neural networks) to roughly the same extent today. In the future, however, according to the survey, a greater use of black-box models is expected, especially neural networks. This means that the importance of explanatory strategies will continue to increase in the future, while they are already an essential component of many AI applications today. The importance of explainability varies greatly depending on the industry. It is considered by far the most important in the healthcare sector, followed by the financial sector, the manufacturing sector, the construction industry and the process industry. Four use cases were analyzed in more detail through in-depth interviews with proven experts. The use cases comprise image analysis of histological tissue sections as well as text analysis of doctors' letters from the health care domain, machine condition monitoring in manufacturing, and AI-supported process control in the process industry. Among these, model explanations that make the model-internal mechanisms of action comprehensible (global explainability) are only indispensable for the process control case as a strict approval requirement. In the other use cases, local explainability is sufficient as a minimum requirement. Global explainability, however, plays a key role in the acceptance of AI-supported products in the considered use cases related to manufacturing industries. Furthermore, the use case analyses show that the selection of a suitable explanation strategy depends on the target groups, the data types used and the AI model used. The study analyzes the advantages and disadvantages of the established tools along these criteria and offers a corresponding decision support. Since white-box models are self-explanatory in terms of model action mechanisms and individual decisions, they should be preferred for all applications that place high demands on comprehensibility - whenever possible. Especially if they perform similarly well, or at least sufficiently well, compared to black-box models. It can be assumed that with the increasing use of AI in business, the need for reliable and intuitive explanation strategies will also increase significantly in the future. In order to meet this demand, the following technical and non-technical challenges currently need to be overcome: · New and further development of suitable "hybrid" approaches that combine data- and knowledge-driven approaches, or white- and black-box modelling approaches respectively. · Consideration of aspects from behavioural and cognitive science - such as the measurability of the quality of an explanation from the user's point of view, automated adaptations of explanations to users, explainability of holistic AI systems - in order to improve explainable AI systems · Definition of application and risk classes from which the basic necessity of an explanation for given use cases can be derived · Definition of uniform requirements for the explainability of AI and thus the creation of clear regulatory specifications and approval guidelines corresponding to the application and risk classes · Creation of approval and (re)certification frameworks for systems continuously learning during operational deployment · Provision and implementation of comprehensive education and training programs for examiners and inspectors to verify the explainability of AI.
... There isn't a consensus for a significance cutoff for AUC values. Previous publications have suggested an AUC between 0.7 and 0.8 as acceptable and greater than 0.8 as excellent 26,27 , while the National Center on Response to Intervention's Technical Standard sets AUC values between 0.75 and 0.85 as 'partially convincing' and below 0.75 as 'unconvincing' 28 . On the other hand, it has been recommended that no set value should be utilized; rather AUC values should be used to compare predictors within a single domain rather than enforcing a strict cutoff value [29][30][31][32] . ...
Preprint
Precision medicine is advancing patient care for complex human diseases. Discovery of biomarkers to diagnose specific subtypes within a heterogeneous diseased population is a key step towards realizing the benefits of precision medicine. However, popular statistical methods for evaluating candidate biomarkers, fold change and AUC, were designed for homogeneous data and we evaluate their performance here. In general, these metrics overlook nearly ‘ideal’ biomarkers when they represent less than half of the diseased population. We introduce a new metric to address this shortfall and run a series of trials comprised of simulated and biological data.
... Logistic Regression is a model in statistical science, which in its basic form uses a function to construct a binary dependent variable classifier, and the independent variables are a set of (influential) factors [21]. From a mathematical point of view, a binary model has a dependent variable with two possible values, e.g., forward/backward, which can be labeled as "0" and "1", where the corresponding probability values of labels are in [0, 1]. ...
Article
Full-text available
Background Ubiquitylation is an important post-translational modification of proteins that not only plays a central role in cellular coding, but is also closely associated with the development of a variety of diseases. The specific selection of substrate by ligase E3 is the key in ubiquitylation. As various high-throughput analytical techniques continue to be applied to the study of ubiquitylation, a large amount of ubiquitylation site data, and records of E3-substrate interactions continue to be generated. Biomedical literature is an important vehicle for information on E3-substrate interactions in ubiquitylation and related new discoveries, as well as an important channel for researchers to obtain such up to date data. The continuous explosion of ubiquitylation related literature poses a great challenge to researchers in acquiring and analyzing the information. Therefore, automatic annotation of these E3-substrate interaction sentences from the available literature is urgently needed. Results In this research, we proposed a model based on representation and attention mechanism based deep learning methods, to automatic annotate E3-substrate interaction sentences in biomedical literature. Focusing on the sentences with E3 protein inside, we applied several natural language processing methods and a Long Short-Term Memory (LSTM)-based deep learning classifier to train the model. Experimental results had proved the effectiveness of our proposed model. And also, the proposed attention mechanism deep learning method outperforms other statistical machine learning methods. We also created a manual corpus of E3-substrate interaction sentences, in which the E3 proteins and substrate proteins are also labeled, in order to construct our model. The corpus and model proposed by our research are definitely able to be very useful and valuable resource for advancement of ubiquitylation-related research. Conclusion Having the entire manual corpus of E3-substrate interaction sentences readily available in electronic form will greatly facilitate subsequent text mining and machine learning analyses. Automatic annotating ubiquitylation sentences stating E3 ligase-substrate interaction is significantly benefited from semantic representation and deep learning. The model enables rapid information accessing and can assist in further screening of key ubiquitylation ligase substrates for in-depth studies.
Article
Full-text available
Background and aims Obesity is an independent risk factor for cardiovascular disease development. Here, we aimed to examine and compare the predictive values of three novel obesity indices, lipid accumulation product (LAP), visceral adiposity index (VAI), and triglyceride-glucose (TyG) index, for cardiovascular subclinical organ damage. Methods A total of 1,773 healthy individuals from the Hanzhong Adolescent Hypertension Study cohort were enrolled. Anthropometric, biochemical, urinary albumin-to-creatinine ratio (uACR), brachial-ankle pulse wave velocity (baPWV), and Cornell voltage-duration product data were collected. Furthermore, the potential risk factors for subclinical organ damage were investigated, with particular emphasis on examining the predictive value of the LAP, VAI, and TyG index for detecting subclinical organ damage. Results LAP, VAI, and TyG index exhibited a significant positive association with baPWV and uACR. However, only LAP and VAI were found to have a positive correlation with Cornell product. While the three indices did not show an association with electrocardiographic left ventricular hypertrophy, higher values of LAP and TyG index were significantly associated with an increased risk of arterial stiffness and albuminuria. Furthermore, after dividing the population into quartiles, the fourth quartiles of LAP and TyG index showed a significant association with arterial stiffness and albuminuria when compared with the first quartiles, in both unadjusted and fully adjusted models. Additionally, the concordance index (C-index) values for LAP, VAI, and TyG index were reasonably high for arterial stiffness (0.856, 0.856, and 0.857, respectively) and albuminuria (0.739, 0.737, and 0.746, respectively). Lastly, the analyses of continuous net reclassification improvement (NRI) and integrated discrimination improvement (IDI) demonstrated that the TyG index exhibited significantly higher predictive values for arterial stiffness and albuminuria compared with LAP and VAI. Conclusion LAP, VAI, and, especially, TyG index demonstrated utility in screening cardiovascular subclinical organ damage among Chinese adults in this community-based sample. These indices have the potential to function as markers for early detection of cardiovascular disease in otherwise healthy individuals.
Article
Post-translational modifications (PTMs) either enhance a protein's activity in various sub-cellular processes, or degrade their activity which leads toward failure of intracellular processes. Tyrosine nitration (NT) modification degrades protein's activity that initiates and propagates various diseases including neurodegenerative, cardiovascular, autoimmune diseases and carcinogenesis. Identification of NT modification supports development of novel therapies and drug discoveries for associated diseases. Identification of NT modification in biochemical labs is expensive, time consuming and error-prone. To supplement this process, several computational approaches have been proposed. However these approaches fail to precisely identify NT modification, due to the extraction of irrelevant, redundant and less discriminative features from protein sequences. This paper presents the NTpred framework that is competent in extracting comprehensive features from raw protein sequences using four different sequence encoders. To reap the benefits of different encoders, it generates four additional feature spaces by fusing different combinations of individual encodings. Furthermore, it eradicates irrelevant and redundant features from eight different feature spaces through a Recursive Feature Elimination process. Selected features of four individual encodings and four feature fusion vectors are used to train eight different Gradient Boosted Tree classifiers. The probability scores from the trained classifiers are utilized to generate a new probabilistic feature space, which is used to train a Logistic Regression classifier. On the BD1 benchmark dataset, the proposed framework outperforms the existing best-performing predictor in 5-fold cross validation and independent test evaluation with combined improvement of 13.7% in MCC and 20.1% in AUC. Similarly, on the BD2 benchmark dataset, the proposed framework outperforms the existing best-performing predictor with combined improvement of 5.3% in MCC and 1.0% in AUC. NTpred is publicly available for further experimentation and predictive use at: https://sds_genetic_analysis.opendfki.de/PredNTS/.
Article
Full-text available
Objective To enable interactive visualization of phenome-wide association studies (PheWAS) on electronic health records (EHR). Materials and Methods Current PheWAS technologies require familiarity with command-line interfaces and lack end-to-end data visualizations. pyPheWAS Explorer allows users to examine group variables, test assumptions, design PheWAS models, and evaluate results in a streamlined graphical interface. Results A cohort of attention deficit hyperactivity disorder (ADHD) subjects and matched non-ADHD controls is examined. pyPheWAS Explorer is used to build a PheWAS model including sex and deprivation index as covariates, and the Explorer’s result visualization for this model reveals known ADHD comorbidities. Discussion pyPheWAS Explorer may be used to rapidly investigate potentially novel EHR associations. Broader applications include deployment for clinical experts and preliminary exploration tools for institutional EHR repositories. Conclusion pyPheWAS Explorer provides a seamless graphical interface for designing, executing, and analyzing PheWAS experiments, emphasizing exploratory analysis of regression types and covariate selection.
Article
Full-text available
Estimation of the covariance structure of spatial processes is a fundamental prerequisite for problems of spatial interpolation and the design of monitoring networks. We introduce a nonparametric approach to global estimation of the spatial covariance structure of a random function Z(x, t) observed repeatedly at times ti (i = 1, …, T) at a finite number of sampling stations xi (i = 1, 2, …, N) in the plane. Our analyses assume temporal stationarity but do not assume spatial stationarity (or isotropy). We analyze the spatial dispersions var(Z(xi, t) − Z(xj, t)) as a natural metric for the spatial covariance structure and model these as a general smooth function of the geographic coordinates of station pairs (xi, xj). The model is constructed in two steps. First, using nonmetric multidimensional scaling (MDS) we compute a two-dimensional representation of the sampling stations for which a monotone function of interpoint distances δij approximates the spatial dispersions. MDS transforms the problem into one for which the covariance structure, expressed in terms of spatial dispersions, is stationary and isotropic. Second, we compute thin-plate splines to provide smooth mappings of the geographic representation of the sampling stations into their MDS representation. The composition of this mapping f and a monotone function g derived from MDS yields a nonparametric estimator of var(Z(xa, t) − Z(xb, t)) for any two geographic locations xa and xb (monitored or not) of the form g(|f(xa) − f(xb)|). By restricting the monotone function g to a class of conditionally nonpositive definite variogram functions, we ensure that the resulting nonparametric model corresponds to a nonnegative definite covariance model. We use biorthogonal grids, introduced by Bookstein in the field of morphometrics, to depict the thin-plate spline mappings that embody the nature of the anisotropy and nonstationarity in the sample covariance matrix. An analysis of mesoscale variability in solar radiation monitored in southwestern British Columbia demonstrates this methodology.
Book
Introduction Multivariate Normal Distribution Wishart Distribution Hotelling's T2 Distribution Multivariate Beta Distributions Rao's Distribution Multivariate Skewness and Kurtosis
Article
There are a number of possible designs for case-control studies. The simplest uses two separate simple random samples, but an actual study may use more complex sampling procedures. Typically, stratification is used to control for the effects of one or more risk factors in which we are interested. It has been shown (Anderson, 1972, Biometrika 59, 19-35; Prentice and Pyke, 1979, Biometrika 66, 403-411) that the unconditional logistic regression estimators apply under stratified sampling, so long as the logistic model includes a term for each stratum. We consider the case-control problem with stratified samples and assume a logistic model that does not include terms for strata, i.e., for fixed covariates the (prospective) probability of disease does not depend on stratum. We assume knowledge of the proportion sampled in each stratum as well as the total number in the stratum. We use this knowledge to obtain the maximum likelihood estimators for all parameters in the logistic model including those for variables completely associated with strata. The approach may also be applied to obtain estimators under probability sampling.