Article

Generalized Linear Models

Taylor & Francis
Journal of the American Statistical Association
Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We fitted a Generalized Linear Model (GLM) [88] relating NPV with field-measured windthrow tree-mortality (see Section 2.1). The GLMs were fitted for each satellite. ...
... The GLMs were fitted for each satellite. We described the distribution of residuals using the binomial family, and used the logit link function, in which linear predictors must be reversed to the scale of the observations by means of an inverse function [88]. ...
Article
Full-text available
Windthrow (i.e., trees broken and uprooted by wind) is a major natural disturbance in Amazon forests. Images from medium-resolution optical satellites combined with extensive field data have allowed researchers to assess patterns of windthrow tree-mortality and to monitor forest recovery over decades of succession in different regions. Although satellites with high spatial-resolution have become available in the last decade, they have not yet been employed for the quantification of windthrow tree-mortality. Here, we address how increasing the spatial resolution of satellites affects plot-to-landscape estimates of windthrow tree-mortality. We combined forest inventory data with Landsat 8 (30 m pixel), Sentinel 2 (10 m), and WorldView 2 (2 m) imagery over an old-growth forest in the Central Amazon that was disturbed by a single windthrow event in November 2015. Remote sensing estimates of windthrow tree-mortality were produced from Spectral Mixture Analysis and evaluated with forest inventory data (i.e., ground true) by using Generalized Linear Models. Field measured windthrow tree-mortality (3 transects and 30 subplots) crossing the entire disturbance gradient was 26.9 ± 11.1% (mean ± 95% CI). Although the three satellites produced reliable and statistically similar estimates (from 26.5% to 30.3%, p < 0.001), Landsat 8 had the most accurate results and efficiently captured field-observed variations in windthrow tree-mortality across the entire gradient of disturbance (Sentinel 2 and WorldView 2 produced the second and third best results, respectively). As expected, mean-associated uncertainties decreased systematically with increasing spatial resolution (i.e., from Landsat 8 to Sentinel 2 and WorldView 2). However, the overall quality of model fits showed the opposite pattern. We suggest that this reflects the influence of a relatively minor disturbance, such as defoliation and crown damage, and the fast growth of natural regeneration, which were not measured in the field nor can be captured by coarser resolution imagery. Our results validate the reliability of Landsat imagery for assessing plot-to-landscape patterns of windthrow tree-mortality in dense and heterogeneous tropical forests. Satellites with high spatial resolution can improve estimates of windthrow severity by allowing the quantification of crown damage and mortality of lower canopy and understory trees. However, this requires the validation of remote sensing metrics using field data at compatible scales.
... is a diagonal matrix of weights; X is a matrix of covariates. We will use Fisher's scoring [5]: ...
... 4. If   u , or the newly created point *  , is accepted and becomes 1  , a random number 1 0   u is chosen; if not, the value is rejected, and the procedure stays at the previous point. 5. Until there are sufficient observations, steps 2, 3, 4 are repeated. ...
Article
Full-text available
Methods of estimating the parameters of generalized linear models for the case of paying insurance premiums to clients are considered. The iterative-recursive weighted least squares method, the Adam optimization algorithm, and the Monte Carlo method for Markov chains were implemented. Insurance indicators and the target variable were randomly generated due to the problem of public access to insurance data. For the latter, the normal and exponential law of distribution and the Pareto distribution with the corresponding link functions were used. Based on the quality metrics of model learning, conclusions were made regarding their construction quality.
... E.g., if only the first two components (X 1 , X 2 ) = (x 1 , x 2 ) of X are observed, we would like to compute the conditional expectation E [ µ(X)| X 1 = x 1 , X 2 = x 2 ] , (1.2) this is further motivated in Example 2.1, below. Such conditional expectations (1.2) are of interest in many practical problems, e.g., they enter the SHapley Additive exPlanation (SHAP) of Lundberg-Lee [24], see also Aas et al. [1], they are of interest in discrimination-free insurance pricing, see Lindholm et al. [21], and they are also useful in a variable importance analysis, similar to the anova (analysis-of-variance/analysis-of-deviance) and the drop1 analyses for generalized linear models (GLMs) in the R statistical software [29], see also Section 2.3.2 in McCullagh-Nelder [28]. Moreover, it is known that the partial dependence plot (PDP) of Friedman [10] and Zhao-Hastie [37] for marginal explanation cannot correctly reflect the dependence structure in the features X. Below, we provide an alternative proposal, called marginal conditional expectation plot (MCEP), that mitigates this deficiency. ...
... Precisely this problem questions the magnitudes in the VPI plot, because different extrapolations give us different magnitudes of increases in deviance losses. Next, we study an anova analysis, similarly to the one offered in the R package for GLMs; we also refer to Section 2.3.2 in McCullagh-Nelder [28]. The anova analysis recursively adds feature components to the regression model, and analyzes the change of (out-of-sample) deviance loss provided by the inclusion of each new feature component. ...
Preprint
Full-text available
A very popular model-agnostic technique for explaining predictive models is the SHapley Additive exPlanation (SHAP). The two most popular versions of SHAP are a conditional expectation version and an unconditional expectation version (the latter is also known as interventional SHAP). Except for tree-based methods, usually the unconditional version is used (for computational reasons). We provide a (surrogate) neural network approach which allows us to efficiently calculate the conditional version for both neural networks and other regression models, and which properly considers the dependence structure in the feature components. This proposal is also useful to provide drop1 and anova analyses in complex regression models which are similar to their generalized linear model (GLM) counterparts, and we provide a partial dependence plot (PDP) counterpart that considers the right dependence structure in the feature components.
... Hence, we utilized the processed descriptions and propose an approach for topic extraction based on GloVe, UMAP, and FKM (Keyword Clustering), while also investigating its effectiveness by training models with two baseline topic modeling approaches to compare with (Topic Modeling). Finally, we assigned posterior cluster memberships to the documents to clarify the more or less exploitable topics through the coefficients extracted from a Generalized Linear Model (GLM) [58]. Briefly, we first collected the publicly available CVE data feeds from NVD and applied the necessary procedures to clean and form the datasets of this study (Data Collection and Preprocessing). ...
... Hence, we utilized the processed descriptions and propose an approach for topic extraction based on GloVe, UMAP, and FKM (Keyword Clustering), while also investigating its effectiveness by training models with two baseline topic modeling approaches to compare with (Topic Modeling). Finally, we assigned posterior cluster memberships to the documents to clarify the more or less exploitable topics through the coefficients extracted from a Generalized Linear Model (GLM) [58]. ...
Article
Full-text available
Security vulnerabilities constitute one of the most important weaknesses of hardware and software security that can cause severe damage to systems, applications, and users. As a result, software vendors should prioritize the most dangerous and impactful security vulnerabilities by developing appropriate countermeasures. As we acknowledge the importance of vulnerability prioritization, in the present study, we propose a framework that maps newly disclosed vulnerabilities with topic distributions, via word clustering, and further predicts whether this new entry will be associated with a potential exploit Proof Of Concept (POC). We also provide insights on the current most exploitable weaknesses and products through a Generalized Linear Model (GLM) that links the topic memberships of vulnerabilities with exploit indicators, thus distinguishing five topics that are associated with relatively frequent recent exploits. Our experiments show that the proposed framework can outperform two baseline topic modeling algorithms in terms of topic coherence by improving LDA models by up to 55%. In terms of classification performance, the conducted experiments—on a quite balanced dataset (57% negative observations, 43% positive observations)—indicate that the vulnerability descriptions can be used as exclusive features in assessing the exploitability of vulnerabilities, as the “best” model achieves accuracy close to 87%. Overall, our study contributes to enabling the prioritization of vulnerabilities by providing guidelines on the relations between the textual details of a weakness and the potential application/system exploits.
... GLM model is an extension of the ordinary linear model by allowing response variables to have a non-normal distribution. It has been used to predict daily precipitation and temperature in previous studies (McCullagh and Nelder 1989;Segond et al., 2006;Asong et al., 2016). In GLM, response variables are predicted using Equation (8) (McCullagh and Nelder, 1989): ...
... It has been used to predict daily precipitation and temperature in previous studies (McCullagh and Nelder 1989;Segond et al., 2006;Asong et al., 2016). In GLM, response variables are predicted using Equation (8) (McCullagh and Nelder, 1989): ...
Article
In the past two decades, data-driven modeling has become a popular approach for different modeling tasks. This paper presents an evaluation of the performance of five widely used data-driven approaches (i.e., generalized linear model, lasso regression, support vector machine, neural networks, and random forest) for the modeling of the Etobicoke Creek watershed in Ontario, Canada. The models are built with eleven years of meteorological and hydrometric data from local stations, and the performance is examined by the Nash-Sutcliffe efficiency coefficient, coefficient of determination, mean absolute percentage error, and root mean squared error. The results show all the models are able to generate acceptable predictions and random forest has the highest accuracy. This study can provide support for the selection of hydrological modeling approaches in future studies.
... ; https://doi.org/10.1101/2023.08.25.554749 doi: bioRxiv preprint 2 Non-deep learning machine learning classifiers: To assess if there is any merit in using deep learning models for SUMOylation prediction, we compare SUMOnets with two machine learning methods, logistic regression [35] and gradient boosted decision trees that we build and train. For the latter, we use the XGBoost implementation [5]. ...
Preprint
Full-text available
SUMOylation is a reversible post-translational protein modification in which SUMOs (small ubiquitin-like modifiers) covalently attach to a specific lysine residue of the target protein. This process is vital for many cellular events. Aberrant SUMOylation is associated with several diseases, including Alzheimer's, cancer, and diabetes. Therefore, accurate identification of SUMOylation sites is essential to understanding cellular processes and pathologies that arise with their disruption. We present three deep neural architectures, SUMOnets, that take the peptide sequence centered on the candidate SUMOylation site as input and predict whether the lysine could be SUMOylated. Each of these models, SUMOnet-1, -2, and -3 relies on different compositions of deep sequential learning architectural units, such as bidirectional Gated Recurrent Units(biGRUs) and convolutional layers. We evaluate these models on the benchmark dataset with three different input peptide representations of the input sequence. SUMOnet-3 achieves 75.8% AUPR and 87% AUC scores, corresponding to approximately 5% improvement over the closest state-of-the-art SUMOylation predictor and 16% improvement over GPS-SUMO, the most widely adopted tool. We also evaluate models on a challenging subset of the test data formed based on the absence and presence of known SUMOylation motifs. Even though the performances of all methods degrade in these cases, SUMOnet-3 remains the best predictor in these challenging cases. The SUMOnet-3 framework is available as an open-source project and a Python library at https://github.com/berkedilekoglu/SUMOnet.
... In order to avoid this, it is common to introduce a known and strictly increasing link function → h a b : , [ ] and assume that regression data x h Y , ( ( )) for the transformed outcome variable follows the linear model of Section 3.1. This is analogous to link functions of generalized linear models [76], although here we focus on transformations of quantiles rather than of expected values. Since quantiles are preserved by monotone transformations, it follows from (88) that the CQs satisfy ...
Article
Full-text available
In this article, we survey and unify a large class or L L -functionals of the conditional distribution of the response variable in regression models. This includes robust measures of location, scale, skewness, and heavytailedness of the response, conditionally on covariates. We generalize the concepts of L L -moments (G. Sillito, Derivation of approximants to the inverse distribution function of a continuous univariate population from the order statistics of a sample , Biometrika 56 (1969), no. 3, 641–650.), L L -skewness, and L L -kurtosis (J. R. M. Hosking, L-moments: analysis and estimation of distributions using linear combinations or order statistics , J. R. Stat. Soc. Ser. B Stat. Methodol. 52 (1990), no. 1, 105–124.) and introduce order numbers for a large class of L L -functionals through orthogonal series expansions of quantile functions. In particular, we motivate why location, scale, skewness, and heavytailedness have order numbers 1, 2, (3,2), and (4,2), respectively, and describe how a family of L L -functionals, with different order numbers, is constructed from Legendre, Hermite, Laguerre, or other types of polynomials. Our framework is applied to models where the relationship between quantiles of the response and the covariates follows a transformed linear model, with a link function that determines the appropriate class of L L -functionals. In this setting, the distribution of the response is treated parametrically or nonparametrically, and the response variable is either censored/truncated or not. We also provide a framework for asymptotic theory of estimates of L L -functionals and illustrate our approach by analyzing the arrival time distribution of migrating birds. In this context, a novel version of the coefficient of determination is introduced, which makes use of the abovementioned orthogonal series expansion.
... The dependent variable y is the count variable being modeled, and its value depends on those of the independent variables in x. The latter equation could also be formulated as (GLM) [7]. In epidemiology, Poisson regression is valuable because it allows for the estimation of the relative risk (RR) associated with a unit increase in a pollutant. ...
... Hereto, we employ the frequency-severity approach (Ohlsson and Johansson, 2010;Frees et al., 2014) where we model the claim frequency and claim severity separately. To simulate the number of claims and the claim costs as a function of the policyholder and contract-specific characteristics, we rely on the generalized linear model (GLM) framework (McCullagh and Nelder, 1999). We use a Poisson GLM with log link as the data-generating model for the number of claims (Ohlsson and Johansson, 2010;Quijano Xacur and Garrido, 2015) N ij ∼ Poi(w ij exp( cf x ⊤ ij β cf )). ...
Preprint
Full-text available
Traditionally, the detection of fraudulent insurance claims relies on business rules and expert judgement which makes it a time-consuming and expensive process (\'Oskarsd\'ottir et al., 2022). Consequently, researchers have been examining ways to develop efficient and accurate analytic strategies to flag suspicious claims. Feeding learning methods with features engineered from the social network of parties involved in a claim is a particularly promising strategy (see for example Van Vlasselaer et al. (2016); Tumminello et al. (2023)). When developing a fraud detection model, however, we are confronted with several challenges. The uncommon nature of fraud, for example, creates a high class imbalance which complicates the development of well performing analytic classification models. In addition, only a small number of claims are investigated and get a label, which results in a large corpus of unlabeled data. Yet another challenge is the lack of publicly available data. This hinders not only the development of new methods, but also the validation of existing techniques. We therefore design a simulation machine that is engineered to create synthetic data with a network structure and available covariates similar to the real life insurance fraud data set analyzed in \'Oskarsd\'ottir et al. (2022). Further, the user has control over several data-generating mechanisms. We can specify the total number of policyholders and parties, the desired level of imbalance and the (effect size of the) features in the fraud generating model. As such, the simulation engine enables researchers and practitioners to examine several methodological challenges as well as to test their (development strategy of) insurance fraud detection models in a range of different settings. Moreover, large synthetic data sets can be generated to evaluate the predictive performance of (advanced) machine learning techniques.
... The variance from a Gamma process is given as σ 2 Nelder 1983), and 1/α is the measure of overall dispersion. Variance structures can be modelled by a smoothed non-parametric function g(x i |τ ) and basis coefficients for the variance function τ (Rice & Silverman 1991, Silverman 1985. ...
Preprint
Full-text available
Grassland ecosystems support a wide range of species and provide key services including food production, carbon storage, biodiversity support, and flood mitigation. However, yield stability in these grassland systems is not yet well understood, with recent evidence suggesting water stress throughout summer and warmer temperatures in late summer reduce yield stability. In this study we investigate how grassland yield stability of the Park Grass Experiment, UK, has changed over time by developing a Bayesian time-varying Autoregressive and time-varying Generalised Autoregressive Conditional Heterogeneity model using the variance-parameterised Gamma likelihood function.
... Having derived synthetic individual level data, for the purposes of model fitting, for all models except MCML, the data were then collapsed (summing cases, averaging doses) into the 5 dose groups given in Table 1. Poisson linear relative risk generalised linear models 41 were fitted to this grouped data, with rates given by expression (3), using as offsets the number per group in Table 1. Models were fitted using four separate methods: ...
Preprint
Full-text available
There is direct evidence of risks at moderate and high levels of radiation dose for highly radiogenic cancers such as leukaemia and thyroid cancer. For many cancer sites, however, it is necessary to assess risks via extrapolation from groups exposed at moderate and high levels of dose, about which there are substantial uncertainties. Crucial to the resolution of this area of uncertainty is the modelling of the dose-response relationship and the importance of both systematic and random dosimetric errors for analyses in the various exposed groups. It is well recognised that measurement error can alter substantially the shape of this relationship and hence the derived population risk estimates. Particular attention has been devoted to the issue of shared errors, common in many datasets, and particularly important in occupational settings. We propose a modification of the regression calibration method which is particularly suited to studies in which there is a substantial amount of shared error, and in which there may also be curvature in the true dose response. This method can be used in settings where there is a mixture of Berkson and classical error. In fits to synthetic datasets in which there is substantial upward curvature in the true dose response, and varying (and sometimes substantial) amounts of classical and Berkson error, we show that the coverage probabilities of all methods for the linear coefficient \(\alpha\) are near the desired level, irrespective of the magnitudes of assumed Berkson and classical error, whether shared or unshared. However, the coverage probabilities for the quadratic coefficient \(\beta\) are generally too low for the unadjusted and regression calibration methods, particularly for larger magnitudes of the Berkson error, whether this is shared or unshared. In contrast Monte Carlo maximum likelihood yields coverage probabilities for \(\beta\) that are uniformly too high. The extended regression calibration method yields coverage probabilities that are too low when shared and unshared Berkson errors are both large, although otherwise it performs well, and coverage is generally better than these other three methods. A notable feature is that for all methods apart from extended regression calibration the estimates of the quadratic coefficient \(\beta\) are substantially upwardly biased.
... A brief overview of the models that will be applied in this paper is highlighted here. For more in-depth understanding, the reader can refer to McCullagh and Nelder [38], Long [39], Cameron and Trivedi [40], Agresti [41], Winkelmann [42], and Zhou et al. [43]. ...
Article
Full-text available
Climate finance stakeholders across Africa have long sought to understand the complex nature of the climate cash flow architecture. Distribution models are critical mathematical tools for generating the general characteristics of the cash flow that are used to inform policy decisions. In this paper, we undertake a comprehensive investigation of the climate funds flowing into sub-Saharan Africa (SSA) by suggesting candidate climate finance models that can be used by policy makers to design simulations that can aid in assessing climate risks, identify more efficient climate finance schemes, and obtain optimal control parameter settings under different scenarios. This is achieved by considering climate finance as a form of insurance. Different dimensions of the data are examined following four distinct groupings of the data set. This is to account for different views of risk by the various climate finance participants. The frequency and severity of the approved funds are analyzed with the aid of various mathematical distribution models and regression analyses. The dynamics of a given variable relative to varying scenarios are examined. The findings obtained confirm the presence of emerging risks induced by the nature of the flow. Central Africa for instance records the lowest theme-specific projects and mitigation finance accounts for more than half of all the climate funds while sectoral-wise, adaptation finance is majorly concentrated in the energy sector. The perpetuation of the observed inequalities across the themes, subregions and sector-specific climate-related projects portends grave consequences as these risks begin to accumulate over time. The Burr mixture model best fitted the approved projects’ cost distribution and the factors driving the frequency and severity of approved projects ranged from Central Africa to projects in the general environment sector. One of the policy recommendations emphasized was the need to adopt a risk-adjusted distribution model for climate finance allocation in SSA.
... β = (β 0 , . . . , β p ) t bektorearen balioak estimatzeko, (X, Y ) aldagaien n tamainako lagina emanda, egiantz handieneko metodoa eta haztatutako karratu txikien metodoan oinarritutako algortimo iteratiboak erabiltzen dira [10]. β-ren estimatzaileaβ = (β 0 , . . . ...
Article
Full-text available
Egun, Hosmer-Lemeshow (HL) testa erregresio logistikoko ereduen doikuntza-egokitasuna neurtzeko maiz erabiltzen den hipotesi-kontrastea da. Ordea, lagin tamainarekin lotuta dauden hainbat eragozpen ditu eta, hori dela eta, azken urteetan eraldaketa ugari jasan ditu. Lan honetan, g talde kopurua aldatuta, testaren erabakien egonkortasuna aztertu dugu. Aukeratutako eredua egokia denean, HL testaren errendimendua ona dela lortu dugu. Gainera, ez da lagin tamainaren araberakoa. Bestalde, egoera honetan, proposatutako talde kopuru gomendatuaren erabilerak ez du eragin nabarmenik. Aldiz, doitutako eredua desegokia denean, HL testa lagin tamainarekiko sentikorra da eta, batez ere lagin txikietan, errendimendua eskasa da. Honetaz gain, lagin handietan, talde kopuruaren erabilerak eragina du.
... Statistical analysis. Data were analyzed by Generalized Linear Models (GZLM), a generalization of General Linear Models used to fit regression models for univariate data presumed to follow the exponential class of distributions 107 . Estimations were adjusted to Linear, Gamma, or Tweedie probability distributions according to the Akaike Information Criterion (AIC). ...
Article
Full-text available
In fear conditioning with time intervals between the conditioned (CS) and unconditioned (US) stimuli, a neural representation of the CS must be maintained over time to be associated with the later US. Usually, temporal associations are studied by investigating individual brain regions. It remains unknown, however, the effect of the interval at the network level, uncovering functional connections cooperating for the CS transient memory and its fear association. We investigated the functional network supporting temporal associations using a task in which a 5-s interval separates the contextual CS from the US (CFC-5s). We quantified c-Fos expression in forty-nine brain regions of male rats following the CFC-5s training, used c-Fos correlations to generate functional networks, and analyzed them by graph theory. Control groups were trained in contextual fear conditioning, in which CS and US overlap. The CFC-5s training additionally activated subdivisions of the basolateral, lateral, and medial amygdala; prelimbic, infralimbic, perirhinal, postrhinal, and intermediate entorhinal cortices; ventral CA1 and subiculum. The CFC-5s network had increased amygdala centrality and higher amygdala internal and external connectivity with the retrosplenial cortex, thalamus, and hippocampus. Amygdala and thalamic nuclei were network hubs. Functional connectivity among these brain regions could support CS transient memories and their association.
... Fininsa and Yuen (2001) demonstrated how to create a deviation table for the final reduced multiple variable models. The likelihood ratio test was used to evaluate the importance of the variables and was compared to the Chi-square value when variables were added to the reduced model (McCullagh and Nelder, 1989). Analysis of deviation, odd ratio, and standard error of new variables in a reduced model demonstrated the role of independent variables and variable classes within the independent variables in determining the importance of FBG incidence. ...
Article
Full-text available
Physoderma fungal species cause faba bean gall (FBG) which devastates faba bean (Vicia faba L.) in the Ethio-pian highlands. In three regions (Amahara, Oromia, and Tigray), the relative importance, distribution, intensity, and association with factors affecting FBG damage were assessed for the 2019 (283 fields) and 2020 (716 fields) main cropping seasons. A logistic regression model was used to associate biophysical factors with FBG incidence and severity. Amhara region has the highest prevalence of FBG (95.7%), followed by Tigray (83.3%), and the Oromia region (54%). Maximum FBG incidence (78.1%) and severity (32.8%) were recorded from Amhara and Tigray areas, respectively. The chocolate spot was most prevalent in West Shewa, Finfinne Special Zone, and North Shewa of the Oromia region. Ascochyta blight was found prevalent in North Shewa, West Shewa, Southwest Shewa of Oromia, and the South Gondar of Amhara. Faba bean rust was detected in all zones except for the South Gonder and North Shewa, and root rot disease was detected in all zones except South Gonder, South Wollo, and North Shewa of Amahara. Crop growth stage, cropping system , altitude, weed density, and fungicide, were all found to affect the incidence and severity of the FBG. Podding and maturity stage, mono-cropping, altitude (>2,400), high weed density, and non-fungicide were found associated with increased disease intensities. However, crop rotation, low weed infestation, and fungicide usage were identified as potential management options to reduce FBG disease.
... Because some replaced tokens have their exact rank labels unknown (represented by the special value −1), the loss of the TQR task cannot be directly formulated as the loss of standard ordinal regression (McCullagh and Nelder, 1989). To address this challenge, we set the loss of the TQR task with K levels to be the summation of K −1 binary cross entropy losses: ...
... Descriptive statistics were used to describe patient and clinical characteristics. To account for the complex sampling of the NIS, survey adjusted methods--survey-weighted generalized linear model (Svyglm)--were used to provide original valuations of US population's resulting output (Bernhardt et al. 2015;McCullagh and Nelder 1989). For the total charges and LOS, those were not normally distributed, hence we log-transformed, and the geometric mean was presented. ...
Article
Full-text available
Purpose Cancer-related fatigue (CRF) is a devastating complication with limited recognized clinical risk factors. We examined characteristics among solid and liquid cancers utilizing Machine learning (ML) approaches for predicting CRF. Methods We utilized 2017 National Inpatient Sample database and employed generalized linear models to assess the association between CRF and the outcome of burden of illness among hospitalized solid and non-solid tumors patients. And further applied lasso, ridge and Random Forest (RF) for building our linear and non-linear ML models. Results The 2017 database included 196,330 prostate (PCa), 66,385 leukemia (Leuk), 107,245 multiple myeloma (MM), and 41,185 cancers of lip, oral cavity and pharynx (CLOP) patients, and among them, there were 225, 140, 125 and 115 CRF patients, respectively. CRF was associated with a higher burden of illness among Leuk and MM, and higher mortality among PCa. For the PCa patients, both the test and the training data had best areas under the ROC curve [AUC = 0.91 (test) vs. 0.90 (train)] for both lasso and ridge ML. For the CLOP, this was 0.86 and 0.79 for ridge; 0.87 and 0.84 for lasso; 0.82 for both test and train for RF and for the Leuk cohort, 0.81 (test) and 0.76 (train) for both ridge and lasso. Conclusion This study provided an effective platform to assess potential risks and outcomes of CRF in patients hospitalized for the management of solid and non-solid tumors. Our study showed ML methods performed well in predicting the CRF among solid and liquid tumors.
... One of the best-known models for positive-valued series is the (multiplicative) autoregressive conditional duration (ACD) process proposed by Engle and Russell (1998) (see also Engle, 2002 Aknouche and Francq, 2023). Compared to the strong ARMA model, the ACD equation allows direct data modeling without any prior transformation, which gains in forecasting accuracy (McCullagh and Nelder, 1989;Engle and Russell, 1998;Engle, 2002). In addition, the conditional variance of the standard ACD model is stochastic time-varying, leading to better flexibility in modeling (Bhogal and Variyam, 2019;Hautsch, 2012;Pacurar, 2008). ...
... We used generalized linear models (GLM) with a Gaussian error distribution and an identity link function (McCullagh and Nelder 1989) to examine the effect of habitat (dry sand, wet sand, and mud; categorical data) on the number of invertebrates caught in the pitfall traps and collected in the core samples. We analysed data from each study site separately, as in the unregulated river section the number of invertebrates was higher than in the regulated one (Kozik et al., 2022). ...
... While one could describe the situation of interest in principle by either Binomial or Poisson models, the latter are somewhat more adequate especially if the number of event counts is small compared to the population sizes. The Poisson GLM with random effect can be seen as a special case of a generalised mixed model [12]. ...
Article
Full-text available
The COVID-19 pandemic has expanded fast over the world, affecting millions of people and generating serious health, social, and economic consequences. All South East Asian countries have experienced the pandemic, with various degrees of intensity and response. As the pandemic progresses, it is important to track and analyse disease trends and patterns to guide public health policy and treatments. In this paper, we carry out a sequential cross-sectional study to produce reliable weekly COVID-19 death (out of cases) rates for South East Asian countries for the calendar years 2020, 2021, and 2022. The main objectives of this study are to characterise the trends and patterns of COVID-19 death rates in South East Asian countries through time, as well as compare COVID-19 rates among countries and regions in South East Asia. Our raw data are (daily) case and death counts acquired from "Our World in Data", which, however, for some countries and time periods, suffer from sparsity (zero or small counts), and therefore require a modelling approach where information is adaptively borrowed from the overall dataset where required. Therefore, a sequential cross-sectional design will be utilised, that will involve examining the data week by week, across all countries. Methodologically, this is achieved through a two-stage random effect shrinkage approach, with estimation facilitated by nonparametric maximum likelihood.
... To check if the presumptions of the Ordinary Least Square hold true, diagnostic tests were performed, which includes autocorrelation, heteroscedasticity, and multicollinearity, on the models. According to McCullagh (2018), the designs should have linear variables, uncorrelated independent variables with the random variable, and zero variance for the error term. ...
Article
Full-text available
Human capital investment has a crucial role in economic growth, and as such, it has been regarded as a significant aspect of government spending. The Gini index score reported an average of 41.6 percent in 2018, which is higher than the generally recognized perfect equality Gini index of 20%, suggesting that Kenya has been suffering from high income disparity. There has been a widespread belief that income inequality and human capital investment are mutually exclusive. The theoretical and empirical approaches in the literature provide mixed findings on the relationship. From 1990 to 2019, this study examined the effect of human capital investment on income inequality in Kenya while adjusting for interest rates and GDP per capita. The study adopted a causal research design to determine if a cause-and-effect association between the variables occurs. The time series data were subjected to diagnostic tests to ensure the presumptions of ordinary least squares held. Health expenditure was found to have a negative and statistically significant effect on income inequality after controlling for the interest rate and GDP per capita. After accounting for changes in interest rates and GDP per capita, the result shows that education investment has a negative and statistically insignificant effect on income inequality. The human development index was discovered to have a negative and statistically significant effect on disparity in income which was verified by the robust check. An inverted U was found using the Kuznets test, which was performed to broaden the scope of the research but yielded an insignificant result. The study recommends the formulation and implementation of policies that adhere to the Abuja Declaration on Health, which requires that 15% of government expenditure be allocated to health. The study recommends strict adherence to the 100% transition from primary to post-primary education. The study's conclusions are pertinent to the development and implementation of successful policies that encourage human capital investment, resulting in a decrease in Kenya's levels of income inequality.
... As EEG foram desenvolvidas utilizando a função de quasiverossimilhança [16]. Para a aplicação desta função é necessário respeitar alguns pressupostos sobre a distribuição da variável dependente. ...
Conference Paper
Full-text available
Analisar a relação entre os acidentes entre veículos e peões e as várias características da infraestrutura e da sua envolvente é um passo essencial na identificação de medidas capazes de promover a segurança rodoviária. Neste trabalho estuda-se a influência das variáveis associadas ao ambiente construído, à infraestrutura pedonal e à infraestrutura rodoviária (e.g: o uso do solo, a largura dos passeios e largura das faixas de rodagem) no número de atropelamentos, cujas conclusões podem ser utilizadas como base de apoio à criação de políticas direcionadas à redução do número de acidentes nas estradas. A Organização Mundial da Saúde (WHO) revela que, anualmente, cerca de 1,2 milhões de mortes são causadas pelo tráfego rodoviário em todo o mundo [1]. Em Portugal, segundo o Relatório Anual de Sinistralidade do ano 2017 da Autoridade Nacional de Segurança Rodoviária (ANSR), os peões são o segundo tipo de utilizadores com uma percentagem de mortes nas estradas mais alta, cerca de 22%, correspondendo-lhes índices de 2 mortos e 7 feridos graves por 100 vítimas [2]. Em termos dos tipos de áreas e locais onde ocorrem acidentes com peões, segundo a Comissão Europeia, os peões enfrentam maior risco em estradas urbanas, dado que 69% das mortes de peões ocorrem dentro dessas áreas [3]. Tudo isto suscita uma evidente necessidade de realizar investigações sobre as causas ou fatores mais influentes nos acidentes com peões dentro das zonas urbanas. Somente com esse conhecimento se pode criar políticas direcionadas à redução do número de acidentes com peões e assim conseguir promover modos de transporte limpos [4, 5, 6]. A fim de procurar soluções para os acidentes de tráfego dos utilizadores vulneráveis da rede viária, tais como os peões, existe já uma série de estudos desenvolvidos na área da segurança rodoviária. Estes trabalhos mostram que os fatores que estão relacionados diretamente com a exposição ao risco, frequências de acidente e sua gravidade, são: o volume de tráfego pedonal, o volume de tráfego motorizado e as suas velocidades de operação [7, 8, 9]. No entanto, os problemas de segurança nas deslocações pedonais podem estar muitas vezes relacionados com os desequilíbrios entre a conceção e a utilização do espaço público [10]. Estudos recentes mostram que a segurança rodoviária pode ser melhorada através do tratamento dos espaços públicos, nomeadamente da infraestrutura pedonal, da infraestrutura rodoviária e do ambiente construído, devido ao facto de que nem todos os ambientes são
... McCullagh and Nelder showed that the logit model belongs to the family of generalized linear models [47] (see also [48]) when the response variables are dichotomous or binomial. They also developed the statistical details of the model, including the estimation of its parameters using the reweighted iterative least-squares method, its goodness-of-fit forms, and its hypothesis tests, among others. ...
Article
Full-text available
This article presents the results of a study on opinions on the elements and spaces of the historical urban landscape in Ibarra, Ecuador. This research aimed to propose an objective way of interpreting historical landscapes based on the opinions of people who frequent those places. Our hypothesis was that personal characteristics (e.g., age, gender, educational level, and frequency of visits) condition people’s judgments of urban landscapes, and we aimed to establish which of these characteristics were the most influential. A survey was conducted in the place of study, and passersby were asked to mention three elements and spaces that they liked or disliked. The methodology had two parts: a descriptive statistical analysis that was used to locate each point on a map and a logistic regression model to study the relationships between people’s opinions and their personal characteristics. The results show that (1) it was possible to demonstrate the elements and spaces that were liked and disliked in proportion graphs and planimetry and (2) that an explanatory analysis of opinions could be carried out using a logistic regression model to study significant characteristics. We found that the frequency of visits was the most significant characteristic for the elements and spaces Citation: Briceño-Avila, M.; Ponsot-Balaguer, E.A.; RondónGonzález, A. Study on Liking and Disliking in the Historical Urban Landscape of Ibarra, Ecuador. Sustainability 2023, 15, 11390. https://doi.org/10.3390/ su151411390 Academic Editor: Diego A. Barrado-Timón Received: 1 June 2023 Revised: 11 July 2023 Accepted: 12 July 2023 Published: 22 July 2023 Copyright: © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). that were disliked. We also concluded that the results of this study could provide objective tools for obtaining the opinions of people and combining them with planimetry. Additionally, the results could be used to establish priorities for urban authorities regarding improvements and interventions for elements and spaces that people like or dislike.
... Various statistical procedures have been developed to overcome the over-dispersed problem in the last several years. A common way is to apply the Poisson quasi-likelihood (McCullagh and Nelder 2019) by specifying the parameters relating to the dependence of mean on explanatory variables and the variance written as a multiplicative constant of the mean. Besides, the negative binomial model is another valuable technique for over-dispersion. ...
Article
Full-text available
Count data in environmental epidemiology or ecology often display substantial over-dispersion, and failing to account for the over-dispersion could result in biased estimates and underestimated standard errors. This study develops a new generalized linear model family to model over-dispersed count data by assuming that the response variable follows the discrete Lindley distribution. The iterative weighted least square is developed to fit the model. Furthermore, asymptotic properties of estimators, the goodness of fit statistics are also derived. Lastly, some simulation studies and empirical data applications are carried out, and the generalized discrete Lindley linear model shows a better performance than the Poisson distribution model.
... To assess the effect of feeding site characteristics on scavenger behaviour unrelated to competition or predation within the guild, we used modelling in the linear family to see if one particular independent variable changes the relationship of another particular independent variable. We used the backward elimination procedure (with p = 0.05 as a threshold) to build and compare sets of generalised linear mixed models (GLMM) to test the effect on each scavenger species with the camera site fitted as a random factor and (1) season, (2) cause of death, (3) consumption level, (4) vegetation density, (5) visibility, (6) distance to human disturbance, and (7) snow presence as fixed effects [31]. We used GLMM to enable the modelling of variables measured at multiple time scales with an unbalanced design. ...
Article
Full-text available
Scavenging guilds often have several trophic levels with varying dominance and intra-guild predation, competition, and interaction. Apex predators can control subordinate predators by limiting their numbers and affecting behaviour but also supply a continuous food source by abandoning carcasses. Camera traps monitored the scavenger guild in Alpe di Catenaia, Tuscan Apennine, for three years to determine intraguild interactions and the behaviour response. Wild boar visited most feeding sites but only scavenged in 1.4% of their visits. Red fox was the most frequent scavenger, traded vigilance and feeding equally, and selected low vegetation density, while marten invested more in feeding than vigilance. Marten was the prime follower, appearing within the shortest time after another scavenger had left the site. Red fox occasionally looked upwards, possibly to detect birds of prey. Badger showed scarcely any vigilance, did not feed much on carcasses but scent-marked abundantly. Wolves showed the highest vigilance in proportion to feeding at carcasses among the scavengers. Sites with good visibility were selected by all scavengers except martens who selected poor visibility and new moon illumination. Scavengers were mostly nocturnal, showed weak responses to twilight hours or lunar illumination, and all but red fox avoided human disturbance areas.
... In this article, classification models used in the analysis of breast cancer are presented [16].In addition, metrics are provided to evaluate the performance of the models. The confusion matrix is used to analyze the performance of the classification models and the sample is divided into four classes: true positive (VP), false positive (FP), true negative (VN) and false negative (FN) [20]. These metrics allow to evaluate the ability of the models to correctly predict the diagnostic classes of breast cancer. ...
Preprint
Cancer is a tumor that affects people worldwide, with a higher incidence in females but not excluding males. It ranks among the top five deadliest types of cancer, particularly prevalent in less developed countries with deficient healthcare programs. Finding the best algorithm for effective breast cancer prediction with minimal error is crucial. In this scientific article, we employed the SMOTE method in conjunction with the R package Shiny to enhance the algorithms and improve prediction accuracy. We classified the tumor types as benign and malignant (B/M). Various algorithms were analyzed using a Kaggle dataset, and our study identified the superior algorithm as logistic regression. We evaluated algorithm performance using confusion matrices to visualize results and the ROC Curve to obtain a comprehensive measure of performance. Additionally, we calculated precision by dividing the number of correct predictions by the total predictions Keywords Breast cancer, Smote, Benign, Malignant.
... Several mathematical or statistical models were developed for predicting malaria case incidence. Generalized Linear Models (GLM) [2][3][4] were used in the literature. Examples of GLM include the Poisson regression developed first by Nelder and Wedderburn [5], the negative binomial (NB) regression [3], the quasi-Poisson regression [5] and the zero-inflated regression [6]. ...
Article
Full-text available
Affecting millions of individuals yearly, malaria is one of the most dangerous and deadly tropical diseases. It is a major global public health problem, with an alarming spread of parasite transmitted by mosquito (Anophele). Various studies have emerged that construct a mathematical and statistical model for malaria incidence forecasting. In this study, we formulate a generalized linear model based on Poisson and negative binomial regression models for forecasting malaria incidence, taking into account climatic variables (such as the monthly rainfall, average temperature, relative humidity), other predictor variables (the insecticide-treated bed-nets (ITNs) distribution and Artemisinin-based combination therapy (ACT)) and the history of malaria incidence in Dakar, Fatick and Kedougou, three different endemic regions of Senegal. A forecasting algorithm is developed by taking the meteorological explanatory variable Xj at time t-𝓁j, where t is the observation time and 𝓁j is the lag in Xj that maximizes its correlation with the malaria incidence. We saturated the rainfall in order to reduce over-forecasting. The results of this study show that the Poisson regression model is more adequate than the negative binomial regression model to forecast accurately the malaria incidence taking into account some explanatory variables. The application of the saturation where the over-forecasting was observed noticeably increases the quality of the forecasts.
... Thus, comparisons to GM will be performed on the logit scale [24]. ...
Article
Full-text available
Monitoring of clinical trials is a fundamental process required by regulatory agencies. It assures the compliance of a center to the required regulations and the trial protocol. Traditionally, monitoring teams relied on extensive on-site visits and source data verification. However, this is costly, and the outcome is limited. Thus, central statistical monitoring (CSM) is an additional approach recently embraced by the International Council for Harmonisation (ICH) to detect problematic or erroneous data by using visualizations and statistical control measures. Existing implementations have been primarily focused on detecting inlier and outlier data. Other approaches include principal component analysis and distribution of the data. Here we focus on the utilization of comparisons of centers to the Grand mean for different model types and assumptions for common data types, such as binomial, ordinal, and continuous response variables. We implement the usage of multiple comparisons of single centers to the Grand mean of all centers. This approach is also available for various non-normal data types that are abundant in clinical trials. Further, using confidence intervals, an assessment of equivalence to the Grand mean can be applied. In a Monte Carlo simulation study, the applied statistical approaches have been investigated for their ability to control type I error and the assessment of their respective power for balanced and unbalanced designs which are common in registry data and clinical trials. Data from the German Multiple Sclerosis Registry (GMSR) including proportions of missing data, adverse events and disease severity scores were used to verify the results on Real-World-Data (RWD).
... The effect in the P. ferruginea population's structure (adult density and adult mean size) of different environmental parameters were modelled using generalized linear models (GLMs; McCullagh and Nelder, 1989). Models were carried out using normal distributions and identity link functions to determine which of the parameters gave the best fit to the data (see Cayuela, 2010). ...
Article
Full-text available
The critically endangered species Patella ferruginea (Gastropoda, Patellidae), endemic to the western Mediterranean, has breeding populations in both natural and artificial habitats, the latter of which are generally linked to port infrastructures. Over the past decade, the temporal change of this species’ population has been monitored (structure and density) using exhaustive censuses along Ceuta’s coast (Strait of Gibraltar), one of the few stronghold populations within the entire Mediterranean basin. This study focuses on the population dynamics of P. ferruginea in Ceuta and the environmental factors that affect the structure of this population, such as wave exposure, coastline heterogeneity, substratum roughness, substratum lithology, and chlorophyll-a concentration. Different potential negative interactions were also considered: angling, shell fishing, bathing in the intertidal, bathing near the intertidal, recreational boating and temporary migrant campsites nearby. The results have shown in the period 2011-2021, the estimated size of P. ferruginea population has increased by 200 %, from 55,902 to 168,463 individuals (of which 131,776 are adults). The subpopulation with the greatest increase in these years was the one settled on dolomitic rip-raps inside the Ceuta’s harbor, with an increase of 1,288%. The results of the present study indicate that Ceuta hosts the main population of this endangered species through its distributional range (Western Mediterranean), being a source population on the Southern Iberian Peninsula that its preservation must be prioritized. Statistical modelling has shown that the adult density of P. ferruginea is positively influenced by coastal heterogeneity, habitat area and substratum roughness, but negatively by vertical inclination, concentration of chlorophyll-a, and anthropogenic impact. These results also support the concept of ¨Artificial Marine Micro-Reserves¨ as a new area-based conservation measure according with the IUCN guidelines, as these will contribute to setting up a network of these source populations that promote genetic flow among populations, with eventual recolonization throughout its original distribution.
... Generalized Linear Models (GLM) were applied to assess the effect of biophysical-socioeconomic variables on CAFS tree species richness and tree density (Table 1). GLM is a exible generalization of linear models that allow non-normality in the response variable and a non-linear relationship with explanatory variables (McCullagh & Nelder 1989). First, we graphically explored data variability to select an error distribution. ...
Preprint
Full-text available
Specialty coffee (SC) production enables farmers to earn premium prices for high-quality coffee. In Bolivia, some coffee-based agroforestry systems (CAFS) produce SC. However, while many Bolivian families’ livelihoods depend on coffee, studies on SC-producing CAFS remain scarce. Yet, research on tree diversity, CAFS management and the factors affecting tree diversity can offer novel insights on agroforestry. We sampled 24 farms in three villages located in the Caranavi municipality. We analyzed farms main characteristics, biophysical variables, shade tree diversity, tree uses, management practices and farmers’ socioeconomic background. Additionally, we surveyed 50 coffee farmers to collect information about their preferences for shade tree species and tree characteristics. Then, we investigated if farmers’ socioeconomic and farm biophysical variables affect CAFS tree species richness and tree density using generalized linear models (GLM). Our results showed that studied farms are small and certified properties (average: 2.6 ha) managed by families; we observed that CAFS provide farmers with valuable products besides SC. We identified 85 tree species that provide principally shade for coffee and fruits, timber, lumber and medicines. Moreover, farmers prefer mostly shade tree species that offer them useful and marketable products, while tree characteristics are preferred according to their benefits to coffee and farmers. GLM revealed that socioeconomic and biophysical variables related to management and landscape composition affect: tree species richness and density. These results suggest that management and landscape are influential factors driving CAFS tree diversity. Hence, factors fostering farmers’ ability to manage their CAFS for biodiversity and household wellbeing should be promoted.
... Probably the single most relevant method for prediction problems is multivariate regression in the form of generalized linear models (GLMs) (Nelder and Wedderburn, 1972;McCullagh and Nelder, 1989). These offer a wide range of tools capable of automatically finding the best linear combination of predictor variables for most types of target variables. ...
Article
Full-text available
Recent years have seen a sharp increase in the generation and use of mineral trace-element data in geological research. This is largely due to the advent of rapid and affordable laser-ablation inductively coupled plasma mass-spectrometry (LA-ICP-MS). However, while much new data is being generated and published, relatively little work has been done to develop appropriate methods for its statistical analysis and interpretation, and indeed, experimental design. In fact, several characteristic features of the data require careful consideration during evaluation and interpretation to avoid biased results. In particular, the commonly hierarchical structure of mineral trace-element data and its compositional nature must be taken into account to generate meaningful and robust results. Unfortunately, these features are not appropriately considered in most current studies. This review provides a general overview of the special features of mineral trace-element data and their consequences for statistical analysis and interpretation, as well as study design. Specifically, it highlights the need for 1) the use of log- or log-ratio-transformations for statistical analysis, 2) careful preparation of the raw data prior to analysis, including an appropriate treatment of missing values, and 3) the application of statistical methods suited to hierarchical data structures. These points, as well as the consequences of neglecting them, are illustrated with relevant examples from ore geology. However, the general principles described in this review also apply to mineral trace-element datasets collected in other fields of the geosciences, as well as other fields dealing with compositional data.
... Generalized linear models (GLMs) are often used for time series analyses [87] and it is not aim of this paper to explain in detail all its cases and formulas. However, the idea of GLM consists of three components: ...
Article
Full-text available
The aim of this study was to investigate the potential impact of guided imagery (GI) on attentional control and cognitive performance and to explore the relationship between guided imagery, stress reduction, alpha brainwave activity, and attentional control using common cognitive performance tests. Executive function was assessed through the use of attentional control tests, including the anti-saccade, Stroop, and Go/No-go tasks. Participants underwent a guided imagery session while their brainwave activity was measured, followed by attentional control tests. The study’s outcomes provide fresh insights into the influence of guided imagery on brain wave activity, particularly in terms of attentional control. The findings suggest that guided imagery has the potential to enhance attentional control by augmenting the alpha power and reducing stress levels. Given the limited existing research on the specific impact of guided imagery on attention control, the study’s findings carry notable significance.
... As the outcome data are counts (or equivalently rates), the statistical model should follow approximately a Poisson distribution (Cameron and Trivedi, 2013). We thus used generalized linear models (McCullagh and Nelder, 1989) in which the natural logarithm of rates (or counts with a fixed offset for the logarithm of the population) was regressed against a linear combination of covariables. The counts were over-dispersed, so we used quasi-likelihood Poisson and negative binomial models to account for extra Poisson variability. ...
Article
Full-text available
Lyme disease (LD) is the most common vector-borne illness in the USA. Incidence is related to specific environmental conditions such as temperature, metrics of land cover, and vertebrate species diversity. To determine whether greenness, as measured by the Normalized Difference Vegetation Index (NDVI), and other selected indices of land cover were associated with the incidence of LD in the northeastern USA for the years 2000–2018, we conducted an ecological analysis of incidence rates of LD in counties of 15 “high” incidence states and the District of Columbia for 2000–2018. Annual counts of LD by county were obtained from the US Centers for Disease Control and values of NDVI were acquired from the Moderate Resolution Imaging Spectroradiometer instrument aboard Terra and Aqua Satellites. County-specific values of human population density, area of land and water were obtained from the US Census. Using quasi-Poisson regression, multivariable associations were estimated between the incidence of LD, NDVI, land cover variables, human population density, and calendar year. We found that LD incidence increased by 7.1% per year (95% confidence interval: 6.8–8.2%). Land cover variables showed complex non-linear associations with incidence: average county-specific NDVI showed a “u-shaped” association, the standard deviation of NDVI showed a monotonic upward relationship, population density showed a decreasing trend, areas of land and water showed “n-shaped” relationships. We found an interaction between average and standard deviation of NDVI, with the highest average NDVI category; increased standard deviation of NDVI showed the greatest increase in rates. These associations cannot be interpreted as causal but indicate that certain patterns of land cover may have the potential to increase exposure to infected ticks and thereby may contribute indirectly to increased rates of LD. Public health interventions could make use of these results in informing people where risks may be high.
... We will also perform dynamic network analyses using two main types of models. The first involves summarizing network change from time to time via a specific measure and then modeling using standard models within the generalized linear model (GLM) framework [41]. The resulting logistic regression will allow us to estimate how respondent-level characteristics are related to the likelihood of relationship dissolution; fitting a conditional logistic model to the same outcome will allow exploration of the association between alter and/or relationship-level characteristics and the likelihood of dissolution [42]. ...
Article
Full-text available
Background: Sexual minority men (SMM) who engage in condomless anal sex and injection drug use are at increased risk for viral Hepatitis C (HCV) infection. Additionally, studies have found racial disparities in HCV cases across the United States. However, very few epidemiological studies have examined factors associated with HCV infection in HIV-negative Black and Latino SMM. This paper describes the rationale, design, and methodology of a prospective epidemiological study to quantify the HCV prevalence and incidence and investigate the individual and environmental-level predictors of HCV infection among HIV-negative, Black and Latino SMM in the Southern U.S. Methods: Beginning in September 2021, 400 Black and Latino SMM, aged 18 years and above, will be identified, recruited and retained over 12-months of follow-up from two study sites: greater Washington, DC and Dallas, TX areas. After written informed consent, participants will undergo integrated HIV/STI testing, including HCV, HIV, syphilis, gonorrhea, and chlamydia. Subsequently, participants will complete a quantitative survey-including a social and sexual network inventory-and an exit interview to review test results and confirm participants' contact information. Individual, interpersonal, and environmental factors will be assessed at baseline and follow-up visits (6 and 12 months). The primary outcomes are HCV prevalence and incidence. Secondary outcomes are sexual behavior, substance use, and psychosocial health. Results: To date (March 2023) a total of 162 participants have completed baseline visits at the DC study site and 161 participants have completed baseline visits at the Texas study site. Conclusion: This study has several implications that will directly affect the health and wellness of Black and Latino SMM. Specifically, our results will inform more-focused HCV clinical guidelines (i.e., effective strategies for HCV screening among Black/Latino SMM), intervention development and other prevention and treatment activities and development of patient assistance programs for the treatment of HCV among uninsured persons, especially in Deep South, that have yet to expand Medicaid.
... To evaluate the influence of fishing activities (distance of gillnet stakes from dugong feeding areas and total area of gillnet stakes) on the density of dugong feeding trails, a generalized linear model (McCullagh & Nelder 1989) with a negative binomial distribution and logarithmic link function was performed. The explanatory variables included site, time, distance from gillnet stakes, and area of gillnet stakes, while the response variable was dugong feeding trail density. ...
Article
Full-text available
Fishing provides an important food source for humans, but it also poses a threat to many marine ecosystems and species. Declines in wildlife populations due to fishing activities can remain undetected without effective monitoring methods that guide appropriate management actions. In this study, we combined the use of unmanned aerial vehicle-based imaging (drones) with machine-learning to develop a monitoring method for identifying hotspots of dugong foraging based on their feeding trails and associated seagrass beds. We surveyed dugong hotspots to evaluate the influence of gillnet fishing activities on dugong feeding grounds at Inhaca Island, Southern Mozambique. The results showed that drones and machine learning can accurately identify and monitor dugong feeding trails and seagrass beds, with an F1 accuracy of 80% and 93.3%, respectively. Feeding trails were observed in all surveyed months with the highest density occurring in August (6,040 ± 4,678 trails/km2) There was a clear overlap of dugong foraging areas and gillnet fishing grounds, with a statistically significant positive correlation between fishing areas and the frequency of dugong feeding trails. Dugongs were found to feed mostly in Saco East, where the number of gillnet stakes was 3.7 times lower and the area covered by gillnets was 2.6 times lower than in Saco West. This study highlights the clear potential of drones and machine-learning to study and monitor animal behavior in the wild, particularly in hotspots and remote areas. We encourage the establishment of effective management strategies to monitor and control the use of gillnets, thereby avoiding the accidental bycatch of dugongs.
Article
Full-text available
Seagrass is a globally vital marine resource that plays an essential global role in combating climate change, protecting coastlines, ensuring food security, and enriching biodiversity. However, global climate change and human activities have led to dramatic environmental changes severely affecting seagrass growth and development. Therefore, it is crucial to understand accurately how environmental changes, affect seagrass distribution. In this study, we selected the seagrass distribution area in Hainan, China, as the study area and proposed an ensemble model combining five machine learning models to predict the potential distribution of seagrass. Fifteen environmental variables were entered into the model, and the study results showed that the ensemble model provided the highest accuracy (Area Under Curve (AUC) = 0.91). The environmental variables were then classified into regional and site explanations with the help of explainable artificial intelligence (XAI) methods. The difference in the contribution of regional and site environmental variables is demonstrated, and the model provides a more reasonable explanation for the site. The Shapley value (SHAP) and Partial dependency plot (PDP) analysis methods explain the importance of environmental variables in the seagrass distribution model and the effect of multiple environmental variable interactions on the prediction results, which implies opening the black box model for machine learning. More evidence that explainable artificial intelligence can explain the effects of environmental variables in seagrass distribution models will help to improve environmental understanding in seagrass conservation processes.
Thesis
Background: In the United States (US), cigarette smoking and weight status have been considered the main public health concerns in recent years due to a higher incidence of all-cause mortality and respiratory diseases such as asthma and COPD among those with past 30 day smoking or have an underweight or obesity weight status than those who do not smoke cigarettes or have a weight status of normal weight or overweight. The health burden associated with cigarette smoking and weight status in the US adult population has not been consistent across sociodemographic factors such as sex/gender, socioeconomic status (SES), and race/ethnicity. The association among cigarette smoking, weight status and all-cause mortality; cigarette smoking, weight status and asthma; and cigarette smoking, weight status and COPD are not entirely understood, or how disparities may contribute to these associations. Each of this dissertation’s three aims addresses a specific research question about the associations among cigarette smoking and weight status with all-cause mortality, asthma, and COPD as well as which factors may contribute to health disparities of these associations. The first aim sought to determine whether weight status was a mediator between cigarette smoking and all-cause mortality among adults with past 30 days smoking in the US. The second aim sought to determine whether weight status is a mediator between cigarette smoking and asthma, and cigarette smoking and COPD. The third aim sought to determine which factors were a source of health disparities in the associations among cigarette smoking and weight status with all-cause mortality, asthma, or COPD. Methods: The study population included adults in the US with past 30 day smoking, with nationally representative samples for the National Health and Nutrition Examination Survey (NHANES). For all three aims, cigarette smoking, asthma, and COPD were self-reported, while weight status was measured on-sites and all-cause mortality was collected through death records. The first and second studies included causal mediation analyses with weight status as the mediator of the associations between cigarette smoking and all-cause mortality, cigarette smoking and asthma and cigarette smoking and COPD using the NHANES dataset from 2003- 2018 and 2013-2018, respectively. For the third study Structural Equation Models (SEM) were implemented to determine which factors related to health disparities may contribute to the associations among cigarette smoking, weight status, all-cause mortality, asthma, or COPD using the NHANES 2003-2018 dataset (for all-cause mortality) and the NHANES 2013-2018 (for asthma and COPD). Results: In the mediation analysis between cigarette smoking and all-cause mortality with weight status as a mediator, the total effect (TE) for the model with only physiological factors was -1.94 (95% CI=-2.67, -0.04; p<0.001), with an average direct effect (DE) of -1.82 (95% CI=-2.51, - 0.56; p<0.001) and an average indirect effect (IE) of -0.118 (95% CI= -0.19, -0.03; p =0.004). The TE for the model adjusted for physiological and sociodemographic factors was -1.54 (95% CI = -2.20, 0.01; p = 0.048), an average DE of -1.49 (95% CI -2.18, -0.01; p = 0.048) and an average IE of -0.049 (95% CI = -0.052, 0.02; p = 0.518). For the mediation analysis between cigarette smoking and asthma and cigarette smoking and COPD having as mediator weight status, it was obtained that for asthma, the TE was 0.0009; p=0.016, with an average DE 0.0009; p=0.016 and an average IE of 0.00003; p=0.232. For COPD, the TE was 0.00166; p<0.001. The average DE was 0.00174; p<0.001; the average IE was -0.00008; p=0.46. The Prevalence Ratio (PR) of having asthma and COPD was 1.03 (95% CI=1.00, 1.06; p<0.1032) and 1.04 (95% CI: 1.03, 1.05; p<0.001), respectively. For the third aim, sex/gender was a significant factor in the associations among cigarette smoking, weight status and all-cause mortality; cigarette smoking, weight status and asthma and cigarette smoking, weight status, and COPD. Race/ethnicity was only significant in the association of cigarette smoking, weight status, and all-cause mortality, and cigarette smoking, weight status, and COPD among Hispanic Mexican and Non-Hispanic White individuals. Conclusions: Findings from this dissertation showed that weight status was not a mediator between cigarette smoking and all-cause mortality; cigarette smoking and asthma, or cigarette smoking and COPD when considering physiological and sociodemographic factors. The findings also indicated that sex/gender contribute to health disparities of these associations. Smoking cessation and harm reduction interventions to reduce the incidence of all-cause mortality, asthma, and COPD due to cigarette smoking should be tailored by sex/gender.
Article
Full-text available
Biodiversity loss in river ecosystems is much faster and more severe than in terrestrial systems, and spatial conservation and restoration plans are needed to halt this erosion. Reliable and highly resolved data on the state of and change in biodiversity and species distributions are critical for effective measures. However, high‐resolution maps of fish distribution remain limited for large riverine systems. Coupling data from global satellite sensors with broad‐scale environmental DNA (eDNA) and machine learning could enable rapid and precise mapping of the distribution of river organisms. Here, we investigated the potential for combining these methods using a fish eDNA dataset from 110 sites sampled along the full length of the Rhone River in Switzerland and France. Using Sentinel 2 and Landsat 8 images, we generated a set of ecological variables describing both the aquatic and the terrestrial habitats surrounding the river corridor. We combined these variables with eDNA‐based presence and absence data on 29 fish species and used three machine‐learning models to assess environmental suitability for these species. Most models showed good performance, indicating that ecological variables derived from remote sensing can approximate the ecological determinants of fish species distributions, but water‐derived variables had stronger associations than the terrestrial variables surrounding the river. The species range mapping indicated a significant transition in the species occupancy along the Rhone, from its source in the Swiss Alps to outlet into the Mediterranean Sea in southern France. Our study demonstrates the feasibility of combining remote sensing and eDNA to map species distributions in a large river. This method can be expanded to any large river to support conservation schemes.
Article
Full-text available
Climate change has been associated with both latitudinal and elevational shifts in species’ ranges. The extent, however, to which climate change has driven recent range shifts alongside other putative drivers remains uncertain. Here, we use the changing distributions of 378 European breeding bird species over 30 years to explore the putative drivers of recent range dynamics, considering the effects of climate, land cover, other environmental variables, and species’ traits on the probability of local colonisation and extinction. On average, species shifted their ranges by 2.4 km/year. These shifts, however, were significantly different from expectations due to changing climate and land cover. We found that local colonisation and extinction events were influenced primarily by initial climate conditions and by species’ range traits. By contrast, changes in climate suitability over the period were less important. This highlights the limitations of using only climate and land cover when projecting future changes in species’ ranges and emphasises the need for integrative, multi-predictor approaches for more robust forecasting.
Article
Full-text available
Recientemente, dos métodos de modelaje de selección de hábitat han ganado cada vez más relevancia en la literatura científica: las “step selection functions” (SSF) y el MaxEnt. A pesar de su semejanza, estos métodos raramente son usados en el mismo contexto. El primero es utilizado en modelos basados en datos de movimiento y el segundo en estudios de distribución de especies. Motivado por la dificultad de estimar modelos convergentes que tiene el SSF, he comparado la precisión de predicciones hechas por modelos MaxEnt en datos de movimiento. Como estudio de caso, utilicé datos de localizaciones de jaguares en cinco países de América Latina, y creé modelos de los dos tipos, utilizando datos climáticos y de uso del terreno, disponibles de imágenes de satélite. Comparé el rendimiento de ambos modelos mediante validación cruzada, y midiendo el área debajo de la curva (AUC) en el conjunto de datos de prueba. Los modelos de SSF presentaron una precisión media de 0.5510 ± 0.0147 en comparación con 0.7544 ± 0.0185 en los modelos MaxEnt equivalentes. Atribuyo estas diferencias, en parte, a la dificultad de los modelos SSF y regresiones logísticas condicionales de converger en sus estimaciones. Por eso, yo recomiendo la utilización de modelos MaxEnt para actividades predictivas, como el diseño de reservas naturales o de corredores de fauna.
Preprint
Mechanistic and correlative models are two types of species distribution models (SDMs). They each have distinct foci, conceptual foundations, and levels of dependency on data availability, leading to potentially different estimates of species’ ecological niches and distributions. Mechanistic SDMs integrate detailed biological processes, making it possible to account for species’ biotic interactions. Despite their assumed importance, interactions in species distribution modeling remain uncommon. In this study, we applied an ensemble model of multiple correlative SDMs, a mechanistic SDM of the focal species (prey) alone, and a mechanistic SDM of the predator-prey interactions, to compare the predictions of correlative and mechanistic approaches and assess their relative strengths and limitations. We predict there are considerable and subtle differences in various predictions generated by the correlative and mechanistic approaches for each aphid species, which call for prior knowledge concerning species’ presence data or life histories. Our mechanistic SDMs allowed for the assessment of the relative significance of abiotic and biotic factors, along with their interactions, in determining species’ habitat suitability. Additionally, we predict aphid habitat suitability decreases across continents due to the effect of predation. However, this decrease may be offset or enhanced by the interaction effect between predation and climate change in different regions. This suggests the necessity of accounting for biotic interactions and the interplay between abiotic and biotic factors in mechanistic approaches. Our research highlights the impact of model philosophies in SDM studies and addresses the importance of selecting an appropriate modeling approach in line with the study’s objectives. Furthermore, our study suggests that mechanistic SDMs could serve as a valuable addition for assessing the robustness of correlative SDM predictions.
Article
Full-text available
The threat of invasive species to biodiversity and ecosystem structure is exacerbated by the increasingly concerning outlook of predicted climate change and other human influences. Developing preventative management strategies for invasive plant species before they establish is crucial for effective management. To examine how climate change may impact habitat suitability, we modeled the current and future habitat suitability of two terrestrial species, Geranium lucidum and Pilosella officinarum, and two aquatic species, Butomus umbellatus and Pontederia crassipes, that are relatively new invasive plant species regionally, and are currently spreading in the Pacific Northwest (PNW, North America), an area of unique natural areas, vibrant economic activity, and increasing human population. Using North American presence records, downscaled climate variables, and human influence data, we developed an ensemble model of six algorithms to predict the potential habitat suitability under current conditions and projected climate scenarios RCP 4.5, 7.0, and 8.5 for 2050 and 2080. One terrestrial species (P. officinarum) showed declining habitat suitability in future climate scenarios (contracted distribution), while the other terrestrial species (G. lucidum) showed increased suitability over much of the region (expanded distribution overall). The two aquatic species were predicted to have only moderately increased suitability, suggesting aquatic plant species may be less impacted by climate change. Our research provides a template for regional-scale modelling of invasive species of concern, thus assisting local land managers and practitioners to inform current and future management strategies and to prioritize limited available resources for species with expanding ranges. Supplementary Information The online version contains supplementary material available at 10.1007/s10530-023-03139-8.
Chapter
Given the rise in loan defaults, especially after the onset of the COVID-19 pandemic, it is necessary to predict if customers might default on a loan for risk management. This paper proposes an early warning system architecture using anomaly detection based on the unbalanced nature of loan default data in the real world. Most customers do not default on their loans; only a tiny percentage do, resulting in an unbalanced dataset. We aim to evaluate potential anomaly detection methods for their suitability in handling unbalanced datasets. We conduct a comparative study on different classification and anomaly detection approaches on a balanced and an unbalanced dataset. The classification algorithms compared are logistic regression and stochastic gradient descent classification. The anomaly detection methods are isolation forest and angle-based outlier detection (ABOD). We compare them using standard evaluation metrics such as accuracy, precision, recall, F1 score, training and prediction time, and area under the receiver operating characteristic (ROC) curve. The results show that these anomaly detection methods, particularly isolation forest, perform significantly better on unbalanced loan default data and are more suitable for real-world applications.KeywordsAnomaly detectionUnbalanced datasetEarly warning systemLoan default
Article
Full-text available
Forecasting habitat suitability and connectivity can be central to both controlling range expansion of invasive species and promoting native species conservation, especially under changing climate conditions. This study aimed to identify and prioritize areas in Spain to control the expansion of one of the most harmful invasive species in Europe, the American mink, while conserving its counterpart, the endangered European mink, under current and future conditions. We used ensemble habitat suitability and dynamic connectivity models to predict species ranges and movement routes considering likely climate change under three emission scenarios. Then, using habitat availability metrics, we prioritized areas for invasive mink control and native mink conservation and classified them into different management zones that reflected the overlap between species and threat from American to European minks. Results suggest that both species are likely to experience declines in habitat and connectivity under climate change scenarios with significantly larger declines by the end of the century for European minks (72 and 80% respectively) than for American minks (41 and 32%). Priority areas for management of both species varied over time and across emission scenarios, with a general shift in priority habitat towards the North-East of the study area. Our findings demonstrate how habitat suitability and dynamic connectivity approaches can guide long-term management strategies to control invasive species and conserve native species while accounting for likely landscape changes. The simultaneous study of both invasive and native species can support prioritized management action and inform management planning of the intensity, extent, and techniques of intervention depending on the overlap between species.
Article
تناول هذا البحث أحد أهم نماذج الانحدار غير الخطية الواسعة الاستعمال في نمذجة التطبيقات الاحصائية وهو انموذج الانحدار اللوجستي الثنائي, ومن ثم تقدير معلمات هذا الأنموذج باستعمال طريقة المربعات الصغرى الموزونة, وقد تم في الجانب التطبيقي استعمال هذا الأنموذج لنمذجة البيانات الخاصة بالمصابين بأمراض القلب, وتم التوصل فيه من خلال مقارنة اسباب حالات حدوث الوفاة الحقيقية مع اسباب حالات حدوث الوفاة المقدرة الى مدى ملائمة الأنموذج في نمذجة هذا النوع من البيانات واستخلاص السبب الرئيس لحدوث الوفاة هو التدخين, وكذلك دقة الطريقة (WLSE) في تقدير معلمات الأنموذج .
Article
Full-text available
Most studies on dogs’ cognitive skills in understanding human communication have been conducted on pet dogs, making them a role model for the species. However, pet dogs are just a minor and particular sample of the total dog world population, which would instead be better represented by free-ranging dogs. Since free-ranging dogs are still facing the selective forces of the domestication process, they indeed represent an important study subject to investigate the effect that such a process has had on dogs’ behavior and cognition. Despite only a few studies on free-ranging dogs (specifically village dogs) having been conducted so far, the results are intriguing. In fact, village dogs seem to place a high value on social contact with humans and understand some aspects of humans’ communication. In this study we aimed to investigate village dogs’ ability in understanding a subtle human communicative cue: human facial expressions, and compared them with pet dogs, who have already provided evidence of this social skill. We tested whether subjects were able to distinguish between neutral, happy, and angry human facial expressions in a test mimicking a potential real-life situation, where the experimenter repeatedly performed one facial expression while eating some food, and ultimately dropped it on the ground. We found evidence that village dogs, as well as pet dogs, could distinguish between subtle human communicative cues, since they performed a higher frequency of aversive gazes (looking away) in the angry condition than in the happy condition. However, we did not find other behavioral effects of the different conditions, likely due to the low intensity of the emotional expression performed. We suggest that village dogs’ ability in distinguishing between human facial expressions could provide them with an advantage in surviving in a human-dominated environment.
Conference Paper
Full-text available
Decision tree classifiers are widely used in machine learning due to their interpretability and versatility. However, they suffer from limitations such as over fitting, lack of interpretability, and suboptimal performance on complex datasets. In this paper, we propose EnhancedTree+, a novel approach to address these limitations and enhance the effectiveness of decision tree classifiers. EnhancedTree+ incorporates advanced splitting criteria, ensemble techniques, and pruning mechanisms to improve accuracy, interpretability, and handling of complex datasets. Extensive experimentation and performance evaluations demonstrate the superiority of Enhanced Tree+ over traditional approaches. The proposed approach achieves higher accuracy, provides more meaningful insights into the decision-making process, and exhibits robustness in handling diverse data characteristics. This research contributes to the advancement of decision tree classifiers and their practical applications in various domains.
ResearchGate has not been able to resolve any references for this publication.