Book

Classification and Regression Trees (CART)

Authors:
... We tested several methods and found similar results regardless of the variable importance metric used. Due to computational considerations, we implement a node purity metric, sometimes called recursive partitioning, as described in [53,54]. ...
... In the following explanations for node purity variable importance, it is perhaps useful to imagine the scenario where the outcomes Z i take the values 0 and 1. When each constituent decision tree is formed in producing the random forest model, nodes are split based on some impurity metric relating to the outcomes Z i [53]. For a given node, A, node impurity is defined as ...
... The technical details are left to more complete works on random forests, e.g. see [53], and Appendix B. ...
Article
Full-text available
Tumor microenvironments (TMEs) contain vast amounts of information on patient’s cancer through their cellular composition and the spatial distribution of tumor cells and immune cell populations. Exploring variations in TMEs between patient groups, as well as determining the extent to which this information can predict outcomes such as patient survival or treatment success with emerging immunotherapies, is of great interest. Moreover, in the face of a large number of cell interactions to consider, we often wish to identify specific interactions that are useful in making such predictions. We present an approach to achieve these goals based on summarizing spatial relationships in the TME using spatial K functions, and then applying functional data analysis and random forest models to both predict outcomes of interest and identify important spatial relationships. This approach is shown to be effective in simulation experiments at both identifying important spatial interactions while also controlling the false discovery rate. We further used the proposed approach to interrogate two real data sets of Multiplexed Ion Beam Images of TMEs in triple negative breast cancer and lung cancer patients. The methods proposed are publicly available in a companion R package funkycells .
... tree implementations were proposed over the years. For example, ID3 (Quinlan, 1986), CART (Li et al., 1984), C4.5/C5.0 (Quinlan, 2004(Quinlan, , 2014 to name a few. ...
... Here, there is a total of |S j | = 2 |Q j |−1 − 1 possible binary splits. However, it is easy to show that one can order the categories by the corresponding mean of their response variables, and only consider the splits along this ordered list (Li et al., 1984). This leads to a total of |Q j |−1 candidate splits. ...
... It is well-known that large trees tend to overfit the data (high variance and low bias) while smaller trees might not capture the all relationships between the features (high bias and low variance). A popular solution is by cross-validated pruning of the tree (Li et al., 1984). ...
Article
Full-text available
Multi-target learning (MTL) is a popular machine learning technique which considers simultaneous prediction of multiple targets. MTL schemes utilize a variety of methods, from traditional linear models to more contemporary deep neural networks. In this work we introduce a novel, highly interpretable, tree-based MTL scheme which exploits the correlation between the targets to obtain improved prediction accuracy. Our suggested scheme applies cross-validated splitting criterion to identify correlated targets at every node of the tree. This allows us to benefit from the correlation among the targets while avoiding overfitting. We demonstrate the performance of our proposed scheme in a variety of synthetic and real-world experiments, showing a significant improvement over alternative methods. An implementation of the proposed method is publicly available at the first author's webpage.
... Notably, the computational demands of the CHAID algorithm arise from the calculation of chi-square statistics for each potential split. In response to the need for managing both classification and regression tasks involving categorical and numerical data, and to address the challenges associated with CHAID, classification and regression trees (CART) was introduced (Breiman 1984). This algorithm utilizes the Gini index as a measure of impurity to establish splitting criteria. ...
... It also provides unbiased variable selection, enhancing its reliability and accuracy in modeling tasks (Kang et al. 2010;Lee 2021). CART was further developed by introducing a non-parametric method called quick unbiased efficient statistical tree (QUEST) (Breiman 1984). This algorithm is highly recognized for its ability to handle large data sets and its unbiased nature, as it does not make any assumptions about the underlying distribution of the data (Lee and Lee 2015;Song et al. 2020). ...
Article
Full-text available
his paper provides a comprehensive review of tree-based models and their application in condition assessment and prediction of water, wastewater, and sewer pipe failures. Tree-based models have gained significant attention in recent years due to their effectiveness in capturing complex relationships between parameters of systems and their ability in handling large data sets. This study explores a range of tree-based models, including decision trees and ensemble trees utilizing bagging, boosting, and stacking strategies. The paper thoroughly examines the strengths and limitations of these models, specifically in the context of assessing the pipes’ condition and predicting their failures. In most cases, tree-based algorithms outperformed other prevalent models. Random forest was found to be the most frequently used approach in this field. Moreover, the models successfully predicted the failures when augmented with a richer failure data set. Finally, it was identified that existing evaluation metrics might not be necessarily suitable for assessing the prediction models in the water and sewer networks.
... In this problem, using the fork-join system parameters, we trained classical ML [36,37] algorithms and a neural network to predict the sojourn time of the fork-join system for the case when a linear function approximates the tasks PH intensities at the servers. We generated a synthetic dataset consisting of approximately 150 thousand records. ...
... We generated a synthetic dataset consisting of approximately 150 thousand records. In the experiment, we used classical algorithms on trees-decision trees [36] and gradient boosting [37]-as well as a neural network. During training, the dataset was divided into 80% for training, 15% for testing, and 5% for validation. ...
Article
Full-text available
This paper presents a study of fork–join systems. The fork–join system breaks down each customer into numerous tasks and processes them on separate servers. Once all tasks are finished, the customer is considered completed. This design enables the efficient handling of customers. The customers enter the system in a MAP flow. This helps create a more realistic and flexible representation of how customers arrive. It is important for modeling various real-life scenarios. Customers are divided into K≥2 tasks and assigned to different subsystems. The number of tasks matches the number of subsystems. Each subsystem has a server that processes tasks, and a buffer that temporarily stores tasks waiting to be processed. The service time of a task by the k-th server follows a PH (phase-type) distribution with an irreducible representation (βk, Sk), 1≤k≤K. An analytical solution was derived for the case of K=2 when the input MAP flow and service time follow a PH distribution. We have efficient algorithms to calculate the stationary distribution and performance characteristics of the fork–join system for this case. In general cases, this paper suggests using a combination of Monte Carlo and machine learning methods to study the performance of fork–join systems. In this paper, we present the results of our numerical experiments.
... According to Li et al. (1984), the decisions in decision tree-based classification are based on the tree structure created for different feature values during the training process. The parent nodes sub-branch is selected on the basis of the matching value of the feature's vector. ...
Article
Full-text available
Statistics across different countries point to breast cancer being among severe cancers with a high mortality rate. Early detection is essential when it comes to reducing the severity and mortality of breast cancer. Researchers proposed many computer-aided diagnosis/detection (CAD) techniques for this purpose. Many perform well (over 90% of classification accuracy, sensitivity, specificity, and f-1 sore), nevertheless, there is still room for improvement. This paper reviews literature related to breast cancer and the challenges faced by the research community. It discusses the common stages of breast cancer detection/ diagnosis using CAD models along with deep learning and transfer learning (TL) methods. In recent studies, deep learning models outperformed the handcrafted feature extraction and classification task and the semantic segmentation of ROI images achieved good results. An accuracy of up to 99.8% has been obtained using these techniques. Furthermore, using TL, researchers combine the power of both, pre-trained deep learning-based networks and traditional feature extraction approaches.
... This section introduces the machine learning models to be accessed in the following sections and describes the method to quantify their corresponding prediction accuracy. The decision tree continuously divides the dataset through judgment conditions and finally obtains the classification or regression model of the tree [25]. This research employs a regression decision tree. ...
Article
Full-text available
Time-domain numerical simulation is generally considered an accurate method to predict the mooring system performance, but it is also time and resource-consuming. This paper attempts to completely replace the time-domain numerical simulation with machine learning approaches, using a catenary anchor leg mooring (CALM) system design as an example. An adaptive sampling method is proposed to determine the dataset of various parameters in the CALM mooring system in order to train and validate the generated machine learning models. Reasonable prediction accuracy is achieved by the five assessed machine learning algorithms, namely random forest, extremely randomized trees, K-nearest neighbor, decision tree, and gradient boosting decision tree, among which random forest is found to perform the best if the sampling density is high enough.
... The RF is an integrated learning algorithm that aggregates multiple CARTs (Classification and Regression Trees) decision trees [100]. Decision trees are objective that do not require any a priori assumption for determining the unknown type; instead, they are achieved by voting or computing the mean of the output, depending on whether the task is classification or regression. ...
Article
Full-text available
This study focuses on exploring the indication and importance of selenium (Se) and tellurium (Te) in distinguishing different genetic types of ore deposits. Traditional views suggest that dispersed elements are unable to form independent deposits, but are hosted within deposits of other elements as associated elements. Based on this, the study collected trace elemental data of pyrite, sphalerite, and chalcopyrite in various types of Se-Te bearing deposits. The optimal end-elements for distinguishing different genetic type deposits were recognized by principal component analysis (PCA) and the silhouette coefficient method, and discriminant diagrams were drawn. However, support vector machine (SVM) calculation of the decision boundary shows low accuracy, revealing the limitations in binary discriminant visualization for ore deposit type discrimination. Consequently, two machine learning algorithms, random forest (RF) and SVM, were used to construct ore genetic type classification models on the basis of trace elemental data for the three types of metal sulfides. The results indicate that the RF classification model for pyrite exhibits the best performance, achieving an accuracy of 94.5% and avoiding overfitting errors. In detail, according to the feature importance analysis, Se exhibits higher Shapley Additive Explanations (SHAP) values in volcanogenic massive sulfide (VMS) and epithermal deposits, especially the latter, where Se is the most crucial distinguishing element. By comparison, Te shows a significant contribution to distinguishing Carlin-type deposits. Conversely, in porphyry- and skarn-type deposits, the contributions of Se and Te were relatively lower. In conclusion, the application of machine learning methods provides a novel approach for ore genetic type classification and discrimination research, enabling more accurate identification of ore genetic types and contributing to the exploration and development of mineral resources.
... An alternative to the multiplicative interaction method, such as RAMP, includes a popular treebased interaction method (Breiman et al., 1984;Breiman, 1996Breiman, , 2001Meinshausen, 2010). In particular, RF achieves robust and accurate prediction performance while mitigating overfitting and leveraging high-order interactions. ...
Article
Full-text available
Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.
... Decision tree model. Decision trees used in data mining are of two main types: a classification tree (the predicted result is the class identifier) and a regression tree (the predicted result is a real number) [28]. Decision trees split the space of objects according to some set of splitting rules. ...
Conference Paper
The comprehensive informatization of society allows to receive an increasing amount on the agriculture yield data and climate data. Thus, it becomes possible to use the available data to forecast grain yields and grain prices, to analyze crop losses and more. Climatic factors play a decisive role in wheat yield fluctuations. The use of climate data bases makes it possible to build the yield prognostic models, which allow in advance (in 3 months) to estimate the future yield. Pre-harvest yield forecasting can assist grain producers in making the necessary arrangements for storage and marketing of the crop. Such forecasts will also help farmers in planning the logistics of their business. In this work, the average decadal temperature values of April, May, and June and the monthly amounts of precipitation for six regions of the chernozem zone of Ukraine were chosen to study the impact of climate on the wheat yield. The task of this research is to assess the impact of climatic factors on detrended wheat yield values using machine learning techniques. The work uses an innovative approach, according to which detrended yield values are divided into two groups, labeled as "low yield" and "high yield". Five machine learning models were used as classifiers, which were adapted to the available data and demonstrated a classification accuracy about 80% on test samples. The linear discriminant analysis and the logistic regression model are the most effective classifiers and provide 87% classification accuracy.
... The approach will utilize supervised machine learning techniques, and the data will be retrieved from the social media Twitter. The novel contribution of this research is comparing 9 machine learning algorithms, namely: Random Forest [13], Decision Tree [14], Naï ve Bayes [15], XGB [16], LGBM [17], AdaBoost [18], Voting Classifier [19], K-Nearest Neighbor [20], Logistic regression [21]. With the aim to identify the most optimal and reliable sentiment analysis technique that can be used in real-time and on-demand to monitor public sentiment and its underlying concerns towards the candidates of the upcoming 2024 Indonesian presidential election during dynamic political situation of campaign period. ...
... Considering the high nonlinearity between aerosol loading, surface emission and satellite longwave radiances, we develop a machine learning retrieval approach using the Random Forest Regression (RFR) model Dong et al., 2023;Hutengs & Vohland, 2016). We first apply the relative importance (defined as the Gini importance, Breiman et al., 1984) feature selection method to identify independent and build generalized model. The relative importance of each factor is shown Figure S2 in Supporting Information S1 and the retrieval algorithm is trained between 14 selected predictors (as outlined in Table S1 in Supporting Information S1), encompassing six radiance bands, RH, OZO10, OZO50, OZO500, OZO850, SATAZI, SATZEN, TOPO, and the targeted variables (MODIS 550 nm AOD and ERA-5 SKT). ...
Article
Full-text available
Aerosol remote sensing typically relies on reflected shortwave radiation and thus lacks nighttime retrievals. Here we made an original attempt to retrieve nighttime aerosol optical depth (AOD) by utilizing longwave measurements in the atmospheric window region from the Atmospheric InfraRed Sounder (AIRS) instrument. A machine‐learning based algorithm is developed using AIRS longwave radiance and auxiliary data as the input and AOD from Moderate Resolution Imaging Spectroradiometer (MODIS) as well as reanalysis surface temperature as the output. Independent validation indicates good agreement with lunar AOD derived from surface photometers. An overall increase in nighttime AOD compared to daytime is also uncovered, which is further corroborated by surface and space lidar measurements. The theoretical basis of the algorithm is further verified using radiative transfer simulations. Our approach substantially extends the potential of hyperspectral longwave measurements and offers valuable insights into nighttime aerosol properties.
... Decision trees find applications in data mining, medicine, finance, and marketing.  Random Forest Classifier (RFC) [28]- [30]: Used in a variety of application areas. Particularly effective for classification and regression problems with large datasets. ...
Article
In recent years, the widespread availability of internet access has brought both advantages and disadvantages. Users now enjoy numerous benefits, including unlimited access to vast amounts of information and seamless communication with others. However, this accessibility also exposes users to various threats, including malicious software and deceptive practices, leading to victimization of many individuals. Common issues encountered include spam emails, fake websites, and phishing attempts. Given the essential nature of internet usage in contemporary society, the development of systems to protect users from such malicious activities has become imperative. Accordingly, this study utilized eight prominent machine learning algorithms to identify spam URLs using a large dataset. Since the dataset only contained URL information and spam classification, additional feature extractions such as URL length and the number of digits were necessary. The inclusion of such features enhances decision-making processes within the framework of machine learning, resulting in more efficient detection. As the effectiveness of feature extraction significantly impacts the results of the methods, the study initially conducted feature extraction and trained models based on the weight of features. This paper proposes a data correlated matrix approach for spam URL detection using machine learning algorithms. The distinctive aspect of this study lies in the feature extraction process applied to the dataset, aimed at discerning the most impactful features, and subsequently training models while considering the weighting of these features. The entire dataset was used without any reduction in data. Experimental findings indicate that tree-based machine learning algorithms yield superior results. Among all applied methods, the Random Forest approach achieved the highest success rate, with a detection rate of 96.33% for the non-spam class. Additionally, a combined and weighted calculation method yielded an accuracy of 94.16% for both spam and non-spam data.
... It is interpretable and can handle both numerical and categorical data. We use the CART (classification and regression tree) implementation of DT [11]. DT construction is a process of recursively partitioning a dataset into subsets, based on the values of the features. ...
Article
Full-text available
Machine learning (ML) algorithms can handle complex genomic data and identify predictive patterns that may not be apparent through traditional statistical methods. They become popular tools for medical applications including prediction, diagnosis or treatment of complex diseases like rheumatoid arthritis (RA). RA is an autoimmune disease in which genetic factors play a major role. Among the most important genetic factors predisposing to the development of this disease and serving as genetic markers are HLA-DRB and non-HLA genes single nucleotide polymorphisms (SNPs). Another marker of RA is the presence of anticitrullinated peptide antibodies (ACPA) which is correlated with severity of RA. We use genetic data of SNPs in four non-HLA genes (PTPN22, STAT4, TRAF1, CD40 and PADI4) to predict the occurrence of ACPA positive RA in the Polish population. This work is a comprehensive comparative analysis, wherein we assess and juxtapose various ML classifiers. Our evaluation encompasses a range of models, including logistic regression, k -nearest neighbors, naïve Bayes, decision tree, boosted trees, multilayer perceptron, and support vector machines. The top-performing models demonstrated closely matched levels of accuracy, each distinguished by its particular strengths. Among these, we highly recommend the use of a decision tree as the foremost choice, given its exceptional performance and interpretability. The sensitivity and specificity of the ML models is about 70% that are satisfying. In addition, we introduce a novel feature importance estimation method characterized by its transparent interpretability and global optimality. This method allows us to thoroughly explore all conceivable combinations of polymorphisms, enabling us to pinpoint those possessing the highest predictive power. Taken together, these findings suggest that non-HLA SNPs allow to determine the group of individuals more prone to develop RA rheumatoid arthritis and further implement more precise preventive approach.
... A decision tree comprises decision nodes, which segment data, and leaf nodes that determine the target value. DT regression uses this structure to predict numeric outcomes through iterative data [23,24] The Random Forest (RF) algorithm employs the technique of bagging to generate a collection of decision trees. RF regression combines multiple decision trees that are trained on different subsets of data [25]. ...
... 17 MO algorithms included in the proposed MMOMML are ABC (Karaboga & Basturk, 2007), FA (Yang, 2010), GA (Goldberg, 1989;Holland, 1975), DE (Storn & Price, 1997), WOA (Mirjalili & Lewis, 2016), PSO (Kennedy, 2011), CS (Yang & Deb, 2009), TLBO (Sahu et al., 2015), FBI (Chou & Nguyen, 2020), FPA (Yang, 2012), RRA (Merrikh-Bayat, 2015), WCA (Eskandar et al., 2012), GWO (Mirjalili et al., 2014), SOS (Cheng & Prayogo, 2014), EFO (Abedinpourshotorban et al., 2016), GSA (Rashedi et al., 2009), and equilibrium optimizer (EO) (Faramarzi et al., 2020). The 15 ML models included in the proposed MMOM-ML are linear regression (Galton, 1886;Pearson, 1904), multivariate regression (Kendall, 1957), logistic regression (Conolly, 1958), multivariate adaptive regression splines (MARS) (Friedman, 1991), classification and regression tree (CART) (Gordon et al., 1984), least square support vector regression (LSSVR) (Suykens & Vandewalle, 1999), SVR (Cortes & Vapnik, 1995;Vapnik et al., 1996), ANN (McCulloch & Pitts, 1943;Rumelhart et al., 1986), ANFIS (Jang, 1993), RBFNN (Moody & Darken, 1989), random forest (RF) (Breiman, 2001), LogitBoost (Friedman et al., 2000), XGB (Chen & Guestrin, 2016), LightGBM (LGBM) (Ke et al., 2017), and AdaBoost (AB) (Freund & Schapire, 1997;Hastie et al., 2009). ...
Article
Full-text available
Machine learning (ML) presents a promising method for predicting mechanical properties in structural engineering, particularly within complex nonlinear structures under extreme conditions. Despite its potential, research has shown a disproportionate focus on concrete structures, leaving steel structures less explored. Furthermore, the prevalent combination of metaheuristic optimization (MO) and ML in existing studies is often subjective, pointing to a significant gap in identifying and leveraging more effective hybrid models. To bridge these gaps, this study introduces a novel system named the Multiple Metaheuristic Optimizers – Multiple Machine Learners (MMOMML) system, designed for predicting mechanical strength in steel structures. The MMOMML system amalgamates 17 MO algorithms with 15 ML techniques, generating 255 hybrid models, including numerous novel configurations not previously examined. With a user-friendly interface, MMOMML enables structural engineers to tackle inference challenges efficiently, regardless of their coding proficiency. This capability is convincingly demonstrated through two practical applications: steel beams’ shear strength and steel cellular beams’ elastic buckling. By offering a versatile and robust tool, the MMOMML system meets construction engineers’ and researchers’ practical and research needs, marking a significant advancement in the field.
... The surrogate model with the least RMSE and an acceptable training time is the best solution. In this work, the following models are considered and compared: Polynomial Regression (PR) [25], Gaussian Process Regression (GPR) [26], Regression Tree (RT) [27], and Support Vector Regression (SVR) [28]. ...
Article
Full-text available
Cable-stayed bridges have commonly been built for crossing large-span obstacles, such as rivers, valleys, and existing structures. Obtaining an optimum design for a cable-stayed bridge is challenging, due to the large number of design variables and design constraints that are typically nonlinear and usually conflict with each other. Therefore, it is a reasonable alternative to turn the large and complex optimization problem into two sub-problems, i.e., optimizing the internal force distribution by adjusting the cable prestressing forces, and optimizing the other sizing or geometrical parameters. However, conventional methods are lacking in efficiency when dealing with the problem of optimization of cable forces in the first sub-problem, under the circumstance that iteration between the two sub-problems is required. To address this, this paper presents a surrogate-model-assisted method to construct a cable forces predictor ahead of the structural optimization process, so that cable forces can be effectively predicted rather than optimized in each iterative round. Additionally, B-spline interpolation curve is adopted for variable condensation when sampling for the surrogate model. Finally, the structure optimization in the second sub-problem is performed by leveraging an optimization program based on particle swarm optimization method. The performance of the proposed framework is tested with a practical engineering application. Results show that the proposed method showcases good efficiency and accuracy. The theoretical raw material consumption of the towers and the cables is 32% lower than the original design.
... The blue regions stand for the bandgap and the blue dot at k 1 is the top of the valence band. The bottom of the conduction band changes from the k 2 -wavevector (blue point) for indirect bandgap to the k 1 -wavevector (red dots) for direct bandgap Herein, we conduct a broad analysis using data science and interpretable Machine Learning methods such as Decision Trees (DT) [18] and Random Forests (RF) [19]. In particular, we perform a descriptive analysis using the VAX method [20], extracting Jumping Emerging Patterns (JEPs, descriptive logic rules) [21,22] from Machine Learning models, and then look for causal relations or insights to explain why a material has a direct or indirect band gap. ...
Article
Full-text available
Having a direct or indirect band gap can influence the potential applications of a semiconductor, for indirect band gap materials are usually not suitable for optoelectronic devices. Even though this is a fundamental property of semiconducting materials, discussed in textbooks, no unified theory exists to explain why a material has a direct or indirect band gap. Here we used an interpretable machine learning model, the multiVariate dAta eXplanation (VAX) method, to gather information from a dataset of materials extracted from the Materials Project. The dataset contains more than 10000 entries, and atomic properties such as the number of electrons, electronic affinity and orbital energies were used as features to build random forest models that successfully explain the directness of the band gaps. Our results indicate that symmetry is an important feature that dictates the target property, which is the reason why our analysis is made based on sub-groups with similar structures. These sub-groups include materials with zincblende, rocksalt, wurtzite, and perovskite structures. Besides the symmetry of the materials, the existence or not of d bands and the relative energy of atomic orbitals were found to be important in defining whether a material’s band gap is direct or indirect. In conclusion, interpretable machine learning methods such as VAX can be useful in obtaining physical interpretation from materials databases.
... The RF is useful for handling different types of predictor variables without prior data transformation or outlier elimination [35]. Random forests for regression are formed by growing a certain number of decision trees based on the CART algorithm [36]. Each regression tree is trained based on the generated dataset of randomly selected variables. ...
Article
Full-text available
Controlling groundwater table decline could mitigate land subsidence and induced environmental hazards in over-explored areas. Nevertheless, this becomes a challenge in the multi-layered porous system as (in)elastic deformation simultaneously occurs due to vast spatiotemporal variability in the groundwater table. In this study, SBAS-InSAR was used to estimate annual land deformation during 2017–2022 in a specific region of North China Plain, in which aquifers are composed of many layers of fine-grained compressible sediments and the groundwater table has experienced a prolonged decline. The random forest (RF) was applied to establish the nonlinear relationship between accumulated deformation and its potential driving factors, including the depth to the groundwater table (GWD) and its change rate, and the compressible sediment thickness. Results show that the marked subsidence and uplift co-exist in the region even though the groundwater table has risen widely since the South–North Water Diversion Project. The land subsidence is attributed to inelastic compaction of the thick compressible deposits in depression cone centers, where the GWD is over 40 m and 90 m in the shallow and deep aquifers, respectively. In contrast, the marked uplift is primarily attributed to fast rising of the groundwater table (e.g., −2.44 m/a). The RF predictions suggest that, to control the subsidence, the GWD should be less than 20 and 70 m in the shallow and deep aquifers, respectively, and the rising rate of the GWD should increase to 2–5 times of current rates in the depression cones. To mitigate the marked uplift, the rising rate of the GWD should reduce to 1/2–1/5 of the current rates in the shallow aquifers. The uneven deformations of sediments in the depression cone centers and uplift in their boundaries may exacerbate geohazards. Therefore, it is vital to implement appropriate governance of groundwater recovery in the multi-layered porous system.
... Its strength lies its ability to translate complex regression relationships between dependent and independent variables into readily interpretable decision trees. Essentially, it partitions heterogeneous datasets into homogeneous subgroups based on a chosen target variable (Gordon et al., 1984). The algorithm builds a hierarchical model between covariates and soil properties, structured as a decision tree (Bittencourt & Clarke, 2003). ...
Article
Accurate estimation of particle size distribution across a large area is crucial for proper soil management and conservation, ensuring compatibility with capabilities and enabling better selection and adaptation of precision agricultural techniques. The study investigated the performance of tree-based models, ranging from simpler options like CART to sophisticated ones like XGBoost, in predicting soil texture over a wide geographic region. Models were constructed using remotely sensed plant and soil indexes as covariates. Variable selection employed the Boruta approach. Training and testing data for machine learning models consisted of particle size distribution results from 622 surface soil samples collected in southeastern Turkey. The XGBoost Clay model emerged as the most accurate predictor, with an R 2 value of 0.74. Its superiority was further underlined by a 21.36% relative improvement in XGBoost Clay RMSE compared to RF Clay and 44.5% compared to CART Clay. Similarly, the R 2 values for XGBoost Silt and XGBoost Sand models reached 0.71 and 0.75 in predicting sand and silt content, respectively. Among the considered covariates, the normalized ratio vegetation index and slope angle had the highest impact on clay content (21%), followed by topographic position index and simple ratio clay index (20%), while terrain ruggedness index had the least impact (18%). These results highlight the effectiveness of Boruta approach in selecting an adequate number of variables for digital mapping, suggesting its potential as a viable option in this field. Furthermore , the findings of this study suggest that remote sensing data can effectively contribute to digital soil mapping, with tree-based model development leading to improved prediction performance.
Article
Full-text available
Forests provide crucial ecosystem services and are increasingly threatened by invasive plant species. The spread of these invasive species has affected biodiversity and has become a trending topic due to its impact on both endemic species and biodiversity. Therefore, it is imperative to implement conservation measures to protect native species such as mapping and monitoring invasive plant species in the forest realm. Mapping understory herb invasive plant species within forest categories is challenging, for example species such as Ageratum conyzoides and Cassia tora do not occur in distinct clusters, making them difficult to distinguish from the surrounding forest. In this paper, phenology plays a vital role for analysing the separability of both inter and intra-species discrimination to examine temporal curves for different vegetation indices that affect plant growth during the green and senescence periods. Machine learning algorithms, including regression tree-based algorithms, decision tree-based algorithms, and probabilistic algorithms, were used to determine the most effective algorithm for pixel-based classification. Support Vector Machine (SVM) classifier was the most effective method, with an overall accuracy of this classifier was calculated as 90.28% and a kappa of 0.88. The findings indicate that machine learning algorithms remain effective for pixel-based classification of understory invasive plant species from forest class. Thus, this study shows a technical method to distinguish invasive plant species from forest class which can help forest managers to locate invasion sites to eradicate them and conserve native biodiversity.
Article
Full-text available
The statistical relationship between sensor signature features and lubricant solid particle contamination conditions in a spherical roller bearing has been investigated in this study. The influence of particle size and concentration of solid contaminants in lubricant on the RMS parameter of time-domain acoustic emission, vibration, and sound sensor signals are examined. Machine learning algorithms are trained with time domain statistical features derived from sensor signatures to predict the lubricant conditions. Decision trees, bagging tree ensembles, and support vector machines are used to build ML models. Decision Tree models are built using classification and regression tree algorithms with three distinct split criteria, namely gini, towing, and maximum deviance. A bagged tree ensemble model is constructed using the decision tree as a base learner. In the support vector machine, kernel tricking is done to optimize the classification boundaries. Models built using Acoustic emission signature features predict lubricant conditions with better accuracy compared to models constructed using sound and vibration signature features. Feature-level fusion approach is implemented by combining the vibration, sound, and acoustic emission features at the feature level to improve the prediction power of machine learning models. The bagged tree ensemble and support vector machine models, which are trained using fused features, predict lubricant conditions in spherical roller bearings with an accuracy of around 99%.
Article
Full-text available
Phytoplankton are the foundation of marine ecosystems and play a crucial role in determining the optical properties of seawater, which are critical for remote sensing applications. However, passive remote sensing techniques are limited to obtaining data from the near surface, and cannot provide information on the vertical distribution of the subsurface phytoplankton. In contrast, active LiDAR technology can provide detailed profiles of the subsurface phytoplankton layer (SPL). Nevertheless, the large amount of data generated by LiDAR brought a challenge, as traditional methods for SPL detection often require manual inspection. In this study, we investigated the application of supervised machine learning algorithms for the automatic recognition of SPL, with the aim of reducing the workload of manual detection. We evaluated five machine learning models—support vector machine (SVM), linear discriminant analysis (LDA), a neural network, decision trees, and RUSBoost—and measured their performance using metrics such as precision, recall, and F3 score. The study results suggest that RUSBoost outperforms the other algorithms, consistently achieving the highest F3 score in most of the test cases, with the neural network coming in second. To improve accuracy, RUSBoost is preferred, while the neural network is more advantageous due to its faster processing time. Additionally, we explored the spatial patterns and diurnal fluctuations of SPL captured by LiDAR. This study revealed a more pronounced presence of SPL at night during this experiment, thereby demonstrating the efficacy of LiDAR technology in the monitoring of the daily dynamics of subsurface phytoplankton layers.
Article
Full-text available
We assessed predictive models (PMs) for diagnosing Pneumocystis jirovecii pneumonia (PCP) in AIDS patients seen in the emergency room (ER), aiming to guide empirical treatment decisions. Data from suspected PCP cases among AIDS patients were gathered prospectively at a reference hospital's ER, with diagnoses later confirmed through sputum PCR analysis. We compared clinical, laboratory, and radiological data between PCP and non-PCP groups, using the Boruta algorithm to confirm significant differences. We evaluated ten PMs tailored for various ERs resource levels to diagnose PCP. Four scenarios were created, two based on X-ray findings (diffuse interstitial infiltrate) and two on CT scans (“ground-glass”), incorporating mandatory variables: lactate dehydrogenase, O2sat, C-reactive protein, respiratory rate (> 24 bpm), and dry cough. We also assessed HIV viral load and CD4 cell count. Among the 86 patients in the study, each model considered either 6 or 8 parameters, depending on the scenario. Many models performed well, with accuracy, precision, recall, and AUC scores > 0.8. Notably, nearest neighbor and naïve Bayes excelled (scores > 0.9) in specific scenarios. Surprisingly, HIV viral load and CD4 cell count did not improve model performance. In conclusion, ER-based PMs using readily available data can significantly aid PCP treatment decisions in AIDS patients.
Thesis
Full-text available
Using constraint logic programming, the goal of this thesis is to develop several constraint acquisition techniques for the situations where we have error-free data. Such situations render majority of ML techniques unusable and new approaches are required. The proposed constraint acquisition techniques are applied for two use cases: search for new sharp bounds conjectures for eight combinatorial objects and the constraint acquisition from a single valid short-term production schedule. The contributions of the thesis include (i) a constraint model to acquire Boolean-arithmetic expressions from data, (ii) an automatically generated database of anti-rewriting constraints that prevent the generation of simplifiable Boolean-arithmetic equations, (iii) a number of formulae synthesis techniques which can acquire a single formula combining several learning biases, (iv) the acquisition of a variety of scheduling constraints such as temporal, resource, calendar and shift constraints, and in this later case (v) the generation of a MiniZinc scheduling model.
Article
Achieving precise control in laser-based powder bed fusion of polymers is crucial for ensuring the structural integrity of aerospace and automotive components. Closed-loop feedback control systems using process monitoring techniques, such as infrared thermography, have the potential to provide reliable production by controlling the temperature of the melt. However, challenges arise from complex interactions among variables, such as part geometry, scan strategy, and laser parameters, affecting the melted polymer's temperatures and subsequent particle fusion. Thus, the correlation between thermal signals and resulting part density still needs to be clarified. In this work, a machine learning algorithm is trained to predict local porosity or, rather, solidity based on thermal and temporal features extracted from the melt's temperature-time profile. This enables statistical techniques to assess the contribution of the melt's thermal and temporal features in the decision-making process for evaluating porosity, along with the influence of voxel size and configuration. The in-situ process signature is measured using infrared thermography, and the porosity is analyzed by X-ray micro-computed tomography. The 2D thermal data is first converted into voxel information and then stitched with the 3D mircoCT data in a second step. The resulting 3D thermal features and porosity matrices are downsampled and utilized to train a machine learning algorithm (lightGBM). Models with high prediction accuracy are achieved using a small voxel size to avoid over-homogenizing features and by utilizing thermal signals from adjacent voxels to determine porosity in a volume element. The highest predictor for resulting porosity is the peak temperature of the melt during laser exposure. Interlayer effects, such as sufficient reheating of subsurface layers, are the second-highest indicator for dense parts. Furthermore, the model's performance is also affected by intra-layer effects, including the peak temperature of adjacent voxels and the cooling behavior after laser exposure. This research has several implications for industry, as it enables the detection of process defects based on in-situ process monitoring data without post-process material testing. Moreover, identifying thermal signal ranges that lead to the highest porosity can reduce the number of experiments needed for material qualification processes.
Article
The accurate prediction of blast-induced ground vibration due to underground ring blasting is a prominent need for ensuring the safety of structures. Different site-specific empirical equations are available for the prediction of ground vibration. These empirical equations are best suited when the monitoring and blasting locations are present in the same medium. The change in the medium alters the behavior of wave propagation. Hence, existing empirical equations have limitations in peak particle velocity (PPV) prediction when the blasting location is an underground hard rock mine and the monitoring location is ground surface. This is because the underground metal mine comprises different levels having void in the form of excavated stope or paste-filled stope. It is very difficult to predict the magnitude of PPV on the surface in such instances. Therefore, this study has been carried out to predict the PPV at surface due to underground blasting. In this paper, PPV data was recorded at surface for 207-ring blasts. Furthermore, the PPV has also been measured at different underground locations for 47-ring blasts. Different empirical equations along with k-nearest neighbor (KNN) and random forest (RF) model of machine learning technique were developed for the prediction of PPV. Most of the empirical models have higher accuracy in the prediction of PPV at an underground location. This shows that scaled distance-based empirical predictors are best suited when the monitoring and blasting media are the same. However, the empirical models do not predict PPV accurately when the monitoring location is ground surface and the blast is conducted underground. The machine learning models are better suited for PPV prediction in such cases. Based on the analysis performed for the case study site, RF model predicts PPV at surface with the highest accuracy. The coefficient of determination and root mean square error for RF model used for predicting PPV at ground surface are 0.94 and 0.438 mm/s respectively. The RF-based model is also the best suited among all the models for predicting PPV at underground locations as well.
Article
Full-text available
Objectives The elimination of mother-to-child transmission (MTCT) of syphilis has been set as a public health priority. However, an instrument to predict the MTCT of syphilis is not available. We aimed to develop and validate an intuitive nomogram to predict the individualised risk of MTCT in pregnant women with syphilis in China. Design Retrospective cohort study. Setting Data was acquired from the National Information System of Prevention of MTCT of Syphilis in Guangdong province between 2011 and 2020. Participants A total of 13 860 pregnant women with syphilis and their infants were included and randomised 7:3 into the derivation cohort (n=9702) and validation cohort (n=4158). Primary outcome measures Congenital syphilis. Results Among 13 860 pregnant women with syphilis and their infants included, 1370 infants were diagnosed with congenital syphilis. Least absolute shrinkage and selection operator regression and multivariable logistic regression showed that age, ethnicity, registered residence, marital status, number of pregnancies, transmission route, the timing of syphilis diagnosis, stage of syphilis, time from first antenatal care to syphilis diagnosis and toluidine red unheated serum test titre were predictors of MTCT of syphilis. A nomogram was developed based on the predictors, which demonstrated good calibration and discrimination with an area under the curve of the receiver operating characteristic of 0.741 (95% CI: 0.728 to 0.755) and 0.731 (95% CI: 0.710 to 0.752) for the derivation and validation cohorts, respectively. The net benefit of the predictive models was positive, demonstrating a significant potential for clinical decision-making. We have also developed a web calculator based on this prediction model. Conclusions Our nomogram exhibited good performance in predicting individualised risk for MTCT of syphilis, which may help guide early and personalised prevention for MTCT of syphilis.
Article
PURPOSE: Crop diseases can cause significant reductions in yield, subsequently impacting a country’s economy. The current research is concentrated on detecting diseases in three specific crops – tomatoes, soybeans, and mushrooms, using a real-time dataset collected for tomatoes and two publicly accessible datasets for the other crops. The primary emphasis is on employing datasets with exclusively categorical attributes, which poses a notable challenge to the research community. METHODS: After applying label encoding to the attributes, the datasets undergo four distinct preprocessing techniques to address missing values. Following this, the SMOTE-N technique is employed to tackle class imbalance. Subsequently, the pre-processed datasets are subjected to classification using three ensemble methods: bagging, boosting, and voting. To further refine the classification process, the metaheuristic Ant Lion Optimizer (ALO) is utilized for hyper-parameter tuning. RESULTS: This comprehensive approach results in the evaluation of twelve distinct models. The top two performers are then subjected to further validation using ten standard categorical datasets. The findings demonstrate that the hybrid model II-SN-OXGB, surpasses all other models as well as the current state-of-the-art in terms of classification accuracy across all thirteen categorical datasets. II utilizes the Random Forest classifier to iteratively impute missing feature values, employing a nearest features strategy. Meanwhile, SMOTE-N (SN) serves as an oversampling technique particularly for categorical attributes, again utilizing nearest neighbors. Optimized (using ALO) Xtreme Gradient Boosting OXGB, sequentially trains multiple decision trees, with each tree correcting errors from its predecessor. CONCLUSION: Consequently, the model II-SN-OXGB emerges as the optimal choice for addressing classification challenges in categorical datasets. Applying the II-SN-OXGB model to crop datasets can significantly enhance disease detection which in turn, enables the farmers to take timely and appropriate measures to prevent yield losses and mitigate the economic impact of crop diseases.
Preprint
Full-text available
With the growth of internet of things (IoT) devices, cyber-attacks, such as distributed denial of service, that exploit vulnerable devices infected with malware have increased. Therefore, vendors and users must keep their device firmware updated to eliminate vulnerabilities and quickly handle unknown cyberattacks. However, it is difficult for both vendors and users to continually keep the devices safe because vendors must provide updates quickly and the users must continuously manage the conditions of all deployed devices. Therefore, to ensure security, it is necessary for a system to adapt autonomously to changes in cyberattacks. In addition, it is important to consider network-side security that detects and filters anomalous traffic at the gateway to comprehensively protect those devices. This paper proposes a self-adaptive anomaly detection system for IoT traffic, including unknown attacks. The proposed system comprises a honeypot server and a gateway. The honeypot server continuously captures traffic and adaptively generates an anomaly detection model using real-time captured traffic. Thereafter, the gateway uses the generated model to detect anomalous traffic. Thus, the proposed system can adapt to unknown attacks to reflect pattern changes in anomalous traffic based on real-time captured traffic. Three experiments were conducted to evaluate the proposed system: a virtual experiment using pre-captured traffic from various regions across the world, a demonstration experiment using real-time captured traffic , and a virtual experiment using a public dataset containing the traffic generated by malware. The results of all experiments showed that the detection model with the dynamic update method achieved higher accuracy for traffic anomaly detection than the pre-generated detection model. The experimental results indicate that a system adaptable in real time to evolving cyberattacks is a novel approach for ensuring the comprehensive security of IoT devices against both known and unknown attacks. key words: internet of things, machine learning, honeypot, traffic anomaly detection
Article
Full-text available
Background Web-based self-help interventions for parents of children with ADHD and other externalizing disorders have been proven to be effective. In order to recommend individualized and optimized interventions, a better understanding of the acceptance and utilization of this innovative treatment approach is needed. Previous research has frequently employed subjective reports of utilization, but the validity of these studies may be limited. Methods Data from the German WASH study were used. Participants (n = 276) were randomly assigned to the intervention condition (a) web-based self-help or (b) web-based self-help with optional telephone-based support calls. Data collection took place at baseline (T1) and 12 weeks later (T2). Utilization data were tracked using a log file generated for each participant at T2. Prediction models were calculated using CART (Classification and Regression Trees), a method known mostly from the field of machine learning. Results Acceptance, of the intervention as defined in this paper was very high on objective (89.4% have taken up the intervention) and subjective measures (91.4% reported having used the intervention and 95.3% reported they would recommend the intervention to a friend). The average number of logins corresponded to recommendations. Predictors of acceptance and predictors of utilization were similar and included, e.g., child’s externalizing symptoms, parental psychopathology, and above all additional telephone-based support by counselors. Conclusions Through a detailed identification of acceptance and utilization, and the predictors thereof, we were able to gain a better understanding of the acceptance and utilization of web-assisted self-help for a parent management intervention in the treatment of children with ADHD and ODD. These findings can be used to recommend web-based interventions to particularly suitable families. It should be noted that some form of support is required for an intensive engagement with the content of the program. Trial Registration The protocol of the study (German Clinical Trials Register DRKS00013456 conducted on January 3rd, 2018) was approved by the Ethics Committee of the University Hospital, Cologne.
Chapter
Imbalanced datasets pose a significant and longstanding challenge to machine learning algorithms, particularly in binary classification tasks. Over the past few years, various solutions have emerged, with a substantial focus on the automated generation of synthetic observations for the minority class, a technique known as oversampling. Among the various oversampling approaches, the Synthetic Minority Oversampling Technique (SMOTE) has recently garnered considerable attention as a highly promising method. SMOTE achieves this by generating new observations through the creation of points along the line segment connecting two existing minority class observations. Nevertheless, the performance of SMOTE frequently hinges upon the specific selection of these observation pairs for resampling. This research introduces the Genetic Methods for OverSampling (GM4OS), a novel oversampling technique that addresses this challenge. In GM4OS, individuals are represented as pairs of objects. The first object assumes the form of a GP-like function, operating on vectors, while the second object adopts a GA-like genome structure containing pairs of minority class observations. By co-evolving these two elements, GM4OS conducts a simultaneous search for the most suitable resampling pair and the most effective oversampling function. Experimental results, obtained on ten imbalanced binary classification problems, demonstrate that GM4OS consistently outperforms or yields results that are at least comparable to those achieved through linear regression and linear regression when combined with SMOTE.
Article
The rapid acceleration of global warming has led to an increased burden of high temperature-related diseases (HTDs), highlighting the need for advanced evidence-based management strategies. We have developed a conceptual framework aimed at alleviating the global burden of HTDs, grounded in the One Health concept. This framework refines the impact pathway and establishes systematic data-driven models to inform the adoption of evidence-based decision-making, tailored to distinct contexts. We collected extensive national-level data from authoritative public databases for the years 2010–2019. The burdens of five categories of disease causes – cardiovascular diseases, infectious respiratory diseases, injuries, metabolic diseases, and non-infectious respiratory diseases – were designated as intermediate outcome variables. The cumulative burden of these five categories, referred to as the total HTD burden, was the final outcome variable. We evaluated the predictive performance of eight models and subsequently introduced twelve intervention measures, allowing us to explore optimal decision-making strategies and assess their corresponding contributions. Our model selection results demonstrated the superior performance of the Graph Neural Network (GNN) model across various metrics. Utilizing simulations driven by the GNN model, we identified a set of optimal intervention strategies for reducing disease burden, specifically tailored to the seven major regions: East Asia and Pacific, Europe and Central Asia, Latin America and the Caribbean, Middle East and North Africa, North America, South Asia, and Sub-Saharan Africa. Sectoral mitigation and adaptation measures, acting upon our categories of Infrastructure & Community, Ecosystem Resilience, and Health System Capacity, exhibited particularly strong performance for various regions and diseases. Seven out of twelve interventions were included in the optimal intervention package for each region, including raising low-carbon energy use, increasing energy intensity, improving livestock feed, expanding basic health care delivery coverage, enhancing health financing, addressing air pollution, and improving road infrastructure. The outcome of this study is a global decision-making tool, offering a systematic methodology for policymakers to develop targeted intervention strategies to address the increasingly severe challenge of HTDs in the context of global warming.
Article
Full-text available
Background and objectives Hypertension is one of the most serious risk factors and the leading cause of mortality in patients with cardiovascular diseases (CVDs). It is necessary to accurately predict the mortality of patients suffering from CVDs with hypertension. Therefore, this paper proposes a novel cost-sensitive deep neural network (CSDNN)-based mortality prediction model for out-of-hospital acute myocardial infarction (AMI) patients with hypertension on imbalanced data. Methods The synopsis of our research is as follows. First, the experimental data is extracted from the Korea Acute Myocardial Infarction Registry-National Institutes of Health (KAMIR-NIH) and preprocessed with several approaches. Then the imbalanced experimental dataset is divided into training data (80%) and test data (20%). After that, we design the proposed CSDNN-based mortality prediction model, which can solve the skewed class distribution between the majority and minority classes in the training data. The threshold moving technique is also employed to enhance the performance of the proposed model. Finally, we evaluate the performance of the proposed model using the test data and compare it with other commonly used machine learning (ML) and data sampling-based ensemble models. Moreover, the hyperparameters of all models are optimized through random search strategies with a 5-fold cross-validation approach. Results and discussion In the result, the proposed CSDNN model with the threshold moving technique yielded the best results on imbalanced data. Additionally, our proposed model outperformed the best ML model and the classic data sampling-based ensemble model with an AUC of 2.58% and 2.55% improvement, respectively. It aids in decision-making and offers a precise mortality prediction for AMI patients with hypertension.
Article
Full-text available
Professional bicycle racing is a popular sport that has attracted significant attention in recent years. The evolution and ubiquitous use of sensors allow cyclists to measure many metrics including power, heart rate, speed, cadence, and more in training and racing. In this paper we explore for the first time assignment of a subset of a team’s cyclists to an upcoming race. We introduce RaceFit, a model that recommends, based on recent workouts and past assignments, cyclists for participation in an upcoming race. RaceFit consists of binary classifiers that are trained on pairs of a cyclist and a race, described by their relevant properties (features) such as the cyclist’s demographic properties, as well as features extracted from his workout data from recent weeks; as well additional properties of the race, such as its distance, elevation gain, and more. Two main approaches are introduced in recommending on each stage in a race and aggregate from it to the race, or on the entire race. The model training is based on binary label which represent participation of cyclist in a race (or in a stage) in past events. We evaluated RaceFit rigorously on a large dataset of three pro-cycling teams’ cyclists and race data achieving up to 80% precision@i. The first experiment had shown that using TP or STRAVA data performs the same. Then the best-performing parameters of the framework are using 5 weeks time window, imputation was effective, and the CatBoost classifier performed best. However, the model with any of the parameters performed always better than the baselines, in which the cyclists are assigned based on their popularity in historical data. Additionally, we present the top-ranked predictive features.
Article
Full-text available
Given that roofing contractors in the construction industry have the highest fatality rate among specialty contractors, understanding the root cause of incidents among roofers is critical for improving safety outcomes. This study applied frequency analysis and decision tree data-mining techniques to analyze roofers’ fatal and non-fatal accident reports. The frequency analysis yielded insights into the leading cause of accidents, with fall to a lower level (83%) being the highest, followed by incidence sources relating to structures and surfaces (56%). The most common injuries experienced by roofing contractors were fractures (49%) and concussions (15%), especially for events occurring in residential buildings, maintenance and repair works, small projects (i.e., $50,000 or less), and on Mondays. According to the decision tree analysis, the most important factor for determining the nature of the injury is the nonfragile injured body part, followed by injury caused by coating works. The decision tree also produced decision rules that provide an easy interpretation of the underlying association between the factors leading to incidents. The decision tree models developed in this study can be used to predict the nature of potential injuries for strategically selecting the most effective injury-prevention strategies.
Chapter
Impervious surfaces change the natural hydrology due to lower levels of water infiltration, increasing stream peak flows and flood risks. Concentrated storm water runoff over the landscape can contribute to pollutants and contamination of drinking water, streams and aquifers. In the past decade (2010–2020), the North Central Province of Sri Lanka has experienced a series of anomalously severe flash flood events during annual monsoon rain from December to January. While regional paddy production has experienced successes and failures, the failures have dominated due to adverse climate conditions. This study aims to develop supervised machine learning-based geospatial analytics models to classify spatial and temporal impervious surface cover changes. Following the literature on remote sensing in conjunction with machine learning, we deploy Google Earth Engine-based machine learning algorithms under the localised climate zone (LCZ) classification workflow approach to predict the imperviousness of surfaces in the northern part of Sri Lanka during 2013–2020. The ground truth for the training data set is established via Google earth images and field survey data extracted from urban areas such as the Anuradhapura and Polonnaruwa districts of the North Central Province. Random forest (RF) and classification and regression tree (CART) classifications were used to train and test the data extracted from the Landsat imageries. CART classification gives promising results. Performance measures (F1 scores) for impervious, vegetation, water, agriculture and bare lands are 0.71, 0.96, 0.96, 0.91 and 0.91, respectively. The predictive model with a pixel density analysis conducted at the lowest level of local administrative divisions appears practically and conceptually appealing to aggregated and disaggregated urban systems.
Conference Paper
Over the past few decades, the growing population in developing countries has significantly impacted land use and land cover (LULC), resulting in a threat to natural resources. Therefore, monitoring LULC changes in critical areas for effective land-use planning and policy-making is crucial. Google Earth Engine (GEE) cloud computing is a new platform that processes geospatial data and classifies LULC over vast areas utilizing machine-learning classification algorithms. In this study, we tested several classification models using Python and GEE to evaluate their accuracy and reliability in reproducing the LULC of a watershed located in Uruguay. We aimed to address the limited availability of GEE models. Our findings indicated that the Histogram-based Gradient Boosting Classifier outperforms the other models and delivers an improved performance of 21% compared to the model implemented in GEE.
Article
Financial distress prediction has been a prominent research field for several decades. Accurate prediction of financial distress not only helps to safeguard the interests of investors but also improves the ability of managers to manage financial risks. Prior studies predominantly rely on accounting metrics derived from financial statements to predict financial distress. Our research takes a step further by incorporating media news to enhance the accuracy of financial distress prediction. Based on the data from Chinese listed companies, seven classifiers are established to verify the additional value of media news in improving the financial distress prediction performance of models. Experimental results demonstrate that the inclusion of media news in predictive models is effective as it contributes to better performance compared with models that solely rely on accounting features. Moreover, random forest model is a reliable tool in financial distress prediction due to its superior ability to capture complex feature relationships. Evaluation indicators, statistical tests, and Bayesian A/B tests further confirm that the inclusion of media news can significantly improve the identification of financially distressed companies.
ResearchGate has not been able to resolve any references for this publication.