Figure 1 - uploaded by Audrey Poterie
Content may be subject to copyright.
illustrates the notions of nodes, terminal nodes and depth.

illustrates the notions of nodes, terminal nodes and depth.

Source publication
Article
Full-text available
We consider the problem of predicting a categorical variable based on groups of inputs. Some methods have already been proposed to elaborate classification rules based on groups of variables (e.g. group lasso for logistic regression). However, to our knowledge, no tree-based approach has been proposed to tackle this issue. Here, we propose the Tree...

Contexts in source publication

Context 1
... other words, T h is the deeper subtree of T max whose terminal nodes have a depth less than or equal to h. For example, Table 1 gives the terminal nodes for the sequence of subtrees of the tree displayed in Figure 1. ...
Context 2
... 8 , t 9 , t 5 , t 10 , t 11 , t 7 Table 1: Terminal nodes for the subtrees in Figure 1. ...
Context 3
... TPLDA allows to elaborate a less complex partition of the input space without lost of accuracy ( Figure 2). The associated trees are displayed in Appendix B (see Figure 13). ...
Context 4
... far as concerning the importance score, the use of a penalized Gini criterion seems to improve the ability of the importance score to identify the true relevant group (Table 5 and Figures 10 and 11). Nonetheless, it seems that the large noisy group still appears in the five most important groups, even when adding a penalty function to the Gini impurity criterion. ...
Context 5
... points out that the group importance score might be not reliable for estimating the importance of large noisy groups when some noisy groups are very large compared to the true relevant groups. This lack of robustness is also highlighted by the high variability of the estimate of the importance of the large noisy group (Figures 10 and 11). So, when group sizes vary greatly, the estimated values of the importance score for the largest groups should be considered with caution. ...
Context 6
... score of importance gives information about the relative importance of each group and also allows to perform selection of groups of genes. Figure 12 displays the measure of importance of each group, on the leukemia data. The measure of importance enables to highlight four groups: the first, the third, the sixth and the fifteenth groups. ...
Context 7
... Additional figures about the illustration of the TPLDA method on a simple example Figure 13 displays the two trees built by CART and TPLDA in the simple example used to illustrate TPLDA in Section 2.4. As mentioned previously, in this example, the TPLDA tree is much easier than the CART tree. ...
Context 8
... in the first split, the FDA used to split the entire data space overfits the training set. This can be seen in Figure 14: the training misclassification error decreases much faster for TLDA and becomes smaller than the Bayes error from the first split while the test misclassification error for TLDA remains stable. Consequently, after applying the pruning procedure which removes the less informative nodes, the final TLDA tree is trivial in at least 25 % of the simulations (Table 9). ...
Context 9
... results are given in Table 10. Figure 15 displays the group selection frequencies of CART and CARTD. Overall, the two pruning methods lead to similar CART trees and so similar classification rules. ...
Context 10
... figures about the simulation studies are displayed in this subsection. Figure 16 displays the predictive performances of TPLDA, CART and GL in the first three experiments. Figure 17 shows the distribution of the importance score for each group in the first three experiments. ...
Context 11
... 16 displays the predictive performances of TPLDA, CART and GL in the first three experiments. Figure 17 shows the distribution of the importance score for each group in the first three experiments. Figure 10 and Figure 11 display the distribution of the importance score for each group in the fourth and the fifth experiments according to the penalty function. ...
Context 12
... 17 shows the distribution of the importance score for each group in the first three experiments. Figure 10 and Figure 11 display the distribution of the importance score for each group in the fourth and the fifth experiments according to the penalty function. ...
Context 13
... 17 shows the distribution of the importance score for each group in the first three experiments. Figure 10 and Figure 11 display the distribution of the importance score for each group in the fourth and the fifth experiments according to the penalty function. ...

Similar publications

Article
Full-text available
Background This article addresses the automatic classification of reconstructed neurons through their morphological features. The purpose was to extend the capabilities of the L-Measure software. Methods New morphological features were developed, based on modifications of the conventional Sholl analysis. The lengths of the compartments, as well as...

Citations

... The variable selection feature of tree methods is particularly adapted in the framework of multivariate functional data and allows predicting the response throughout more complex but still interpretable relationships. It represents in some way a generalization of the finite-dimensional setting presented in Poterie et al. (2019) to the case of multivariate functional data. The procedure consists in split a node of the tree by successively selecting an optimal discriminant score (according to some impurity measure) among discriminant scores obtained from MFPLS regression models with different subsets of predictors. ...
... These groups of variables define the candidates to be used to split the node of the tree (the score will be calculated with the only variables in the candidate group). Inspired by Poterie et al. (2019)'s methodology, our algorithm is composed of two main steps. In a nutshell, with the help of MFPLS methodology, the first step provides the results of the splitting according to candidates groups G 1 , . . . ...
... Select the optimal splitting according to group G * which maximizes the decrease of impurity function Q (see (Poterie et al. 2019) for more details), ...
Article
Full-text available
Classification (supervised-learning) of multivariate functional data is considered when the elements of the random functional vector of interest are defined on different domains. In this setting, PLS classification and tree PLS-based methods for multivariate functional data are presented. From a computational point of view, we show that the PLS components of the regression with multivariate functional data can be obtained using only the PLS methodology with univariate functional data. This offers an alternative way to present the PLS algorithm for multivariate functional data. Numerical simulation and real data applications highlight the performance of the proposed methods.
... To address these limitations and explore more sophisticated solutions, researchers have introduced novel tree-based approaches that offer enhanced accuracy, interpretability, and efficiency. One such method is the Tree Penalized Linear Discriminant Analysis (TPLDA) which generates a classification rule grounded on groups of variables, rendering resulting trees more readily understandable and computationally less demanding (Poterie, Dupuy, Monbet, & Rouviere, 2019). ...
... One possible approach involves using a similarity formula to identify the attribute that exhibits the highest degree of similarity, which then becomes the splitting node (Zaim, Ramdani, & Haddi, 2018). Another method utilizes a Penalized Linear Discriminant Analysis (PLDA), which is based on clusters of variables, to construct a classification rule (Poterie, Dupuy, Monbet, & Rouviere, 2019). In certain scenarios, decision trees employ distinct splitting mechanisms, such as categorizing and uncategorizing child nodes, to enhance performance (Zeng & Chen, 2019). ...
Article
Full-text available
Machine learning, an integral component of Artificial Intelligence (AI), empowers systems to autonomously enhance their performance through experiential learning. This paper presents a comprehensive overview of the Classification Tree Algorithm's pivotal role in the realm of machine learning. This algorithm simplifies the process of categorizing new instances into predefined classes, leveraging their unique attributes. It has firmly established itself as a cornerstone within the broader landscape of classification techniques. This paper delves into the multifaceted concepts, terminologies, principles, and ideas that orbit the Classification Tree Algorithm. It sheds light on the algorithm's essence, providing readers with a clearer and more profound understanding of its inner workings. By synthesizing a plethora of existing research, this endeavor contributes to the enrichment of the discourse surrounding classification tree algorithms. In summary, the Classification Tree Algorithm plays a fundamental role in machine learning, facilitating data classification, and empowering decision-making across domains. Its adaptability, alongside emerging variations and innovative techniques, ensures its continued relevance in the ever-evolving landscape of artificial intelligence and data analysis.
... The variable selection feature of tree methods is particularly adapted in the framework of multivariate functional data and allows predicting the response throughout more complex but still interpretable relationships. It represents in some way a generalization of the finite-dimensional setting presented in Poterie et al. (2019) to the case of multivariate functional data. The procedure consists in split a node of the tree by successively selecting an optimal discriminant score (according to some impurity measure) among discriminant scores obtained from MFPLS regression models with different subsets of predictors. ...
... Select the optimal splitting according to group G * which maximizes the decrease of impurity function ∆Q (see Poterie et al. (2019) for more details), ...
... In order to avoid overfitting, a pruning method can be employed. Here, we use the same technique as in Poterie et al. (2019), i.e. the optimal depth of the decision tree (m * ) is estimated using a validation set. ...
Preprint
Full-text available
Classification (supervised-learning) of multivariate functional data isconsidered when the elements of the random functional vector of interest are defined on different domains. In this setting, PLS classification and tree PLS-based methods for multivariate functional data are presented.From a computational point of view, we show that the PLS components of the regression with multivariate functional data can be obtained using only the PLS methodology with univariate functional data.This offers an alternative way to present the PLS algorithm for multivariate functional data. Numerical simulation and real data applications highlight the performance of the proposed methods.
... Inspired by Poterie et al. (2019)'s methodology, our algorithm is composed of two main steps. In a nutshell, with the help of partial least squares, the first step gives potential splitting according to groups, and the second one selects the best splitting candidate using the Gini criterion. ...
... The strategy used to prevent over-fitting is the same as in (Poterie et al., 2019). The motivation is to find the optimal depth of our model. ...
Preprint
Classification of multivariate functional data is explored in this paper, particularly for functional data defined on different domains. Using the partial least squares (PLS) regression, we propose two classification methods. The first one uses the equivalence between linear discriminant analysis and linear regression. The second is a decision tree based on the first technique. Moreover, we prove that multivariate PLS components can be estimated using univariate PLS components. This offers an alternative way to calculate PLS for multivariate functional data. Finite sample studies on simulated data and real data applications show that our algorithms are competitive with linear discriminant on principal components scores and black-boxes models.
... The former divides the tree structure into further branches while the latter is associated with specific classes/clusters. Recent studies also demonstrate the good performance of DT-based approaches when dealing with correlated input variables [93], which is crucial when analyzing systems with uncertainties [94]. ...
Article
Full-text available
This paper proposes an approach that combines reduced-order models with machine learning in order to create physics-informed digital twins to predict high-dimensional output quantities of interest, such as neutron flux and power distributions in nuclear reactor cores. The digital twin is designed to solve forward problems given input parameters, as well as to solve inverse problems given some extra measurements. Offline, we use reduced-order modeling, namely, the proper orthogonal decomposition (POD) to assemble physics-based computational models that are accurate enough for fast predictive digital twin. The machine learning techniques, namely, k-nearest-neighbors (KNN) and decision trees (DT) are used to formulate the input-parameter-dependent coefficients of the reduced basis, whereafter the high-fidelity fields are able to be reconstructed. Online, we use the real time input parameters to rapidly reconstruct the neutron field in the core based on the adapted physics-based digital twin. The effectiveness of the framework is illustrated through a real engineering problem in nuclear reactor physics - reactor core simulation in the life cycle of HPR1000 governed by the two-group neutron diffusion equations affected by input parameters, i.e., burnup, control rod inserting step, power level and temperature of the coolant, which shows potential applications for on-line monitoring purpose.
Article
Although interaction effects can be exploited to improve predictions and allow for valuable insights into covariate interplay, they are given limited attention in analysis. Interaction forests are a variant of random forests for categorical, continuous, and survival outcomes that explicitly models quantitative and qualitative interaction effects in bivariable splits performed by the trees constituting the forests. The new effect importance measure (EIM) associated with interaction forests allows for ranking of covariate pairs with respect to their interaction effects' importance to prediction. Using EIM, separate importance value lists for univariable effects, quantitative interaction effects, and qualitative interaction effects are obtained. In the spirit of interpretable machine learning, the bivariable split types of interaction forests target easily interpretable and communicable interaction effects. To learn about the nature of the interplay between covariates identified as interacting it is convenient to visualise their estimated bivariable influence. Functions that perform this task are provided in the R package diversityForest, which implements interaction forests. In a large-scale empirical study using 220 data sets, interaction forests tended to deliver better predictions than conventional random forests and competing random forest variants that use multivariable splitting. In a simulation study, EIM delivered considerably better rankings for the relevant quantitative and qualitative interaction effects than competing approaches. These results indicate that interaction forests are suitable tools for the challenging task of identifying and making use of easily interpretable and communicable interaction effects in predictive modelling.
Article
Full-text available
Increasing a personal debt burden implies greater financial vulnerability and threats for macroeconomic stability. It also generates a risk of the households over-indebtedness. The assessment of over-indebtedness is conducted with the use of various objective and subjective measures based on the micro-level data. The aim of the study is to investigate over-indebted households in Poland using a unique dataset obtained from the CATI survey. We discuss and compare the usefulness of various over-indebtedness measures across different socio-economic characteristics. Due to the differences in over-indebtedness across single measures, we perform a more complex assessment using a mix of indicators. As an alternative to other commonly criticised over-indebtedness measures, we apply the “below the poverty line” (BPL) measure. In order to obtain the profile of over-indebted households, we use classification and regression tree analysis as an alternative to logit or probit models. We find that DSTI (“debt service to income”) ratio underestimates the extent of over-indebtedness in vulnerable groups of households in comparison with the BPL. We highlight the necessity to use different measures depending on the adopted definition of over-indebtedness. A psychological burden of debts is particularly strong among older and poorly educated respondents. We also find that the age structure of over-indebted households in Poland differs from this structure in countries with a broader access to consumer credits. Our results can be used to enrich the methods of assessing the household over-indebtedness.