ArticlePDF Available

Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid

Authors:

Abstract and Figures

Naive-Bayes induction algorithms were previously shown to be surprisingly accurate on many classification tasks even when the conditional independence assumption on which they are based is violated. However, most studies were done on small databases. We show that in some larger databases, the accuracy of Naive-Bayes does not scale up as well as decision trees. We then propose a new algorithm, NBTree, which induces a hybrid of decision-tree classifiers and Naive-Bayes classifiers: the decision-tree nodes contain univariate splits as regular decision-trees, but the leaves contain Naive-Bayesian classifiers. The approach retains the interpretability of Naive-Bayes and decision trees, while resulting in classifiers that frequently outperform both constituents, especially in the larger databases tested.
Content may be subject to copyright.
74
76
78
80
82
84
86
88
90
92
94
96
0
500 1000 1500 2000 2500
accuracy
instances
DNA
Naive-Bayes
C4.5
66
68
70
72
74
76
78
80
82
0
1000 2000 3000 4000
accuracy
instances
waveform-40
Naive-Bayes
C4.5
56
58
60
62
64
66
68
70
72
74
0 500 1000 1500 2000 2500
accuracy
instances
led24
Naive-Bayes
C4.5
99
99.1
99.2
99.3
99.4
99.5
99.6
99.7
99.8
99.9
100
0
15,000 30,000 45,000 60,000
accuracy
instances
shuttle
Naive-Bayes
C4.5
60
65
70
75
80
85
90
0
5,000 10,000 15,000 20,000
accuracy
instances
letter
Naive-Bayes
C4.5
82
82.5
83
83.5
84
84.5
85
85.5
86
86.5
0
15,000 30,000 45,000
accuracy
instances
adult
Naive-Bayes
C4.5
80
82
84
86
88
90
92
94
96
98
100
0 500 1000 1500 2000 2500
accuracy
instances
chess
Naive-Bayes
C4.5
93
94
95
96
97
98
99
100
0
2,000 4,000 6,000 8,000
accuracy
instances
mushroom
Naive-Bayes
C4.5
76
78
80
82
84
86
88
0
1000 2000 3000 4000 5000 6000
accuracy
instances
satimage
Naive-Bayes
C4.5
NBTree - C4.5
NBTree - NB
tic-tac-toe
chess
letter
vehicle
vote
monk1
segment
satimage
flare
iris
led24
mushroom
vote1
adult
shuttle
soybean-large
DNA
ionosphere
breast (L)
crx
breast (W)
german
pima
heart
glass
cleve
waveform-40
glass2
primary-tumor
Accuracy difference
-10.00
0.00
10.00
20.00
30.00
NBTree/ C4.5
NBTree/ NB
tic-tac-toe
chess
letter
vehicle
vote
monk1
segment
satimage
flare
iris
led24
mushroom
vote1
adult
shuttle
soybean-large
DNA
ionosphere
breast (L)
crx
breast (w)
german
pima
heart
glass
cleve
waveform-40
glass2
primary-tumor
Error Ratio
0.00
0.50
1.00
1.50
... Dataset. Adult [10] contains a diverse set of attributes pertaining to individuals in the United States. The dataset is often utilized to predict whether an individual's annual income exceeds 50,000 dollars, making it a popular choice for binary classification tasks. ...
Preprint
Full-text available
The fairness of AI decision-making has garnered increasing attention, leading to the proposal of numerous fairness algorithms. In this paper, we aim not to address this issue by directly introducing fair learning algorithms, but rather by generating entirely new, fair synthetic data from biased datasets for use in any downstream tasks. Additionally, the distribution of test data may differ from that of the training set, potentially impacting the performance of the generated synthetic data in downstream tasks. To address these two challenges, we propose a diffusion model-based framework, FADM: Fairness-Aware Diffusion with Meta-training. FADM introduces two types of gradient induction during the sampling phase of the diffusion model: one to ensure that the generated samples belong to the desired target categories, and another to make the sensitive attributes of the generated samples difficult to classify into any specific sensitive attribute category. To overcome data distribution shifts in the test environment, we train the diffusion model and the two classifiers used for induction within a meta-learning framework. Compared to other baselines, FADM allows for flexible control over the categories of the generated samples and exhibits superior generalization capability. Experiments on real datasets demonstrate that FADM achieves better accuracy and optimal fairness in downstream tasks.
... Traditionally, methods for solving this task can be classified into two categories: those based on statistical models and those based on reinforcement learning (RL) models. Firstly, among the previous works based on statistical models, Bayesian-based models are particularly prominent, known for their low complexity [5], [6]. They define symptom probing as a feature selection task, using entropy functions to identify optimal features and maximize information gain as the training objective. ...
Article
Full-text available
Automated diagnosis, as a temporary medical supplement, has gained significant attention in research in recent years. Existing methods employ sequence generation approaches to inquire about symptoms and diagnose diseases. However, these methods ignore the fact that: 1) doctors utilize their past experience and similar cases to aid in diagnosis in real-world scenarios; 2) doctors inquire about key symptoms that serve as vital diagnostic evidence within limited conversations. To address these issues, we propose an end-to-end model KDPoG. Firstly, in addition to use the symptom and attribute embedding, we propose patient-oriented graph enhanced representation learning, which is built by a patient-oriented graph and learned with heterogeneous graph convolution networks. Furthermore, based on the encoder built with gated attention units, we propose knowledge-guided attention mechanism learning, which incorporates conditional probabilities of co-occurrence between symptom pairs. Finally, we utilize two linear layers as the classification module to achieve symptom probing and disease diagnosis. We conduct extensive experiments on four public datasets, which demonstrate that our proposed model outperforms the state-of-the-art methods. We achieve an average absolute improvement of over 2% in disease diagnosis accuracy. Particularly, on the Muzhi-10 dataset, we observe an absolute improvement of over 14.7% in symptom recall rate.
... AdultIncome dataset [16] contains the income extracted from the United States Census Bureau. The dataset includes 14 features related to personal data, such as race, age, education, and so on. ...
Article
Feature selection is a widely studied technique whose goal is to reduce the dimensionality of the problem by removing irrelevant features. It has multiple benefits, such as improved efficacy, efficiency and interpretability of almost any type of machine learning model. Feature selection techniques may be divided into three main categories, depending on the process used to remove the features known as Filter, Wrapper and Embedded. Embedded methods are usually the preferred feature selection method that efficiently obtains a selection of the most relevant features of the model. However, not all models support an embedded feature selection that forces the use of a different method, reducing the efficiency and reliability of the selection. Neural networks are an example of a model that does not support embedded feature selection. As neural networks have shown to provide remarkable results in multiple scenarios such as classification and regression, sometimes in an ensemble with a model that includes an embedded feature selection, we attempt to embed a feature selection process with a general-purpose methodology. In this work, we propose a novel general-purpose layer for neural networks that removes the influence of irrelevant features. The Feature-Aware Drop Layer is included at the top of the neural network and trained during the backpropagation process without any additional parameters. Our methodology is tested with 17 datasets for classification and regression tasks, including data from different fields such as Health, Economic and Environment, among others. The results show remarkable improvements compared to three different feature selection approaches, with reliable, efficient and effective results.
Preprint
Full-text available
In location-based resource allocation scenarios, the distances between each individual and the facility are desired to be approximately equal, thereby ensuring fairness. Individually fair clustering is often employed to achieve the principle of treating all points equally, which can be applied in these scenarios. This paper proposes a novel algorithm, tilted k-means (TKM), aiming to achieve individual fairness in clustering. We integrate the exponential tilting into the sum of squared errors (SSE) to formulate a novel objective function called tilted SSE. We demonstrate that the tilted SSE can generalize to SSE and employ the coordinate descent and first-order gradient method for optimization. We propose a novel fairness metric, the variance of the distances within each cluster, which can alleviate the Matthew Effect typically caused by existing fairness metrics. Our theoretical analysis demonstrates that the well-known k-means++ incurs a multiplicative error of O(k log k), and we establish the convergence of TKM under mild conditions. In terms of fairness, we prove that the variance generated by TKM decreases with a scaled hyperparameter. In terms of efficiency, we demonstrate the time complexity is linear with the dataset size. Our experiments demonstrate that TKM outperforms state-of-the-art methods in effectiveness, fairness, and efficiency.
Chapter
In recent years, deep neural networks (DNNs) have been widely used in a wide range of applications. However, there is a societal concern about the ability of DNNs to make sound and equitable decisions, particularly when they are used in sensitive areas where valuable resources are allocated, such as education, loan, and employment. Before reliable deployment of DNNs in such a sensitive domain, it is essential to do a fair test, i.e., generating as many instances as possible to uncover fairness violations. However, the current testing methods are still restricted in the aspects of interpretability, performance, and generalizability. To overcome the challenges, we propose a new DNN fairness testing framework that differs from previous work in several key aspects: (1) interpretable—it quantitatively interprets DNNs’ fairness violations for the biased decision; (2) effective—it uses the interpretation results to guide the generation of more diverse instances in less time; (3) generic—it can handle both structured and unstructured data. A large number of DNNs are used to evaluate the performance of our method. For example, on a structured dataset, it is also possible to exploit the instances of our method to increase the fairness of the biased DNNs.
Chapter
Although deep neural networks (DNNs) have shown superior performance in different software systems, they also display malfunctioning and can even lead to irreversible catastrophes. Hence, it is significant to detect the misbehavior of DNN-based software and enhance the quality of DNNs. Test input prioritization is a highly effective approach to ensure the quality of DNNs. This method involves prioritizing test inputs in such a way that inputs that are more likely to reveal bugs or issues are identified early on, even with limited time and manual labeling efforts. Nevertheless, current prioritization methods still have limitations in three aspects: certifiability, effectiveness, and generalizability. To overcome the challenges, we propose a test input prioritization technique designed based on a movement cost perspective of test inputs in DNNs’ feature space. Our method differs from previous works in three key aspects: (1) certifiable—it provides a formal robustness guarantee for the movement cost; (2) effective—it leverages formally guaranteed movement costs to identify malicious bug-revealing inputs; and (3) generic—it can be applied to various tasks, data, models, and scenarios. Extensive evaluations across two tasks (i.e., classification and regression), six data forms, four model structures, and two scenarios (i.e., white box and black box) demonstrate our method’s superior performance. For instance, it significantly improves 53.97% prioritization effectiveness on average compared with baselines. Its robustness and generalizability are 1.41\(\sim \)2.00 times and 1.33\(\sim \)3.39 times that of baselines on average, respectively.
Chapter
Federated learning (FL) with multiple clients collaborating to train a federated model without exchanging their individual data is a method of distributed machine learning. Although federated learning has gained an unprecedented success in data privacy preservation, its frailty of vulnerability to “free-rider” attacks attracts increasing attention. A number of defenses against free-rider attacks have been proposed for FL. Nevertheless, these methods may not protect against highly masquerading hitchhikers. Furthermore, when more than 20% of the clients are “hitchhikers”, the effectiveness of their defense may drop dramatically. To tackle these challenges, we reconceptualize the defense problem from a new perspective, i.e., the frequency of model weight evolution. Based on our experience, a new insight is gained that the frequency of model weight evolution is significantly different for free-riders and benign clients during the training process of FL. Motivated by this insight, a novel defense method based on the frequency of model weight evolution is proposed. In particular, a frequency of weight changes during the local training process is first collected. In the case of each client, it takes the WEF-Matrix of the local model and its weight of the model for each iteration and uploads it to the server. The server then separates “free-riders” from virtuous clients based on the difference in the WEF-Matrix. Finally, the broker uses a personalized method to offer different global models to the appropriate clients, thus keeping hitchhikers from obtaining high-value models. The combined experiments on five datasets and five models show that our method defends better than the state-of-the-art baseline and can identify hitchhikers at an early stage. The hitchhikers are identified at an early stage of training. Furthermore, we also verify the effectiveness of our method to adaptive attacks and visualize the WEF-Matrix during training to explain its effectiveness.
Article
Federated Learning (FL) provides a privacy-preserving and decentralized approach to collaborative machine learning for multiple FL clients. The contribution estimation mechanism in FL is extensively studied within the database community, which aims to compute fair and reasonable contribution scores as incentives to motivate FL clients. However, designing such methods involves challenges in three aspects: effectiveness, robustness, and efficiency. Firstly, contribution estimation methods should utilize the data utility information of various client coalitions rather than that of individual clients to ensure effectiveness. Secondly, we should beware of adverse clients who may exploit tactics like data replication or label flipping. Thirdly, estimating contribution in FL can be time-consuming due to enumerating various client coalitions. Despite numerous proposed methods to address these challenges, each possesses distinct advantages and limitations based on specific settings. However, existing methods have yet to be thoroughly evaluated and compared in the same experimental framework. Therefore, a unified and comprehensive evaluation framework is necessary to compare these methods under the same experimental settings. This paper conducts an extensive survey of contribution estimation methods in FL and introduces a comprehensive framework to evaluate their effectiveness, robustness, and efficiency. Through empirical results, we present extensive observations, valuable discoveries, and an adaptable testing framework that can facilitate future research in designing and evaluating contribution estimation methods in FL.
Article
Full-text available
Submitted to the Department of Computer Science. Copyright by the author. Thesis (Ph. D.)--Stanford University, 1995.
Article
Presented here are results on almost sure convergence of estimators of regression functions subject to certain moment restrictions. Two somewhat different notions of almost sure convergence are studied: unconditional and conditional given a training sample. The estimators are local means derived from certain recursive partitioning schemes.
Conference Paper
The paper presents a case study in examining the bias of two particular formalisms: decision trees and linear threshold units. The immediate result is a new hybrid representation, called a percep-tron tree, and an associated learning algorithm called the perceptron tree error correction proce-dure. The longer term result is a model for ex-ploring issues related to understanding represen-tational bias and constructing other useful hybrid representations.
Article
Although successful in medical diagnostic problems, inductive learning systems were not widely accepted in medical practice. In this paper two different approaches to machine learning in medical applications are compared: the system for inductive learning of decision trees Assistant, and the naive Bayesian classifier. Both methodologies were tested in four medical diagnostic problems: localization of primary tumor, prognostics of recurrence of breast cancer, diagnosis of thyroid diseases, and rheumatology. The accuracy of automatically acquired diagnostic knowledge from stored data records is compared, and the interpretation of the knowledge and the explanation ability of the classification process of each system is discussed. Surprisingly, the naive Bayesian classifier is superior to Assistant in classification accuracy and explanation ability, while the interpretation of the acquired knowledge seems to be equally valuable. In addition, two extensions to naive Bayesian classifier are briefly described: dealing with continuous attributes, and discovering the dependencies among attributes.