ThesisPDF Available

On Feature Selection for Credit Scoring

January 2015

January 2015

DOI:10.13140/2.1.3354.1443

Advisor: Mohamed Limam

Authors:

Waad Bouaguel

University of Jeddah

Credits' granting is a fundamental question for which every credit institution is confronted and one of the most complex tasks that it has to deal with. This task is based on analyzing and judging a large amount of receipts credits' requests. Typically, credit scoring databases are often large and characterized by redundant and irrelevant features. With so many features, classification methods become more computational demanding. This difficulty can be solved by using feature selection methods. Many such methods are proposed in literature such as filter and wrapper methods. Filter methods select the best features by evaluating the fundamental properties of data, making them fast and simple to implement. However, they are sensitive to redundancy and there are so many filtering methods proposed in previous work leading to the selection trouble. Wrapper methods select the best features according to the classifier's accuracy, making results well-matched to the predetermined classification algorithm. However, they typically lack generality since the resulting subset of features is tied to the bias of the used classifier. The purpose of this thesis is to build simple and robust credit scoring models based on selecting the most relevant features. Three feature selection methods are proposed. First we propose a new filter rank aggregation based on an optimization method using genetic algorithms and similarity. Second, we introduce an ensemble wrapper feature selection method based on an improved exhaustive search. Combining both methods seems a natural choice to benefit from their advantages and avoid their shortcomings. Thus, a three stage feature selection using quadratic programming is considered. Based on different performance criteria and on four real credit datasets our three methods are evaluated. Results show that feature subsets selected by the proposed methods are either superior or at least as adequate as those selected by their competitor methods.

Key steps of feature selection process.

…

The process of wrapper feature selection.

…

The process of filter feature selection.

…

Data pre-processing flowchart

…

+10

The process of credit scoring.

…

Figures - uploaded by Waad Bouaguel

Content may be subject to copyright.

Content uploaded by Waad Bouaguel

Content may be subject to copyright.

Universit´e de Tunis

Institut Sup´erieur de Gestion

Ecole Doctorale Sciences de Gestion

LARODEC

On Feature Selection Methods

for Credit Scoring

THESE

En vue de l’obtention du Doctorat en

Informatique de Gestion

Pr´esent´ee et soutenue publiquement par

Mlle Bouaguel Waad

Le ** ** 2015

Membres du Jury

Mr. name name Professeur Universit´e de **** Pr´esident

Mr. LIMAM Mohamed Professeur Universit´e de Tunis Directeur de Th`ese

Mr. AYADI Mohamed Professeur Universit´e de Tunis Rapporteur

Mme. CHAOUACHI Jouhaina Maˆıtre de Conf´erence Universit´e de Carthage Rapporteur

Mr. name name Professeur Universit´e de **** Membre

Acknowledgements

I would like to express my special appreciation and thanks to my advisor Professor

Mohamed Limam, you have been a tremendous mentor for me. I would like to thank

you for encouraging my research and for allowing me to grow as a research scientist.

Your advice on both research as well as on my career have been priceless. This thesis

would not have been possible without your help, support and patience.

Furthermore I would also like to thank with much appreciation Dr. Ghazi Bel

Mufti for the useful comments, remarks and engagement through the learning process

of this thesis.

I take this opportunity to express my deepest gratitude to the jury members. It

is my honor that they accepted the invitation and spent their precious time helping

me to improve this thesis.

I would especially like to acknowledge the ﬁnancial, academic and technical sup-

port of the high institute of management and I also thank all LARODEC members

for their support and assistance since the start of my masters’ work.

A special thanks to my family. Words cannot express how grateful I am to my

mother, father, brother and sister for all the sacriﬁces that you’ve made on my behalf.

Your prayer for me was what sustained me thus far.

I would like to dedicate my thesis and to express my deepest appreciation to my

beloved ﬁanc´e who always was my support in all times.

Abstract

Credits’ granting is a fundamental question for which every credit institution is con-

fronted and one of the most complex tasks that it has to deal with. This task is based

on analyzing and judging a large amount of receipts credits’ requests. Typically, credit

scoring databases are often large and characterized by redundant and irrelevant fea-

tures. With so many features, classiﬁcation methods become more computational

demanding. This diﬃculty can be solved by using feature selection methods. Many

feature selection methods are proposed in literature such as ﬁlter and wrapper meth-

ods. Filter methods select the best features by evaluating the fundamental properties

of data, making them fast and simple to implement. Wrapper methods select the

best features according to the classiﬁer accuracy, making results well-matched to the

predetermined classiﬁcation algorithm. However, they typically lack generality since

the resulting subset of features is tied to the bias of the used classiﬁer. The purpose

of this thesis is to build simple and robust credit scoring models based on selecting

the most relevant features. Three feature selection methods are proposed. First we

propose a new ﬁlter rank aggregation method based on optimization using genetic al-

gorithms and similarity. Second, we introduce an ensemble wrapper feature selection

method based on an improved exhaustive search. Combining both methods seems a

natural choice to beneﬁt from their advantages and avoid their shortcomings. Thus,

a three stage feature selection using quadratic programming is considered. Based on

diﬀerent performance criteria and on four real credit datasets our three methods are

evaluated. Results show that feature subsets selected by the proposed methods are

either superior or at least as adequate as those selected by their competitor methods.

Keywords: Feature selection, ﬁlter, wrapper, hybrid, rank aggregation.

R´esum´e

L’octroi des Cr´edits est une question fondamentale `a laquelle chaque ´etablissement de

cr´edit est confront´e, Il s’agit de l’une des tˆaches les plus complexes qu’il doit traiter.

Cette tˆache est bas´ee sur l’analyse et le jugement d’une grande quantit´e de deman-

des de cr´edits re¸cus. Typiquement, les bases de donn´ees utilis´es en credit scoring

sont tr`es grandes et se caract´erisent par la pr´esence de variables redondantes et non

signiﬁcatives. Avec tant de variables les m´ethodes de classiﬁcation deviennent plus

complexes. Cette diﬃcult´e peut ˆetre r´esolue en utilisant des m´ethodes de s´election

de variables. Des nombreuses m´ethodes de s´election de variables ont ´et´e propos´ees

en litt´erature. D’une part les m´ethodes ﬁltre choisissent les meilleures variables en

´evaluant les propri´et´es fondamental des donn´ees, ce qui les rend rapide et facile `a met-

tre en œuvre. Cependant, ils ignorent la redondance entre les variables. Il y a aussi

beaucoup de m´ethodes ﬁltre propos´ees dans les travaux pr´ec´edent menant `a un trouble

de s´election. D’autre part les m´ethodes wrapper choisissent les meilleures variables

selon le taux de classiﬁcation g´en´er´e par un classiﬁcateur, de ce fait le r´esultat est bien

assorti `a l’algorithme de classiﬁcation utilis´e. Cependant, ces m´ethodes manquent de

g´en´eralit´e puisque le sous-ensemble r´esultant de variables est biais´e du classiﬁcateur

utilis´e. Le but de cette th`ese est de construire des mod`eles de credit scoring simple

et robuste tout en s´electionnant les variables les plus pertinentes. D’abord nous pro-

posons une nouvelle m´ethode ﬁltre d’agr´egation de rangs bas´ee sur l’optimisation, des

algorithmes g´en´etiques et la similarit´e. Dans un second temps, nous pr´esentons une

m´ethode d’ensemble wrapper de s´election de variables bas´ee sur une recherche ex-

haustive am´elior´ee. L’union des deux m´ethodes semble un choix naturel pour proﬁter

de leurs avantages et ´eviter leurs d´efauts. Ainsi, nous avons propos´e une troisi`eme

m´ethode de s´election `a trois niveaux utilisant la programmation quadratique. En se

basant sur diﬀ´erents crit`eres de performance et sur quatre bases de donn´ees r´eelles

de cr´edit, nous avons ´evalu´e nos trois m´ethodes. Les r´esultats montrent que les sous-

ensembles de variables choisis par les m´ethodes propos´ees sont sup´erieurs ou au moins

aussi ad´equats que ceux choisis par leurs m´ethodes concurrentes.

Mots cl´es: S´election de variable, ﬁlter, wrapper, hybrid, agr´egation des rangs.

iii

List of Abbreviations and Symbols Used

CS Credit Scoring.

DA Discriminant Analysis.

DT Decision Tree.

LR Logistic Regression.

SVM Support Vector Machines.

ANN Artiﬁcial Neural Networks.

KNN K-Nearest-Neighbor.

LinR Linear Regression.

ΩThe population of credit applicants.

χThe space of observations.

nNumber of instances.

xMatrix of observations.

dNumber of features.

YVector of class labels.

XOriginal features set.

wWeight vector associated with the features.

WWeight vector associated with the ranked lists.

γStopping criterion.

FFeatures subset.

mNumber of ﬁlters to be aggregated.

LList of ranked features.

rFeature rank.

σOptimal rank list.

DDistance function.

kCardinality of a ranked list.

GA Genetic Algorithms.

MaxRel Maximal Relevance.

List of Abbreviations

mRMR minimal-Redundancy-Maximum-Relevance.

QMatrix describing the coeﬃcients of the quadratic terms.

Zd-dimensional row vector describing the coeﬃcients of the linear terms.

αTradeoﬀ between relevance and redundancy.

PCC Pearson correlation coeﬃcient.

MI Mutual information.

χ2Chi-squared test.

OiObserved frequency.

EiExpected theoretical frequency.

ANOVA Analysis of variance.

Table of Contents

Acknowledgements ............................... i

Abstract ...................................... ii

R´esum´e ...................................... iii

List of Abbreviations and Symbols Used .................. iv

List of ﬁgures ................................... x

List of tables ................................... xii

List of algorithms ................................ xv

Introduction ................................... 1

Chapter 1 Overview of Feature Selection in Credit Scoring .... 4

1.1 Introduction................................ 5

1.2 Credit scoring: state of the art . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Background of credit scoring . . . . . . . . . . . . . . . . . . . 6

1.2.2 Basic notations in credit scoring . . . . . . . . . . . . . . . . . 8

1.2.3 Proprieties of ﬁnancial data . . . . . . . . . . . . . . . . . . . 9

1.3 Basics of feature selection . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.1 Search direction . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.2 Searchstrategy .......................... 12

1.3.3 Evaluation function . . . . . . . . . . . . . . . . . . . . . . . . 14

1.3.4 Stopping criterion . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Feature selection algorithms . . . . . . . . . . . . . . . . . . . . . . . 19

1.4.1 Filtermethods .......................... 19

Table of contents

1.4.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4.3 Embedded methods . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4.4 Hybridmethods.......................... 22

1.4.5 Comparison of feature selection algorithms . . . . . . . . . . . 23

1.5 Datasets description and pre-processing . . . . . . . . . . . . . . . . . 24

1.5.1 Datasets description . . . . . . . . . . . . . . . . . . . . . . . 24

1.5.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 26

1.6 Performance metrics for feature selection . . . . . . . . . . . . . . . . 28

1.7 Conclusion................................. 29

Chapter 2 A Filter Rank Aggregation Approach Based on Opti-

mization, Genetic Algorithm and Similarity for Credit

Scoring ............................. 31

2.1 Introduction................................ 32

2.2 Filterframework ............................. 32

2.2.1 Feature weighting methods . . . . . . . . . . . . . . . . . . . . 32

2.2.2 Subset search methods . . . . . . . . . . . . . . . . . . . . . . 33

2.2.3 Issue I: Selection trouble and rank aggregation . . . . . . . . 34

2.2.4 Issue II: Incomplete ranking and disjoint ranking for similar

features .............................. 37

2.3 New approach for ﬁlter feature selection . . . . . . . . . . . . . . . . 38

2.3.1 Optimization problem . . . . . . . . . . . . . . . . . . . . . . 39

2.3.2 Solution to optimization problem using genetic algorithm . . . 41

2.3.3 A rank aggregation based on similarity . . . . . . . . . . . . . 45

2.4 Experimental investigations . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.1 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 50

2.5 Conclusion................................. 60

Chapter 3 Ensemble Wrapper Feature Selection ........... 61

3.1 Introduction................................ 61

vii

Table of contents

3.2 WrapperFramework ........................... 62

3.2.1 Issue I: Evaluation using a single classiﬁer . . . . . . . . . . . 63

3.2.2 Issue II: Subset generation and search strategies . . . . . . . 63

3.3 New approach for wrapper feature selection . . . . . . . . . . . . . . 64

3.3.1 Primary dimensionality reduction step: similarity study . . . . 65

3.3.2 Subset generation step: speeding up exhaustive search by heuris-

tics................................. 66

3.3.3 Evaluation step: eﬀects of using multiple classiﬁers . . . . . . 68

3.4 Experimental Investigations . . . . . . . . . . . . . . . . . . . . . . . 73

3.4.1 Results and discussion for the same-type approach . . . . . . 73

3.4.2 Results and discussion for the mixed-type approach . . . . . . 82

3.5 Conclusion................................. 84

Chapter 4 A Three-Stage Feature Selection Using Quadratic Pro-

gramming for Credit Scoring ................ 86

4.1 Introduction................................ 86

4.2 HybridFramework ............................ 87

4.3 New Approach for hybrid feature selection . . . . . . . . . . . . . . . 87

4.3.1 Stage I: feature-based ﬁltering . . . . . . . . . . . . . . . . . . 88

4.3.2 Stage II: fusion using quadratic programming . . . . . . . . . 89

4.3.3 Stage III: Feature-based wrapping . . . . . . . . . . . . . . . . 92

4.4 Experimental investigations . . . . . . . . . . . . . . . . . . . . . . . 94

4.4.1 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 95

4.5 Conclusion................................. 105

Conclusions and future works ......................... 106

Bibliography ................................... 109

Publications .................................... 115

viii

Table of contents

Appendix A Feature categories and datasets description ....... 117

A.1 Featurecategories............................. 117

A.1.1 Qualitative features . . . . . . . . . . . . . . . . . . . . . . . . 117

A.1.2 Quantitative features . . . . . . . . . . . . . . . . . . . . . . . 118

A.2 datasetsdescription............................ 119

A.2.1 Australian dataset . . . . . . . . . . . . . . . . . . . . . . . . 119

A.2.2 Germandataset.......................... 120

A.2.3 HEMQdataset .......................... 123

A.2.4 Tunisian dataset . . . . . . . . . . . . . . . . . . . . . . . . . 123

Appendix B Classiﬁcation methods .................... 126

B.1 Artiﬁcial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 126

B.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 127

B.3 DecisionTrees .............................. 127

B.4 K-nearest-Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

List of Figures

Figure 1.1 The process of credit scoring. . . . . . . . . . . . . . . . . . . 9

Figure 1.2 Key steps of feature selection process. . . . . . . . . . . . . . . 12

Figure 1.3 The process of ﬁlter feature selection. . . . . . . . . . . . . . . 20

Figure 1.4 The process of wrapper feature selection. . . . . . . . . . . . . 22

Figure 1.5 Data pre-processing ﬂowchart . . . . . . . . . . . . . . . . . . 27

Figure 2.1 General scheme of ﬁlter rank aggregation. . . . . . . . . . . . 36

Figure 2.2 A summary ﬂowchart of the proposed genetic algorithm rank

aggregation ............................ 44

Figure 2.3 Flowchart summarizing the rank aggregation approach based

onsimilarity............................ 45

Figure 2.4 Illustrative example of the ﬁrst scenario . . . . . . . . . . . . 46

Figure 3.1 A ﬂowchart combining heuristic and exhaustive search . . . . 67

Figure 3.2 A wrapper approach combing multiple classiﬁers for feature se-

lection. .............................. 69

Figure 4.1 A view of feature relevance categories . . . . . . . . . . . . . 89

Figure 4.2 The proposed process of merging features selected by three ﬁl-

ters in the fusion method . . . . . . . . . . . . . . . . . . . . . 90

Figure 4.3 Redundancy analysis for highly ranked features . . . . . . . . 92

List of ﬁgures

Figure 4.4 Flowchart of the proposed three-stage feature selection fusion . 93

Figure A.1 Features categories. . . . . . . . . . . . . . . . . . . . . . . . . 117

List of Tables

Table 1.1 Taxonomy of ﬁlter feature selection methods. . . . . . . . . . 20

Table 1.2 Summary and comparison of feature selection methods. . . . . 24

Table 1.3 Summary of datasets used for evaluating the feature selection

methods. .............................. 25

Table 1.4 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Table 2.1 Parameters of experimental environment for genetic algorithm. 50

Table 2.2 Summary of the best performance results archived by the set of

feature selection methods for the four datastes within the ﬁlter

framework. ............................. 51

Table 2.3 Performance comparison of the new ﬁlter method and the other

feature selection methods for the Australian dataset. . . . . . . 52

Table 2.4 Performance comparison of the new ﬁlter method and the other

feature selection methods for the German dataset. . . . . . . . 53

Table 2.5 Performance comparison of the new ﬁlter method and the other

feature selection methods for the HMEQ dataset. . . . . . . . . 54

Table 2.6 Performance comparison of the new ﬁlter method and the other

feature selection methods for the Tunisian dataset. . . . . . . . 55

Table 2.7 Summary of F-measures for all feature selection methods with

the four classiﬁcation methods in ﬁlter framework. . . . . . . . 57

Table 2.8 Tests of between-subjects eﬀects in ﬁlter framework. . . . . . . 58

xii

List of tables

Table 2.9 Multiple comparisons table for feature selection methods in ﬁlter

framework. ............................. 59

Table 3.1 General properties of some classiﬁcation algorithms. . . . . . . 70

Table 3.2 Summary of used classiﬁers within each family. . . . . . . . . . 70

Table 3.3 Summary of the possible combination of all classiﬁers, where the

number of classiﬁers to be combined is two. . . . . . . . . . . . 71

Table 3.4 Summary of the possible combination of all classiﬁers, where the

number of classiﬁers to be combined is three. . . . . . . . . . . 72

Table 3.5 Performance comparison of the new wrapper method and the

other feature selection methods for the Australian dataset. . . . 74

Table 3.6 Performance comparison of the new wrapper method and the

other feature selection methods for the German dataset. . . . . 75

Table 3.7 Performance comparison of the new wrapper method and the

other feature selection methods for the HMEQ dataset. . . . . 77

Table 3.8 Performance comparison of the new wrapper method and the

other feature selection methods for the Tunisian dataset. . . . . 78

Table 3.9 Summary of F-measures for all aggregation methods with the

four classiﬁcation methods in wrapper framework. . . . . . . . 81

Table 3.10 Tests of between-subjects eﬀects in wrapper framework. . . . . 81

Table 3.11 Multiple comparisons table for classiﬁer levels in wrapper frame-

work. ................................ 82

Table 3.12 Total number of evaluated subsets and selected features by 2

classiﬁers mixed-type combinations and associated F-measure

rates for the Australian Dataset. . . . . . . . . . . . . . . . . . 83

xiii

List of tables

Table 3.13 Total number of evaluated subsets and selected features by 3

classiﬁers mixed-type combinations and associated F-measure

rates for the Australian Dataset. . . . . . . . . . . . . . . . . . 84

Table 4.1 Number of remaining features after Stage I . . . . . . . . . . . 94

Table 4.2 Summary of the best performance results archived by the set of

feature selection methods for the four datasets within the hybrid

framework. ............................. 96

Table 4.3 Classiﬁcation results for the three stage feature selection for the

Australiandataset.......................... 97

Table 4.4 Classiﬁcation results for the three stage feature selection for the

Germandataset........................... 98

Table 4.5 Classiﬁcation results for the three stage feature selection for the

HMEQdataset. .......................... 99

Table 4.6 Classiﬁcation results for the three stage feature selection for the

Tunisiandataset........................... 100

Table 4.7 Tests of between-subjects eﬀects in hybrid framework. . . . . . 102

Table 4.8 Summary of F-measures for all feature selection methods with

the four classiﬁcation methods in hybrid framework. . . . . . . 103

Table 4.9 Multiple comparisons table for feature selection methods in hy-

bridframework. .......................... 104

Table 4.10 Multiple comparisons table for diﬀerent classiﬁers in hybrid frame-

work. ................................ 105

xiv

List of Algorithms

1.1 Reliefalgorithm.............................. 15

1.2 Generalized ﬁlter feature selection algorithm . . . . . . . . . . . . . . 19

1.3 Generalized wrapper feature selection algorithm . . . . . . . . . . . . 21

2.4 Rank aggregation based on similarity . . . . . . . . . . . . . . . . . . 47

Introduction

Motivations for feature selection in credit scoring

Failures of ﬁnancial institutions are generally related to their inability of controlling

an ensemble of ﬁnancial risks. Diﬀerent kinds of risks exist but the most important

one is credit risk. Credit granting decision is an important and widely studied topic

in the lending industry. The set of decision models and their underlying methods

that serve lenders in granting consumer credit are called credit scoring (CS) (Zhang

et al. 2010).

The general scheme in CS is to use the credit history of previous customers to compute

the new applicant’s defaulting risk (Tsai and Wu 2008; Thomas 2009). The collected

portfolio, i.e. collection of booked loans, is used to build a CS model that would

be used to identify the association between the applicant’s characteristics and how

good or bad is the credit worthiness of the applicant. Generally, portfolios used

for the scoring task are voluminous and they are in the range of several thousands.

These portfolios are characterized by noise, missing values, by redundant or irrelevant

features and complexity of distributions (Piramuthu 2006). The number of considered

features is called data dimension and high dimensionality in the feature space has

advantages but also some serious shortcomings. In fact, as the number of features

increases more computation is required and model accuracy and scoring interpretation

are reduced (Liu and Schumann 2005; Howley et al. 2006). One solution is to perform

a feature selection on the original data.

Feature selection is a term commonly used in machine learning to describe existing

Introduction

set of methods to reduce a dataset to a convenient size for processing and investigation.

This process involves not only a pre-deﬁned cutoﬀ on the number of features that can

be considered when building a credit model but also the choice of appropriate features

based on their relevance to the study (Fernandez 2010). Further, it is often the case

that ﬁnding the correct subset of predictive features is an important problem in its

own right.

Research questions in feature selection

Three main classes of feature selection are identiﬁed in the literature (Rodriguez

et al. 2010): ﬁlter, wrapper and hybrid feature selection methods.

Usually, ﬁlter methods choose the best features by using some informative mea-

sure. Various ﬁltering methods and their modiﬁcations are proposed in previous

literature leading to the selection trouble of how to choose the best criterion for a

speciﬁc feature selection task (Wu et al. 2009). This question is still an open re-

search ﬁled. In order to handle this issue, ensemble methods, i.e rank aggregation,

could be an interesting solution (Dietterich 2000; Dittman et al. 2013). Aggregation

methods provide robust results where the issue of selecting the appropriate ﬁlter is

alleviated to some level (Saeys et al. 2008). However, in many cases, where rank-

ings are incomplete or highly similar features are given divergent rankings, eﬀective

rank aggregation becomes a diﬃcult task (Sculley 2007). These diﬃculties may be

addressed by considering similarity between features in various ranked lists in addi-

tion to their rankings. The intuition is that similar features should receive similar

rankings given an appropriate measure of similarity.

Filter feature selection does not take into account the properties of the classiﬁer, as

it performs statistical tests on variables. Therefore, results obtained from a wrapper

are diﬀerent from that of a ﬁlter because the former actually takes into consideration

the classiﬁer proprieties. In fact, using a single classiﬁer in the wrapper evaluation

process may inﬂuence the ﬁnal selection result because each particular classiﬁer has

its own speciﬁcity and nature (Chrysostomou 2008). When the classiﬁer has been

changed, because of its bias the result may diﬀer in terms of the amount of time,

Introduction

the accuracy and the number of selected features. As such, a possible improvement

direction for this drawback is to use an ensemble of classiﬁers and combine their out-

comes. Nevertheless, we have reduced knowledge about the eﬀects of using multiple

classiﬁers on feature selection applied to CS tasks, specially the eﬀects of using dif-

ferent number of classiﬁers with the same and diﬀerent nature (Chrysostomou et al.

2008). Then, we focus on how the number and the nature of the used classiﬁers aﬀect

the number of selected features and the accuracy of the credit model.

In order to ﬁnd the best subset of features to be evaluated, the ideal approach is

to perform a complete search in the whole search space (Chan et al. 2010). However,

searching all possibilities is sometime unrealistic (Liu and Yu 2005). Hence, in order

to minimize the number of evaluations done by the classiﬁer and at the same time

maintain the accuracy, we look for a combined search algorithm that reduces the

number of possible candidates using a mixture between complete and heuristic search

methods.

Usually, hybrid feature selection methods, combining the two discussed approaches,

are needed to serve more complicated purposes, (Wu et al. 2009). In fact, construct-

ing a hybrid feature selection process provided with the advantages of ﬁlters and

wrappers is a very interesting research question. So the challenge here is how to

make these two methods work together in order to hid the shortcomings of each one.

Thesis structure

This thesis is organized as follows: in Chapter 2 we review the necessary back-

ground for this work and the relevant literature. In Chapter 3 we present a new

rank aggregation approach based on optimization, genetic algorithm (GA) and simi-

larity for CS. In Chapter 4 we introduce an ensemble wrapper feature selection based

on an improved exhaustive search for CS. Chapter 5 presents a hybrid three-stage

feature selection approach using quadratic programming for CS. Finally, Chapter 6

summarizes the key ﬁndings and their limitations and describes some possible future

directions.

Chapter 1

Overview of Feature Selection in Credit Scoring

Contents

1.1 Introduction .......................... 5

1.2 Credit scoring: state of the art . . . . . . . . . . . . . . . 5

1.2.1 Background of credit scoring . . . . . . . . . . . . . . . . . 6

1.2.2 Basic notations in credit scoring . . . . . . . . . . . . . . . 8

1.2.3 Proprieties of ﬁnancial data . . . . . . . . . . . . . . . . . 9

1.3 Basics of feature selection . . . . . . . . . . . . . . . . . . 10

1.3.1 Search direction . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.2 Searchstrategy......................... 12

1.3.3 Evaluation function . . . . . . . . . . . . . . . . . . . . . . 14

1.3.4 Stopping criterion . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Feature selection algorithms . . . . . . . . . . . . . . . . . 19

1.4.1 Filtermethods ......................... 19

1.4.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . . . . 20

1.4.3 Embedded methods . . . . . . . . . . . . . . . . . . . . . . 21

1.4.4 Hybridmethods ........................ 22

1.4.5 Comparison of feature selection algorithms . . . . . . . . . 23

1.5 Datasets description and pre-processing . . . . . . . . . . 24

1.5.1 Datasets description . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 1: Overview of Feature Selection

1.5.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . 26

1.6 Performance metrics for feature selection . . . . . . . . . 28

1.7 Conclusion ........................... 29

1.1 Introduction

Feature selection is a fundamental topic in CS. As such, this chapter will provide an

overview of CS in Section (1.2). Then, in order to provide a foundation to the most

commonly used feature selection methods in CS, a brief introduction of the basics

of feature selection is given in Section (1.3) namely search direction, search strategy,

evaluation function and stopping criterion. Subsequently, Section (1.4) explains how

feature selection is performed using ﬁlter, wrapper and hybrid feature selection with

some examples and a brief comparison between theses three approaches. Then, Sec-

tion (1.5) introduces the used datasets throughout this theses. In Section (1.6) the

performance measures are given.

1.2 Credit scoring: state of the art

Credit risk is one of the major issues in ﬁnancial research (Matjaz 2012; Jiang 2009).

Over the past few years, many companies fall apart and were forced into bankruptcy

or to a signiﬁcantly constrained business activity because of the deteriorated ﬁnancial

and economic situation (Haizhou and Jianwu 2011). When banks are unprepared to a

variation in the economic activity they will probably suﬀer from huge credit losses. In

fact it is very obvious that credit risk increases in economic depression. However, this

eﬀect could increase when bank experts under or over estimate the creditworthiness

of credit applicants. Expressing why some companies or individuals do default while

others don’t and what are the main factors that drive credit risk and how to build

robust credit model is very important for ﬁnancial stability.

Chapter 1: Overview of Feature Selection

1.2.1 Background of credit scoring

CS is basically a way of recognizing the diﬀerent groups in a population when one

cannot see the characteristics that separates the groups but only related ones (Thomas

2000). This idea of diﬀerentiating between groups in the same population was ﬁrst

introduced in statistics by Fisher (1936). He wanted to distinguish between three

varieties of iris by measurements of the physical size of the plants. Then Durand

(1941) was the ﬁrst to recognize that one could use the same techniques to discriminate

between good and bad loans. His research was done in the context of a research project

for the US National Bureau of Economic Research. Since, CS was a true success and

banks started using it for their predictive activities (Thomas et al. 2002).

CS consists of the evaluation of the risk related to lending money to an organi-

zation or a person. In the past few years, the business of credit products increased

enormously. Approximately every day, individual’s and company’s records of past

lending and repaying transactions are collected and evaluated (Hand and Henley

1997). This information is used by lenders such as banks to evaluate an individual’s

or company’s means and willingness to repay a loan. According to Yang (2001) the

set of collected information makes the deciders task simple because it helps deter-

mine: whether to extend credit duration or to modify a previously approved credit

limit and to quantify the probability of default, bankruptcy or fraud associated to a

company or a person. When assessing the risk related to credit products, diﬀerent

problems arise depending on the context and the diﬀerent types of credit applicants.

Sadatrasoul et al. (2013) summarize diﬀerent kinds of scoring as follows: application

scoring, behavioral scoring, collection scoring and fraud detection.

Application scoring

Application scoring refers to the assessment of the credit worthiness for new appli-

cants. It quantiﬁes the risks associated with credit requests by evaluating the social,

demo-graphic, ﬁnancial and other data collected at the time of the application.

Chapter 1: Overview of Feature Selection

Behavioral scoring

Behavioral scoring involves principles that are similar to application scoring with the

diﬀerence that it refers to existing customers. As a consequence, the analyst already

has evidence of the borrower’s behavior with the lender. Behavioral scoring models

analyze the consumer’s behavioral patterns to support dynamic portfolio management

processes.

Collection scoring

Collection scoring is used to divide customers with diﬀerent levels of insolvency into

groups, separating those who require more decisive actions from those who don’t

need to be attended to immediately. These models are distinguished according to the

degree of delinquency (early, middle, late recovery) and allow a better management of

delinquent customers, from the ﬁrst signs of delinquency (30-60 days) to subsequent

phases and debt write-oﬀ.

Fraud detection

Fraud scoring models rank the applicants according to the relative likelihood that an

application may be fraudulent. We will address the application scoring problem also

known as consumer CS. In this context the term credit will be used to refer to an

amount of money that is borrowed to a credit applicant by a ﬁnancial institution and

which must be repaid with interest in a regular interval of time. The probability that

an applicant will default must be estimated from information about the applicant

provided at the time of the credit application and the estimate will serve as the basis

for an accept or a reject decision. According to Sadatrasoul et al. (2013), accurate

classiﬁcation is of beneﬁt both to the creditor in terms of increased proﬁt or reduced

loss and to the applicant in term of avoiding over commitment. Deciding whether or

not to grant a credit is generally carried out by banks and various other organizations.

It is an economic activity which has seen rapid growth over the last 30 years.

Traditional methods of deciding whether to grant credit to a particular individual

Chapter 1: Overview of Feature Selection

use human judgment of the risk of default based on experience of previous decisions

(Thomas et al. 2002). Nevertheless, economic demands resulting from the arising

number of credit requests, joined with the emergence of new machine learning tech-

nology, have led to the development of sophisticated models to help the credit granting

decision.

The statistical CS models, called scorecards or classiﬁers, use predictors from ap-

plication forms and other sources to estimate the probabilities of defaulting. A credit

granting decision is taken by comparing the estimated probability of defaulting with

a suitable threshold (Bardos 2001). Standard statistical methods used in the industry

for developing scorecards are discriminant analysis (DA), linear regression (LinR) and

logistic regression (LR). Despite their simplicity, Tuﬀ´ery (2007) and Thomas (2009)

show that both DA and LR prediction need strong assumptions on data. Hence, other

models based on data mining methods are proposed. These models do not lead to

scorecards but they indicate directly the class of the credit applicant (Jiang 2009).

Artiﬁcial intelligence methods such as decision trees (DT), artiﬁcial neural networks

(ANN), K-nearest-neighbor (KNN) and support vector machines (SVM) can be used

as alternative methods for CS (Bellotti and Crook 2009). These methods extract

knowledge from training datasets without any assumption on the data distributions.

The classiﬁcation methods are described in Appendix (A). A brief summary about

the used classiﬁcation methods in this theses is included in Chapter (3).

1.2.2 Basic notations in credit scoring

In what follows, we present the main notations which will be used in this CS context.

Let Ω the population of credit applicant. We denote by χthe space of observations

in Rddeﬁned by the random variable Xdeﬁned as

X: Ω →χ⊂Rd

i xi= (x1

i, x2

i, ..., xd

i).

(1.1)

Chapter 1: Overview of Feature Selection

We have nindividuals described by dvariables as shown by the matrix xgiven below:

x=





1... xd

... ... ...

n... xd







Let Xdenote the set of features such that X= (X1, X2, ..., X d). The nobservations

are divided into two groups, where the group label of an applicant iis represented

through the modalities {0,1}of a binary target variable Y, where the label 0 represent

a bad applicant and 1 represent a good one. We denote by Y= (y1, . . . , yn) the vector

of class labels for the ninstances. Figure (1.1) summarizes the process of CS and its

basic notions.

Figure 1.1: The process of credit scoring.

1.2.3 Proprieties of ﬁnancial data

According to Hand and Henley (1997) CS portfolios are frequently voluminous and

they are in the range of several thousands, well over 100000 applicants measured on

more than 100 variables are quite common. These portfolios are characterized by

missing values, complexity of distributions and by redundant or irrelevant features

(Piramuthu 2006). Clearly, applicants characteristics will vary from one situation to

another. An applicant looking for a small loan will be asked for information which

is diﬀerent from another who is asking for a big loan. Furthermore, the data which

may be used in a credit model is always subject to changing legislation (Hand and

Henley 1997).

Based on initial application forms ﬁlled by the credit applicants some are ac-

cepted or rejected based on some obvious characteristics. Further information is then

Chapter 1: Overview of Feature Selection

collected on the remaining credit applicants using further forms. This process of col-

lection of the borrower’s information allows banks to avoid losing time on obvious

non worthy applicants as well as allowing quick decisions.

As any classiﬁcation problem, choosing the number of appropriate features to be

included in the credit model is an important task. One might try to use as many

features as possible. However, the more the number of features grows the more

computation is required and model accuracy and scoring interpretation are reduced

(Liu and Schumann 2005; Howley et al. 2006). There are also other practical issues,

in fact with too many questions or a lengthy vetting procedure, applicants will deter

and will go elsewhere. Based on Hand and Henley (1997), standard statistical and

pattern recognition strategy is to explore a large number of features and to identify

an eﬀective subset, i.e. {10,11,12}, of those features to be considered for building

the credit model.

1.3 Basics of feature selection

There are two famous special methods of dimensionality reduction. The ﬁrst one is

feature extraction where the input data is transformed into a reduced representation

set of features, so new attributes are generated from the initial ones. The second

category is feature selection. In this category a subset of existing features is selected

without a transformation. Generally, feature selection is preferred over feature ex-

traction since it keeps all information about the importance of each single feature

while in feature extraction obtained variables are usually not interpretable (Giudici

2003).

Conserving the information of each feature provides much simplicity and interpretabil-

ity to ﬁnancial data processing. Hence, feature selection is more appropriate to our

study. Feature selection is an important framework in knowledge discovery and spe-

cially in ﬁnancial applications, not only for the insight gained from determining pre-

dictive modeling features but also for the improved performance, understandability

and accuracy of the credit models.

Chapter 1: Overview of Feature Selection

The idea behind feature selection is to reduce the eﬀect of tricky features in the

dataset, where tricky and unneeded features include:

•irrelevant features are those that can never contribute to improve the predictive

accuracy of credit model, where the accuracy is how close a measured value

is to the actual or true value. However, the algorithm may mistakenly include

them in the model. Removing such features reduces the dimension of the search

space and speeds up the learning algorithm.

•redundant features are those that can replace others in a feature subset. They

basically bring similar information as other features. For example, a dataset

may include two features that provide similar information as date of birth and

age. Typically feature redundancy is deﬁned in terms of feature correlation,

where two features are redundant to each other if they are correlated.

According to Rodriguez et al. (2010) a successful feature selection: a) reduces the

dimensionality of the feature space, b) speeds up and reduces the cost of a learning

algorithm and c) obtains the feature subset which is the most relevant to classiﬁcation.

Feature selection algorithms are typically composed of the following four components:

search direction, search strategy, evaluation function and the stopping point. Figure

(1.2) gives a ﬂowchart presenting the general process of feature selection based on

these four components.

1.3.1 Search direction

Choosing the starting point in the process of searching for the most important features

is the ﬁrst issue to be considered when performing a feature selection on the original

features set. Once the starting point is deﬁned, the search direction is determined

(Liu and Motoda 1998; Yun et al. 2007). The search for the most relevant feature

subset may start with an empty set and successively add the most relevant features.

In this case, the search direction is called forward direction. On the other hand, the

search may begin with the full set and successively removes less relevant features.

Chapter 1: Overview of Feature Selection

Figure 1.2: Key steps of feature selection process.

In this case, the direction is named backward. Other ways of starting point can be

used, we may start with both ends and add and remove features at the same time i.e.

bidirectional. The search may also begin with a random subset of features in order

to avoid being trapped into local optima (Liu and Yu 2005).

1.3.2 Search strategy

Once the starting point and the search direction are decided, the search strategy

must be chosen. The search strategy is a fundamental part in the process of subset

generation. Typically, for a dataset of dfeatures, 2dpossible subsets are candidates

for further examinations (Yun et al. 2007). Even for a moderate d, the search space

may be too large for a complete search (Kwang 2002). Consequently, two strategies

have been explored in the literature as discussed by Liu and Yu (2005) and also by

Legrand and Nicoloyannis (2005): exhaustive and heuristic.

Exhaustive search

An exhaustive search methodically performs a complete search to ﬁnd all possible

features’ subsets and picks the optimal subset of features by examining all possible

Chapter 1: Overview of Feature Selection

candidate subsets. Since exhaustive search examines all possible subsets, it always

guarantees to ﬁnd the optimal result. However, as the number of features grows,

exhaustive search becomes rapidly impractical because the search space is in the

order of O(2d).

Heuristic search

Naturally a search does not have to be exhaustive in order to guarantee good or

acceptable results (Legrand and Nicoloyannis 2005). Heuristic methods are a set of

realistic and practical approaches that are easier to put into practice. Still, such search

strategy does not always guarantee to produce an optimal solution, but nonetheless

a greedy heuristic may yield locally optimal solutions that approximate a global

optimal solution. Many research works discussed heuristic search in CS. Wang et al.

(2012) proposed a novel approach to feature selection called RSFS, based on rough

set and scatter search. In RSFS, conditional entropy is regarded as the heuristic to

search for optimal solutions. Falangis and Glen (2010) proposed a variety of heuristic

feature selection methods for CS problems with large numbers of observations. These

heuristic procedures, which are based on the mixed integer programming model for

maximizing classiﬁcation accuracy, were applied to three CS datasets and proved to

be eﬃcient. The two most popularly used categories of heuristic search strategies are

sequential search and random search.

•Sequential search

Sequential search includes: forward selection, backward elimination and bidi-

rectional search (Chan et al. 2010). These approaches consider local changes to

the feature subsets during the search for the appropriate feature subset, where

a local change is basically adding or removing a single feature from the subset.

This kind of approaches is known for its eﬃciency in generating fast results as

the order of the search space is typically in the order of O(d2).

•Random search

Generally, it starts with selecting a random features’ subset and may proceed

Chapter 1: Overview of Feature Selection

in two diﬀerent ways. Either a random search follows classical sequential search

and adds randomness into it, or it generates the next subset in a completely

random manner (Liu and Yu 2005).

1.3.3 Evaluation function

Feature selection methods search for the best subset that optimally describes the

target variable. Once all candidate subsets are generated, each one is evaluated and

compared with the other subsets according to an evaluation criterion. As established

by Dash and Liu (2003), a subset is optimal always relative to the chosen evaluation

criteria, which means that the chosen best subset using one evaluation criterion may

not be the same using another one. Many criteria have been proposed in previous

work as discussed by Kumar and Kumar (2011). Dash and Liu (2003) grouped the

evaluation functions into ﬁve categories: distance, information , dependence, consis-

tency, and classiﬁer error rate. Liu and Yu (2005) on the other hand, has divided

evaluation criteria into two classes based on their dependency on the classiﬁcation

algorithms that will ﬁnally be applied on the selected feature subset. Considering

these repartitions, we divide the evaluation functions as given below.

Independent criteria

Independent criteria, by deﬁnition, are independent of the used classiﬁcation algo-

rithm and are generally used in ﬁlter methods. They evaluate the relevance of a

feature or feature subset by exploiting the intrinsic characteristics in the data with-

out involving any classiﬁcation algorithm. In the following we discuss the most well

known independent criteria.

•Distance measures

As discussed by Dash and Liu (2003) and by Liu and Yu (2005) in a binary

context, distance measures, also known as separability, divergence, or discrim-

ination measure, study the diﬀerence between the two-class conditional prob-

abilities. In other words, a feature Xjis chosen over another feature Xj0if it

Chapter 1: Overview of Feature Selection

induces a greater diﬀerence between the two-class conditional probabilities than

Xj0. In the case where the diﬀerence is zero then the two features are identical.

Relief is one of the most famous features selection methods based on distance

measures. This method uses the Euclidean distance to select a sample composed

of a random instance xiand their two nearest instances in xof the same class,

i.e nearmiss(x), and opposite class i.e nearhit(x). Then, a routine is used to

update the feature weight vector for every sample triplet and determines the

average feature weight vector relevance. Then, features with average weights

over a given threshold are selected. Algorithm (1.1) give a more detailed pic-

ture on the process of Relief method, where w= (w1, . . . , wd) is a weight vector

associated to Xand Tis the number of iterations.

Algorithm 1.1 Relief algorithm

Require: x matrix of observations.

Tnumber of iterations.

Ensure: selected features subset.

1: initiate the weight vector to zero: w= 0.

2: for cpt=1 . . . T do

3: pick randomly an instance xifrom x

4: for j =1 . . . d do

5: wj= (xj

i−nearmiss(x)j)2−(xj

i−nearhit(x)j)2

6: end for

7: end for

8: the chosen feature set is {Xjpwj> threhold}

•Information measures

The information theory approach has proved to be eﬀective in solving many

problems as discussed by Kumar and Kumar (2011). One of these problems is

feature selection where information theory basics can be exploited as metrics or

as optimization criteria. These measures are typically used with ﬁlter feature

selection methods include mutual Information (MI). They provide a solid theo-

retical framework for measuring the relation between the classes and a feature

or more than one feature (Bonev 2010).

Formally, the MI of two continuous random variables Xjand Xj0is deﬁned as

Chapter 1: Overview of Feature Selection

follows:

MI (Xj, X j0) = Z Z p(xj, xj0)log p(xj, xj0)

p(xj)p(xj0)dxjdxj0,(1.2)

where p(xj, xj0) is the joint probability density function and p(xj) and p(xj0) are

the marginal probability density functions. In the case of discrete random vari-

ables, the double integral becomes a summation, where p(xj, xj0) is the joint

probability mass function, and p(xj) and p(xj0) are the marginal probability

mass functions. MI is an information metric used to measure the relevance of

features taking into account the amount of information shared by two features

(Kumar and Kumar 2011). Large values of MI indicates high correlation be-

tween two features and zero indicates that two features are uncorrelated. Many

authors proposed feature selection methods based on MI in diﬀerent evaluation

functions such as Kumar and Kumar (2011) and Al-Ani and Deriche (2001).

•Dependency measures

As discussed by Dash and Liu (2003) and Yu and Liu (2003), dependency mea-

sures or correlation measures quantify the ability to predict the value of one

variable based on the value of another variable. If correlation between two fea-

tures is adopted as an evaluation function, the above deﬁnition becomes that a

feature is relevant if it is strongly associated with the class. In other words, if

the correlation of a feature Xjwith the class variable is superior to the correla-

tion of feature Xj0with the class variable, then feature Xjis considered more

predictive.

The Pearson’s correlation coeﬃcient (PCC) for continuous features is a simple

measure but eﬀective in a wide variety of feature selection methods (Rodriguez

et al. 2010). Formally PCC is deﬁned by

P CC =cov(Xj, Xj0)

pvar(Xj)var(Xj0),(1.3)

Chapter 1: Overview of Feature Selection

where cov is the covariance of variables and var is the variance of each variable.

Another popular feature selection method is Pearson’s chi-squared test (χ2).

This test is usually used with nominal or categorical variables. The (χ2) test

can also be used with numerical variables by converting them into nominal or

categorical types. The ﬁrst step in performing the (χ2) test for independence

is to convert the raw data into a contingency table. Then, the independence

between each variable and the target variable is measured using the contingency

table. χ2is deﬁned by :

χ2=Pc

i=1

(Oi−Ei)2

Ei,(1.4)

where Oiis an observed frequency; Eiis the expected theoretical frequency,

asserted by the hypothesis of independency and cthe number of cells in the

contingency table.

•Consistency measures

A consistency measure evaluates the distance of a feature subset from the con-

sistent class label. Consistency is established when a data set with the selected

features alone is consistent. That is, no two instances may have the same feature

values if they have a diﬀerent class label (Arauzo-Azofra et al. 2008).

According to Arauzo-Azofra et al. (2008) having consistency in a dataset is

usually accompanied while looking for a small feature set. Because, as the

number of features increases the more the consistent hypothesis can be rejected.

In any case, the search for small feature sets is the common goal of feature

selection methods, so this is not a particularity of consistency methods. The

most basic of these measures is the one that simply guesses if the training data

set is consistent or not with the selected features. Its output is just a boolean

value.

Dependent criteria

Dependent critera are generally used with wrapper feature selection methods when

the performance of a speciﬁc scoring algorithm is used to determine which features are

Chapter 1: Overview of Feature Selection

selected. When using dependant criteria we generally obtain superior results as the

found features are well-matched to the predetermined mining algorithm. However, it

also tends to be more computationally expensive and may not be suitable for other

scoring algorithms (Chrysostomou 2008).

In general, classiﬁcation accuracy is widely used as the primary measure of de-

pendent criteria. Features are selected by the classiﬁcation algorithm and later used

in predicting the class labels of unseen instances. Usually, accuracy is high but it

is computationally costly to estimate accuracy for every feature subset (Yu and Liu

2004).

1.3.4 Stopping criterion

The ﬁnal step in the process of feature selection is to choose a stopping criterion for

the search of feature subsets. The stopping criteria depends on the level of depen-

dency of the used evaluation function. As discussed by Chrysostomou (2008), in case

independent criteria are used, a commonly used stopping criterion is the ordering of

the features according to some relevance score. When dealing with dependent eval-

uation function one might stop adding or removing features when there is no more

improvement in the accuracy of the current feature subset. Some frequently used

stopping criteria are:

•The search is completed when all feature subsets are evaluated.

•A suitable high-quality subset is selected when the smallest feature subset with

the highest discriminant power is found and as a result the search algorithm

stops the searching process.

•A speciﬁc bound is achieved where a bound can be a particular number of

features or number of iterations is reached.

•A subsequent addition, or deletion, of any feature does not produce a better

subset.

Chapter 1: Overview of Feature Selection

1.4 Feature selection algorithms

1.4.1 Filter methods

A feature selection algorithm is considered a ﬁlter if it ﬁlters out all unwanted features

(Molina et al. 2002; Blum and Langley 1997). According to Forman (2008) a ﬁlter

technique is a pre-selection process which is independent of the applied classiﬁcation

algorithm. The process of ﬁlter methods is illustrated in Algorithm (1.2) (Yu and Liu

2004).

Algorithm 1.2 Generalized ﬁlter feature selection algorithm

Require: X : all features.

F0: a subset of features from which to start the search F0⊂X.

γ: a stopping criterion.

Ensure: Fbest: selected features subset.

1: initialize: Fbest =F0.

2: γbest =eval(F0,X, M ); evaluate F0by an independent criteria M.

3: while γ== γbest do

4: F=generate(X); generate a subset for evaluation.

5: γ=eval(F, X, M ); evaluate the current subset Fby M.

6: if γis better than γbest then

7: γbest =γ.

8: Fbest =F.

9: end if

10: end while

11: return Fbest.

Filter methods typically evaluate the importance of features by looking at the intrinsic

properties of the data (Saeys et al. 2007). Basically, in ﬁlter approach, a relevance

score is assigned to each feature in the dataset. Then, they are ordered according to

their relevance score. In general, features with high scores are then selected and low

scoring features are eliminated (Chrysostomou 2008). Once all features are ranked,

the selected features are introduced as inputs to the classiﬁer. Figure (1.3) illustrates

the ﬁlter feature selection process.

Filters can be exceptionally eﬀective since they easily scale down high dimen-

sional data. They are computationally fast and simple since the selection criterion

Chapter 1: Overview of Feature Selection

Figure 1.3: The process of ﬁlter feature selection.

is completely independent of the classiﬁer (Guyon and Elisseeﬀ 2003). Several rank-

ing criteria for ﬁlter methods have been proposed in the literature. Examples of the

commonly used ﬁlter ranking criteria are summarized in Table (1.1).

Table 1.1: Taxonomy of ﬁlter feature selection methods.

Model search Advantages Disadvantages Examples

univariate fast, scalable, indepen-

dent of the classiﬁer

ignores feature depen-

dencies, ignores interac-

tion with the classiﬁer

PCC, χ2, en-

tropy

multivariate models feature depen-

dencies, independent of

the classiﬁer, better com-

putational complexity

than wrapper methods

slower than univariate

techniques, Less scal-

able than univariate

techniques, Ignores

interaction with the

classiﬁer

correlation-

based feature

selection(CFS)

1.4.2 Wrapper methods

Wrapper methods use speciﬁc classiﬁers and use resulting classiﬁcation performance

to select features. While ﬁlter methods treat the problem of ﬁnding the best feature

subset independently of the learning step, wrapper methods use the model accuracy

within the feature subset search. They use search methods to pick subsets of vari-

ables and evaluate their importance based on the estimated classiﬁcation accuracy

Chapter 1: Overview of Feature Selection

(Rodriguez et al. 2010). Details of the wrapper process are described in Algorithm

(1.3) (Yu and Liu 2004).

Algorithm 1.3 Generalized wrapper feature selection algorithm

Require: X : all features.

F0: a subset of features from which to start the search F0⊂X.

γ: a stopping criterion.

Ensure: Fbest: selected features subset.

1: initialize: Fbest =F0.

2: γbest =eval(F0,X, M ); evaluate F0by a classiﬁcation algorithm A.

3: while γ=γbest do

4: F=generate(X); generate a subset for evaluation.

5: γ=eval(F, X, A); evaluate the current subset Fby A.

6: if γis better than γbest then

7: γbest =γ.

8: Fbest =F.

9: end if

10: end while

11: return Fbest.

According to Kohavi and John (1997), a wrapper model incorporates the classiﬁcation

algorithm into the feature selection process and considers it as a perfect ”black box”.

In other words it is not necessary to know the classiﬁcation algorithm or how it works,

only its ability to test the solution on the validation set.

Wrappers use a search procedure in the space of possible features, and then gen-

erate and evaluate various subsets in order to ﬁnd the best one. The evaluation of a

speciﬁc subset of features is obtained by training and testing a speciﬁc classiﬁcation

model repetitively, rendering this approach tailored to a speciﬁc classiﬁcation algo-

rithm. To search the space of all feature subsets, a search algorithm is then ’wrapped’

around the classiﬁcation model. Figure (1.4) illustrates the process of wrapper feature

selection.

1.4.3 Embedded methods

In embedded feature selection methods the search for an optimal subset of features is

built into the classiﬁer construction; i.e. feature selection occurs naturally as a part

of the learner. Typically, these methods use all features as input to generate a model.

Chapter 1: Overview of Feature Selection

Figure 1.4: The process of wrapper feature selection.

Then, they evaluate the model to infer the relevance of the features. As a result, they

directly link features relevance to the learner used to model the relationship (Tuv

et al. 2009).

Just like wrappers, embedded methods are speciﬁc to a given learning algorithm. In

fact, the classiﬁer has its own feature selection algorithm and both interact together.

So, implicitly, features dependencies are taken into account. Also embedded methods

are far less computationally intensive than wrapper methods.

As discussed earlier, the similarly to wrapper methods is linked to the classiﬁcation

stage. This same link is much stronger when the feature selection in embedded

methods is included into the classiﬁer construction. Embedded methods oﬀer the

same advantages as wrapper methods concerning the interaction between the feature

selection and the classiﬁcation. However, since the embedded-based approaches are

algorithm-speciﬁc they are not adequate for our requirement.

1.4.4 Hybrid methods

The hybrid model attempts to take advantage of other feature selection approaches by

using their diﬀerent evaluation criteria in diﬀerent search stages. In case the chosen

feature selection technique proves to be too slow to allow complex search schemes

in a large numbers of candidate features, it may be more practicable to introduce

Chapter 1: Overview of Feature Selection

another fast but less accurate feature selection method to pre-ﬁlter some of unwanted

features. So, only more promising features are eventually presented to the primary

slow feature selection technique.

Many hybrid feature selection methods were proposed in the past few years to con-

struct accurate CS model. An interesting hybrid ﬁlter-wrapper approach is introduced

by Huang et al. (2007) where a genetic algorithm based approach is used to optimize

the parameters of SVM classiﬁer and feature subset simultaneously, without reducing

the SVM classiﬁcation accuracy. Cho et al. (2010) proposes a hybrid method for

eﬀective bankruptcy prediction, based on the combination of variable selection using

decision trees and case-based reasoning using the Mahalanobis distance with variable

weight.

In general, hybrid algorithms focus on combining ﬁlter and wrapper algorithms

to achieve the best possible performance with similar accuracy of wrapper and time

complexity of ﬁlter algorithms.

1.4.5 Comparison of feature selection algorithms

Numerous feature selection techniques are available. In order to better understand

the inner instrument of each technique and the commonalities and diﬀerences among

them, we present a categorizing framework in Table (1.2) based on the previous

discussions.

Comparing feature selection methods is not an easy task, since it depends on nu-

merous factors. Feature selection methods could be compared according to diﬀerent

purposes, for general purpose of irrelevancy removal, ﬁlters are good choices as they

are unbiased and fast. On the other hand, to improve the classiﬁcation performance,

wrappers should be preferred over ﬁlters since they are more appropriate to the classi-

ﬁcation tasks. Sometimes, hybrid feature selection methods are needed to serve more

complicated purposes.

In terms of the amount of time, a feature selection method that is considered to

be theoretically complex may take longer to select relevant features than a feature

Chapter 1: Overview of Feature Selection

selection method which is regarded as theoretically simple. The time concern is also

about whether the feature selection process is time critical or not.

When time is not an important issue, based-complete search methods are recom-

mended to achieve optimality, otherwise heuristic-based search methods should be

selected for fast results. Time constraints can also aﬀect the choice of feature selec-

tion models as diﬀerent models have diﬀerent computational complexities. The ﬁlter

model is preferred in applications where applying a particular classiﬁer is too costly.

Table 1.2: Summary and comparison of feature selection methods.

Filter Wrapper Hybrid

Evaluation

Criterion

distance, information,

dependency and consis-

tency

predictive accuracy independent criteria,

dependent criteria

Search feature weighting, subset

exhaustive, heuristic mixture

Characteristics unbiased and fast, ro-

bust against overﬁtting,

reasonable computation

cost, reasonable statisti-

cal scalability

achieve higher opti-

mality, interact with

the classiﬁer, consider

dependencies

take advantage of

other feature selection

approaches

1.5 Datasets description and pre-processing

1.5.1 Datasets description

The adopted herein datasets used for evaluation are four real-world datasets: two

datasets from the UCI repository of machine learning databases: Australian and

German credit datasets (http://archive.ics.uci.edu/ml/datasets.html), a dataset from

a Tunisian bank and the HMEQ dataset. Table (1.3) displays the characteristics of

these datasets.

Chapter 1: Overview of Feature Selection

Table 1.3: Summary of datasets used for evaluating the feature selection methods.

Names Australian German HMEQ Tunisian

Total instances 690 1000 5960 2970

Nominal features 6 13 2 11

Numeric features 8 7 10 11

Total features 14 20 12 22

Number of classes 2 2 2 2

Australian credit dataset:

Australian dataset present an interesting mixture of attributes: continuous, nominal

with small numbers of values, and nominal with larger numbers of values, with few

missing values. Appendix (A) contains the complete list of variables used in this data

set. The dataset is composed of 690 instances where 306 ones are creditworthy while

383 are not. All attribute names and values have been changed to meaningless sym-

bols for conﬁdentiality. This dataset was used in the European StatLog project, which

involves comparing the performances of machine learning, statistical, and neural net-

work algorithms on data sets from real-world industrial areas including medicine,

ﬁnance, image analysis, and engineering design.

German credit dataset:

The German credit dataset is often used by credit specialists for classiﬁcation pur-

poses. This dataset covers a sample of 1000 credit consumers where 700 instances

are creditworthy and 300 are not. For each applicant 21 numeric input variables

are available, .i.e. 7 metric, 13 categorical and a target attribute, with informa-

tion pertaining to past and current customers who borrowed from a German bank

(http://www.stat.uni-muenchen.de/service/datenarchiv/kredit/kredite.html).

Among the 20 input variables assumed to aﬀect the target variable we mention:

duration of credits in months, behavior repayment of other loans, value of savings or

stocks, stability in the employment, further running credits. Appendix (A) contains

the complete list of variables used in this data set.

Chapter 1: Overview of Feature Selection

HMEQ credit dataset:

The HMEQ dataset is composed of 5960 instances describing recent home equity

loans where 4771 instances are creditworthy and 1189 are not. The target is a binary

variable that indicates if an applicant eventually defaulted. For each applicant, 12

input variables were recorded where 10 are continuous features, 1 is binary and 1 is

nominal, more details are provided in Appendix (A). .

Tunisian credit dataset:

Tunisian dataset covers a sample of 2970 instances of credit consumers where 2523

instances are creditworthy while 446 are not. Each credit applicant is described by a

binary target variable and a set of 22 input variables where 11 features are numerical

and 11 are categorical (see Appendix (A)).

1.5.2 Data pre-processing

In this section, we describe the adopted data pre-processing steps. Each dataset is

cleaned from missing values, then it is discretized and split into training and testing

samples as shown in Figure (1.5).

Missing value replacement

Most ﬁnancial datasets contain missing values that should be properly handled. Many

methods dealing with missing values are available. The simplest one is to remove all

instances with missing values. This method is suitable when missing data are not

important. Another simple way is to substitute missing values with the correspond-

ing mean or median values over all instances. In this context, we estimate missing

values with the average or mode of features depending on their nature meaning either

numerical or categorical.

Chapter 1: Overview of Feature Selection

Figure 1.5: Data pre-processing ﬂowchart

Features discretization

For simplicity, each variable is discretized, knowing that discretization of continuous

features depends of the context. In this study, we are in the supervised learning

context. The discretization step should be performed prior to the learning process.

Several tools can be used for that, and we selected Weka 3.7.0 machine learning

package (Bouckaert et al. 2009) for its simplicity .

Splitting datasets

Datasets for the scoring task are usually extremely large. In order to reduce classi-

ﬁcation tools complexity and to increase scoring models accuracy sampling becomes

necessary as stated by Fernandez (2010). In order to obtain a calibrated model,

the credit database should be split. Sampled subsets are expected to be balanced

and cover the complete database. Subsequently, we split the datasets into a training

sample and a test sample, where the ﬁrst deals with the new feature selection ap-

proach and diverse classiﬁcation models and the second one checks the reliability of

the constructed models in the learning step.

Chapter 1: Overview of Feature Selection

1.6 Performance metrics for feature selection

The performance of our proposed methods is evaluated using the standard informa-

tion retrieval performance measures: precision, recall and F-measure metrics. In a

classiﬁcation context, the precision is calculated as the ratio of the number of credit

applicants correctly identiﬁed by the model as positives Y= 1, ie. true positive (T P ),

to the total number of credit applicants. The total number of credit applicants is the

number of applicants correctly identiﬁed as positives plus the number of incorrectly

classiﬁed applicants, ie. false positive (F P ).

The recall, also known as T P rate or sensitivity, measures how often a classiﬁcation

model correctly ﬁnds the right class to a credit applicant. It is deﬁned as the propor-

tion of T P against the total number of applicants that actually belong to the positive

class. The total number of potential correct applicants is the number of T P plus

the count of false negatives (F N) which are the applicants that were not labeled as

belonging to the positive class but should have been.

When the precision rate is 1 for a class C means that every applicant aﬀected to

this class does indeed belong to it, but this rate does not inform about the number

of applicants from this class that were not correctly classiﬁed. A recall rate of 1

means that every applicant from class C is labeled as belonging to class C but does

not inform about the number of applicants that were incorrectly labeled as belonging

to class C. In general, there is an inverse relationship between precision and recall,

where it is possible to increase one at the cost of reducing the other. The F-measure

combines recall and precision into a global measure.

In general, the terms T P ,T N ,F P , and F N evaluate the results of the classiﬁer. The

terms positive (P) and negative (N) refer to the classiﬁer’s prediction, and the terms

true (T) and false (F) refer to whether that prediction corresponds to the external

observation.

Chapter 1: Overview of Feature Selection

The four outcomes can be formulated in confusion matrix, as follows:

Table 1.4: Confusion matrix

Predected

Creditworthy (Y= 1) Not creditworthy (Y= 0) Total

Observed

(Y= 1) T P F P T P +F P

(Y= 0) F N V N F N +T N

Total T P +F N F P +T N n

Precision, recall and F-measure are then given by :

P recision =|T P |

|T P |+|F P |.(1.5)

Recall =|T P |

|T P |+|F N |.(1.6)

F-measure = 2 ·P recision ·Recall

P recision +Recall ,(1.7)

The cited performance measures are obtained when the cut-oﬀ is 0.5. However,

changing this threshold might modify previous results and allows to catch a greater

number of good or bad applicants. Graphical tools can be also used as an evaluation

criterion instead of a scalar criterion. In this thesis we use the area under the receiver

operating characteristic (ROC) curve to evaluate the eﬀect of selected features on

classiﬁcation models. ROC curve shows how errors change when the threshold varies.

This curve situates positive instances against negative ones to allow ﬁnding the middle

ground between speciﬁcity and sensitivity. An area of 1 represents a perfect test; an

area equal or below 0.5 represents a worthless test. So the combination of features

that gives the highest area under the ROC curve will be considered as the most

suitable for the classiﬁcation task.

1.7 Conclusion

This chapter gives an overview of CS and feature selection methods. A brief state of

the art of the commonly used feature selection methods, namely ﬁlter, wrapper and

Chapter 1: Overview of Feature Selection

hybrid feature selection methods, is given. Filter methods do not use classiﬁers but

instead they use independent criteria and the characteristics of the dataset to select

relevant features. Wrapper methods, on the other hand, are classiﬁer dependant.

Moreover, hybrid methods present a mixture between ﬁlters and wrappers. In the

following chapters three feature selection methods will be proposed. Details about

the proposed methods and their related results are presented in Chapters 2, 3 and 4.

Chapter 2

A Filter Rank Aggregation Approach Based on Op-

timization, Genetic Algorithm and Similarity for

Credit Scoring

Contents

2.1 Introduction .......................... 32

2.2 Filterframework........................ 32

2.2.1 Feature weighting methods . . . . . . . . . . . . . . . . . . 32

2.2.2 Subset search methods . . . . . . . . . . . . . . . . . . . . . 33

2.2.3 Issue I: Selection trouble and rank aggregation . . . . . . . 34

2.2.4 Issue II: Incomplete ranking and disjoint ranking for similar

features ............................. 37

2.3 New approach for ﬁlter feature selection . . . . . . . . . 38

2.3.1 Optimization problem . . . . . . . . . . . . . . . . . . . . . 39

2.3.2 Solution to optimization problem using genetic algorithm . 41

2.3.3 A rank aggregation based on similarity . . . . . . . . . . . 45

2.4 Experimental investigations . . . . . . . . . . . . . . . . . 49

2.4.1 Results and discussion . . . . . . . . . . . . . . . . . . . . . 50

2.5 Conclusion ........................... 60

Chapter 2: Filter rank aggregation

2.1 Introduction

Filters are very commonly used feature selection methods. Thus, this chapter dis-

cusses the major issues of this approach and presents a new approach based on rank

aggregation, GA and similarity. As such, in Section (2.2) we give a brief reminder

of the ﬁlter framework and two major issues when dealing with ﬁltering methods:

the selection trouble and the issue of disjoint ranking for similar features. Then,

we present our new approach in Section (2.3) and the experimental investigations in

Section (2.4).

2.2 Filter framework

According to Yu and Liu (2003) ﬁlter methods can be grouped into two categories:

feature weighting methods and subset search methods. This categorization is based on

whether they evaluate the relevance of features separately or through feature subsets.

In what follows, we present the advantages and shortcomings of some well known

feature selection methods in each category.

2.2.1 Feature weighting methods

In feature weighting methods, weights are assigned to each feature independently and

features are ranked based on their relevance to the target variable. Relief is a famous

algorithm that studies features relevance (Kira and Rendell 1992). Algorithm (1.1) in

Chapter (1) presents the basic concepts of this method. Notice that the fundamental

idea of Relief is to estimate the relevance of features according to how well their values

separate the instances of the same and diﬀerent classes that are near each other (Yu

and Liu 2003).

For a dataset with ninstances and dfeatures the complexity of relief is in order

of O(nd), which makes it very practical to data sets with large number of instances

and features, such as CS datasets. Although simple, Relief doesn’t remove redundant

features. If feature weights are superior to a particular threshold, these features will

Chapter 2: Filter rank aggregation

be selected even though many of them are highly correlated to each other (Kira and

Rendell 1992).

In general, feature weighting methods have similar shortcomings as Relief. They

are good in capturing the relevance of features to the target variable but fail to capture

redundancy among features.

2.2.2 Subset search methods

Subset search methods use a particular evaluation measure which captures the rele-

vance of each subset. In this way not only relevance is considered but also redundant

features are identiﬁed within the selected subset. In this context Hall (2000) used

a correlation measure to evaluate the relevance of the feature subsets. He based his

work on the hypothesis that a good feature subset contains highly correlated features

to the target variables, yet uncorrelated to each other. His proposed approach, named

CFS, also uses heuristic search to ﬁnd a candidate subsets to be evaluated.

According to Arauzo-Azofra et al. (2008) correlation measures eﬃciently decrease

irrelevance and redundancy. Yu and Liu (2003) recommended two main approaches

to measure correlation, one is based on classical linear correlation between random

variables and the other is based on information theory.

Many correlation coeﬃcients can be used under to ﬁrst approach but the most

common are PCC and χ2(see Section (1.3)). According to Yu and Liu (2003) PCC

is not able to capture correlations that are not linear. Another limitation is that

the calculation requires all features to have numerical values. On the other hand,χ2

is used to investigate whether two distributions of categorical variables diﬀer. To

overcome these shortcomings, correlation measure based on the information theory

could be used.

The second approach based on information theory measures how much knowledge

two variables carry about each other. MI is a well known information theory measure

that captures nonlinear dependencies between variables ( for more details see Section

(1.3)).

Chapter 2: Filter rank aggregation

In general, Subset search methods need to evaluate all possible subsets. Conse-

quently, a search is performed to ﬁnd the candidate subsets. Therefore, these methods

suﬀer from time complexity issues which make them not practical to deal with high

dimensional data.

Feature ranking makes use of a scoring function computed from the values (xj

i, yi)

using one of the criteria discussed above such as weighting, consistency and corre-

lation. It is assumed that a high score is indicative of a valuable variable and that

variables are sorted in decreasing order of the scoring function. Even when feature

ranking is not optimal, it could be preferable than other feature subset selection

methods because of its computational and statistical scalability. It is computation-

ally eﬃcient since it requires only the computation of dscores and sorting the scores.

It is statistically robust against overﬁtting because it introduces bias, however it may

have considerably less variance (Hastie et al. 2001).

2.2.3 Issue I: Selection trouble and rank aggregation

Given the variety of ﬁlter based methods, it is diﬃcult to identify which of the ﬁlter

criteria would provide the best output for the experiments. The question is then how

to choose the best criterion for a speciﬁc feature selection task? Wu et al. (2009) call

this problem a selection trouble. There exists no universal solution for this problem

unless to evaluate all existing methods and then establish a general conclusion, which

is an impossible task. The best approach is to independently apply a mixture of the

available methods and evaluate the results.

Combining preference lists from those individual rankers into a single better ranking is

known as rank aggregation. Rank aggregation methods have emerged as an important

tool for combining information in CS. Ensemble feature selection methods, i.e rank

aggregation, use an idea similar to ensemble learning for classiﬁcation (Dietterich

2000). In a ﬁrst step, a number of diﬀerent feature selectors, i.e rankers, are used and

then the output of these separate selectors is aggregated and returned as the ﬁnal

ensemble result.

Chapter 2: Filter rank aggregation

Ensemble methods have been widely applied to bring together a set of classiﬁers

for building robust predictive models. It has been shown that these ensemble classi-

ﬁers are competitive with other individual classiﬁers and in some cases are superior.

Recently, there have been studies applying the ensemble concept to the process of

feature selection (Dittman et al. 2013). Rank aggregation could be used to improve

the robustness of the individual feature selection methods. Diﬀerent rankers may

yield diﬀerent ranking lists that can be considered as local optima in the space of fea-

ture subsets and ensemble feature selection might give a better approximation to the

optimal ranking of features. Also, the representational power of a particular feature

selector might constrain its search space such that optimal subsets cannot be reached.

Ensemble feature selection could help in alleviating this problem by aggregating the

outputs of several feature selectors (Saeys et al. 2008).

As discussed earlier, rank aggregation have many merits. However, with ensemble

feature selection the question is how to aggregate results of individual rankers. A

number of diﬀerent rank aggregation methods have been proposed in the literature.

Some of them are easy to set up like the mean, median, highest rank or lowest rank

aggregation and some are more diﬃcult (Dittman et al. 2013).

All rank aggregation methods assume that the ranked lists being combined assign

a value to each feature, from 1 to d, where the rank 1 is assigned to the most relevant

feature, the second best feature is 2, and so on until the least relevant feature is

assigned d. Simple rank aggregation method use straightforward way to ﬁnd the

ﬁnal aggregated list, in all cases, once each feature has been given a single value

based on the mean, median, highest, or lowest value, all features are ranked based

on these new values. For example, mean aggregation simply ﬁnds the mean value of

the feature’s rank across all the lists and uses this as that feature’s value. Likewise,

median ﬁnds the median rank value across all the lists being combined, using the

mean of the middle two values if there are an even number of lists. Highest rank and

lowest rank use related strategies: either the highest (best, smallest) or the lowest

(worst, largest) rank value across all the lists is assigned as the value for the feature

Chapter 2: Filter rank aggregation

in question. Figure (2.1) shows the general rank aggregation process to obtain a

consensus rank list from mindividual ﬁlters.

Figure 2.1: General scheme of ﬁlter rank aggregation.

Simple ranking methods are easy to set up. However, in many cases it is possible for

two features to end up tied, even if this was not the case in any of the lists being com-

bined and even when these features do not have any tie of similarity (Dittman et al.

2013). Recent work in the area of rank aggregation methods has developed unique

and innovative approaches. These new methods can focus on diﬀerent aspects of the

ranking process including comparing results to randomly generated results. Kolde

et al. (2012) proposed an approach that detects features that are ranked consis-

tently better than expected under null hypothesis of uncorrelated inputs and assigns

a signiﬁcance score for each feature. The underlying probabilistic model makes the

algorithm parameter free and robust to outliers, noise and errors. Other research

focused on giving more weight to top ranking features or combining well known ag-

gregation methods. In this work we use rank aggregation from another perspective.

Chapter 2: Filter rank aggregation

In fact we aim to ﬁnd the best list which would be the closest as possible to all in-

dividual ordered lists all together and this can be seen as an optimization problem.

More details will be given in Section (2.3).

2.2.4 Issue II: Incomplete ranking and disjoint ranking for similar fea-

tures

The rankings provided by diﬀerent ﬁlters may be in many cases incomplete or even

disjoint. In fact incomplete rankings may come in two forms.

•In the ﬁrst form, diﬀerent ﬁlters or some of them may each provide rankings for

only the kbest features and ignore the remaining features provided in the begin-

ning (Sculley 2007). Assume we have 7 features {X1, X2, X3, X4, X 5, X6, X 7},

where {X1, X3}are not the most relevant features. In this case one of the ﬁlter

may provide a ranking just over the set {X2, X4, X5, X6, X7}and ignore X1

and X3.

•In the second form, used ﬁlters may provide complete rankings over a limited

subset of available features due to incomplete knowledge (Sculley 2007). Having

the same example where we have 7 features {X1, X2, X3, X 4, X5, X6, X 7}and

only information about features {X3, X5, X 6}is available. In this case one of

the ﬁlter may provide a ranking just over the set {X3, X5, X6}and ignore the

set {X1, X2, X4, X7}.

Incomplete rankings are common in many ﬁnancial applications but still it is not

the only problem with rank aggregation. In fact the majority of rankings involve a

set of similar features, but despite the similarity between these features they are not

ranked similarly which additionally to the problem of incomplete rankings may lead

to noisy rankings.

Let us give an illustrative example. Assume we have 7 features {X1, X2, X 3, X4, X 5,

X6, X7}, where X2and X5are highly similar but not identical. We consider the two

following rank lists from two diﬀerent ﬁlters:

list one is

Chapter 2: Filter rank aggregation

{X3, X2, X7, X5}

and list two is

{X2, X7, X3, X4, X 1}.

If we have no preference to either one then standard methods of rank aggregation

may produce the rankings in the following way:

Aggregation 1: {X2, X3, X7, X5, X 4, X1}.

Aggregation 2: {X3, X2, X7, X5, X 4, X1}.

And if we want to take advantage of similarity in rank aggregation, we need a new

aggregation method. The later should use the additional information provided by a

deﬁned similarity measure. Therefore, a more acceptable ranking that agrees with

our oint of view is:

{X3, X2, X5, X7, X 4, X1}.

To avoid disjoint ranking for similar features, we present in the Section (2.3.3)

a simple approach that extends any standard aggregation method in order to take

similarity into account.

2.3 New approach for ﬁlter feature selection

In this section we propose a novel approach for ﬁlter feature selection. We consider

building a two-stage ﬁlter feature selection model. In the ﬁrst step, an optimization

function and GA are used to solve the selection trouble and the rank aggregation

problem and to sort the features according to their relevance. In the second step, a

standard algorithm is proposed in order to solve the problem of disjoint ranking for

similar features and to eliminate redundant ones.

Chapter 2: Filter rank aggregation

2.3.1 Optimization problem

The aim of rank aggregation when dealing with feature selection is to ﬁnd the best

list which would be the closest as possible to all individual ordered lists all together.

This can be seen as an optimization problem when we look at argmin(D, σ), where

argmin gives a list σat which the distance Dwith a randomly selected ordered list

is minimized. In this optimization framework the objective function is given by :

f(σ) =

i=1

Wi×D(σ, Li),(2.1)

where Widenotes the weight associated with the lists Li, D is a distance function

measuring the distance between a pair of ordered lists, mis the number of lists and

Liis the ith ordered list of cardinality k. The best solution is then to look for σ∗,

which would minimize the total distance between σ∗and Li, given by

σ∗=argmin

i=1

Wi×D(σ, Li).(2.2)

Measuring the distance between two ranking lists is classical and several well-studied

metrics are known (Carterette 2009; Kumar and Vassilvitskii 2010), including the

Kendall’s tau distance and the Spearman footrule distance. Before deﬁning this

two distance measures and their corresponding weighted distances some necessary

notations are needed.

For each feature Xj∈Li,r(Xj), j= 1, . . . , d shows the ranking of this feature, where

r(Xj) = 1 is associated with the feature on top of Li, that is most important one

and r(Xj) = dis associated with the feature which is at the bottom, or the least

important one with regard to the target concept. All other ranks correspond to the

features that would be in-between. Note that rankings are always positive, and higher

rank shows lower preference in the list.

Chapter 2: Filter rank aggregation

Spearman footrule distance

Spearman footrule distance between two given rankings lists Land σis deﬁned as the

sum overall the absolute diﬀerences between the ranks of all unique elements from

both ordered lists combined. Formally, the Spearman footrule distance between L

and σis given by

Spearman(L, σ) = X

X∈(L∪σ)

|rL(X)−rσ(X)|.(2.3)

Spearman footrule distance is a very simple way to compare two ordered lists. The

smaller the value of this distance the more similar the lists are. When the two lists

to be compared have no elements in common, the metric is k(k+ 1).

Kendall’s tau distance

Kendall’s tau distance between two ordered rank list Land σis given by the number

of pairwise adjacent transpositions needed to transform one list into another (Dinu

and Manea 2006). This distance can be seen as the number of pairwise disagreements

between the two rankings. Hence, the formal deﬁnition for the Kendall’s tau distance

is:

Kendall(L, σ) = X

Xj,Xj0∈(L∪σ)

K, (2.4)

where











0if rL(Xj)< rL(Xj0), rσ(Xj)< rσ(Xj0)

or rL(Xj)> rL(Xj0), rσ(Xj)> rσ(Xj0)

1if rL(Xj)> rL(Xj0), rσ(Xj)< rσ(Xj0)

or rL(Xj)< rL(Xj0), rσ(Xj)> rσ(Xj0)

pif rL(Xj) = rL(Xj0) = k+ 1,

rσ(Xj) = rσ(Xj0) = k+ 1

(2.5)

Chapter 2: Filter rank aggregation

That is, if we have no knowledge of the relative position of Xjand Xj0in one of

the lists we have several choices: impose no penalty (0), full penalty (1), or a partial

penalty p such that 0 <p<1.

Weighted distance

In case, the only information available about the individual list is the rank order, the

Spearman footrule distance and the Kendall’s tau distance are adequate measures.

However, the presence of any additional information about the individual list may

improve the ﬁnal aggregation. Typically with ﬁlter methods, weights are assigned to

each feature independently and then the features are ranked based on their relevance

to the target variable. It would be beneﬁcial to integrate these weights winto our

aggregation scheme. Hence, the weight associated with each feature consists of taking

the average score across all the ranked feature lists. We ﬁnd the average for each

feature by adding all the normalized scores associated to each lists, and dividing the

sum by the number of lists. According to Pihur et al. (2009) the weighted Spearman’s

footrule distance between the two lists Land σis given by

w.Spearman(L, σ) = X

X∈(L∪σ)

|w(rL(X)) −w(rσ(X))|×|rL(X)−rσ(X)|,(2.6)

were w(rL(X)) and w(rσ(X)) denote the weights associated with the feature X

with rank rin the lists Land σ. Analogously to the weighted Spearman’s footrule

distance, the weighted Kendall’s tau distance is given by (Pihur et al. 2009):

w.Kendall(L, σ) = X

Xj,Xj0∈(L∪σ)

|w(rL(Xj)) −w(rσ(Xj0))|K. (2.7)

2.3.2 Solution to optimization problem using genetic algorithm

The introduced optimization problem in Section (2.3.1) is a typical integer program-

ming problem. As far as we know, there is no eﬃcient solution to such kind of

Chapter 2: Filter rank aggregation

problem. One possible approach would be to perform complete search. However, it

is too time demanding to make it applicable in real applications, and more practical

solutions are needed.

The introduced method uses GA for rank aggregation. GAs were developed by

Holland (1992) to imitate the mechanism of genetic models of natural evolution and

selection. GAs are powerful tools for solving complex combinatorial problems, where

a combinatorial problem involves choosing the best subset of components from a pool

of possible components in order that the mixture has some desired quality (Clegg

et al. 2009). GAs are computational models of evolution. They work on the basis of

a set of candidate solutions. Each candidate solution is called a ”chromosome”, and

the whole set of solutions is called a ”population”. The algorithm allows movement

from one population of chromosomes to a new population in an iterative fashion.

Each iteration is called a ”generation”. In our case, GA proceeds in the following

way: initialization, selection, cross-over and mutation.

Initialization

Once a set of aggregation rank lists are generated by several ﬁltering methods, it

is necessary to create an initial population of features to be used as starting point

for the genetic algorithm, where each feature in the population represents a possible

solution. This starting population is then obtained by randomly selecting a set of

ordered rank lists.

Despite the success of GA on a wide collection of problems, the choice of the

population size is still an issue. Gotshall and Rylander (2000) proved that the larger

the population size is the better chance of it containing the optimal solution. However,

increasing population size increases the number of generations. In order to have great

results, the population size should depend on the length of the ordered lists and on

the number of unique elements in these lists. From empirical studies, over a wide

range of problems, a population size between 30 and 100 is usually recommended

(Pihur et al. 2009).

Chapter 2: Filter rank aggregation

Selection

Once the initial population is ﬁxed, we need to select new members for the next

generation. In fact, each element in the current population is evaluated on the basis

of its overall ﬁtness given by Equation (2.1). Depending on which distance is used,

new members, i.e rank lists, are produced by selecting high performing elements.

Cross-over

The selected members are then crossed-over with the cross-over probability. Crossover

randomly selects a point in two selected lists and exchanges the remaining segments

of these lists to create a new ones. Therefore, crossover combines the features of two

lists to create two similar ranked lists.

Mutation

In case only the crossover operator is used to produce the new generation, one possible

problem that may arise is when all the ranked lists in the initial population have the

same value at a particular rank. Then, all future lists will have the same value at this

particular rank. To overcome this unwanted situation a mutation operator is used.

Mutation operates by randomly changing one or more elements of any list. It acts as

a population perturbation operator. Typically mutation does not occur frequently so

mutation is of the order of 0.001 (Pihur et al. 2009).

Figure (2.2) presents a ﬂowchart summarizing the fundamental steps of the pro-

posed rank aggregation method using GAs.

Chapter 2: Filter rank aggregation

Figure 2.2: A summary ﬂowchart of the proposed genetic algorithm rank aggregation

Chapter 2: Filter rank aggregation

2.3.3 A rank aggregation based on similarity

In this section we solve the problem of disjoint ranking for similar features and elim-

inate redundant ones. First, we perform a simple algorithm that incorporates sim-

ilarity knowledge in the ﬁnal ranking in order to handle disjoint ranking of similar

features. Then, redundant features are eliminated by comparing the relevance of

each pair of redundant features to the target class. Figure (2.3) gives a summary on

performing rank aggregation based on similarity.

Figure 2.3: Flowchart summarizing the rank aggregation approach based on similarity

Solution to disjoint ranking for similar features

First we perform a feature selection using three diﬀerent feature selection methods

namely: Relief, χ2and MI and three diﬀerent rankings are obtained. Second an ag-

gregation is performed on the obtained three ranking lists using the proposed genetic

rank aggregation algorithm in Section (2.3.2) yielding a combined list InitialR.

In each iteration we study the similarity between the ﬁrst feature in InitialR, i.e

the feature with r(Xj) = 1 that we denote V ar, and the remaining features in this list

using a function denoted SIM. Before using the SIM function, the possible similarities

between features in Xare summarized using the full MI matrix. Where each element

in this matrix represents the pairwise similarity between two features Xjand Xj0.

Chapter 2: Filter rank aggregation

This matrix is in general a (d×d) symmetric positive semi deﬁnite matrix and takes

on values in [0,...,1] , with diagonal values equal to 0. A large value indicates a close

relationship between variables Xjand Xj0.

SIM function compares the similarity between the features using this matrix. If V ar

has 80% of similarity with any of the features in the list InitialR, the function SIM

return ’TRUE’ elsewhere it returns ’FALSE’.

In accordance with the function SIM result, two possible scenarios arise depending

on whether V ar has a strong link of similarity with the remaining features or not.

First scenario: if the function SIM returns ’FALSE’ then V ar doesn’t have any

strong connection with any other features in the list InitialR. In this case there is no

disagreement among rankings and V ar is removed from the list InitialRand added

to F inalRwhich is the ﬁnal aggregation list that will serve to classiﬁcation.

Lets take to the previous illustrative example of Section (2.2.4) where we aggregate

two ranking lists {X3, X2, X7, X 5}and {X2, X7, X3, X 4, X1}. If the obtained aggre-

gated list is {X3, X2, X7, X5, X 4, X1}then, the function SIM studies the similarity

between feature X3and {X2, X7, X 5, X4, X 1}and returns false. Then, X3will be

added to F inalRand {X2, X7, X5, X4, X 1}is investigated in the next iteration. Fig-

ure (2.4) illustrates the ﬁrst scenario.

Figure 2.4: Illustrative example of the ﬁrst scenario

Second scenario: If the SIM function returns the value ’ TRUE ’ then the feature

V ar has a strong link of similarity with one of the other features in InitialR. Then, we

check if these two features have divergent rankings in spite of their strong similarity.

Chapter 2: Filter rank aggregation

First we use the function PLUS-SIM, which returns the most similar feature to V ar in

the list InitialR. Then, we examine if the result of PLUS-SIM is equal to the feature

with the next ranking of V ar in InitialR. In case the feature with the next ranking

to the feature V ar is the most similar, then V ar and its neighbor are removed from

InitialRand added to F inalR. Else we use the functions DIST-POS and PERMUT

in order to move closer the similar features. More details are given in Algorithm (2.4)

with a detailed description of the diﬀerent functions used in this approach.

Algorithm 2.4 Rank aggregation based on similarity

Require: InitialR: Initial rank aggregation.

Ensure: F inalR: Final rank list.

1: while InitialR== ∅do

2: V ar =InitialR[1].

3: V arlist =SU BLI ST (InitialR,2).

4: if SIM(Var, V arlist)==FALSE then

5: F inalR= CONCAT (F inalR, Var).

6: InitialR=V arlist.

7: else

8: V arnext=V arlist [1].

9: if V arnext ==PLUS-SIM (Var, V arlist)then

10: F inalR= CONCAT (F inalR, Var).

11: F inalR= CONCAT (F inalR,V arnext).

12: REMOVE(V arnext,V arlist).

13: InitialR=V arlist.

14: else

15: while V arnext==PLUS-SIM (Var, V arlist)do

16: if DIST-POS(Var-next ,PLUS-SIM (Var, V arlist), V arlist)>1then

17: PERMUTE(Var-next, Var,InitialR).

18: else

19: PERMUTE(PLUS-SIM (Var, InitialR),Var-next,InitialR).

20: end if

21: end while

22: end if

23: end if

24: end while

25: Return F inalR.

Chapter 2: Filter rank aggregation

•SIM(E, L) return : false, true

Takes a parameter list L and a feature E and check if feature E has a similarity

with one of the elements of list L. If the similarity with one of the elements of

the list is superior to 80 %, the function returns true elsewhere false.

•CONCAT (L, E) return : list

Takes a parameter list L to be concatenated and appends the second argument

E into the end of list L.

•POS(E,L) return: number

Searches for feature E in List L, and returns its position in list L, or zero if

feature E was not found in L.

•PLUS-SIM(E, L) return : feature

Searches for a feature in list L with the biggest similarity to feature E.

•SUBLIST(L, P) return : list

Returns a list of the elements in list L, starting at the speciﬁed position P in

this list.

•REMOVE(E,L)

Removes element E given as argument from list L.

•DIST-POS(E1,E2,L) return : number

Counts the number of positions between two given elements E1 and E2 in list

•PERMUT(E1,E2,L)

Swaps the position of two features E1 and E2 in list L.

Removing unwanted features

Once the selection trouble is solved and a consensus list of mutual features is obtained,

we come across the issue of choosing the appropriate number of features to retain. In

Chapter 2: Filter rank aggregation

fact a list of sorted features doesn’t provide us with the optimal features subset. In

general a predeﬁned small number of features is retained from the consensus list in

order to build the ﬁnal model. If the number of used features is relatively small or

big, then the ﬁnal classiﬁcation results may be degraded.

Despite the fact that most of the features that had a disjoint ranking in Section

(2.3.3) are relevant, the underlying concepts can be concisely captured using only a

few features, while keeping all of them has substantially detrimental eﬀect on the

credit model accuracy. So while we solve the problem of disjoint ranking, we use a

marker to mark each pair of treated feature as similar items. A matrix MATSis

then created in order to stock each pair of similar features, where each row of M ATS

contains a feature and their similar items. Then, we study each row of MATSby

looking into the computed MI in order to identify the feature that supplies the most

information about the target class. As a result the feature with the highest MI is

kept and other similar features are removed from the aggregated list. Let’s take the

illustrative example that we used in Section (2.2.4). We suppose that after dealing

with the problem of disjoint ranking we obtain this list {X3, X2, X5, X 7, X4, X1},

introduced before, features X2and X5are highly similar an while looking into the

results of MI we notice that X5has the highest MI, consequently X2is removed from

the list.

2.4 Experimental investigations

Our feature selection ensemble is composed by three diﬀerent ﬁlter selection algo-

rithms namely: Relief , χ2and MI ( see Section (2.2)). These algorithms are available

in Weka 3.7.0 machine learning package (Bouckaert et al. 2009).

The aggregation of these ﬁlters is ﬁrst performed by our GA approach with Kendall

(GA-K ) and Spearman distances ( GA-S) and then compared to the mean, median,

highest rank or lowest rank aggregation (see Section (2.2.3) ). These aggregation

methods were tested using a Matlab implementation of the R package ”RobustRank-

Aggreg” written by Kolde et al. (2012), and compared to the results given by the

Chapter 2: Filter rank aggregation

individual feature selection methods. We use in this chapter four diﬀerent classiﬁers,

namely DT, SVM, ANN and KNN. These four classiﬁers are available in Weka 3.7.0

machine learning package (Bouckaert et al. 2009).

The parameters setting for GA are given Table (2.1). These parameters were

chosen based on the result of several preliminary runs of the proposed approach. A

10-fold cross validation is used to compare the classiﬁer’s performance against others.

Table 2.1: Parameters of experimental environment for genetic algorithm.

Parameter Value

Size of population 100

Mutation rate 0.001

Crossover rate 0.7

Number of generation 10

2.4.1 Results and discussion

First, the three feature selection methods: Relief, χ2and MI are applied to the

datasets and three rankings of features are obtained. Next, the obtained rankings

are aggregated using the available aggregation methods. Then, we pick a number of

top-ranked features to get a few feature subsets. Then, DT, SVM, ANN and KNN

classify the datasets using these feature subsets. Results are presented in Tables (2.3)-

(2.6), where the best results are shown in bold.

From Table (2.2) we notice that for the Australian dataset, GA-K provides the

highest precision in comparison with other feature selection methods except the case

where DT is used as classiﬁer, GA-S achieves the best precision. With the German

dataset GA-K achieves the highest precision with all classiﬁers. For the HMEQ

dataset GA-K achieves the highest precision with SVM and ANN classiﬁers while the

highest precision of the other two classiﬁers are achieved by GA-S. Finally for the

Tunisian dataset the highest precision rate is also achieved by GA-K aggregation for

the DT, SVM and ANN classiﬁers while the highest precision rate for KNN classiﬁer

is achieved by highest rank aggregation.

We further investigate the recall results for the set of feature selection methods.

For the Australian dataset the best recalls are achieved by GA-S for the DT and

Chapter 2: Filter rank aggregation

Table 2.2: Summary of the best performance results archived by the set of feature

selection methods for the four datastes within the ﬁlter framework.

DT SVM ANN KNN

Relief

χ2⊗

Mean ⊗⊗⊗⊗

Median ⊗

Highest rank ⊕

Lowest rank ⊗ ⊗

GA-K  ⊗ ⊕ ⊕ ⊕   ⊗ ⊕ ⊕  ⊕  ⊕ ⊕   ⊗ ⊕⊗⊕

GA-S  ⊗ ⊕⊕⊗⊕⊕ ⊗⊕ ⊗⊕

Precision, ⊗Recall, ⊕F-measure, ROC Area.

Color: -Australian, -German , -HEMQ, -Tunisian.

KNN while for the SVM and ANN classiﬁers the best rates are achieved by the

lowest rank aggregation. Looking for German dataset results’ in Table (2.4) we see

that the highest recall is achieved in three times by GA-K expect for ANN where the

best recall is given by GA-S, Table (2.5) shows that the highest recall for the HMEQ

dataset is obtained with GA-K for KNN and with the mean aggregation for the DT

and SVM and χ2for ANN classiﬁer. For the Tunisian dataset Table (2.6) shows that

mean aggregation achieves twice the highest recall with ANN and DT followed by

GA-S for the KNN classiﬁer and median aggregation for SVM.

Table (2.3) shows that GA-S achieves the best F-measure three times except for

KNN where the highest F-measure for the Australian dataset is achieved with GA-K.

From Tables (2.4) and (2.5) we see that the highest recall is achieved by GA-K except

for ANN where the highest rank aggregation gives the best performance with ANN

and HMEQ dataset. Finally, Table (2.6) shows that GA-K achieves twice the highest

recall with SVM and DT and GA-S did the same with ANN and KNN.

Chapter 2: Filter rank aggregation

Table 2.3: Performance comparison of the new ﬁlter method and the other feature

selection methods for the Australian dataset.

Precision Recall F-Measure ROC Area

Decision Tree

Relief 0.786 0.917 0.846 0.655

MI 0.930 0.870 0.900 0.642

χ20.932 0.860 0.905 0.680

Mean 0.931 0.890 0.910 0.700

Median 0.931 0.888 0.909 0.713

Highest rank 0.920 0.943 0.931 0.689

Lowest rank 0.900 0.902 0.901 0.681

GA-K 0.946 0.923 0.934 0.727

GA-S 0.952 0.950 0.951 0.762

Support Vector Machine

Relief 0.795 0.898 0.843 0.702

MI 0.931 0.870 0.900 0.711

χ20.918 0.935 0.927 0.690

Mean 0.923 0.943 0.928 0.720

Median 0.921 0.945 0.932 0.721

Highest rank 0.933 0.940 0.936 0.707

Lowest rank 0.894 0.980 0.935 0.705

GA-K 0.945 0.921 0.933 0.898

GA-S 0.943 0.942 0.943 0.890

Artiﬁcial Neural Network

Relief 0.885 0.926 0.905 0.653

MI 0.929 0.873 0.902 0.700

χ20.926 0.924 0.926 0.683

Mean 0.927 0.934 0.931 0.752

Median 0.925 0.937 0.931 0.755

Highest rank 0.929 0.940 0.934 0.732

Lowest rank 0.896 0.975 0.933 0.726

GA-K 0.931 0.953 0.941 0.728

GA-S 0.929 0.883 0.905 0.742

K-Nearest Neighbor

Relief 0.784 0.881 0.829 0.801

MI 0.890 0.892 0.891 0.789

χ20.912 0.799 0.851 0.788

Mean 0.920 0.886 0.902 0.825

Median 0.926 0.906 0.915 0.859

Highest rank 0.940 0.932 0.936 0.832

Lowest rank 0.944 0.930 0.937 0.834

GA-K 0.950 0.920 0.934 0.860

GA-S 0.942 0.942 0.942 0.866

If we look in ROC area results’ we notice from Tables (2.4) and (2.5) that GA-K

achieves the highest values except with German dataset and SVM where GA-S gives

the best ROC area and respectively with the HMEQ dataset and ANN the GA-S gives

Chapter 2: Filter rank aggregation

Table 2.4: Performance comparison of the new ﬁlter method and the other feature

selection methods for the German dataset.

Precision Recall F-Measure ROC Area

Decision Tree

Relief 0.682 0.555 0.669 0.631

MI 0.516 0.534 0.525 0.621

χ20.737 0.477 0.579 0.600

Mean 0.750 0.542 0.612 0.682

Median 0.750 0.545 0.613 0.727

Highest rank 0.788 0.605 0.684 0.689

Lowest rank 0.700 0.642 0.669 0.760

GA-K 0.792 0.701 0.743 0.795

GA-S 0.756 0.697 0.725 0.789

Support Vector Machine

Relief 0.517 0.511 0.514 0.692

MI 0.603 0.534 0.566 0.701

χ20.705 0.489 0.577 0.622

Mean 0.766 0.552 0.627 0.780

Median 0.756 0.560 0.643 0.781

Highest rank 0.762 0.623 0.685 0.766

Lowest rank 0.708 0.602 0.650 0.802

GA-K 0.823 0.812 0.817 0.812

GA-S 0.812 0.799 0.805 0.809

Artiﬁcial Neural Network

Relief 0.556 0.511 0.533 0.605

MI 0.612 0.534 0.572 0.589

χ20.721 0.500 0.591 0.602

Mean 0.781 0.586 0.656 0.689

Median 0.778 0.591 0.671 0.677

Highest rank 0.770 0.600 0.674 0.678

Lowest rank 0.765 0.602 0.673 0.700

GA-K 0.821 0.706 0.759 0781

GA-S 0.819 0.708 0.759 0.786

K-Nearest Neighbor

Relief 0.703 0.688 0.695 0.702

MI 0.740 0.700 0.719 0.751

χ20.697 0.700 0.698 0.743

Mean 0.751 0.723 0.736 0.750

Median 0.749 0.730 0.739 0.800

Highest rank 0.720 0.763 0.740 0.791

Lowest rank 0.719 0.758 0.738 0.802

GA-K 0.820 0.811 0.815 0.813

GA-S 0.817 0.800 0.808 0.810

the best performance. For the Australian dataset the highest result are achieved twice

with GA-S and once with the mean aggregation for ANN and with GA-K for SVM

classiﬁer. Finally, for the Tunisian dataset Table (2.6) shows that GA-K achieves the

Chapter 2: Filter rank aggregation

Table 2.5: Performance comparison of the new ﬁlter method and the other feature

selection methods for the HMEQ dataset.

Precision Recall F-Measure ROC Area

Decision Tree

Relief 0.747 0.800 0.736 0.782

MI 0.814 0.831 0.801 0.791

χ20.818 0.832 0.798 0.760

Mean 0.821 0.981 0.887 0.786

Median 0.808 0.926 0.863 0.788

Highest rank 0.906 0.921 0.913 0.806

Lowest rank 0.842 0.922 0.880 0.805

GA-K 0.920 0.921 0.921 0.822

GA-S 0.923 0.912 0.917 0.815

Support Vector Machine

Relief 0.845 0.807 0.728 0.722

MI 0.822 0.828 0.784 0.755

χ20.822 0.828 0.784 0.690

Mean 0.830 0.987 0.902 0.702

Median 0.823 0.906 0.862 0.689

Highest rank 0.905 0.945 0.924 0.744

Lowest rank 0.900 0.891 0.895 0.742

GA-K 0.966 0.933 0.949 0.810

GA-S 0.942 0.940 0.941 0.812

Artiﬁcial Neural Network

Relief 0.663 0.715 0.688 0.689

MI 0.681 0.788 0.730 0.781

χ20.838 0.974 0.901 0.763

Mean 0.850 0.966 0.904 0.745

Median 0.848 0.971 0.905 0.723

Highest rank 0.897 0.842 0.980 0.788

Lowest rank 0.870 0.880 0.875 0.801

GA-K 0.902 0.972 0.935 0.825

GA-S 0.896 0.955 0.924 0.822

K-Nearest Neighbor

Relief 0.734 0.817 0.773 0.691

MI 0.805 0.820 0.812 0.799

χ20.688 0.801 0.740 0.760

Mean 0.822 0.822 0.822 0.801

Median 0.842 0.820 0.830 0.810

Highest rank 0.830 0.826 0.828 0.808

Lowest rank 0.828 0.821 0.824 0.806

GA-K 0.843 0.900 0.870 0.850

GA-S 0.850 0.867 0.858 0.823

best results with DT and SVM while GA-S achieves the highest results with ANN

and KNN.

The computed values or scores of recall, precision, and the F-measures are used to

Chapter 2: Filter rank aggregation

Table 2.6: Performance comparison of the new ﬁlter method and the other feature

selection methods for the Tunisian dataset.

Precision Recall F-Measure ROC Area

Decision Tree

Relief 0.876 0.888 0.882 0.681

MI 0.885 0.883 0.884 0.680

χ20.876 0.880 0.879 0.623

Mean 0.860 0.962 0.913 0.796

Median 0.871 0.899 0.884 0.791

Highest rank 0.901 0.907 0.904 0.793

Lowest rank 0.889 0.902 0.895 0.799

GA-K 0.922 0.912 0.917 0.813

GA-S 0.917 0.908 0.912 0.811

Support Vector Machine

Relief 0.845 0.807 0.728 0.682

MI 0.822 0.828 0.784 0.651

χ20.822 0.828 0.784 0.645

Mean 0.830 0.987 0.902 0.762

Median 0.889 0.975 0.930 0.755

Highest rank 0.922 0.907 0.914 0.800

Lowest rank 0.881 0.880 0.880 0.815

GA-K 0.967 0.952 0.959 0.831

GA-S 0.966 0.923 0.944 0.823

Artiﬁcial Neural Network

Relief 0.827 0.847 0.830 0.703

MI 0.822 0.852 0.826 0.700

χ20.833 0.850 0.832 0.623

Mean 0.875 0.964 0.917 0.802

Median 0.881 0.951 0.914 0.801

Highest rank 0.905 0.901 0.894 0.729

Lowest rank 0.878 0.888 0.887 0.725

GA-K 0.924 0.902 0.912 0.822

GA-S 0.916 0.943 0.929 0.826

K-Nearest Neighbor

Relief 0.788 0.800 0.794 0.810

MI 0.821 0.688 0.748 0.799

χ20.753 0.677 0.713 0.780

Mean 0.809 0.801 0.805 0.812

Median 0.811 0.799 0.805 0.821

Highest rank 0.950 0.688 0.700 0.798

Lowest rank 0.940 0.691 0.796 0.792

GA-K 0.889 0.888 0.888 0.900

GA-S 0.887 0.901 0.893 0.905

measure the performance of the feature selection methods. The diﬀerences between

any two features selection methods may be due to chance or there is a signiﬁcant

diﬀerence between them. To rule out the possibility that the diﬀerence is due to

Chapter 2: Filter rank aggregation

chance and to conﬁrm our conclusions, statistical hypothesis testing is used.

Analysis of variance (ANOVA) is a particular form of statistical hypothesis testing

mainly used in the analysis of experimental data to test the equality of three or

more population means. Here, we are interested in determining whether the mean

values of a given performance measure signiﬁcantly diﬀer accordingly with the used

feature selection method and classiﬁcation method. A two-way ANOVA is performed

to conclude on the diﬀerence between the diﬀerent features selection methods and

classiﬁcation methods. Then the ﬁrst factor is represented through the diﬀerent

feature selection methods and the second is represented by the diﬀerent classiﬁcation

methods. Then ANOVA tests the null hypothesis H0that the means of all groups

of factor 1 are equal and also tests H0that the means of all groups of factor 2 are

equal and summarize when the relationship between one factor and the dependent

variable , i.e level of F-measure, changes for diﬀerent levels of the other factor. Several

hypotheses are jointly tested in two-way ANOVA, then H0and alternative hypothesis

H1for factor 1 would be











H0:µ1

Relief =µ1

MI =µ1

χ2=µ1

Median =µ1

Mean =µ1

Highest =µ1

Lowest

=µ1

GA−S=µ1

GA−KPerformances of selection methods are equal,

versus

H1:∀t, µ1

t6=µ1

i, i, t ∈ {Relief, M I, χ2, M edian, Mean, Highest, Lowest,

GA −k, GA −S}, i 6=tAt least one of the feature

selection methods mean of performance is diﬀerent from the others

H0and H1for factor 2 would be











H0:µ2

DT =µ2

SV M =µ2

ANN =µ2

KN N Performances of classiﬁers are equal,

versus

H1:∀t, µ2

t6=µ2

i, i, t ∈ {DT , SV M, K NN, AN N}, i 6=tAt least one of the classifer

mean of performance is diﬀerent from the others

Chapter 2: Filter rank aggregation

Interaction between the two factors is given by the following hypotheses:











H0: There is no interaction between the two factors,

versus

H1: There is an interaction between the two factors

ANOVA analyzes variance by separating it into two parts within-groups variabil-

ity and between-groups variability. Then F statistic indicates whether the between-

groups variability is signiﬁcantly greater than the within-groups variability. If the F

statistic is signiﬁcant (p-value ≤0.05), at least one group mean is signiﬁcantly diﬀer-

ent from one or more of the others. A signiﬁcant F statistic suggests that we reject

H0. To set up a two-way ANOVA we use the data in Table (2.7), the results obtained

from this table are summarized in Table (2.8)

Table 2.7: Summary of F-measures for all feature selection methods with the four

classiﬁcation methods in ﬁlter framework.

Relief MI χ2Mean Median Highest Lowest GA-K GA-S

DT 0,856 0,900 0,905 0,910 0,909 0,931 0,901 0,934 0,951

0,669 0,525 0,579 0,612 0,613 0,684 0,669 0,743 0,725

0,736 0,801 0,798 0,887 0,863 0,913 0,88 0,921 0,917

0,882 0,884 0,879 0,913 0,884 0,904 0,895 0,917 0,912

SVM 0,843 0,900 0,927 0,928 0,932 0,936 0,935 0,933 0,943

0,514 0,566 0,577 0,627 0,643 0,685 0,650 0,817 0,805

0,728 0,784 0,784 0,902 0,862 0,924 0,895 0,949 0,941

0,728 0,784 0,784 0,902 0,930 0,914 0,880 0,959 0,944

ANN 0,905 0,902 0,926 0,931 0,931 0,934 0,933 0,941 0,905

0,533 0,572 0,591 0,656 0,671 0,674 0,673 0,759 0,759

0,688 0,730 0,901 0,904 0,905 0,98 0,875 0,935 0,924

0,830 0,826 0,832 0,917 0,914 0,894 0,887 0,912 0,929

KNN 0,829 0,891 0,851 0,902 0,915 0,936 0,937 0,934 0,942

0,695 0,719 0,698 0,736 0,739 0,74 0,738 0,815 0,808

0,773 0,812 0,740 0,822 0,830 0,828 0,824 0,870 0,858

0,794 0,748 0,713 0,805 0,805 0,700 0,796 0,888 0,893

The actual result of the two-way ANOVA namely, whether either of the two fac-

tors, or their interaction are statistically signiﬁcant is shown in the Table (2.8). The

particular rows we are interested in are the ” Selection Method”, ” Classiﬁer ” and ”

Selection Method * Classiﬁer ” rows, and these are highlighted above in red. These

Chapter 2: Filter rank aggregation

Table 2.8: Tests of between-subjects eﬀects in ﬁlter framework.

Source Type III Sum of

Squares

DF Mean Square F Sig. (p-value)

Corrected Model 0,359a35 0,010 0,766 0,815

Intercept 98,109 1 98,109 7337,855 0,000

Selection Method 0,304 80,038 2,840 0,007

Classiﬁer 0,006 30,002 0,160 0,923

Selection Method * Classiﬁer 0,048 24 0,002 0,151 1,000

Error 1,444 108 0,013

Total 99,912 144

Corrected Total 1,802 143

Dependant Variable : F-measure

rows inform us whether our independent variables (the ” Selection Method ” and

”Classiﬁer” rows) and their interaction (the ” Selection Method * Classiﬁer” row)

have a statistically signiﬁcant eﬀect on the dependent variable, ”F-measure”. It is

important to ﬁrst look at the ” Selection Method * Classiﬁer ” interaction as this will

determine how we can interpret our results. We notice from the ”Sig.” column that

we don’t have a signiﬁcant interaction between the two factors which means that the

eﬀect on the outcome of any speciﬁc level of F-measure change for one factor is the

same for every ﬁxed setting of the other factor.

We also report the results of ”Selection Method” and ”Classiﬁer”, but again,

these needs to be interpreted in the context of the interaction result. We can see

from the above table that there was no statistically signiﬁcant diﬀerence in mean

interest in F-measure between the diﬀerent classiﬁers (p-value= 0,923), but there were

statistically signiﬁcant diﬀerences between the diﬀerent feature selection methods (p-

value =0,007).

ANOVA only tells us if there is any diﬀerence between the groups. If we want to

know where the diﬀerences are, then we need to do some additional analysis. Then

we use Tukey post hoc test in order to perform multiple comparisons for the diﬀerent

feature selection methods and we obtain a multiple comparisons table, as shown in

Table (2.9).

Chapter 2: Filter rank aggregation

Table 2.9: Multiple comparisons table for feature selection methods in ﬁlter frame-

work.

Selection

method

(I)

Selection

method

(J)

Mean dif-

ference

(I-J)

Sig. Selection

method

(I)

Selection

method

(J)

Mean dif-

ference

(I-J)

Sig.

χ2GA-K -,108875* 0,009 Mean χ20,054313 0,187

GA-S -,104438* 0,012 GA-K -0,054562 0,185

Highest -0,06825 0,098 GA-S -0,050125 0,223

Lowest -0,055188 0,18 Highest -0,013937 0,734

Mean -0,054313 0,187 Lowest -0,000875 0,983

Median -0,053813 0,191 Median 0,0005 0,99

MI 0,008813 0,83 MI 0,063125 0,125

Relief 0,030125 0,463 Relief ,084438* 0,041

GA-K χ20,108875* 0,009 Median χ20,053813 0,191

GA-S 0,004437 0,914 GA-K -0,055062 0,181

Highest 0,040625 0,323 GA-S -0,050625 0,218

Lowest 0,053688 0,192 Highest -0,014437 0,725

Mean 0,054562 0,185 Lowest -0,001375 0,973

Median 0,055062 0,181 Mean -0,0005 0,99

MI 0,117688* 0,005 MI 0,062625 0,128

Relief 0,139000* 0,001 Relief 0,083938* 0,042

GA-S χ20,104438* 0,012 MI χ2-0,008813 0,83

GA-K -0,004437 0,914 GA-K -0,117688* 0,005‘

Highest 0,036188 0,378 GA-S -0,113250* 0,007

Lowest 0,04925 0,231 Highest -0,077063 0,062

Mean 0,050125 0,223 Lowest -0,064 0,12

Median 0,050625 0,218 Mean -0,063125 0,125

MI 0,113250* 0,007 Median -0,062625 0,128

Relief 0,134563* 0,001 Relief 0,021313 0,603

Highest χ20,06825 0,098 Relief χ2-0,030125 0,463

GA-K -0,040625 0,323 GA-K -0,139000* 0,001

GA-S -0,036188 0,378 GA-S -0,134563* 0,001

Lowest 0,013062 0,75 Highest -0,098375* 0,018

Mean 0,013937 0,734 Lowest -0,085313* 0,039

Median 0,014437 0,725 Mean -0,084438* 0,041

MI 0,077063 0,062 Median -0,083938* 0,042

Relief 0,098375* 0,018 MI -0,021313 0,603

Lowest χ20,055188 0,18

GA-K -0,053688 0,192

GA-S -0,04925 0,231

Highest -0,013062 0,75

Mean 0,000875 0,983

Median 0,001375 0,973

MI 0,064 0,12

Relief 0,085313* 0,039

From Table (2.9) we notice that there is some repetition of the results, but regard-

less of which row we choose to read from, we are interested in the diﬀerences between

(1) GA-K and the individual feature selection methods, ie. relief, MI and χ2, (2)

Chapter 2: Filter rank aggregation

GA-S and the individual feature selection methods. From the results, we can see that

there is a statistically signiﬁcant diﬀerence between the obtained results from GA-K

and GA-S and individual results (p-value <0.05).

2.5 Conclusion

In this chapter, we investigated the eﬀect of the fusion of a set of ranking methods.

First, we conducted a preliminary study in which the issue of rank aggregation is

presented as an optimization problem solved using GA and distance measures. Second

we focused on solving the problem of disjoint ranking for similar features and choosing

the right number of features from the ﬁnal ranked list, for that we relate the similarity

of the feature to their rankings. We evaluated the proposed approach on four credit

datasets. Results show that there is generally beneﬁcial eﬀect of aggregating feature

rankings as compared to those produced by single methods. We also compared the

proposed approach with four well known aggregation methods and results are either

superior or at least as adequate as those selected by the other aggregation methods.

The second method for selecting the most important features is to use wrapper feature

selection. Details about this method are presented in the next chapter.

Chapter 3

An Ensemble Wrapper Feature Selection Based on

an Improved Exhaustive Search for Credit Scoring

Contents

3.1 Introduction .......................... 61

3.2 Wrapper Framework . . . . . . . . . . . . . . . . . . . . . 62

3.2.1 Issue I: Evaluation using a single classiﬁer . . . . . . . . . . 63

3.2.2 Issue II: Subset generation and search strategies . . . . . . 63

3.3 New approach for wrapper feature selection . . . . . . . 64

3.3.1 Primary dimensionality reduction step: similarity study . . 65

3.3.2 Subset generation step: speeding up exhaustive search by

heuristics ............................ 66

3.3.3 Evaluation step: eﬀects of using multiple classiﬁers . . . . . 68

3.4 Experimental Investigations . . . . . . . . . . . . . . . . . 73

3.4.1 Results and discussion for the same-type approach . . . . . 73

3.4.2 Results and discussion for the mixed-type approach . . . . 82

3.5 Conclusion ........................... 84

3.1 Introduction

Wrappers feature selection usually selects a feature subset of the most relevant fea-

tures with respect to the classiﬁcation performance given by a particular classiﬁer.

Chapter 3: Ensemble Wrapper Feature Selection

Although eﬃcient, wrapper feature selection, as pointed out in the introduction, has

some limitations, since their result depend on the search strategy and on using a

single classiﬁer in the evaluation process. Thus, we introduce a new approach based

on ensemble methods that deals with the major issues of wrapper approach. As such,

we give in Section (3.2) some details on wrapper framework and we discuss subset

generation, search strategies and the use of multiple classiﬁers as an evaluation func-

tion. Then, we give the principal points of our new approach in Section (3.3). In

Section (3.4) we present the results of recall, precision and F-measure obtained on

four credit datasets.

3.2 Wrapper Framework

Notice that the main idea of wrapper feature selection is to remove unwanted features

from the data by using the predictive accuracy of a particular classiﬁer. It has been

showen that wrappers generally outperform ﬁlters (Liu and Schumann 2005) in terms

of accuracy since they are tuned to the speciﬁc interactions between the classiﬁer and

the dataset. However, wrapper methods have practical and theoretical limitations

(Chrysostomou 2008). Wrappers typically lack generality since the resulting subset

of features is tied to the bias of the classiﬁer used in the evaluation function. The

optimal feature subset will be speciﬁc to the classiﬁer under consideration. Also,

ﬁnding the optimal feature subset will come with a high computational cost. This

cost depends on the number of times the classiﬁer is trained on the evaluation process

on the number of subsets to be investigated and on the size of these feature subsets.

In fact the number of subsets and their size depend on the used search strategy.

If a complete search is used the number of subsets will increase along with the time

complicity and if an heuristic is used not all subsets will be investigated and we may

not have some interesting combination of features. In the following, we focus on

two of the discussed wrapper’s shortcomings: the bias of the classiﬁer and subsets

generation process.

Chapter 3: Ensemble Wrapper Feature Selection

3.2.1 Issue I: Evaluation using a single classiﬁer

Using a single classiﬁer in the wrapper process, may favor one candidate subset over

the others. In fact the diﬀerence in biases and assumptions of each classiﬁer may

inﬂuence the ﬁnal result in term of accuracy and execution time (Chrysostomou et al.

2008). According to Chrysostomou (2008), when a classiﬁer used for evaluation is

changed the set of kept features will change and as a result diﬀerent levels of accuracy

are obtained, inducing a lack of generality in the produced model. The level of

complexity of the classiﬁer is also a fundamental factor to be investigated. In theory

when a complex classiﬁer is used, it may take much longer to choose the best subset

of features than a classiﬁer which is considered to be simple. For example, when SVM

is used as an evaluation function in the process of ﬁnding the best features subset, it

may take more time to identify the most relevant features than using LR or KNN.

The number of classiﬁers used in the combination framework may also aﬀect the

evaluation process. If a small number of classiﬁers is used, it is likely that we obtain

a good set of relevant features given the high level of harmony among classiﬁers.

However, if a high number of classiﬁers is used we may end up getting fewer relevant

features. This is true to ﬁnd that the level of agreement between the classiﬁers will

probably be low since more classiﬁers are required to agree on the relevance of a

feature.

Based on the important limitation of using a single classiﬁer, we consider using

more than one classiﬁer within wrapper feature selection framework. Consequently,

we may notably improve the general accuracy. In fact we look for mutually approved

sets of signiﬁcant features. Such sets will possibly give higher classiﬁcation accuracies

and reduce the biases of individual classiﬁers.

3.2.2 Issue II: Subset generation and search strategies

The ideal feature selection approach is the exhaustive search on the full set of features

to ﬁnd an optimal subset of features. However, as the number of features increases

the exhaustive search becomes rapidly impractical even for a moderate d(Chan et al.

Chapter 3: Ensemble Wrapper Feature Selection

2010). If we look at diﬀerent ways in which features subsets are generated among

many variations, three basic schemes are available in the literature namely forward

selection, backward elimination and random scheme (Liu and Yu 2005).

Forward selection and backward elimination are considered as heuristics. Gener-

ally, sequential generation can help in getting a valid subset in a reasonable time but

still cannot ﬁnd an optimal set. This is due to the fact that the generation scheme

uses an heuristic to obtain an optimal subset by selecting sequentially the best as in

the forward case or removing the worst as in the backward case. Using such kind of

generator will without doubt speed up the selection process. However, if the search

falls in local optima it cannot turn back. In fact the generator has no way to get out

of the local optima because what has been removed cannot be added and what has

been added cannot be removed. This is a big shortcoming for sequential schemes.

To avoid this problem we may use the random generation scheme, which adds

randomness to the ﬁxed rule of sequential generation and helps avoiding getting stuck

at some local optima. Although random generation scheme could improve sequential

results it still does not guarantee ﬁnding an optimal subset. This can be further

elaborated in terms of search strategies (Yun et al. 2007) .

Hence, in order to minimize the search space, we propose ﬁrst to reduce the

number of features by forward selection and backward elimination so that exhaustive

search method can handle the generation process in a realistic time. In this way,

the selected feature set is much better in terms of accuracy than those from forward

selection and backward elimination and computed much faster than the exhaustive

method.

3.3 New approach for wrapper feature selection

In this section we propose a novel approach for wrapper feature selection. We consider

building a three-stage wrapper feature selection model.

•At ﬁrst, primary dimensionality reduction step is conducted on the original

feature space based on similarity study with the prior knowledge. This step is

Chapter 3: Ensemble Wrapper Feature Selection

used in order to reduce the search space.

•Second , the subset generation step is performed using a mixture of heuristic

and exhaustive search methods.

•The ﬁnal step is to study the evaluation process of wrapper feature selection

and the eﬀect of using multiple classiﬁers with diﬀerent and same nature.

3.3.1 Primary dimensionality reduction step: similarity study

The ﬁrst step of our proposed approach is designed speciﬁcally to select less redundant

features without sacriﬁcing the quality. Redundancy is measured by a similarity mea-

sure between a preselected set of features and the remaining features in the dataset.

In this step we enhance an existing set of preselected features by adding additional

features that complement the existing set but still with strong predictive power.

In CS we may already have a set of features preselected with prior information.

In fact, experts in banks have years of experience on some particular category of

credit and knowledge about which features are more important. This knowledge is

generally obtained by years of use of classical feature selection methods. Thus, a

possible improvement of the exhaustive search is to use the prior knowledge and to

eliminate redundant features before generating the candidate subsets. Since our goal

is to take advantage of any additional information about the feature, we may want

to select a set of features complementary to those preselected by bank experts. We

study the eﬀect of using prior information on relevant feature complexity.

1. First, we split the features set in two sets. The ﬁrst one regroups a set of features

that were assumed to be more relevant according to some prior knowledge. The

second set contains the remaining ones.

2. Once the two sets are obtained we conduct a similarity study, and a similarity

matrix is constructed. In this step the MI is chosen as a similarity measure

given its eﬃciency ( see Chapter 1, Section 1.3 for more information about MI).

Chapter 3: Ensemble Wrapper Feature Selection

3. Then, we take each feature from the remaining set and we investigate its level

of similarity with the features of the ﬁrst set. If the similarity is over 80%, the

evaluated feature is eliminated else it is retained for further examination.

The ﬁrst part of Figure (3.2) shows a simpliﬁed ﬂow chart of the dimensionality

reduction before the exhaustive search.

3.3.2 Subset generation step: speeding up exhaustive search by heuristics

Once the redundant feature elimination step is performed the search space is reduced

and we have a less expensive exhaustive search. Although the search space is reduced

by the previous step this search method still poses the problem of being computa-

tionally prohibitive. In fact an exhaustive search method is an enumerative search

method that works by considering all possible features’ combinations. According to

Chan et al. (2010) this method is practical if the number of features is less than 10.

The use of more than 10 features would be costly in terms of computational time.

In this case, speciﬁc heuristics can be used to reduce the set of candidate solutions

to a manageable size. We think that using the ﬁrst step combined with an heuristic

will reduce the search space to less than 10 features, which will make the exhaustive

search a realistic task. This, considering the fact that all datasets in this research

have less than 40 features.

In theory each search strategy has its particular eﬀects on the selected feature

subset and on the performance of the induction algorithm. When the heuristic is

changed the result may diﬀer in terms of the number of selected features, As such, we

extend the idea of ensemble method to search strategies. In the following we propose

to perform several heuristics in order to get diver results. We use both sequential

forward feature selection and backward feature elimination as a part of a combined

feature selection. Figure (3.1) illustrates the proposed combination process for an

example of 10 features.

In the ﬁrst step, the forward selection and backward elimination methods are si-

multaneously applied to the reduced feature set resulting in two diﬀerent intermediate

Chapter 3: Ensemble Wrapper Feature Selection

feature lists. Each list includes a set of complementary variables.

Figure 3.1: A ﬂowchart combining heuristic and exhaustive search

In the second step, the two lists are merged into one single list with the most relevant

features and the non selected features are eliminated. Since some selected features

may appear in one of the intermediate feature lists and not in the other, these features

must be re-weighted in order to take into consideration their relevance degree. In case

a feature is selected by both forward and backward and another feature is selected only

once, the one which is more selected is considered as much relevant. Consequently,

the resulting features are then re-weighted according to their number of appearances

in the intermediate lists. Then, the weight is equal to 1 if it appears in the two

intermediate feature subsets, otherwise it is 0.5. In the third step, a complete search

is used on the weighted features.

Chapter 3: Ensemble Wrapper Feature Selection

3.3.3 Evaluation step: eﬀects of using multiple classiﬁers

Many classiﬁcation methods were proposed to deal with the credit worthiness problem

on the basis of information from past applicants. The most common statistical meth-

ods to evaluate applicants’ solvability are LR and DA (Paleologo et al. 2010). Unfor-

tunately, these two methods need some fundamental assumptions on data (ˇ

Suˇsterˇsiˇc

et al. 2009). In addition to traditional methods diﬀerent machine learning and artiﬁ-

cial intelligence methods have been used such as: DT, ANN, SVM and many others.

Although the majority of these methods are simple and do not need assumptions

on data, they need a good mechanism to search for optimal model parameters and

feature subsets.

Each one of these individual methods produce a single discrimination rule and

each of them has some qualities and restrictions which may inﬂuence the feature

evaluation process. No one can conﬁrm for sure the superiority of one classiﬁer on

another. Rather than to try to optimize the accuracy of one classiﬁer, it is better

to integrate multiple classiﬁers. This approach has been recognized to be successful

and achieves better performance and higher precision of predictability in the learning

process (Hsieh and Hung 2010; Chen and Li 2010). Here, the same ensemble concept

is adopted in the feature evaluation process as part of the pre-processing course.

Figure (3.2) shows how the results of a set of classiﬁers are merged to form a new

evaluation function.

Classiﬁers

Many statistical classiﬁers are based on many assumptions as normality distribution

and absence of multicollinearity. But if the data are not normal? For that the

statistical family is not appropriate because of the number of restrictions.

The chosen algorithms in this study are representative of the most popular family

of classiﬁer models that were selected to form committees of experts, in order to test

various classiﬁer combination schemes. Therefore, this section focuses only on the

general aspect of each family. As detail, of algorithms are not the main concern of

Chapter 3: Ensemble Wrapper Feature Selection

Figure 3.2: A wrapper approach combing multiple classiﬁers for feature selection.

this thesis, only conceptual description of the algorithms is given in Table (3.1) more

details are given in Appendix (B).

Among the most popular classiﬁer models four were selected namely DT, SVM,

KNN, and ANN.

Chapter 3: Ensemble Wrapper Feature Selection

Table 3.1: General properties of some classiﬁcation algorithms.

Classiﬁcation Algorithms

DA LR DT SVM ANN KNN

Assumptions

numeric variables yes yes

normally distributed variables yes

equal covariance matrices yes

problem of interaction yes yes

problem of multicollinearity yes yes

normalization of variables yes yes

sensitive to class proportions yes yes yes

Output Score yes yes yes yes yes

Class yes yes yes yes

Classiﬁer arrangement approaches

In this section two diﬀerent classiﬁer arrangement approaches are used within the

wrapper evaluation process, namely the same-type approach and the mixed-type ap-

proach. The same-type approach combines classiﬁers from the same family and uses

them within the wrapper framework to select the relevant features. For example,

classiﬁers belonging to SVM family are combined together. The mixed-type approach

combines classiﬁers from diﬀerent families.

Table 3.2: Summary of used classiﬁers within each family.

DT ANN KNN SVM

J48

RandomForest

(RF)

MultilayerPerceptron

(MP)

VotedPerceptron (VP)

K=1 (1NN)

K=5 (5NN)

Polynomial

(SVMP)

Radial (SVMR)

The same-type combinations uses classiﬁers from the four diﬀerent families dis-

cussed before. More precisely, Two classiﬁers from DT family, two from ANN family,

one from KNN family is used with two diﬀerent number of neighbors K= 1 and K=5

and one classiﬁer is used from SVM family. The chosen SVM classiﬁer is used with

two diﬀerent kernels, namely the polynomial and radial basis function kernel. In this

way, features that are related to both linear aspects and non-linear aspects can be

identiﬁed. All considered classiﬁers are summarized in Table (3.2).

Chapter 3: Ensemble Wrapper Feature Selection

By using the second arrangement approach we investigate how classiﬁers from

diﬀerent families work together and how their interaction aﬀects features’ selection.

Classiﬁers are combined using an exhaustive approach so that each classiﬁer is used

with every other classiﬁer from a diﬀerent nature. This leads to the construction of a

total of 76 mixed-type classiﬁer combinations, described in Tables (3.3)-(3.4) including

24 for 2-classiﬁer mixed-type combinations and 52 for 3-mixed-type combinations. In

this way, both approaches help us obtain a complete picture of the eﬀects of the

nature and number of classiﬁers on feature selection.

Table 3.3: Summary of the possible combination of all classiﬁers, where the number

of classiﬁers to be combined is two.

Possible combinations

(J48+ SVMP), (J48+ SVMR), (J48+ MP), (J48+ VP), (J48 +1NN),

(J48+5NN), (RF+ SVMP), (RF+ SVMR), (RF+ MP), (RF+ VP),

(RF +1NN), (RF+5NN),( SVMP + MP), (SVMP + VP), (SVMP +1NN),

(SVMP +5NN), (SVMR + MP), (SVMR + VP), (SVMR +1NN),

(SVMR +5NN), (MP +1NN), (MP +5NN), (VP +1NN), (VP +5NN).

Aggregation rules

Traditionally, the approach used to build a multi-classiﬁers system is to experimen-

tally compare the performance of several classiﬁers and select the best one. However,

many alternative approaches based on combining multiple classiﬁers have emerged

over recent years (Kuncheva et al. 2001). There are basically two classiﬁer combi-

nation scenarios. In the ﬁrst, all classiﬁers use the same representation of the input

example. In this case, each classiﬁer, for a given input example, produce an estimate

of the same posteriori class probability. In the second scenario, each classiﬁer uses

its only representation of the input example. For multiple classiﬁers using distinct

representations, many existing schemes can be considered, where all the representa-

tions are used jointly to make a decision. We can derive the commonly used classiﬁer

Chapter 3: Ensemble Wrapper Feature Selection

combination schemes such as the product rule, average rule, minimum rule, maximum

rule and majority voting schemes.

Table 3.4: Summary of the possible combination of all classiﬁers, where the number

of classiﬁers to be combined is three.

Possible combinations

(J48 +RF + SVMP), (J48 +RF+ SVMR), (J48 +RF + MP), (J48 +RF +VP),

(J48 +RF +1NN), (J48 +RF +5NN), (J48+ SVMP+ SVMR), (J48+ MP +

VP),

(J48 +1NN+5NN), (J48+ SVMP + MP),( J48+ SVMP + VP), (j48+SVMP

+1NN),

(J48+ SVMP +5NN), (J48+ SVMR + MP), (J48+ SVMR + VP), (J48+

SVMR +1NN),

(J48+ SVMR +5NN), (J48+ MP +1NN), (J48+ MP +5NN), (J48+ VP

+1NN),

(J48+ VP +5NN), (RF + SVMP+ SVMR), (RF + MP + VP), (RF

+1NN+5NN),

(RF + SVMP + MP), (RF + SVMP + VP), (RF+SVMP +1NN), (RF +

SVMP +5NN),

(RF + SVMR + MP), (RF + SVMR + VP), (RF+SVMR +1NN), (RF +

SVMR +5NN),

(RF + MP +1NN), (RF + MP +5NN), (RF + VP +1NN), (RF + VP +5NN),

(SVMP+SVMR + MP), (SVMP+SVMR +VP), (SVMP+SVMR +1NN),

(SVMP+SVMR +5NN),

(SVMP + MP + VP), (SVMP +1NN+5NN), (SVMP + MP +1NN), (SVMP

+ MP +5NN),

(SVMP + VP +1NN), (SVMP + VP +5NN), (SVMR + MP + VP), (SVMR

+1NN+5NN),

(SVMR + MP +1NN), (SVMR + MP +5NN), (SVMR + VP +1NN), (SVMR

+ VP +5NN).

The simplest and most common way for aggregation is to use a simple arithmetic

mean also known as the average. This operator is interesting because it gives an

Chapter 3: Ensemble Wrapper Feature Selection

aggregated value that is smaller than the greatest argument and bigger than the

smallest one. Then, the resulting aggregation is ”a middle value”. This property

is known as the compensation property. The minimum and the maximum are also

basic aggregation operators. The minimum gives the smallest value of a set, while

the maximum gives the greatest one (Kittler 1998). Majority vote is also a common

classiﬁer combination method, particularly used in classiﬁer ensembles when the class

labels of the classiﬁers are crisp (Kuncheva et al. 2001). In general, majority voting is

a simple method that does not require any parameters to be trained or any additional

information for later results.

3.4 Experimental Investigations

We considered information retrieval measures for the four datasets when the linear

SVM was applied as a classiﬁer on the selected set of features using the ensemble

wrapper approach by 10-fold cross validation.The precision, recall, F-measure and

ROC area of feature subsets selected from diﬀerent combinations are given in Tables

(3.5)-(3.8), where the best results are shown in bold.

Two approaches for wrapper evaluation are presented, namely the same-type ap-

proach and the mixed-type approach. Results for the ﬁrst approach are investigated

in Section (3.4.1) and those for the second approach in Section (3.4.2).

3.4.1 Results and discussion for the same-type approach

Analysis of features selected by DT family combination

Looking to the results produced by DT family, we notice that the J48 classiﬁer

achieves in most cases the best individual results for the German, HMEQ and the

Tunisian datasets, expect for the Australian dataset where the individual results pro-

duced by SVM were slightly better. The good performance of the wrapper using DT

classiﬁers is guided by the nature of this family which is well known for its highly

accurate performance on ﬁnancial data (Piramuthu 2004).

Chapter 3: Ensemble Wrapper Feature Selection

Table 3.5: Performance comparison of the new wrapper method and the other feature

selection methods for the Australian dataset.

Precision Recall F-Measure ROC Area

Decision Tree

J48 0.867 0.855 0.855 0.862

RF 0.863 0.851 0.851 0.858

Average 0.782 0.925 0.848 0.863

Product 0.864 0.852 0.853 0.859

Maximum 0.930 0.794 0.856 0.859

Minimum 0.866 0.855 0.855 0.862

Majority Vote 0.782 0.922 0.846 0.865

Support Vector Machine

SVMP 0.921 0.794 0.853 0.855

SVMR 0.930 0.799 0.860 0.862

Average 0.787 0.925 0.850 0.864

Product 0.866 0.855 0.855 0.861

Maximum 0.859 0.848 0.848 0.856

Minimum 0.927 0.794 0.855 0.858

Majority Vote 0.781 0.915 0.848 0.857

Artiﬁcial Neural Network

MP 0.860 0.849 0.850 0.856

VP 0.859 0.848 0.848 0.855

Average 0.862 0.851 0.851 0.857

Product 0.783 0.919 0.861 0.860

Maximum 0.862 0.851 0.851 0.857

Minimum 0.862 0.851 0.851 0.857

Majority Vote 0.864 0.853 0.854 0.858

K-Nearest Neighbor

1NN 0.865 0.852 0.852 0.860

5NN 0.859 0.848 0.848 0.855

Average 0.812 0.890 0.849 0.877

Product 0.811 0.866 0.838 0.883

Maximum 0.820 0.880 0.849 0.875

Minimum 0.824 0.823 0.822 0.876

Majority Vote 0.853 0.851 0.851 0.882

A closer look at Tables (3.5)-(3.8) shows that results are much better within the

combination process.

When some DT algorithms adopt local search strategy, others are global optimized

algorithms. Then, combing a set of DT algorithms may avoid some of their drawbacks

and the experimental results demonstrate that combination results are more eﬀective

Chapter 3: Ensemble Wrapper Feature Selection

than individual ones.

Table 3.6: Performance comparison of the new wrapper method and the other feature

selection methods for the German dataset.

Precision Recall F-Measure ROC Area

Decision Tree

J48 0.735 0.750 0.723 0.635

RF 0.686 0.716 0.665 0.570

Average 0.740 0.930 0.824 0.583

Product 0.732 0.933 0.820 0.568

Maximum 0.741 0.930 0.825 0.585

Minimum 0.744 0.929 0.826 0.591

Majority Vote 0.740 0.934 0.826 0.635

Support Vector Machine

SVMP 0.490 0.700 0.576 0.500

SVMR 0.708 0.728 0.709 0.627

Average 0.695 0.722 0.678 0.583

Product 0.682 0.714 0.664 0.568

Maximum 0.697 0.723 0.680 0.585

Minimum 0.702 0.726 0.685 0.591

Majority Vote 0.699 0.724 0.679 0.584

Artiﬁcial Neural Network

MP 0.719 0.738 0.717 0.634

VP 0.703 0.726 0.701 0.614

Average 0.769 0.896 0.827 0.634

Product 0.769 0.894 0.825 0.645

Maximum 0.758 0.894 0.820 0.643

Minimum 0.717 0.737 0.712 0.625

Majority Vote 0.764 0.904 0.828 0.625

K-Nearest Neighbor

1NN 0.699 0.724 0.677 0.582

5NN 0.691 0.718 0.688 0.598

Average 0.745 0.917 0.822 0.592

Product 0.739 0.937 0.826 0.601

Maximum 0.749 0.899 0.817 0.597

Minimum 0.745 0.917 0.822 0.592

Majority Vote 0.742 0.914 0.819 0.587

As expected, combination schemes have approximately the same performance.

Although the product, minimum and the maximum rules, seem to have the best

precision rates while the average rule and the majority vote rule give the best recall

Chapter 3: Ensemble Wrapper Feature Selection

and ROC area for DT family.

Analysis of features selected by SVM family combination

Going over the results presented in Tables (3.5)-(3.8) we notice some diﬀerences among

the individual results of polynomial and radial SVM. For the four datasets, we notice

that the performance with the radial SVM is slightly better. This result is due

to the nature of the two kernels. In general the polynomial kernel looks for linear

characteristics within datasets while the radial kernel identiﬁes linear and non-linear

aspects of the datasets.

Overall we notice that the same-type combinations with SVM improves the per-

formance, meaning that the selected features within the combination process are more

suitable for the CS task. Tables (3.5)-(3.8) show how the model performance changes

with the diﬀerent combination rules. We notice that the four performance measures

increase with the combination. In fact, majority vote, minimum and average rule

combination give signiﬁcantly higher ROC area and F-measure.

The good performance of the obtained combinations using the KNN family is

the result of its natural simplicity. In fact KNN is a non-parametric classiﬁcation

method, that does not assume any parametric distribution of the random variables.

Non-parametric models are very ﬂexible making them usually good classiﬁers for

many situations (Li et al. 2011).

Analysis of features selected by KNN family combination

Tables (3.5)-(3.8) show that KNN classiﬁers give always better results when the size

of the dataset does not exceed 1000 instances as is the case for the German (1000

instances) and Australian (690 instances) datasets. For these datasets, KNN combi-

nations give higher classiﬁcation performance than using the combination from other

families.

Chapter 3: Ensemble Wrapper Feature Selection

Table 3.7: Performance comparison of the new wrapper method and the other feature

selection methods for the HMEQ dataset.

Precision Recall F-Measure ROC Area

Decision Tree

J48 0.859 0.864 0.844 0.795

RF 0.857 0.860 0.838 0.785

Average 0.867 0.982 0.921 0.793

Product 0.863 0.983 0.918 0.787

Maximum 0.914 0.899 0.906 0.809

Minimum 0.855 0.852 0.853 0.806

Majority Vote 0.868 0.979 0.920 0.797

Support Vector Machine

SVMP 0.633 0.796 0.705 0.555

SVMR 0.843 0.804 0.724 0.619

Average 0.827 0.977 0.896 0.701

Product 0.809 0.815 0.759 0.662

Maximum 0.816 0.822 0.774 0.683

Minimum 0.800 0.819 0.778 0.691

Majority Vote 0.824 0.987 0.898 0.682

Artiﬁcial Neural Network

MP 0.693 0.638 0.664 0.677

VP 0.81 0.827 0.789 0.607

Average 0.868 0.871 0.869 0.877

Product 0.835 0.977 0.902 0.602

Maximum 0.811 0.829 0.793 0.734

Minimum 0.838 0.974 0.901 0.732

Majority Vote 0.911 0.930 0.920 0.879

K-Nearest Neighbor

1NN 0.852 0.837 0.791 0.803

5NN 0.837 0.824 0.766 0.812

Average 0.821 0.998 0.901 0.891

Product 0.850 0.825 0.766 0.881

Maximum 0.889 0.997 0.940 0.889

Minimum 0.821 0.996 0.900 0.842

Majority Vote 0.832 0.996 0.907 0.844

Analysis of features selected by ANN family combination

From Tables (3.5)-(3.8) we notice that as with the combination from other classiﬁer

families, the ﬁnal result has improved using ANN combinations. ANN classiﬁers are

made out of neurons and are excellent to extract information from a dataset. During

Chapter 3: Ensemble Wrapper Feature Selection

the training process ANN can be used to map an input to desired output, classify

data or learn patterns. Hence, ANN can also be used to perform indirectly feature

selection (Ledesma et al. 2008).

Table 3.8: Performance comparison of the new wrapper method and the other feature

selection methods for the Tunisian dataset.

Precision Recall F-Measure ROC Area

Decision Tree

J48 0.722 0.850 0.781 0.597

RF 0.797 0.846 0.801 0.695

Average 0.858 0.985 0.917 0.652

Product 0.859 0.985 0.918 0.655

Maximum 0.866 0.985 0.921 0.653

Minimum 0.861 0.986 0.919 0.644

Majority Vote 0.858 0.987 0.917 0.649

Support Vector Machine

SVMP 0.722 0.850 0.781 0.500

SVMR 0.797 0.837 0.805 0.566

Average 0.861 0.962 0.909 0.666

Product 0.710 0.842 0.770 0.500

Maximum 0.860 0.968 0.911 0.563

Minimum 0.798 0.839 0.803 0.661

Majority Vote 0.859 0.968 0.910 0.656

Artiﬁcial Neural Network

MP 0.802 0.843 0.800 0.577

VP 0.826 0.857 0.816 0.562

Average 0.856 0.979 0.913 0.677

Product 0.865 0.984 0.921 0.659

Maximum 0.867 0.975 0.918 0.668

Minimum 0.888 0.855 0.871 0.731

Majority Vote 0.866 0.981 0.920 0.657

K-Nearest Neighbor

1NN 0.785 0.843 0.794 0.680

5NN 0.792 0.844 0.800 0.685

Average 0.855 0.977 0.912 0.775

Product 0.852 0.993 0.917 0.756

Maximum 0.864 0.925 0.893 0.746

Minimum 0.863 0.932 0.896 0.704

Majority Vote 0.866 0.985 0.921 0.753

Chapter 3: Ensemble Wrapper Feature Selection

Same-type approach summary

In summary, results for the four datasets show two clear conclusions. First, perfor-

mance measures have improved using DT family classiﬁers and results are specially

better for HMEQ and Tunisian datasets. Second, when datasets size does not exceed

1000 instances, combinations with KNN family give the best results.

A possible explanation for the ﬁrst observation is that DT classiﬁers are some-

times considered as embedded methods. These kind of methods essentially perform

feature selection within the learning process, which means that they are able to select

relevant features on their own: using their own search strategy and their own split-

ting mechanism. In other words DT classiﬁers select relevant features at two diﬀerent

stages. In the ﬁrst stage features are selected by the individual classiﬁers and in the

second features are selected by the combination of DT classiﬁers. In this way, only

features that are selected at both of these stages will form the ﬁnal feature subset

which is very likely to include features of high relevance.

The second observation concerns the combinations with KNN family. In fact the

main advantage of KNN is that it can learn from a small set of examples explaining

the good performance with the Australian and the German datasets. On the other

hand, its major disadvantage is being computationally intensive for large datasets

since it uses all training data as the examples (Thomas et al. 2002).

In this section we investigated the eﬀect of classiﬁers nature on the ﬁnal feature

selection results. For a global picture we want to know if the observed results are

only due to classiﬁers nature or to their interactions with the aggregation meth-

ods. Hence we use a two-way ANOVA to analyze if the mean values of F-measure

signiﬁcantly change along with the levels of the two independent variables classi-

ﬁer and aggregation method. The ﬁrst independent variable classiﬁer presents the

ﬁrst factor in ANOVA analyze where DT, SVM, ANN and KNN present the levels

of this variable. Aggregation method present the second factor in ANOVA where

{Average, P roduct, M aximum, Minimum, M ajorityV ote}present the levels of this

second factor. To test the interaction we use the hypotheses presented below:

Chapter 3: Ensemble Wrapper Feature Selection

For the ﬁrst factor, ie. Classiﬁer, H0and H1are given by:











H0:µ1

DT =µ1

SV M =µ1

ANN =µ1

KN N Performances of classiﬁers are equal,

versus

H1:∀t, µ1

t6=µ1

i, i, t ∈ {DT , SV M, K NN, AN N}, i 6=tAt least one of the classifer

mean of performance is diﬀerent from the others

H0and H1for factor 2, ie. aggregation method, would be











H0:µ2

Aver =µ2

P rod =µ2

Max =µ2

Min =µ2

MajV Performances of aggregation methods are equal,

versus

H1:∀t, µ2

t6=µ2

i, i, t ∈ {Aver, P rod, M ax, Min, M ajV }, i 6=tAt least one of the agregation

methods mean of performance is diﬀerent from the others

Interaction between the two 2 Factors:











H0: There is no interaction between the two factors,

versus

H1: There is an interaction between the two factors

mean of performance is diﬀerent from the others

To set up a two-way ANOVA we use the data in Table (3.9), the results obtained

from this table are summarized in Table (3.10)

The obtained result of the two-way ANOVA in the Table (3.10) show that we don’t

have a signiﬁcant interaction between the two factors which means that the eﬀect on

the outcome of any speciﬁc level of F-measure change for one factor is the same for

every ﬁxed setting of the other factor. We notice from Table (3.10) that there was

Chapter 3: Ensemble Wrapper Feature Selection

Table 3.9: Summary of F-measures for all aggregation methods with the four classi-

ﬁcation methods in wrapper framework.

Average Product Maximum Minimum Majority Vote

DT 0,848 0,853 0,856 0,855 0,846

0,824 0,82 0,825 0,826 0,826

0,921 0,918 0,906 0,853 0,920

0,917 0,918 0,921 0,919 0,917

SVM 0,85 0,855 0,848 0,855 0,848

0,678 0,664 0,68 0,685 0,679

0,896 0,759 0,774 0,778 0,898

0,909 0,77 0,911 0,803 0,910

ANN 0,851 0,861 0,851 0,851 0,854

0,827 0,825 0,82 0,712 0,828

0,869 0,902 0,793 0,901 0,920

0,913 0,921 0,918 0,871 0,920

KNN 0,849 0,838 0,849 0,822 0,851

0,822 0,826 0,817 0,822 0,819

0,901 0,766 0,940 0,9 0,907

0,912 0,917 0,893 0,896 0,921

Table 3.10: Tests of between-subjects eﬀects in wrapper framework.

Source Type III Sum of

Squares

DF Mean Square F Sig. (p-value)

Corrected Model 0,090a19 0,005 1,154 0,326

Intercept 57,826 1 57,826 14028,038 0,000

Aggregation Method 0,013 40,003 0,768 0,550

Classiﬁer 0,063 30,021 5,081 0,003

Aggregation Method * Classiﬁer 0,015 12 0,001 0,301 0,987

Error 0,247 60 0,004

Total 58,163 80

Corrected Total 0,338 79

Dependant Variable : F-measure

no statistically signiﬁcant diﬀerence in mean interest in F-measure between the dif-

ferent aggregation methods (p-value= 0,550), but there were statistically signiﬁcant

diﬀerences between the diﬀerent classiﬁers (p-value =0,003).

When ANOVA gives a signiﬁcant result for one classiﬁcation methods, this indi-

cates that at least one classiﬁer results’ diﬀers from the other classiﬁers. Yet, ANOVA

test does not indicate which classiﬁer results’ inﬂuenced the reject of H0. In order

to analyze the pattern of diﬀerence between means, we follow the ANOVA results by

a pairwise comparisons. Results of pairwise comparisons for classiﬁers are given in

Table (3.11)

Chapter 3: Ensemble Wrapper Feature Selection

Table 3.11: Multiple comparisons table for classiﬁer levels in wrapper framework.

Classiﬁer (I) Classiﬁer (J) Mean diﬀerence (I-J) Sig.

ANN DT -0,01405 0,9

KNN -0,003 0,999

SVM 0,05790* 0,03

DT ANN 0,01405 0,9

KNN 0,01105 0,948

SVM 0,07195* 0,004

KNN ANN 0,003 0,999

DT -0,01105 0,948

SVM 0,06090* 0,02

SVM ANN -,05790* 0,03

DT -0,07195* 0,004

KNN -0,06090* 0,02

From Table (3.11) we notice that there is a statistically signiﬁcant diﬀerence be-

tween the obtained results from SVM and the others classiﬁcations.

3.4.2 Results and discussion for the mixed-type approach

Because of the large number of combinations, the mixed-type approach is examined

using the Australian dataset and results are summarized in Tables (3.12) and (3.13).

Table (3.12) presents a 2-classiﬁers mixed combination while Table (3.13) presents

a 3-classiﬁers mixed combination. In this way, we investigate the eﬀect of classiﬁers

nature on feature selection, and we can see if the number of classiﬁers within the

combination framework also aﬀects the feature selection. Tables (3.12) and (3.13)

give:

•The measured F-measure for the features generated by diﬀerent combinations

•The mean number of evaluated subsets, i.e. the ﬁrst number between the paren-

theses in each table, and the associated mean number of selected features, i.e.

the second number between the parentheses in each table.

From Tables (3.12) and (3.13) we notice that the combination with few classiﬁers

selected the features that achieved the best F-measure with a smaller number of

evaluated subsets. More speciﬁcally, the 2-classiﬁers’ combination produce an F-

measure in the rang [0.860,0.874] with a number of evaluated subsets that do not

Chapter 3: Ensemble Wrapper Feature Selection

Table 3.12: Total number of evaluated subsets and selected features by 2 classiﬁers

mixed-type combinations and associated F-measure rates for the Australian Dataset.

Lowest F-measure lies

between 0.847 and 0.855

Intermediate F-measure lies

between 0.856 and 0.859

Highest F-measure lies

between 0.860 and 0.874

j48+1NN (79,3) J48+ SVMP(106,4) J48+ MP(116,7)

RF+SVMR(82,2) J48+ SVMR(106,4) RF+SVMP(79,3)

RF+MP(111,6) J48 + VP(120,7) RF+1NN(96,4)

RF+VP(104,5) j48+5NN(105,4) SVMP+VP(88,4)

RF+5NN(96,4) SVMP+MP(112,5) SVMP+1NN(79,3)

SVMP+5NN(116,6) SVMR+MP(112,7) MP+5NN(121,7)

SVMR+1NN(79,3) SVMR+VP(116,7) VP+5NN(117,7)

SVMR+5NN(127,9)

MP+1NN(107,6)

VP+1NN(127,6)

exceed 121 evaluations. On the other hand the 3-classiﬁers’ combination gives the

same rate but with a much bigger number of evaluated subsets.

Table (3.12) shows that combining DT classiﬁers with ANN or KNN classiﬁers

generally yields the lowest F-measure ( RF+MP, RF+VP, RF+5NN, J48+1NN) and

this is due to the diﬀerence in nature between these three types of classiﬁers. Actually,

ANN classiﬁers identify relationships between features based on the available prior

knowledge about the actual features in the dataset. However, KNN classiﬁers select

the most relevant features with the closest distance to a set of speciﬁed features

called neighbors. For this family the resulting features depend of the number of

chosen neighbors. DT classiﬁers nature is very dissimilar to the nature of ANN and

KNN. In fact, they use a statistical measure to evaluate the relevance of features.

Results of Section (3.4.1) show that when the same-type arrangement approach

is used DT classiﬁers give nearly the best results. Table (3.13) shows that it is not

the case when classiﬁers from this family are combined with classiﬁers from other

families.

Table (3.13) shows that the majority of combinations with SVM classiﬁers selected

a set of features that achieved the best rates of F-measure. That was essentially the

case when SVM classiﬁers were combined with KNN classiﬁers. The fact that these

combinations lead to high F-measure, despite the fact that they consider classiﬁers

Chapter 3: Ensemble Wrapper Feature Selection

Table 3.13: Total number of evaluated subsets and selected features by 3 classiﬁers

mixed-type combinations and associated F-measure rates for the Australian Dataset.

Lowest F-measure lies

between 0.847 and 0.855

Intermediate F-

measure lies between

0.856 and 0.859

Highest F-measure lies

between 0.860 and 0.874

J48+RF+MV(136,7) J48+RF+SVMP(82,2) J48+RF+1NN(79,3)

J48+RF+5NN(131,9) J48+RF+SVMR(75,2) J48+VP+1NN(139,7)

J48+1NN+5NN(114,7) J48+RF+MP(144,10) RF+MP+VP(111,6)

J48+SVMR+MP (146,7) J48+MP+VM(126,7) RF+SVMP+MP(132,6)

J48+SVMR+1NN(75,2) J48+SVMP+SVMR(75,2) RF+SVMP+VP(120,8)

J48+SVMR+5NN(141,10) J48+SVMP+MP(126,7) RF+SVMP+1NN (116,7)

J48+MP+5NN(122,7) J48+SVMP+VP(139,6) RF+SVMR+MP (126,9)

J48+VP+5NN(112,7) J48+SVMP+1NN(118,6) RF+SVMR+VP(135,7)

RF+1NN+5NN (79,3) J48+SVMP+5NN(139,10) SVMP+1NN+5NN(108,7)

RF+SVMP+5NN (94,5) J48+SVMR+VP(189,7) SVMP+MP+1NN(132,8)

RF+SVMR+1NN (88,3) J48+MP+1NN(165,10) SVMR+MP+5NN(132,10)

RF+SVMR+5NN (117,8) RF+SVMP+SVMR(82,2) SVMP+VP+1NN(118,10)

RF+MP+1NN(130,7) MP+SVMP+SVMR

(139,6)

SVMR+1NN+5NN(108,7)

RF+MP+5NN (133,10) VP+SVMP+SVMR

(120,5)

SVMR+MP+1NN

(132,9)

RF+VP+1NN(130,7) 1NN+SVMP+SVMR

(82,2)

SVMR+MP+5NN(149,9)

RF+VP+5NN(123,10) 5NN+SVMP+SVMR

(75,2)

SVMR+VP+5NN (149,9)

SVMP+MP+VP (109,5)

SVMP+VP+5NN

(153,11)

SVMR+MP+VP (109,5)

SVMR+VP+1NN(122,8)

from diﬀerent families, could be due to the existence of particular similarities between

these two families. KNN classiﬁers use a distance metric to decide which are the most

relevant features to the target variable, SVM classiﬁers use distance to select the most

relevant features by measuring the distance between each feature in accordance with

the hyper-plane that separates the best class from the target concept.

3.5 Conclusion

In this chapter we developed an ensemble wrapper feature selection approach for a

CS application. The proposed approach is composed of three stages. In the First one

Chapter 3: Ensemble Wrapper Feature Selection

we performed a dimensionality reduction using bank experts’ knowledge on features

where we eliminate redundant features before generating the candidate subsets. In

the second stage a heuristic is used to reduce the search space to less than 10 features,

which makes the exhaustive search a realistic task. In the ﬁnal stage the generated

subsets were evaluated using a multi-classiﬁers process involving two arrangement

approaches, namely the same-type and mixed-type approach. From the three stages,

we show that the use of prior information on relevant features eﬀectively induces a

signiﬁcant gain in complexity with improved generalization. Also, we have shown

that the number of classiﬁers and their nature have an important eﬀect on wrapper

feature selection results. This chapter and the previous one discussed two diﬀerent

concepts of feature selection namely ﬁlter and wrapper methods, their combination

will be investigated in the next chapter.

Chapter 4

A Three-Stage Feature Selection Using Quadratic

Programming for Credit Scoring

Contents

4.1 Introduction .......................... 86

4.2 Hybrid Framework . . . . . . . . . . . . . . . . . . . . . . 87

4.3 New Approach for hybrid feature selection . . . . . . . . 87

4.3.1 Stage I: feature-based ﬁltering . . . . . . . . . . . . . . . . 88

4.3.2 Stage II: fusion using quadratic programming . . . . . . . . 89

4.3.3 Stage III: Feature-based wrapping . . . . . . . . . . . . . . 92

4.4 Experimental investigations . . . . . . . . . . . . . . . . . 94

4.4.1 Results and discussion . . . . . . . . . . . . . . . . . . . . . 95

4.5 Conclusion ........................... 105

4.1 Introduction

Chapters (2) and (3) reviewed the two most important methods for feature selection,

respectively the ﬁlter and the wrapper feature selection methods with proposed modi-

ﬁcations for improvement. In general, we cannot show the superiority of one approach

over the other, because of the fact that there are strong mixed arguments in favor of

both methods. This chapter explores a variety of ﬁlter and wrapper feature selection

Chapter 4: Three-Stage Feature Selection

methods to reduce non relevant features. These two types of selection methods are

complementary to each other. A fusion strategy is then proposed to sequentially com-

bine the ranking criteria of multiple ﬁlters and a wrapper method. Evaluations are

conducted on four credit datasets. This chapter is organized as follows. Section (4.2)

describes hybrid feature selection. Section (4.3) proposes a three-stage fusion com-

bining both strategies. Then, our proposed method is compared with some existing

selection methods in Section (4.4), and conclusions are drawn in Section (4.5).

4.2 Hybrid Framework

As discussed in Chapter 1 there are two main classes of feature selection methods:

the ﬁlter and the wrapper . Both approaches have their merits and shortcomings and

the superiority of one approach over the other is not settled. Rather than trying to

optimize just one approach, it is better to integrate both in one compacted feature

selection model. Several merging approaches can be used for feature selection (Wu

et al. 2009; Mak and Kung 2008): a fusion of several ﬁlters or wrappers, wrappers and

ﬁlters merged in a parallel way, or a sequenced combination of ﬁlters and wrappers.

Filter and wrapper methods require various resources and lead to diﬀered results.

Combining both methods seems a natural choice to beneﬁt from their advantages and

avoid their shortcomings. Since both methods consider two diﬀerent selection criteria

and we have no knowledge about the number of relevant features, a combination of

both methods as a hybrid approach is proposed.

4.3 New Approach for hybrid feature selection

In order to improve the signiﬁcance of selected features, we propose in this section

a three-stage approach as a combination of ﬁlter and wrapper methods. In the ﬁrst

stage we use a set of ﬁlter-based methods to classify candidate attributes based on

their relevance level into three main categories. We obtain the following feature

relevance categories: high, average and poor. Highly relevant features are kept as

input to the second stage, average ones are kept as input to the third stage and the

Chapter 4: Three-Stage Feature Selection

last category is eliminated.

In the second stage an eﬃcient method dealing with both redundancy and relevance is

considered. At this stage we minimize the redundancy among most relevant features

while maximizing their relevance to the target class. To ﬁnd the best combination

between relevance and redundancy we formulate this problem as an optimization of

a quadratic multi-objective function. Once the most relevant features are separated

from redundant ones we move to stage three. The latter takes as input the selected

features of Stage II and combines them with the average relevant features of Stage I.

Then, a wrapper approach is trained on the resulting features. Figure (4.4) presents

a ﬂowchart of the diﬀerent stages of the proposed approach.

4.3.1 Stage I: feature-based ﬁltering

As discussed in Chapters 1 and 2, there are many ranking criteria for ﬁlters: Bayesian

accuracy, T-statistics, PCC, Relief, entropy and many others. Unfortunately, choos-

ing the best one is a diﬃcult task and depends on many factors such as the amount of

available data, the data distribution and types of descriptive features among others.

Rather than to optimize one single ﬁlter, we combine results of multiple ﬁlters in the

pre-selection process. Many methods can be adopted to ﬁnd the best combination.

In this work, we fuse individual ﬁlters’ output, i.e. ﬁnal ranking of each ﬁlter, while

assuming that the eﬀect of each ﬁlter on the ﬁnal decision is the same.

For the pre-ﬁltering stage and for simplicity the number of used ﬁlters is ﬁxed to

three. Each ﬁlter ranks features according to their particular criterion, resulting into

three diﬀerent rankings. Then, the result of each ﬁlter is divided into three subsets

of identical size according to their level of relevance to the target variable. Figure

(4.1) shows the diﬀerent relevance categories. Three groups of features are obtained

for each ﬁlter: the highly signiﬁcant, the average ones and those with poor relevance.

Once the relevance levels are identiﬁed for each ﬁlter, and as a ﬁrst step the most

relevant features are merged. They are produced by the three ﬁlters in one single

group yielding a subset with the most relevant features. At a second step, the three

Chapter 4: Three-Stage Feature Selection

Figure 4.1: A view of feature relevance categories

groups of features with average relevance are grouped and redundant features, or

the ones which appear in the ﬁrst resulting subset, are also eliminated. Since the

remaining features lack relevance and they are not adequate for our study they are

eliminated. Figure (4.2) illustrates Stage I for an example of 22 features from the

Tunisian dataset.

4.3.2 Stage II: fusion using quadratic programming

Filter algorithms frequently do not consider interaction between features. Moreover,

resulting ranking lists from Stage I may contain redundant information. Therefore,

one common improvement direction for ﬁlter algorithms is to consider dependencies

among variables, and an approach based on MI is proposed. This approach studies

the redundancy among features starting from the highly ranked features selected

in Stage I. The problem of feature redundancy is considered by statistical machine

learning methods as well as mathematical ones. Mathematical programming based

approaches have been proven to be successful in terms of classiﬁcation accuracy for

a wide range of applications. The proposed mathematical method is a quadratic

programming formulation. Quadratic optimization process uses an objective function

with quadratic and linear terms. Here, the quadratic term denotes the similarity

among each pair of variables whereas the linear term captures the correlation between

each feature and the target variable.

Let’s assume that the classiﬁer learning problem involves Ntraining samples and

Chapter 4: Three-Stage Feature Selection

Figure 4.2: The proposed process of merging features selected by three ﬁlters in the

fusion method

dvariables. A quadratic programming problem minimizes a multivariate quadratic

function subject to linear constraints as follows (Rodriguez et al. 2010):

Min f (w) = 1

2wTQw−ZTw.

Subject to wi≥0 for all i= 1, . . . , d

and

i=1

wi= 1,

(4.1)

where Zis a d-dimensional row vector with non-negative entries describing the co-

eﬃcients of the linear terms in the objective function measuring how each feature

Chapter 4: Three-Stage Feature Selection

is correlated with the target class (relevance), Qa (d×d) symmetric positive semi-

deﬁnite matrix describing the coeﬃcients of the quadratic terms representing the

similarity among variables (redundancy) and the weights of variables are denoted by

an d-dimensional column vector w.

Bazaraa et al. (1993) and Rodriguez et al. (2010) showed that a feasible solution

exists for this kind of problem and that the constraint region is bounded. When the

objective function f(w) is strictly convex for all feasible points the problem has a

unique local minimum which is also the global minimum. The conditions for solving

quadratic programming, including the Lagrangian function and the Karush-Kuhn-

Tucker conditions are explained in details in (Bazaraa et al. 1993).

Depending on the learning problem, the two conditions can have diﬀerent relative

purposes in the objective function. Therefore, a scalar parameter αis introduced as

follows:

Min f (w) = 1

2(1 −α)wTQw−αZTw,(4.2)

were w,Qand Zare deﬁned as before and α∈[0,1] if α= 1 only relevance is

considered. On the opposite, if α= 0 then only independence between features is

considered. That is features with higher weights are those which have lower similarity

coeﬃcients with the remaining features. Every data set has its best choice of the scalar

α. However, a reasonable choice of αshould balance the relation between relevance

and redundancy. Thus, a good estimation of αis needed. We know that the relevance

and redundancy terms in Equation (4.2) are balanced when (1 −α)Q=αZ, where Q

is the estimate of the mean value of the matrix Qand Zis the estimate of the mean

value of vector Zelements. Hence we propose a practical estimate of αas follows:

ˆα=Q

Q+Z.(4.3)

After solving the quadratic programming optimization the features with higher

weights are considered to be better variables for subsequent classiﬁer training. Figure

(4.3) illustrates Stage II.

Chapter 4: Three-Stage Feature Selection

Figure 4.3: Redundancy analysis for highly ranked features

At this stage, given its eﬃciency MI is chosen as a similarity measure. Hence, the

quadratic term is qjj0=MI (XJ, X j0) and the linear one is zj=MI(XJ, Y ). Using

the quadratic approach based on MI provides a new ranking of the highly ranked

feature selected in Stage I. This new ranking takes into account simultaneously the

MI between all pairs of features and the relevance of each feature to the target one.

4.3.3 Stage III: Feature-based wrapping

In Stages I and II we selected top-ranked features and removed redundant ones based

on MI and quadratic programming. Many studies such as those conducted by Peng

et al. (2005) showed that simply combining highly discriminant features often does

not give a better feature set that yields the best classiﬁcation performance. The

reason behind this is that the feature set is not an inclusive representation of the

characteristics of the target feature. Because features are selected according to their

discriminative powers, they are not maximally representative of the original space

covered by the entire dataset. The feature set may represent one or several dominant

characteristics of the target class, but these could still be small regions of the relevant

Chapter 4: Three-Stage Feature Selection

space covering the target class. Thus, the generalization ability of the feature set

could be limited.

Based on these facts, we propose to combine features selected in Stage II and

those having average relevance in Stage I. This combination aims to expand the

space covered by the feature set. The resulting feature set is the input to a wrapper

algorithm. In wrapper based methods, feature selection is powered by the learning

method, and features’ relevance is evaluated by the given accuracy of the classiﬁcation

method. Generally, we obtain a set with a very small number of non-redundant

features giving a high accuracy, since the characteristics of the features match very

well with the characteristics of the learning method.

Figure 4.4: Flowchart of the proposed three-stage feature selection fusion

Chapter 4: Three-Stage Feature Selection

4.4 Experimental investigations

In Stage I, we used three ﬁlters to rank features according to their level of signiﬁcance.

Matlab used in this step provides a whole set of tools for selecting diversity and

discriminating features. We used the rankfeatures function as a simple way to rank

features using an independent evaluation criterion for binary classiﬁcation. To assess

the signiﬁcance of every feature to separate two labeled groups many criterions could

be used such as: χ2, Relief, MI, relative entropy, and others. For simplicity, the three

ﬁrst criterions are selected for our task. The number of selected features in Stage I

are given in Table (4.1).

In Stage II, the redundancy is reduced using the quadratic optimization and MI as

a similarity measure. This stage is implemented in R software using the quadprog

package (Goldfarb and Idnani 1983), where results are obtained with α= 0.501 for

the Australian dataset, α= 0.511 for the German dataset, α= 0.509 for the HMEQ

and α= 0.514 for the Tunisian dataset. This means that the best value of αis

obtained when there is an equal tradeoﬀ between relevance and redundancy. The MI

for all features is measured using the function mutualinfo in Matlab. Stage III takes

as output the features selected in Stage II and those classiﬁed as average signiﬁcance

features. Then, a Bi-Directional wrapper is used.

Table 4.1: Number of remaining features after Stage I

High Average Poor Retained

Australian

7 3 4 10

German

7 9 4 16

HMEQ

5 5 2 10

Tunisian

10 9 2 19

The following ﬁlter and wrapper methods have been considered in the described

experiments for the comparisons:

•Maximal Relevance (MaxRel) feature selection selects those features that have

Chapter 4: Three-Stage Feature Selection

highest relevance to the target class (Peng et al. 2005).

•minimal-Redundancy-Maximum-Relevance (mRMR) algorithm chooses a sub-

set of features with both minimum redundancy and maximum relevance . The

mRMR algorithm selects features greedily, minimizing their redundancy with

features chosen in previous steps and maximizing their relevance to the class

(Ding and Peng 2005).

•Relief, χ2and MI details about these ﬁlters are given in Chapter 1.

•Kulback-leibler is used as ranking criteria. Features with the maximum Kullback-

Leibler distance are selected as the most signiﬁcant features.

•Forward, backward, bi-Directional wrapper details about this algorithm are

given in Chapter 1.

4.4.1 Results and discussion

Results for the three credit datasets using the previously quoted feature selection

methods are summarized in Tables (4.3)-(4.5). Classiﬁcation results represent the

performance of each feature selection method for four diﬀerent classiﬁcation methods:

DT,SVM, ANN and KNN, where the best results are shown in bold.

Performance of ﬁlters and wrappers

Tables (4.3)-(4.5) show the performances achieved by ANN, KNN, SVM and DT using

six ﬁlters and three wrappers. Both ﬁlters and wrappers perform well as feature

selectors for the scoring task. They may not always give the best set of features

for the classiﬁcation algorithm but in most cases they do. There is obviously a

strong similarity in the feature sets selected by diﬀerent approaches. A more detailed

picture of the achieved results shows that the precision of wrappers is better than

some of the studied ﬁlters. These results are conﬁrmed by the AUC rate for the three

datasets, proving the superiority of wrappers in terms of precision. In some cases

using wrappers is advantageous since they are able to achieve the same performance

Chapter 4: Three-Stage Feature Selection

as ﬁlters with a more reduced subset. In other cases, ﬁlters do a better job with a

signiﬁcant lower complexity than wrapper even by using limited information.

Performance of the New Approach

We notice from Tables (4.3)-(4.6) that in most cases the new approach achieves the

best rate of AUC. The higher is the value of AUC the better is the distinguishing

capacity of the classiﬁer. This means that the chosen features set by the new approach

provides the best combination of features given that it improves the capability of a

credit model to correctly identify the behavior of an applicant to pay back a loan.

Table 4.2: Summary of the best performance results archived by the set of feature

selection methods for the four datasets within the hybrid framework.

DT SVM ANN KNN

MaxRel Features  ⊗

mRMR Features 

χ2

Kullback-leibler 

Relief

MI  ⊕ ⊕

Wrapper Bi-Directional

Wrapper Forward

Wrapper Backwards

Three-Stage Approach  ⊗ ⊕⊗ ⊕ ⊗

⊕ ⊗ ⊗ ⊕

⊗

 ⊗ ⊕  ⊗ ⊕

 ⊗ ⊕

⊗⊕

⊗⊕⊗⊕

 ⊗ ⊕  ⊗ ⊕

Precision, ⊗Recall, ⊕F-measure, ROC Area.

Color: -Australian, -German , -HEMQ, -Tunisian.

From Table (4.2) we see that for the ANN and KNN classiﬁer the three stage

approach always achieves the highest rate of area under the ROC curve. This was

not the case for SVM where the proposed approach achieves three times the highest

area under the ROC curve and for DT where the proposed approach achieves twice

the highest area under the ROC curve. We also notice from Table (4.2) that overall

our proposed approach achieves 15 times the best recall, 13 times the best ROC area

and 14 times the best F-measure and 11 times the highest precision.

Chapter 4: Three-Stage Feature Selection

Consistent with the theoretical analysis for feature selection, the fusion approach

usually outperforms single wrappers or ﬁlters.

Table 4.3: Classiﬁcation results for the three stage feature selection for the Australian

dataset.

Precision Recall F-Measure ROC Area

Decision Tree

MaxRel Features 0.603 0.672 0.636 0.812

mRMR Features 0.557 0.589 0.573 0.797

χ20.603 0.673 0.640 0.916

Kullback-leibler 0.721 0.661 0.690 0.819

Relief 0.586 0.560 0.546 0.799

MI 0.601 0.600 0.600 0.837

Wrapper Bi-Directional 0.739 0.771 0.750 0.798

Wrapper Forward 0.737 0.775 0.755 0.788

Wrapper Backwards 0.749 0.769 0.749 0.796

Three-Stage Approach 0.857 0.889 0.873 0.797

Support Vector Machine

MaxRel Features 0.839 0.871 0.854 0.792

mRMR Features 0.902 0.852 0.876 0.813

χ20.629 0.670 0.655 0.818

Kullback-leibler 0.931 0.861 0.894 0.820

Relief 0.795 0.898 0.843 0.803

MI 0.930 0.870 0.900 0.817

Wrapper Bi-Directional 0.912 0.845 0.876 0.807

Wrapper Forward 0.919 0.840 0.879 0.810

Wrapper Backwards 0.915 0.843 0.878 0.808

Three-Stage Approach 0.900 0.901 0.880 0.823

Artiﬁcial Neural Network

MaxRel Features 0.892 0.917 0.904 0.855

mRMR Features 0.931 0.880 0.905 0.847

χ20.790 0.720 0.700 0.950

Kullback-leibler 0.931 0.870 0.900 0.826

Relief 0.685 0.626 0.605 0.845

MI 0.729 0.673 0.702 0.844

Wrapper Bi-Directional 0.896 0.898 0.898 0.843

Wrapper Forward 0.899 0.892 0.897 0.845

Wrapper Backwards 0.900 0.889 0.898 0.842

Three-Stage Approach 0.895 0.944 0.919 0.856

K-Nearest Neighbor

MaxRel Features 0.782 0.843 0.815 0.757

mRMR Features 0.795 0.844 0.819 0.756

χ20.687 0.739 0.715 0.755

Kullback-leibler 0.831 0.770 0.800 0.724

Relief 0.674 0.726 0.701 0.750

MI 0.789 0.972 0.871 0.742

Wrapper Bi-Directional 0.893 0.924 0.903 0.754

Wrapper Forward 0.790 0.826 0.809 0.752

Wrapper Backwards 0.797 0.820 0.800 0.750

Three-Stage Approach 0.897 0.944 0.919 0.758

Chapter 4: Three-Stage Feature Selection

Table 4.4: Classiﬁcation results for the three stage feature selection for the German

dataset.

Precision Recall F-Measure ROC Area

Decision Tree

MaxRel Features 0.620 0.632 0.620 0.727

mRMR Features 0.496 0.509 0.502 0.562

χ20.594 0.532 0.561 0.681

Kullback-leibler 0.624 0.589 0.606 0.710

Relief 0.582 0.555 0.568 0.691

MI 0.616 0.634 0.625 0.725

Wrapper Bi-Directional 0.776 0.751 0.782 0.705

Wrapper Forward 0.773 0.789 0.780 0.700

Wrapper Backwards 0.770 0.787 0.786 0.703

Three-Stage Approach 0.878 0.890 0.882 0.723

Support Vector Machine

MaxRel Features 0.506 0.565 0.583 0.713

mRMR Features 0.527 0.498 0.512 0.693

χ20.649 0.543 0.591 0.718

Kullback-leibler 0.667 0.532 0.590 0.708

Relief 0.513 0.532 0.522 0.679

MI 0.606 0.566 0.585 0.710

Wrapper Bi-Directional 0.858 0.693 0.760 0.677

Wrapper Forward 0.853 0.695 0.759 0.673

Wrapper Backwards 0.856 0.690 0.758 0.679

Three-Stage Approach 0.956 0.805 0.859 0.677

Artiﬁcial Neural Network

MaxRel Features 0.720 0.634 0.670 0.767

mRMR Features 0.697 0.555 0.616 0.742

χ20.729 0.600 0.657 0.749

Kullback-leibler 0.736 0.577 0.645 0.757

Relief 0.656 0.611 0.633 0.749

MI 0.712 0.634 0.672 0.767

Wrapper Bi-Directional 0.681 0.589 0.631 0.727

Wrapper Forward 0.683 0.583 0.635 0.729

Wrapper Backwards 0.680 0.585 0.629 0.720

Three-Stage Approach 0.781 0.589 0.651 0.769

K-Nearest Neighbor

MaxRel Features 0.703 0.630 0.668 0.775

mRMR Features 0.684 0.611 0.645 0.754

χ20.718 0.634 0.673 0.750

Kullback-leibler 0.716 0.634 0.670 0.762

Relief 0.617 0.611 0.614 0.759

MI 0.703 0.634 0.666 0.775

Wrapper Bi-Directional 0.651 0.577 0.616 0.761

Wrapper Forward 0.653 0.552 0.612 0.760

Wrapper Backwards 0.650 0.570 0.618 0.756

Three-Stage Approach 0.663 0.677 0.619 0.776

To be more precise if we look into the results of Table (4.3) we notice that with the

Australian datasets the proposed approach achieves the highest recall with DT, SVM

Chapter 4: Three-Stage Feature Selection

and ANN classiﬁers and the best F-measure with DT, ANN and KNN classiﬁers.

Table 4.5: Classiﬁcation results for the three stage feature selection for the HMEQ

dataset.

Precision Recall F-Measure ROC Area

Decision Tree

MaxRel Features 0.808 0.558 0.660 0.725

mRMR Features 0.730 0.558 0.632 0.713

χ20.782 0.598 0.677 0.757

Kullback-leibler 0.732 0.552 0.629 0.703

Relief 0.590 0.542 0.565 0.753

MI 0.790 0.598 0.680 0.757

Wrapper Bi-Directional 0.750 0.570 0.647 0.762

Wrapper Forward 0.748 0.564 0.627 0.767

Wrapper Backwards 0.742 0.568 0.643 0.760

Three-Stage Approach 0.788 0.794 0.791 0.888

Support Vector Machine

MaxRel Features 0.806 0.607 0.692 0.723

mRMR Features 0.843 0.530 0.650 0.707

χ20.813 0.579 0.616 0.752

Kullback-leibler 0.848 0.540 0.659 0.710

Relief 0.563 0.593 0.577 0.715

MI 0.823 0.577 0.678 0.737

Wrapper Bi-Directional 0.730 0.573 0.642 0.745

Wrapper Forward 0.739 0.570 0.643 0.755

Wrapper Backwards 0.735 0.579 0.647 0.759

Three-Stage Approach 0.721 0.846 0.778 0.839

Artiﬁcial Neural Network

MaxRel Features 0.691 0.561 0.619 0.720

mRMR Features 0.740 0.556 0.634 0.710

χ20.681 0.588 0.631 0.751

Kullback-leibler 0.737 0.558 0.635 0.707

Relief 0.563 0.515 0.537 0.653

MI 0.681 0.588 0.631 0.751

Wrapper Bi-Directional 0.670 0.589 0.626 0.750

Wrapper Forward 0.676 0.588 0.628 0.753

Wrapper Backwards 0.674 0.600 0.636 0.751

Three-Stage Approach 0.614 0.722 0.663 0.792

K-Nearest Neighbor

MaxRel Features 0.688 0.564 0.619 0.730

mRMR Features 0.694 0.564 0.622 0.707

χ20.671 0.701 0.685 0.747

Kullback-leibler 0.698 0.563 0.623 0.710

Relief 0.538 0.515 0.529 0.655

MI 0.675 0.599 0.634 0.750

Wrapper Bi-Directional 0.675 0.585 0.626 0.752

Wrapper Forward 0.672 0.588 0.627 0.755

Wrapper Backwards 0.671 0.583 0.623 0.753

Three-Stage Approach 0.674 0.704 0.688 0.774

Chapter 4: Three-Stage Feature Selection

Table 4.6: Classiﬁcation results for the three stage feature selection for the Tunisian

dataset.

Precision Recall F-Measure ROC Area

Decision Tree

MaxRel Features 0.611 0.614 0.612 0.700

mRMR Features 0.620 0.623 0.628 0.701

χ20.610 0.615 0.613 0.702

Kullback-leibler 0.796 0.800 0.798 0.688

Relief 0.501 0.502 0.501 0.690

MI 0.590 0.620 0.604 0.679

Wrapper Bi-Directional 0.730 0.790 0.759 0.703

Wrapper Forward 0.706 0.722 0.713 0.702

Wrapper Backwards 0.689 0.736 0.702 0.678

Three-Stage Approach 0.862 0.960 0.908 0.716

Support Vector Machine

MaxRel Features 0.707 0.700 0.733 0.725

mRMR Features 0.805 0.787 0.795 0.700

χ20.650 0.742 0.693 0.670

Kullback-leibler 0.744 0.853 0.794 0.673

Relief 0,522 0.650 0.581 0.670

MI 0,604 0,647 0,608 0.708

Wrapper Bi-Directional 0.711 0.750 0.710 0.706

Wrapper Forward 0.706 0.722 0.713 0.710

Wrapper Backwards 0.774 0.846 0.786 0.680

Three-Stage Approach 0.852 0.990 0.916 0.752

Artiﬁcial Neural Network

MaxRel Features 0.812 0.826 0.818 0.712

mRMR Features 0.820 0.830 0.825 0.715

χ20.775 0.790 0.782 0.650

Kullback-leibler 0.805 0.805 0.805 0.699

Relief 0.788 0.870 0.827 0.702

MI 0.816 0.820 0.818 0.687

Wrapper Bi-Directional 0.889 0.856 0.872 0.713

Wrapper Forward 0.809 0.850 0.806 0.690

Wrapper Backwards 0.815 0.852 0.811 0.703

Three-Stage Approach 0.864 0.973 0.915 0.730

K-Nearest Neighbor

MaxRel Features 0.818 0.850 0.827 0.706

mRMR Features 0.795 0.846 0.807 0.710

χ20.766 0.789 0.777 0.623

Kullback-leibler 0.754 0.800 0.776 0.641

Relief 0.700 0.802 0.747 0.690

MI 0.722 0.850 0.781 0.650

Wrapper Bi-Directional 0.818 0.854 0.813 0.700

Wrapper Forward 0.789 0.840 0.800 0.686

Wrapper Backwards 0.800 0.848 0.802 0.680

Three-Stage Approach 0.859 0.981 0.916 0.723

Table (4.4) shows that for German dataset the new approach achieves the highest

recall with DT, SVM and KNN classiﬁers, the best precision with DT, SVM and

100

Chapter 4: Three-Stage Feature Selection

ANN and the highest F-measure with DT and SVM and produces the best AUC for

ANN and KNN witch means that the selected features by this approach allow ﬁnding

the best middle ground between speciﬁcity and sensitivity.

We notice from Tables (4.5)-(4.6) that the new approach achieves the best pre-

cision, recall, F-measure and ROC area with DT and SVM classiﬁers with both the

HMEQ and Tunisian datasets. Table (4.5) shows that for the HMEQ dataset the new

approach achieves the best recall and F-measure and AUC with all datasets.

A two-way ANOVA is performed on the F-measure results in order to conclude on

the diﬀerence between the diﬀerent features selection methods and classiﬁcation meth-

ods. Then the ﬁrst factor is represented through the diﬀerent feature selection meth-

ods, where {Relief , MI , MaxRel, mRM R, χ2, Kullback, Bi −D irectional, F orward

, B ackward, T hreeS tage}present the levels of the ﬁrst factor. The second factor is

represented by the diﬀerent classiﬁcation methods including {DT, SV M , ANN, KN N}.

Several hypotheses are jointly tested in two-way ANOVA, then H0and alternative

hypothesis H1for the ﬁrst factor presenting all feature selection methods would be











H0:µ1

Relief =µ1

MI =µ1

MaxRel =µ1

mRMR =µ1

χ2=µ1

Kullback =µ1

Directional =µ1

F orward

=µ1

Backward =µ1

T hreeStag e Performances of selection methods are equal,

versus

H1:∀t, µ1

t6=µ1

i, i, t ∈ {Relief, M I, χ2, M axRel, mRM R, Kullback, Directional,

Backward, M axRel, F orward, T hreeStage}, i 6=tAt least one of the feature

selection methods mean of performance is diﬀerent from the others

For the second factor, ie. Classiﬁer, H0and H1are given by:











H0:µ2

DT =µ2

SV M =µ2

ANN =µ2

KN N Performances of classiﬁers are equal,

versus

H1:∀t, µ2

t6=µ2

i, i, t ∈ {DT , SV M, K NN, AN N}, i 6=tAt least one of the classifer

mean of performance is diﬀerent from the others

101

Chapter 4: Three-Stage Feature Selection

Interaction between the two 2 factors:











H0: There is no interaction between the two factors,

versus

H1: There is an interaction between the two factors

mean of performance is diﬀerent from the others

To set up a two-way ANOVA we use the data in Table (4.8), the results obtained

from this table are summarized in Table (4.7)

Table 4.7: Tests of between-subjects eﬀects in hybrid framework.

Source Type III Sum of

Squares

DF Mean Square F Sig. (p-value)

Corrected Model 0,646 39 0,017 1,518 0,045

Intercept 81,140 1 81,140 7432,361 0,000

Seclection Method ,418 9 0,046 4,253 8,37151E-05

Classiﬁer 0,095 3 0,032 2,913 0,037

Selection Method * Classiﬁer 0,133 27 0,005 0,451 0,991

Error 1,310 120 0,011

Total 83,096 160

Corrected Total 1,956 159

Dependant Variable : F-measure

The items of primary interest in this table are the eﬀects listed under the ”Source”

column and the values under the ”Sig.” column. As in the previous hypothesis test,

if the value of ”Sig” is less than the value of 0.05 as set by the experimenter, then

that eﬀect is signiﬁcant. From Table (4.7) we notice that we don’t have a statistically

signiﬁcant interaction between the factor selection method and the factor classiﬁer,

but there were statistically signiﬁcant diﬀerences between classiﬁer levels and selection

method levels where p-value is less than 0.05 for both factors. A more detailed picture

is given in Table (4.9) and Table (4.10).

102

Chapter 4: Three-Stage Feature Selection

Table 4.8: Summary of F-measures for all feature selection methods with the four

classiﬁcation methods in hybrid framework.

χ2Relief MI MaxRel mRMR

DT 0,640 0,546 0,600 0,636 0,573

0,561 0,568 0,626 0,620 0,502

0,677 0,565 0,680 0,660 0,632

0,613 0,501 0,604 0,612 0,628

Kullback Directional Forward Backward Three Stage

0,690 0,750 0,755 0,749 0,873

0,606 0,782 0,780 0,786 0,882

0,629 0,647 0,627 0,643 0,791

0,698 0,759 0,713 0,702 0,908

χ2Relief MI MaxRel mRMR

SVM 0,655 0,843 0,900 0,854 0,876

0,591 0,522 0,585 0,583 0,512

0,616 0,577 0,678 0,692 0,650

0,693 0,581 0,608 0,733 0,795

Kullback Directional Forward Backward Three Stage

0,894 0,876 0,879 0,878 0,880

0,590 0,760 0,759 0,758 0,859

0,659 0,642 0,643 0,647 0,778

0,794 0,710 0,713 0,786 0,916

χ2Relief MI MaxRel mRMR

ANN 0,700 0,605 0,702 0,904 0,905

0,657 0,633 0,672 0,670 0,616

0,631 0,537 0,631 0,619 0,634

0,782 0,827 0,818 0,818 0,825

Kullback Directional Forward Backward Three Stage

0,900 0,898 0,897 0,898 0,919

0,645 0,631 0,635 0,629 0,651

0,635 0,626 0,628 0,636 0,663

0,805 0,872 0,806 0,811 0,915

χ2Relief MI MaxRel mRMR

KNN 0,715 0,701 0,871 0,815 0,819

0,673 0,614 0,666 0,668 0,645

0,685 0,529 0,634 0,619 0,622

0,777 0,747 0,781 0,827 0,807

Kullback Directional Forward Backward Three Stage

0,800 0,903 0,809 0,800 0,919

0,670 0,616 0,612 0,618 0,619

0,623 0,626 0,627 0,623 0,688

0,776 0,813 0,800 0,802 0,916

we notice from Table (4.9) that there is a statistically signiﬁcant diﬀerence between

the obtained results from (1) the three stage approach and MI where p-value = 0.017,

(2) the three stage approach and mRMR where p-value = 0.015, (3) the three stage

approach and χ2where p-value = 0.002 and (4) the three stage approach and relief

where p-value <0.05.

103

Chapter 4: Three-Stage Feature Selection

Table 4.9: Multiple comparisons table for feature selection methods in hybrid frame-

work.

Selection

method

(I)

Selection

method

(J)

Mean dif-

ference (I-

Sig. Selection

method

(I)

Selection

method

(J)

Mean dif-

ference (I-

Sig.

Backward Directional -0,00906 1 MaxRel Backward -0,02725 0,999

χ20,06875 0,695 Directional -0,03631 0,993

Forward 0,00519 1 χ20,0415 0,981

Kullback 0,022 1 Forward -0,02206 1

MaxRel 0,02725 0,999 Kullback -0,00525 1

MI 0,04438 0,971 MI 0,01713 1

mRMR 0,04531 0,967 mRMR 0,01806 1

Relief 0,11687 0,059 Relief 0,08962 0,32

Three stage -0,08819 0,343 Three stage -0,11544 0,066

Directional Backward 0,00906 1 MI Backward -0,04438 0,971

χ20,07781 0,527 Directional -0,05344 0,91

Forward 0,01425 1 χ20,02438 1

Kullback 0,03106 0,998 Forward -0,03919 0,987

MaxRel 0,03631 0,993 Kullback -0,02237 1

MI 0,05344 0,91 MaxRel -0,01713 1

mRMR 0,05437 0,9 mRMR 0,00094 1

Relief ,12594* 0,029 Relief 0,0725 0,627

Three stage -0,07913 0,502 Three stage -,13256* 0,017

χ2Backward -0,06875 0,695 mRMR Backward -0,04531 0,967

Directional -0,07781 0,527 Directional -0,05437 0,9

Forward -0,06356 0,782 χ20,02344 1

kullback -0,04675 0,959 Forward -0,04012 0,985

MaxRel -0,0415 0,981 Kullback -0,02331 1

MI -0,02438 1 MaxRel -0,01806 1

mRMR -0,02344 1 MI -0,00094 1

Relief 0,04812 0,951 Relief 0,07156 0,644

Three stage -,15694* 0,002 Three stage -0,13350* 0,015

Forward Backward -0,00519 1 Relief Backward -0,11687 0,059

Directional -0,01425 1 Directional -0,12594* 0,029

χ20,06356 0,782 χ2-0,04812 0,951

Kullback 0,01681 1 Forward -0,11169 0,086

MaxRel 0,02206 1 Kullback -0,09487 0,244

MI 0,03919 0,987 MaxRel -0,08962 0,32

mRMR 0,04012 0,985 MI -0,0725 0,627

Relief 0,11169 0,086 mRMR -0,07156 0,644

Three stage -0,09338 0,265 Three stage -,20506* 0

Kullback backward -0,022 1 Three stage Backward 0,08819 0,343

Directional -0,03106 0,998 Directional 0,07913 0,502

χ20,04675 0,959 χ20,15694* 0,002

Forward -0,01681 1 Forward 0,09338 0,265

MaxRel 0,00525 1 Kullback 0,11019 0,095

MI 0,02237 1 MaxRel 0,11544 0,066

mRMR 0,02331 1 MI 0,13256* 0,017

Relief 0,09487 0,244 mRMR 0,13350* 0,015

Three stage -0,11019 0,095 Relief 0,20506* 0

104

Chapter 4: Three-Stage Feature Selection

Table 4.10: Multiple comparisons table for diﬀerent classiﬁers in hybrid framework.

Classiﬁer (I) Classiﬁer (J) Mean diﬀerence (I-J) Sig.

ANN DT 0,06180* 0,045

KNN 0,01028 0,971

SVM 0,00803 0,986

DT ANN -0,06180* 0,045

KNN -0,05153 0,128

SVM -0,05378 0,103

KNN ANN -0,01028 0,971

DT 0,05153 0,128

SVM -0,00225 1,000

SVM ANN -0,00803 0,986

DT 0,05378 0,103

KNN 0,00225 1,000

Table (4.10) shows that there is a signiﬁcant diﬀerence between the results pro-

duced by DT and ANN classiﬁer where the computed p-value is 0,045.

4.5 Conclusion

Feature selection is an important task in CS. We propose in this chapter fusing in a

ﬁrst stage a set of ﬁlters methods as a pre-selection step. The ﬁrst stage is followed

by a ﬁlter selection based on a quadratic optimization and a similarity study. Finally,

the fusion is reﬁned by a wrapper selection. Results show that the fusion performance

is either superior to or at least as good as either ﬁlter and wrapper methods.

105

Conclusion

Credit-risk evaluation involves processing huge volumes of data. Consequently it

requires powerful data mining tools. Several methods developed in machine learning

have been used for ﬁnancial credit-risk evaluation and especially for CS. However,

the majority of these tools are aﬀected by the curse of dimensionality and irrelevant

features often degrade the performance of predictive models both in speed and in

predictive accuracy. Hence, the use of optimal feature subset becomes essential.

In this thesis, we reviewed the framework of feature selection and explained the

basic concept of diﬀerent feature selection models: ﬁlter, wrapper and hybrid. Some

research questions related to each one of these three categories were examined.

In Chapter 2 we investigated ﬁlter feature selection. We presented a brief reminder

of the ﬁlter framework and two major issues when dealing with ﬁltering methods,

respectively the selection trouble and the issue of disjoint ranking for similar features.

Then, a new approach is introduced with experimental investigations in Chapter 2

based on three steps: in the ﬁrst we presented the feature selection problem as an

optimization problem with the aim of ﬁnding the best list, which would be the closest

as possible to all individual ordered lists. Then, in the next step we presented a

solution to the optimization problem. The solution consists on using GAs. In the

ﬁnal step we used similarity in order to resolve the problem of disjoint ranking for

similar features. The results for this chapter were conducted on four credit datasets.

We compared the new approach with some well known aggregation methods and

some individual ﬁltering methods and results show that ensemble methods improve

precision, recall and F-measure especially when similarity is considered.

106

Conclusions

The proposed approach in Chapter 3 is an ensemble method based on wrapper

feature selection which is a complete search and a multiple classiﬁers system. We ﬁrst

focused on the search strategy and the choice of the starting point and we proposed

ﬁrst to reduce the search space to a manageable size based on similarity study with

prior knowledge. Then, a hybrid search strategy mixing heuristic and complete search

was performed.

The ﬁnal part of the proposed approach in Chapter 3 was dedicated to the evaluation

process. In this step two diﬀerent classiﬁer arrangement approaches were used within

the wrapper evaluation process, namely the same-type approach and the mixed-type

approach. By using the second arrangement approach we wanted to investigate how

classiﬁers from diﬀerent families work together and how their interaction inﬂuences

the feature selection. Then, by using both approaches we obtained a complete picture

of the inﬂuences of the nature of classiﬁers on feature selection. The obtained results

in this chapter show that the use of prior information and heuristics in the complete

search induces a signiﬁcant gain in complexity with improved generalization. Fur-

thermore, the obtained results show that the number of classiﬁers and their nature

have an important eﬀect on wrapper feature selection.

The ﬁnal contribution of this thesis is proposed in Chapter 4. In this chapter a

three-stage feature selection fusion using quadratic programming is proposed. In the

ﬁrst stage we used a set of ﬁlter-based methods to classify candidate attributes based

on their relevance level into three main categories. We obtained the following feature

relevance categories: high, average and poor. Highly relevant features were kept as

input to the second stage, average ones were kept as input to the third stage and the

last category was eliminated.

In the second stage an eﬃcient method dealing with both redundancy and relevance

is considered. In this stage we minimize the redundancy among the most relevant

features while maximizing their relevance to the target class. To ﬁnd the best com-

bination between relevance and redundancy we formulate this problem as an opti-

mization of a quadratic multi-objective function. Once the most relevant features

107

Conclusions

were separated from the redundant ones we moved to stage three. The latter took

as input the selected features of stage two and combined them with the average rele-

vant features of the ﬁrst stage, then a wrapper approach was trained on the resulting

features. Results show that the fusion performance is either superior to or at least as

good as either ﬁlter and wrapper methods.

The used datasets were relatively small in size, i.e., less than 100 features. In fact,

the datasets that we have considered in the course of evaluation have a maximum of

23 features (i.e. Tunisian dataset). It would be of interest to test our three proposed

methods on large datasets to provide more insight into their performance.

108

Bibliography

Al-Ani, A. and M. Deriche (2001). An optimal feature selection technique using

the concept of mutual information. In Proceedings of the Sixth International

Symposium on Signal Processing and its Applications, pp. 477–480.

Arauzo-Azofra, A., J. M. Benitez, and J. L. Castro (2008). Consistency measures

for feature selection. Journal of Intelligent Information Systems 30 (3), 273–292.

Bardos, M. (2001). Analyse discriminante: Application au risque et scoring ﬁ-

nancier. Dunod.

Bazaraa, M., H. Sherali, and C. Shetty (1993). Nonlinear Programming Theory and

Algorithms. New York: John Wiley.

Bellotti, T. and J. Crook (2009). Support vector machines for credit scoring and

discovery of signiﬁcant features. Expert Systems with Applications 36 (2), 3302–

3308.

Blum, A. L. and P. Langley (1997). Selection of relevant features and examples in

machine learning. Artiﬁcial Intelligence 97 (2), 245–271.

Bonev, B. (2010). Feature Selection based on Information Theory. Ph. D. thesis,

University of Alicante.

Bouckaert, R. R., E. Frank, M. Hall, R. Kirkby, P. Reutemann, A. Seewald, and

D. Scuse (2009). Weka manual (3.7.1).

Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classiﬁcation and Re-

gression Trees. Monterey, CA: Wadsworth and Brooks.

Burges, J. (1998). A tutorial on support vector machines for pattern recognition.

data mining knowledge discovery 2 (2), 121–167.

Carterette, B. (2009). On rank correlation and the distance between rankings. In

Proceedings of the 32nd international ACM SIGIR conference on Research and

development in information retrieval, SIGIR ’09, New York, NY, USA, pp.

436–443. ACM.

Chan, Y. H., W. Y. N. Wing, S. Y. Daniel, and P. P. K. Chan (2010). Empirical

comparison of forward and backward search strategies in l-gem based feature

selection with rbfnn. In Proceedings of the International Conference on Machine

Learning and Cybernetics (ICMLC), pp. 1524–1527.

109

Bibliography

Chen, F. L. and F. C. Li (2010). Combination of feature selection approaches with

svm in credit scoring. Expert Systems with Applications 37 (7), 4902–4909.

Cho, S., H. Hong, and B. C. Ha (2010). A hybrid approach based on the combi-

nation of variable selection using decision trees and case-based reasoning using

the mahalanobis distance: For bankruptcy prediction. Expert Systems with Ap-

plications 37 (4), 3482–3488.

Chrysostomou, K., S. Y. Chen, and X. Liu (2008). Combining multiple classiﬁers

for wrapper feature selection. International Journal of Data Mining, Modelling

and Management 1 (1), 91–102.

Chrysostomou, K. A. (2008). The Role of Classiﬁers in Feature Selection: Num-

ber vs Nature. Ph. D. thesis, School of Information Systems, Computing and

Mathematics. Brunel University.

Clegg, J., J. F. Dawson, S. J. Porter, and M. H. Barley (2009). A genetic algorithm

for solving combinatorial problems and the eﬀects of experimental error - ap-

plied to optimizing catalytic materials. QSAR & Combinatorial Science 28 (9),

1010–1020.

Dash, M. and H. Liu (2003). Consistency-based search in feature selection. Artiﬁcial

Intelligence 151 (1-2), 155–176.

Dietterich, T. G. (2000). Ensemble methods in machine learning. In Proceedings of

the First International Workshop on Multiple Classiﬁer Systems, London, UK,

pp. 1–15. Springer-Verlag.

Ding, C. and H. Peng (2005). Minimum redundancy feature selection from mi-

croarray gene expression data. Journal of Bioinformatics and Computational

Biology 3 (2), 185–206.

Dinu, L. P. and F. Manea (2006). An eﬃcient approach for the rank aggregation

problem. Theoretical Computer Science 359 (1), 455–461.

Dittman, D. J., T. M. Khoshgoftaar, R. Wald, and A. Napolitano (2013). Classiﬁ-

cation performance of rank aggregation techniques for ensemble gene selection.

In C. Boonthum-Denecke and G. M. Youngblood (Eds.), Proceedings of the

International Conference of the Florida Artiﬁcial Intelligence Research Society

(FLAIRS). AAAI Press.

Durand, D. (1941). Risk elements in consumer instalment ﬁnancing. National bu-

reau of economic research [New York].

Falangis, K. and J. Glen (2010). Heuristics for feature selection in mathematical

programming discriminant analysis models. Journal of the Operational Research

Society 61 (5), 804–812.

Fernandez, G. (2010). Statistical Data Mining Using SAS Applications. Chapman

& Hall/Crc: Data Mining and Knowledge Discovery. Taylor and Francis.

Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals

of Eugenics 7 (2), 179–188.

110

Bibliography

Forman, G. (2008). Bns feature scaling: an improved representation over tf-idf for

svm text classiﬁcation. In Proceedings of the 17th ACM conference on Informa-

tion and knowledge mining, New York, NY, USA, pp. 263–270. ACM.

Frydman, H., E. I. Altman, and D. L. Kao (1985). Introducing recursive parti-

tioning for ﬁnancial classiﬁcation: The case of ﬁnancial distress. Journal of

Finance 40 (1), 269–91.

Giudici, P. (2003). Applied Data Mining: Statistical Methods for Business and In-

dustry. West Sussex PO19 8SQ, England: John Wiley & Sons Ltd, The Atrium,

Southern Gate, Chichester.

Goldfarb, D. and A. Idnani (1983). A numerically stable dual method for solving

strictly convex quadratic programs. Mathematical Programming 27 (1), 1–33.

Gotshall, S. and B. Rylander (2000). Optimal population size and the genetic

algorithm.

Guyon, I. and A. Elisseeﬀ (2003). An introduction to variable and feature selection.

Journal of Machine Learning Research 3 (9), 1157–1182.

Haizhou, W. and L. Jianwu (2011). Credit scoring based on eigencredits and

svdd. In Proceedings of the International Conference on Applied Informatics

and Communication, pp. 32–40.

Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric

class machine learning. In Proceedings of the International Conference on Ma-

chine Learning(ICML), pp. 359–366.

Hand, D. J. and W. E. Henley (1997). Statistical classiﬁcation methods in con-

sumer credit scoring: A review. Journal of the Royal Statistical Society Series

A 160 (3), 523–541.

Hastie, T., R. Tibshirani, and J. Friedman (2001). The Elements of Statistical

Learning. Springer series in statistics. Springer New York Inc.

Holland, J. H. (1992). Adaptation in natural and artiﬁcial systems. Cambridge,

MA, USA: MIT Press.

Howley, T., M. G. Madden, M. L. O’Connell, and A. G. Ryder (2006). The ef-

fect of principal component analysis on machine learning accuracy with high-

dimensional spectral data. Knowledge Based Systems 19 (5), 363–370.

Hsieh, N. C. and L. P. Hung (2010). A data driven ensemble classiﬁer for credit

scoring analysis. Expert Systems with Applications 37 (1), 534–545.

Huang, C. L., M. C. Chen, and C. J. Wang (2007). Credit scoring with a data

mining approach based on support vector machines. Expert Systems with Ap-

plications 33 (4), 847–856.

Jiang, Y. (2009). Credit scoring model based on the decision tree and the simulated

annealing algorithm. In Proceedings of the 2009 WRI World Congress on Com-

puter Science and Information Engineering, Washington, DC, USA, pp. 18–22.

IEEE Computer Society.

111

Bibliography

Kira, K. and L. A. Rendell (1992). A practical approach to feature selection. In

Proceedings of the ninth international workshop on Machine learning, San Fran-

cisco, CA, USA, pp. 249–256. Morgan Kaufmann Publishers Inc.

Kittler, J. (1998). Combining classiﬁers: A theoretical framework. Pattern Analysis

& Applications 1 (1), 18–27.

Kohavi, R. and G. H. John (1997). Wrappers for feature subset selection. Artiﬁcial

Intelligence 97 (1).

Kolde, R., S. Laur, P. Adler, and J. Vilo (2012). Robust rank aggregation for gene

list integration and meta-analysis. Bioinformatics 28 (4), 573–580.

Kumar, G. and K. Kumar (2011). A novel evaluation function for feature selection

based upon information theory. In Proceedings of the Canadian Conference on

Electrical and Computer Engineering (CCECE), pp. 395–399.

Kumar, R. and S. Vassilvitskii (2010). Generalized distances between rankings. In

Proceedings of the 19th international conference on World wide web, WWW’10,

New York, NY, USA, pp. 571–580. ACM.

Kuncheva, L. I., J. C. Bezdek, and P. W. Duin (2001). Decision templates for mul-

tiple classiﬁer fusion: an experimental comparison. Pattern Recognition 34 (2),

299–314.

Kwang, L. (2002). Combining multiple feature selection methods. In Proceedings of

the Mid-Atlantic Student Workshop on Programming Languages and Systems.

Ledesma, S., G. Cerda, G. Avi˜na, D. Hern´andez, and M. Torres (2008). Feature

selection using artiﬁcial neural networks. In Proceedings of the 7th Mexican

International Conference on Artiﬁcial Intelligence: Advances in Artiﬁcial In-

telligence, MICAI ’08, Berlin, Heidelberg, pp. 351–359. Springer-Verlag.

Legrand, G. and N. Nicoloyannis (2005). Feature selection method using prefer-

ences aggregation. In Proceedings of the 4th international conference on Ma-

chine Learning and Data Mining in Pattern Recognition, MLDM’05, Berlin,

Heidelberg, pp. 203–217. Springer-Verlag.

Li, S., E. J. Harner, and D. Adjeroh (2011). Random KNN feature selection - a fast

and stable alternative to Random Forests. BMC Bioinformatics 12 (1), 450–461.

Liu, H. and H. Motoda (1998). Feature Selection for Knowledge Discovery and Data

Mining. Norwell, MA, USA: Kluwer Academic Publishers.

Liu, H. and L. Yu (2005). Toward integrating feature selection algorithms for clas-

siﬁcation and clustering. IEEE Transactions on Knowledge and Data Engineer-

ing 17 (4), 491–502.

Liu, Y. and M. Schumann (2005). Data mining feature selection for credit scoring

models. Journal of the Operational Research Society 56 (9), 1099–1108.

Mak, M. W. and S. Y. Kung (2008). Fusion of feature selection methods for pairwise

scoring svm. Neurocomputing 71 (16-18), 3104–3113.

112

Bibliography

Matjaz, V. (2012). estimating probability of default and comparing it to credit

rating classiﬁcation by banks. Economic and business review 14 (4), 299–320.

Merbouha, A. and A. Mkhadri (2006). M´ethodes de scoring non-param`etriques.

Revue de Statistique Appliqu´ee 56 (1), 5–26.

Molina, L. C., L. Belanche, and A. Nebot (2002). Feature selection algorithms: A

survey and experimental evaluation. In Proceedings of the IEEE International

Conference on Data Mining, pp. 306 – 313. IEEE Computer Society.

Paleologo, G., A. Elisseeﬀ, and G. Antonini (2010). Subagging for credit scoring

models. European Journal of Operational Research 201 (2), 490–499.

Peng, H., F. Long, and C. Ding (2005). Feature selection based on mutual informa-

tion: criteria of max-dependency, max-relevance, and min-redundancy. IEEE

Transactions on Pattern Analysis and Machine Intelligence 27 (8), 1226–1238.

Pihur, V., S. Datta, and S. Datta (2009). RankAggreg, an R package for weighted

rank aggregation. BMC Bioinformatics 10 (1), 62–72.

Piramuthu, S. (2004). Evaluating feature selection methods for learning in data

mining applications. European Journal of Operational Research 156 (2), 483 –

494.

Piramuthu, S. (2006). On preprocessing data for ﬁnancial credit risk evaluation.

Expert Systems with Applications 30 (3), 489–497.

Rodriguez, I., R. Huerta, C. Elkan, and C. S. Cruz (2010). Quadratic Programming

Feature Selection. Journal of Machine Learning Research 11 (4), 1491–1516.

Sadatrasoul, S. M., M. Gholamian, M. Siami, and H. Z. (2013). Credit scoring in

banks and ﬁnancial institutions via data mining techniques: A literature review.

Journal of Artiﬁcial Intelligence and Data Mining 1 (2), 119–129.

Saeys, Y., T. Abeel, and Y. Peer (2008). Robust feature selection using ensemble

feature selection techniques. In Proceedings of the European conference on Ma-

chine Learning and Knowledge Discovery in Databases - Part II, ECML PKDD

’08, Berlin, Heidelberg, pp. 313–325. Springer-Verlag.

Saeys, Y., I. n. Inza, and P. Larra˜naga (2007). A review of feature selection tech-

niques in bioinformatics. Bioinformatics 23 (19), 2507–2517.

Schebesch, K. B. and R. Stecking (2005). Support vector machines for classifying

and describing credit applicants: detecting typical and critical regions. Journal

of the Operational Research Society 56 (9), 1082–1088.

Sculley, D. (2007). Rank aggregation for similar items. In Proceedings of the Seventh

SIAM International Conference on Data Mining, pp. 587–592.

Thomas, L. (2009). Consumer credit models: pricing, proﬁt, and portfolios. Oxford

University Press.

Thomas, L., J. Crook, and D. Edelman (2002). Credit Scoring and Its Applications.

Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.

113

Bibliography

Thomas, L. C. (2000). A survey of credit and behavioural scoring: forecasting ﬁ-

nancial risk of lending to consumers. International Journal of Forecasting 16 (2),

149–172.

Tsai, C. F. and J. W. Wu (2008). Using neural network ensembles for bankruptcy

prediction and credit scoring. Expert Systems with Applications 34 (4), 2639–

2649.

Tuﬀ´ery, S. (2007). Data mining et statistique d´ecisionnelle: l’intelligence des

donn´ees. Editions Ophrys.

Tuv, E., A. Borisov, G. Runger, K. Torkkola, I. Guyon, and A. R. Saﬀari (2009).

Feature selection with ensembles, artiﬁcial variables, and redundancy elimina-

tion. Journal of Machine Learning Research 10 (9).

Vapnik, V. (1995). The nature of statistical learning theory. New York, NY, USA:

Springer-Verlag New York, Inc.

Suˇsterˇsiˇc, M., D. Mramor, and J. Zupan (2009). Consumer credit scoring models

with limited data. Expert Systems with Applications 36 (3), 4736–4744.

Wang, J., A. R. Hedar, S. Wang, and J. Ma (2012). Rough set and scatter search

metaheuristic based feature selection for credit scoring. Expert Systems with

Applications 39 (6), 6123–6128.

Wu, O., H. Zuo, W. Zhu, Mingliangand Hu, J. Gao, and H. Wang (2009). Rank

aggregation based text feature selection. In Proceedings of the Web Intelligence,

pp. 165–172.

Yang, L. (2001). New issues in credit scoring application. Working Papers 16/2001,

Institut Fur Wirtschaftsinformatik.

Yu, L. and H. Liu (2003). Feature selection for high-dimensional data: A fast

correlation-based ﬁlter solution. In Proceedings of the International Conference

on Machine Learning (ICML), pp. 856–863.

Yu, L. and H. Liu (2004). Eﬃcient feature selection via analysis of relevance and

redundancy. Journal of Machine Learning Research 5 (1), 1205–1224.

Yun, C., D. Shin, H. Jo, J. Yang, and S. Kim (2007). An experimental study on

feature subset selection methods. In Proceedings of the 7th IEEE International

Conference on Computer and Information Technology, CIT ’07, Washington,

DC, USA, pp. 77–82. IEEE Computer Society.

Zhang, D., X. Zhou, S. C. H. Leung, and J. Zheng (2010). Vertical bagging decision

trees model for credit scoring. Expert Systems with Applications 37 (12), 7838–

7843.

114

Publications

The following are published articles out of this thesis.

Journal Papers

1. Bouaguel, W., Bel Mufti, G. and Limam, M. (2013), A three-stage feature

selection using quadratic programming for credit scoring. Applied Artiﬁcial

Intelligence: An International Journal, Vol.27, No.8, September, pp.721-742.

2. Bouaguel, W. and Bel Mufti, G. (2012), An improvement direction for ﬁlter

selection techniques using information theory measures and quadratic opti-

mization. International Journal of Advanced Research in Artiﬁcial Intelligence,

Vol.1, No:5, August, pp.7-11.

Conference Papers

1. Bouaguel, W., Bel Mufti, G. and Limam, M. (2013), Similarity aggregation a

new version of rank aggregation applied to credit scoring case. Mining In-

telligence and Knowledge Exploration, Lecture Notes in Computer Science,

Vol.8284, pp.618-628.

2. Bouaguel, W., Bel Mufti, G. and Limam, M. (2013), Rank aggregation for

ﬁlter feature selection in credit scoring. Mining Intelligence and Knowledge

Exploration, Lecture Notes in Computer Science, Vol.8284, pp.7-15.

3. Bouaguel, W., Ben Brahim, A. and Limam, M. (2013), Feature selection by

rank aggregation and genetic algorithms, Proceedings of the 5th International

115

Publications

Conference on Knowledge Discovery and Information Retrieval (KDIR 2013),

Vilamoura, Algarve, Portugal, 19 - 22 September, pp.33-40.

4. Ben Brahim, A., Bouaguel, W. and Limam, M. (2013), Feature selection ag-

gregation versus classiﬁers aggregation for several data dimensionalities, Pro-

ceedings of the International Conference on Control, Engineering & Information

Technology (CEIT 2013), Sousse, Tunisia, 4-7 June, pp.10-15.

5. Bouaguel, W., Bel Mufti, G. and Limam, M. (2013), On the eﬀect of search

strategies on wrapper feature selection in credit scoring, Proceedings of the

International Conference on Control, Decision and Information Technologies

(CODIT 2013), Hammamet, Tunisia, 6-8 May, pp.60-67.

6. Bouaguel, W., Bel Mufti, G. and Limam, M. (2013), A fusion approach based

on wrapper and ﬁlter feature selection methods using majority vote and feature

weighting, Proceedings of the International Conference on Computer Applica-

tions Technology (ICCAT 2013), Sousse, Tunisia, 20- 22 January, pp.47-53.

7. Bouaguel, W., Bel Mufti, G. and Limam, M. (2012), Quadratic Programming

for Feature Selection in Credit Scoring, Proceedings of the Third Meeting on

Statistics and Data Mining (MSDM 2012), Hammamet, Tunisia, 14-15 March,

pp.7-14.

Book Chapters

1. Ben Brahim, A., Bouaguel, W. and Limam, M. (2014), Combining feature se-

lection and data classiﬁcation using ensemble approaches: application to cancer

diagnosis and credit scoring. Case Studies in Intelligent Computing Achieve-

ments and Trends, Taylor & Francis (Accepted).

2. Bouaguel, W., Bel Mufti, G. and Limam, M. (2014), A new feature selection

technique applied to credit scoring data using a rank aggregation approach

based on: optimization, genetic algorithm and similarity. Knowledge Discovery

& Data Mining (KDDM) for Economic Development: Applications, Strategies

and Techniques, Taylor & Francis (Conditional acceptation).

116

Appendix A

Feature categories and datasets description

A.1 Feature categories

In general, a feature can describe either a qualitative or quantitative characteristic of

a credit applicant. Examples of qualitative characteristics are gender, occupation and

marital status. Qualitative variables are also called categorical variables. Examples

of quantitative characteristics are age, amount of a loan. Qualitative and quantitative

features can each be divided into two main categories, as depicted in Figure (A.1).

Figure A.1: Features categories.

A.1.1 Qualitative features

Qualitative or categorical variables describe a quality or attribute of the individual.

Categorical features can be either nominal or ordinal. A nominal feature is a cat-

egorical feature that has two or more categories, but there is no intrinsic ordering

117

Appendix A

to the categories. For example, gender is a nominal variable having two categories

{male, f emale}and there is no intrinsic ordering to the categories. Typically, nomi-

nal features are characterized by:

•No ordering of the diﬀerent categories.

•No measure of distance between values.

•Categories can be listed in any order without aﬀecting the relationship between

them.

An ordinal feature is similar to a nominal feature. The diﬀerence between the two is

that there is a clear ordering of the features. For example, suppose you have a fea-

ture, educational experience with modalities such as elementary school graduate, high

school graduate and college graduate. These categories can be ordered as elementary

school, high school and college graduate.Typically, ordinal features are characterized

by:

•An implied ordering of the categories.

•Quantitative distance between levels is unknown.

•Distances between the levels may not be the same.

•Meaning of diﬀerent levels may not be the same for diﬀerent individuals.

A.1.2 Quantitative features

Quantitative features describe some quantity about the individual and are often mea-

sured or counted. These features can be either continuous or discrete. A continuous

variable is one that could take any value in an interval. A discrete feature is one

that can only take speciﬁc numeric values but those numeric values have a clear

quantitative interpretation.

Because qualitative data always have a limited number of alternative values, such

variables are also described as discrete. All numeric qualitative features are discrete,

while some quantitative features are discrete and some are continuous. For statistical

analysis, qualitative features can be converted into discrete numeric data by simply

counting the diﬀerent values that appear.

118

Appendix A

A.2 datasets description

This section reports some benchmark data sets that are used to evaluate the per-

formance of diﬀerent feature selection methods. While the Australian and German

datasets are downloaded from the UCI Machine Learning Repository, The HMEQ

dataset is available at SAS software and the Tunisian dataset is the result of collected

information from a Tunisian bank.

A.2.1 Australian dataset

As discussed earlier in Chapter 1 The Austrian dataset is composed of 690 instances

where 306 ones are creditworthy while 383 are not.There are 6 numerical and 8 cate-

gorical features and all feature names and values have been changed to meaningless

symbols for conﬁdentiality.The labels have been changed for the convenience of the

statistical algorithms. For example, attribute 4 originally had 3 labels p,g,gg and

these have been changed to labels 1,2,3.

Variable Description Type Description of modalities

A1 No description is available Categorical This feature has 2 modalities {0,1},

where no description is available about

the signiﬁcance of each modality

A2 No description is available Continuous No modalites

A3 No description is available Continuous No modalites

A4 No description is available Categorical This feature has 2 modalities {1,2,3},

where no description is available about

the signiﬁcance of each modality

A5 No description is available Categorical This feature has 14 modalities

{1,2,3,4,5,6,7,8,9,10,11,12,13,14},

where no description is available about

the signiﬁcance of each modality.

A6 No description is available Categorical This feature has 9 modalities

{1,2,3,4,5,6,7,8,9}, where no de-

scription is available about the

signiﬁcance of each modality

119

Appendix A

A7 No description is available Continuous No modalites

A8 No description is available Categorical This feature has 2 modalities {0,1},

where no description is available about

the signiﬁcance of each modality

A9 No description is available Categorical This feature has 2 modalities {0,1},

where no description is available about

the signiﬁcance of each modality

A10 No description is available Continuous No modalites

A11 No description is available Categorical This feature has 2 modalities {0,1},

where no description is available about

the signiﬁcance of each modality

A12 No description is available Categorical This feature has 3 modalities {1,2,3},

where no description is available about

the signiﬁcance of each modality

A13 No description is available Continuous No modalites

A14 No description is available Continuous No modalites

A15 Creditability Categorical 1: good applicant

2: bad applicant

A.2.2 German dataset

The German dataset covers a sample of 1000 credit consumers where 700 instances are creditworthy

and 300 are not. For each applicant 21 numeric input variables are available. More details are given

in the table below.

Variable Description Type Description of modalities

Alter Age Continuous No modalites

Beruf Occupation Categorical 1: unemployed / unskilled with no per-

manent residence

2: unskilled with permanent residence

3: skilled worker / skilled employee /

minor civil servant

4: executive / self-employed / higher

civil servant

Beszeit Has been employed by cur-

rent employer for

Categorical 1: unemployed

2: ≤1 year

3 : 1 ≤.. < 4 years

4 : 4 ≤.. < 7 years

120

Appendix A

5 :≥7 years

Beurge Further debtors / Guaran-

tors

Categorical 1: none

2: Co-Applicant

3: Guarantor

Bishkred Number of previous credits

at this bank (including the

running one)

Categorical 1: zero

2: one or two

3: three or four

4: ﬁve or more

Famges Marital Status / Sex Categorical 1: male: divorced / living apart

2: female: divorced / living apart /

married

3: male: single /married /widowed

4: female: single

Gastarb Foreign worker Categorical 1: yes

2: no

Hoehe Amount of credit in

”Deutsche Mark”

Continuous No modalites

Kredit Creditability Binary 0: not credit-worthy

1: credit-worthy

Laufkount Balance of current account Categorical 1: no running account

2: no balance or debit

3: 0 ≤.. < 200 DM

4 : 200 DM / checking account for at

least 1 year

Laufzeit Duration in months Categorical No modalites

Moral Payment of previous cred-

its

Categorical 0: hesitant payment of previous credits

1: problematic running account / there

are further credits running but at other

banks

2: no previous credits / paid back all

previous credits

3: no problems with current credits at

this bank

4: paid back previous credits at this

bank

Pers Number of persons entitled

to maintenance

Categorical 1: 0 to 2

2: 3 or more

121

Appendix A

Rate Installment in % of avail-

able income

Categorical 1: ≥35

2: 25 ≤... < 35

3: 20 ≤... < 25

4 : <20

Telef Telephone Categorical 1: yes

2: no

Sparkont Value of savings or stocks Categorical 1:not available / no savings

2: <100 DM

3: 100 ≤... < 500 DM

4: 500 ≤... < 1000 DM

5 : ≥1000 DM

Verm Most valuable available as-

sets

Categorical 1: not available / no assets

2: Car / Other

3: Savings contract with a building so-

ciety / Life insurance

4: Ownership of house or land

Verw Purpose of credit Categorical 1: new car

2: used car

3: items of furniture

4: radio / television

5: household appliances

6: repair

7: education

8: vacation

9: retraining

10: business

Weitkred Further running credits Categorical 1: at other banks

2: at department store or mail order

house

3: no further running credits

Wohn Type of apartment Categorical 1: free apartment

2: rented ﬂat

3: free apartment

Wohnzeit Living in current house-

hold for

Categorical 1:<1 year

2: 1 ≤... < 4 years

3: 4 ≤... < 7 years

4 : ≥7 years

122

Appendix A

A.2.3 HEMQ dataset

The HMEQ dataset is composed of 5960 instances where 4771 instances are creditworthy and 1189

are not. For each applicant, 12 input variables are available, more descriptions are given below.

Variable Description Type Description of modalities

BAD Creditability Binary 1: applicant defaulted on loan or seri-

ously delinquent

0: applicant paid loan

CLAGE Age of oldest credit line in

months

Continuous No modalites

CLNO Number of credit lines Continuous No modalites

DEBTINC Debt-to-income ratio Continuous No modalites

DELINQ Number of delinquent

credit lines

Continuous No modalites

DEROG Number of major deroga-

tory reports

Continuous No modalites

JOB Occupational categories Categorical This feature has 6 modalities

{1,2,3,4,5,6}, where no descrip-

tion is available about the signiﬁcance

of each modality

LOAN Amount of the loan request Continuous No modalites

MORTDUE Amount due on existing

mortgage

Continuous No modalites

NINQ Number of recent credit in-

quiries

Continuous No modalites

REASON reason for credit Categorical DebtCon=debt consolidation

HomeImp=home improvement

VALUE Value of current property Continuous No modalites

YOJ Years at present job Continuous No modalites

A.2.4 Tunisian dataset

Tunisian dataset covers a sample of 2970 instances of credit consumers where 2523 instances are

creditworthy while 446 are not. Each credit applicant is described by 22 input variables as described

below.

123

Appendix A

Variable Description Type Description of modalities

ZONE GEO Geographic zone Categorical 1: North

2: South

3: Center

GEN Gender of the credit appli-

cant

Categorical 1:female

2:mal

MKT Non description is avail-

able

Categorical 1:PAR

2:PRF

NOUV SM Profession of the applicant Categorical 1: Liberal

2 : Lawyer and likened

3: Private employee

4 : Students / others

5: Doctor and likened

6: Retreat

7: Government employee

AGE Age Continuous No modalites

EPARGNE saving account Categorical 1: Saving account

0: No saving

TOTENG Amount of other credits Continuous No modalites

DUR Duration in months Continuous No modalites

CMR No description is available Continuous No modalites

CMR2 No description is available Continuous No modalites

AUTOR No description is available Continuous No modalites

MNTDEBLOC No description is available Continuous No modalites

ENCOUR Amount of further running

credits

Continuous No modalites

MULTIBANC Multiple running account

in other banks

Categorical 1: Yes

2: No

DOMICIL No description is available Categorical 1 :Transfer of delegation

2: Pension account

3: No domiciliation

4: Directly paid wages

Saﬁr The applicant has a saﬁr

account

Categorical 1: Normal account

2: Saﬁr

124

Appendix A

STAT FAM Marital Status Categorical 1: Single

2: Divorced

3: Married

4: Widower or Widow

SAL MEN Net monthly salary Continuous No modalites

REV MEN Net monthly income Continuous No modalites

NBR Number of dependants un-

der the age of 18

Continuous No modalites

RVS No description is available Continuous No modalites

125

Appendix B

Classiﬁcation methods

This appendix presents classiﬁcation algorithms which are used to evaluate the candidate subsets in

the wrapper framework. Understanding the foundations of these algorithms can be helpful for the

comprehension of this work.

B.1 Artiﬁcial Neural Network

ANN were originally developed in machine learning ﬁeld and become an important data mining

method. A neural network is composed of a set of elementary computational units, called neurons,

connected together through weighted connections. Every neuron, also called a node, represents

an autonomous computational unit and receives as inputs the description of an observation xi,

(x1

i, ..., xd

i) called signal. Each signal is attached with an importance weight after that the neuron

elaborates the input signals, their importance weights and the threshold value through something

called a combination function. The combination function produces a value called potential. An

activation function transforms the potential into an output signal. The activation function is deﬁned

as follows:

f(xi) =

j=1

βjxj

i+β0=βTxi,(B.1)

where βj,j= 1, ..., d is the weight associated for each signal. The ﬁnal output yiof the neurone is

decided according to f(xi) sign.

yi=(0if βTxi≤0

1Otherwise (B.2)

126

Appendix B

ANNs are found to be a powerful solution. Their performance is dependent on initial condition,

network topologies and training algorithms. This may be one reason why the results of ANN vary

for credit scoring.

B.2 Support Vector Machines

Among the new methods for credit scoring, SVM is one of most promising methods. The use of SVM

in ﬁnancial application has been previously examined by several works (Schebesch and Stecking 2005;

Huang et al. 2007; Bellotti and Crook 2009). SVM was ﬁrst proposed by Vapnik (1995) and recently

becomes one of the most applied methods in data mining. There are many reasons for choosing SVM

(Burges 1998), it requires less prior assumptions about the input data and can perform a nonlinear

mapping from an original input space into a high dimensional feature space, in which it constructs

a linear discriminant function to replace the nonlinear function in the original low-dimension input

space. A simple description of the SVM algorithm is provided as follows. Given a training set

{xi, yi}n

i=1 with input vector xi= [x1

i, x2

i, . . . , xd

i]Tand target variable yi∈ {+1,−1}, the original

formulation of SVM algorithm satisﬁes the following conditions:

(βT.φ(xi) + b≥+1 if yi= +1

βT.φ(xi) + b≤ −1if yi=−1(B.3)

which is equivalent to

yi(βT.φ(xi) + b)−1≥0, i= 1, ..., n,(B.4)

where βrepresents the weight vector and b the bias. φ(xi) is a nonlinear mapping function. From

equation (B.4) we come down to the construction of two parallel bounding hyperplanes at opposite

sides of a separating hyperplane βT.φ(xi) + b= 0 in the feature space with the margin width

between both hyperplanes equal to 2

||w||2. the classiﬁer then takes the decision function form

sgin(βT.φ(xi) + b).

B.3 Decision Trees

According to Thomas et al. (2002) the idea of DT is to split the set of applications into diﬀerent

sets and then identify each of these sets as good or bad depending on what the majority in that set

is. The idea was developed for general classiﬁcation problems by Breiman et al. (1984) and was

used for the ﬁrst time by Frydman et al. (1985).

127

Appendix B

This method is very simple and can be described according to this scheme (Giudici 2003): A

DT consists of nodes and edges, the root node deﬁnes the ﬁrst split of the credit applicants sample.

Each internal node splits the instance sample into two subsets. Each node contains individuals of a

single class. The operation is repeated until the division in sub-populations is no more possible.

B.4 K-nearest-Neighbor

The main idea of KNN is to choose a distance measure on the space of application data in order to

measure how distant any two applicants are (Thomas et al. 2002). Then, using a learning sample

of past applicants presented by the couples (xi, yi), a new applicant xi0is classiﬁed as good or bad

depending on the proportions of goods and bads among the k nearest applicants from the learning

sample. The two parameters needed to run this approach are the distance metric and how many

applicants k constitute the set of nearest neighbors. A commonly used distance metric with KNN

is the Euclidean distance given by :

d(xi, xi0) = v

j=1

(xj

i−xj

i0)2(B.5)

This method is usually used for heterogeneous data with missing data. Although simple, the

choice of the number of neighbors kis still a diﬃcult task. This number is either ﬁxed beforehand or

chosen by crossed validation (Merbouha and Mkhadri 2006).

128

Feature selection based on machine learning for credit scoring : An evaluation of filter and embedded methods

Conference Paper

Full-text available

Aug 2021

Information Gain Directed Genetic Algorithm Wrapper Feature selection for Credit Rating

Article

Full-text available

Apr 2018
APPL SOFT COMPUT

Financial credit scoring is one of the most crucial processes in the finance industry sector to be able to assess the credit-worthiness of individuals and enterprises. Various statistics-based machine learning techniques have been employed for this task. “Curse of Dimensionality” is still a significant challenge in machine learning techniques. Some research has been carried out on Feature Selection (FS) using genetic algorithm as wrapper to improve the performance of credit scoring models. However, the challenge lies in finding an overall best method in credit scoring problems and improving the time-consuming process of feature selection. In this study, the credit scoring problem is investigated through feature selection to improve classification performance. This work proposes a novel approach to feature selection in credit scoring applications, called as Information Gain Directed Feature Selection algorithm (IGDFS), which performs the ranking of features based on information gain, propagates the top m features through the GA wrapper (GAW) algorithm using three classical machine learning algorithms of KNN, Naïve Bayes and Support Vector Machine (SVM) for credit scoring. The first stage of information gain guided feature selection can help reduce the computing complexity of GA wrapper, and the information gain of features selected with the IGDFS can indicate their importance to decision making. Regarding the classification accuracy, SVM accuracy is always better than KNN and NB for Baseline techniques, GAW and IGDFS. Also, we can conclude that the IGDFS achieved better performance than generic GAW, and GAW obtained better performance than the corresponding single classifiers (baseline) for almost all cases, except for the German Credit dataset, IGDFS + KNN has worse performance than generic GAW and the single classifier KNN. Removing features with low information gain could produce conflict with the original data structure for KNN, and thus affect the performance of IGDFS + KNN. Regarding the ROC performance, for the German Credit Dataset, the three classic machine learning algorithms, SVM, KNN and Naïve Bayes in the wrapper of IGDFS GA obtained almost the same performance. For the Australian credit dataset and the Taiwan Credit dataset, the IGDFS + Naive Bayes achieved the largest area under ROC curves.

Machine Learning

Chapter

Full-text available

Mar 2018

Katumba Noah

Predictive Analytics

Research

Feb 2018

Katumba Noah

A Rank Aggregation Algorithm for Ensemble of Multiple Feature Selection Techniques in Credit Risk Evaluation

Article

Full-text available

Oct 2016

In credit risk evaluation the accuracy of a classifier is very significant for classifying the high-risk loan applicants correctly. Feature selection is one way of improving the accuracy of a classifier. It provides the classifier with important and relevant features for model development. This study uses the ensemble of multiple feature ranking techniques for feature selection of credit data. It uses five individual rank based feature selection methods. It proposes a novel rank aggregation algorithm for combining the ranks of the individual feature selection methods of the ensemble. This algorithm uses the rank order along with the rank score of the features in the ranked list of each feature selection method for rank aggregation. The ensemble of multiple feature selection techniques uses the novel rank aggregation algorithm and selects the relevant features using the 80%, 60%, 40% and 20% thresholds from the top of the aggregated ranked list for building the C4.5, MLP, C4.5 based Bagging and MLP based Bagging models. It was observed that the performance of models using the ensemble of multiple feature selection techniques is better than the performance of 5 individual rank based feature selection methods. The average performance of all the models was observed as best for the ensemble of feature selection techniques at 60% threshold. Also, the bagging based models outperformed the individual models most significantly for the 60% threshold. This increase in performance is more significant from the fact that the number of features were reduced by 40% for building the highest performing models. This reduces the data dimensions and hence the overall data size phenomenally for model building. The use of the ensemble of feature selection techniques using the novel aggregation algorithm provided more accurate models which are simpler, faster and easy to interpret.

Improving Robustness of Optimized Parameters Gradient Tree Boosting for Crime Forecast Model

Chapter

Jun 2024

Feature Selection Using Law of Total Variance with Fast Correlation-Based Filter

Conference Paper

Aug 2023

Clustering Techniques for Content-Based Feature Extraction From Image

Chapter

Jan 2018

Image retrieval is gaining significant attention in areas such as surveillance, access control, etc. The content-based feature extraction plays a very crucial role in image retrieval. For the characterization of a specific image, mainly three features (i.e., color, texture, and shape) are used. Multimedia can store text, image, audio, and video which can be processed and retrieved. The various techniques are used for image retrieval such as textual annotations, content-based image retrieval in many application areas like medical imaging, satellite imaging, etc. However, most of these techniques were designed for specific domains and universally accepted method is yet to be designed; hence, CBIR is a field of active research. Similar output images indicate efficiency of search and retrieval process. In this chapter, the authors have discussed various image feature extraction techniques and clustering approaches for content-based feature extraction from image and specifically focused on color based CBIR techniques.

Review of Machine Learning models for Credit Scoring Analysis

Article

Full-text available

Jan 2020

Introduction:Increase in computing power and the deeper usage of the robust computing systems in the financial system is propelling the business growth, improving the operational efficiency of the financial institutions, and increasing the effectiveness of the transaction processing solutions used by the organizations. Problem:Despite that the financial institutions are relying on the credit scoring patterns for analyzing the credit worthiness of the clients, still there are many factors that are imminent for improvement in the credit score evaluation patterns. Objective:Machine learning is offering immense potential in Fintech space and determining a personal credit score. Organizations by applying deep learning and machine learning techniques can tap individuals who are not being serviced by traditional financial institutions. Methodology:One of the major insights into the system is that the traditional models of banking intelligence solutions are predominantly the programmed models that can align with the information and banking systems that are used by the banks. But in the case of the machine-learning models that rely on algorithmic systems require more integral computation which is intrinsic. Results:The test analysis of the proposed machine learning model indicates effective and enhanced analysis process compared to the non-machine learning solutions. The model in terms of using various classifiers indicate potential ways in which the solution can be significant. Conclusion: If the systems can be developed to align with more pragmatic terms for analysis, it can help in improving the process conditions of customer profile analysis, wherein the process models have to be developed for comprehensive analysis and the ones that can make a sustainable solution for the credit system management. Originality:The proposed solution is effective and the one conceptualized to improve the credit scoring system patterns. Limitations: The model is tested in isolation and not in comparison to any of the existing credit scoring patterns.

New Method for Instance Feature Selection Using Redundant Features for Biological Data

Conference Paper

Apr 2016

Biological data bases are characterized by a very large number of features and a few instances which make classification more difficult and time consuming. This problem can be solved using feature selection approach. The Filter feature selection method ranks features according to their significance level. Then it selects the most significant features and discards the rest. The discarded features may provide some useful information and could be useful to further consideration. Hence, we propose a new feature selection method that uses these eliminated features in order to increase the classification performance and avoid the curse of dimensionality. The new approach is based on the idea of transforming the value of the similar features into new instances for the retained features. We aim to reduce the feature space by performing features selection and increasing the learning space in creating new instances using the redundant features.

Credit scoring in banks and financial institutions via data mining techniques: A literature review

Article

Full-text available

Apr 2013

This paper presents a comprehensive review of the works done, during the 2000–2012, in the application of data mining techniques in Credit scoring. Yet there isn’t any literature in the field of data mining applications in credit scoring. Using a novel research approach, this paper investigates academic and systematic literature review and includes all of the journals in the Science direct online journal database. The articles are categorized and classified into enterprise, individual and small and midsized (SME) companies credit scoring. Data mining techniques is also categorized to single classifier, Hybrid methods and Ensembles. Variable selection methods are also investigated separately because it’s a major issue in credit scoring problem. The findings of the review reveals that data mining techniques are mostly applied to individual credit score and there are a few researches on enterprise and SME credit scoring. Also ensemble methods, support vector machines and neural network methods are the most favorite techniques used recently. Hybrid methods are investigated in four categories and two of them which are “classification and classification” and “clustering and classification” combinations are used more. Paper analysis provides a guide to future researches and concludes with several suggestions for further studies.

A New Feature Selection Technique Applied to Credit Scoring Data Using a Rank Aggregation Approach Based on: Optimization, Genetic Algorithm and Similarity

Chapter

Full-text available

Mar 2015

Credit scoring has been developed as an essential tool especially in the credit departments of banks that have to deal with a huge sum of credit data. A credit scoring model makes loaning process faster. However, it's nearly unfeasible to analyze this large amount of data, the feature selection techniques has been used to address this issue. Feature selection, is the process of selecting a subset of relevant features for use in model construction. Unlike existing feature selection techniques that causes bias by using distinct statistical properties of data for feature evaluation, feature selection based on rank aggregation are more robust, thus reducing this bias. Over the past years, rank aggregation has emerged as an important tool. Despite its numerous advantages, rank aggregation may be a deep problem. In fact, rankings provided by the different filters may be in many cases incomplete and similar features may be given disjoint ranking. We first consider the rank aggregation problem as an optimization problem, in which we aim to find an optimal list that approximates all the aggregated lists. Then we concentrate on the problem of disjoint ranking, subsequently we perform a new algorithm that eliminate disjoint ranking for similar features and remove the features that bring less information to the target concept. The performance of our approach was tested using four credit datasets and compared 1 to three individual filters and four well known aggregation techniques. The result indicates that the proposed technique is more robust across a wide range of classifiers and has higher accuracy than other traditional feature selection techniques.

Combining feature selection and data classification using ensemble approaches : application to cancer diagnosis and credit scoring

Chapter

Full-text available

Aug 2014

In this paper, we propose to investigate ensemble approaches to combine feature selection and data classification for cancer diagnosis and credit scoring. At first, an ensemble of feature selection techniques is used for feature selection. Each ensemble member provides a different feature set. Then two alternatives are tested. The first one combines these feature sets to obtain a single solution on which a classifier is trained. The second alternative trains a classifier on each feature set and then combines the classifiers ensemble to obtain a single classification output. We hypothesize that the reliability of prediction resulting from each ensemble combination level differs depending on the data dimensionality. Thus, in such an ensemble system, it is necessary to find out the appropriate combination level to obtain the best classification results. The proposed ensemble approaches are evaluated for two high dimensional data sets concerned with cancer diagnosis, as well as for two small size data sets concerned with credit scoring. Evaluation results suggest that the ensemble approaches outperform the baseline models and that data set dimensionality can guide the choice of the aggregation level of the ensemble method.

Statistical Data Mining Using SAS Applications

Book

Jun 2010

George Fernandez

Adaptation in natural and artificial systems

Article

Jan 1994

J.H. Holland

Minimum redundancy feature selection from microarray gene expression data

Article

Jan 2003

Toward integrating feature selection algorithms for classification and clustering

Article

Apr 2005

This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.

Ensemble methods in machine learning

Conference Paper

Jan 2000
Lect Notes Comput Sci

Thomas G. Dietterich

Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.

The Nature of Statistical Learning Theory

Chapter

Jan 2000

Vladimir N. Vapnik

In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.

Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic, USA

Book

Jan 2000

From the Publisher: With advanced computer technologies and their omnipresent usage, data accumulates in a speed unmatchable by the human's capacity to process data. To meet this growing challenge, the research community of knowledge discovery from databases emerged. The key issue studied by this community is, in layman's terms, to make advantageous use of large stores of data. In order to make raw data useful, it is necessary to represent, process, and extract knowledge for various applications. Feature Selection for Knowledge Discovery and Data Mining offers an overview of the methods developed since the 1970's and provides a general framework in order to examine these methods and categorize them. This book employs simple examples to show the essence of representative feature selection methods and compares them using data sets with combinations of intrinsic properties according to the objective of feature selection. In addition, the book suggests guidelines for how to use different methods under various circumstances and points out new challenges in this exciting area of research. Feature Selection for Knowledge Discovery and Data Mining is intended to be used by researchers in machine learning, data mining, knowledge discovery, and databases as a toolbox of relevant tools that help in solving large real-world problems. This book is also intended to serve as a reference book or secondary text for courses on machine learning, data mining, and databases.

On Feature Selection for Credit Scoring

Abstract and Figures

Recommended publications

Methodologies for Granting and Managing Loans for Micro-Entrepreneurs: New Developments and Practica...

Heteroscedastic Discriminant Analysis Combined with Feature Selection for Credit Scoring

Fairness Testing: Testing Software for Discrimination

[Look into a juvenile prison]