Conference PaperPDF Available

Applying Machine Learning to Customized Smell Detection: A Multi-Project Study

October 2020

October 2020

DOI:10.1145/3422392.3422427

Conference: SBES '20: 34th Brazilian Symposium on Software Engineering

Authors:

Daniel Oliveira

Pontifícia Universidade Católica do Rio de Janeiro

Wesley Klewerton Guez Assunção

North Carolina State University

Leonardo da Silva Sousa

Carnegie Mellon University

Willian Oizumi

GoTo

Show all 6 authorsHide

Overall accuracy reached by the ML Techniques on customized smell detection.

…

J48 Efficiency Figure 3: Naive Bayes Efficiency

…

Types of Code Smells Investigated in this Study

…

Figures - uploaded by Willian Oizumi

Content may be subject to copyright.

Content uploaded by Willian Oizumi

Content may be subject to copyright.

Applying Machine Learning to Customized Smell Detection:

A Multi-Project Study

Daniel Oliveira

doliveira@inf.puc-rio.br

Pontical Catholic University

Rio de Janeiro, RJ

Wesley K. G. Assunção

wesleyk@utfpr.edu.br

Federal University of Technology

Toledo, PR

Leonardo Souza

leo.sousa@sv.cmu.edu

Carnegie Mellon University

Silicon Valley, CA

Willian Oizumi

woizumi@inf.puc-rio.br

Pontical Catholic University

Rio de Janeiro, RJ

Alessandro Garcia

afgarcia@inf.puc-rio.br

Pontical Catholic University

Rio de Janeiro, RJ

Baldoino Fonseca

baldoino@ic.ufal.br

Federal University of Alagoas

Maceió, AL

ABSTRACT

Code smells are considered symptoms of poor implementation

choices, which may hamper the software maintainability. Hence,

code smells should be detected as early as possible to avoid software

quality degradation. Unfortunately, detecting code smells is not a

trivial task. Some preliminary studies investigated and concluded

that machine learning (ML) techniques are a promising way to

better support smell detection. However, these techniques are hard

to be customized to promote an early and accurate detection of

specic smell types. Yet, ML techniques usually require numerous

code examples to be trained (composing a relevant dataset) in order

to achieve satisfactory accuracy. Unfortunately, such a dependency

on a large validated dataset is impractical and leads to late detection

of code smells. Thus, a prevailing challenge is the early customized

detection of code smells taking into account the typical limited

training data. In this direction, this paper reports a study in which

we collected code smells, from ten active projects, that were actually

refactored by developers, dierently from studies that rely on code

smells inferred by researchers. These smells were used for evaluat-

ing the accuracy regarding early detection of code smells by using

seven ML techniques. Once we take into account such smells that

were considered as important by developers, the ML techniques are

able to customize the detection in order to focus on smells observed

as relevant in the investigated systems. The results showed that

all the analyzed techniques are sensitive to the type of smell and

obtained good results for the majority of them, especially JRip and

Random Forest. We also observe that the ML techniques did not

need a high number of examples to reach their best accuracy results.

This nding implies that ML techniques can be successfully used

for early detection of smells without depending on the curation of

a large dataset.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

SBES ’20, October 21–23, 2020, Natal, Brazil

ACM ISBN 978-1-4503-8753-8/20/09. . . $15.00

https://doi.org/10.1145/3422392.3422427

CCS CONCEPTS

•Software and its engineering →Software design engineer-

ing.

KEYWORDS

code smell, code smell detection, software quality

ACM Reference Format:

Daniel Oliveira, Wesley K. G. Assunção, Leonardo Souza, Willian Oizumi,

Alessandro Garcia, and Baldoino Fonseca. 2020. Applying Machine Learning

to Customized Smell Detection: A Multi-Project Study. In 34th Brazilian

Symposium on Software Engineering (SBES ’20), October 21–23, 2020, Natal,

Brazil. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3422392.

3422427

1 INTRODUCTION

Code smells are considered symptoms of poor implementation

choices, which make software systems hard to maintain [

]. Due

to their harmfulness to software quality [

], code smells

should be detected and removed as early as possible along the

software lifecycle. Although removing code smells is of paramount

importance to keep the internal quality of a software, their detection

is not an easy task. Their detection is mostly subject to developer

judgment. In fact, identifying code smells is a subjective activity [

Developers have the knowledge to conrm the harmfulness of

a smelly structure [

]. This knowledge varies according to the

developer’s experience, skills, and mastery of the source code being

analyzed.

Existing literature suggests dierent strategies to detect code

smells. The vast majority of these detection strategies are based on

metrics and their respective thresholds [

]. These strategies

tend to analyze each code fragment and employ some previously de-

ned thresholds to classify the fragment as host (or not) of a specic

smell. This diculty stems from the fact that the operationalization

(i.e., the threshold denition) of the strategy for detecting each smell

type requires proper reasoning. Such an operationalization cannot

be solely based on nding metrics and thresholds in compliance

with the conceptual denition of a smell type. The operationaliza-

tion also needs to be customized by considering various contextual

information of the projects, to which only the developer has access.

Preliminary studies have analyzed the use of machine learning

(ML) techniques as a promising way to detect code smells [

SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.

]. These previous studies evaluate the use of training

datasets containing numerous code examples annotated as smelly or

non-smelly code by developers. From these training datasets, the ML

techniques generate detection models that can successfully detect

similar smells to the ones used for the training. However, these

techniques are hard to be customized, for example, the detection of

only certain types of code smell that are relevant for developers of

specic projects. Yet, ML techniques usually require numerous code

examples to be trained. Obtaining such many instances with desired

properties to compose a relevant dataset can be time-consuming,

leading to late detection of code smells. Thus, a prevailing challenge

is the early customized detection of code smells taking into account

the typical limited training data.

To overcome the described limitations, this paper reports a multi-

project study in which we collected code smells that were actually

refactored by developers, dierently from studies that infer code

smells based on metrics and thresholds. This study relies on ten

active projects, with dierent sizes and belonging to distinct do-

mains, and six types of code smells. We chose smell types that cover

dierent system scopes, such as classes, methods, elds, and param-

eters. The composed dataset was used in the training and evaluation

of seven ML techniques in terms of the accuracy regarding early

detection of code smells. The accuracy was computed based on

the traditional F-Measure [

]. In addition to existing strategies,

since we take into account such smells that were considered as

important by developers, the ML techniques are able to customize

the detection in order to focus on smells observed as relevant in

the investigated systems.

The results pointed out that, when considering only relevant

smells, the ML techniques have similar behavior for the same smell

type. Besides that, they also have good support for detecting code

smells in projects with dierent sizes, once they do not need a high

number of examples to reach high results. By the results obtained

and analysis performed, our study led to the following ndings:

•

Smell types vs. ML techniques. When detecting harmful smells,

the smell types have more inuence on accuracy than the

ML techniques themselves.

•

Eect of the metrics on the customized smell detection. Partic-

ularities of each code smell type aect the accuracy of the

ML techniques. For example, the metrics (features) that best

represent the smell.

•

The subjectivity on detecting customized smells aect the ac-

curacy of the ML techniques. Smells with more subjective

denitions, i.e more complex, tend to obtain less accuracy

since the training set will have a higher variation in the

values of the features.

•

The ML techniques using a small set of curated examples (based

on previous refactorings of developers) can successfully support

early customized detection of code smells. When using on

datasets that well-represent what developers consider as

relevant in practice, ML techniques are able to, signicantly,

early detect code smells.

The contributions of this work are as follows. First, dierent

from previous studies, we focused on assessing the use of ML tech-

niques only for detecting smells that ended up being refactored by

a developer. The refactoring of the smelly code indicates that the

developer, either consciously or not, conrmed the relevance of a

smell. Those smells can be considered relevant to the program as

their removal helped the developer to achieve his maintenance goal.

Second, we assessed the accuracy of six well-known ML techniques

on detecting code smells using our customized dataset, consider-

ing dierent samples of training, whereas we gradually increase

the number of examples used to perform ML technique training.

Third, we make publicly available

our dataset for future studies

and replication.

The remaining of this document is structured as follows. Sec-

tion 2 describes the background related to code smells and ML

techniques. Related work is described in Section 3. Section 4 de-

tails the design of our study. Section 5 presents all the results and

analysis as well as the answers to our research questions. Section 6

describes the threats and limitations of the study. Finally, Section 7

presents the concluding remarks and discusses future work.

2 BACKGROUND

Our study encompasses two research topics, namely, code smells

and ML techniques. The next sections describe the six code smells

we considered and the seven ML techniques, respectively.

2.1 Code Smells

Poor code structures are often represented by the so-called code

smells [

]. A code smell is considered a symptom of a poor imple-

mentation choice. As a consequence, software maintenance requires

additional cost and eort on understanding and re-structuring the

smelly code [

]. Due to their harmfulness to software main-

tainability [

], code smells should be detected and removed

as early as possible along the software lifecycle.

There are several types of code smells described in the litera-

ture [

]. Each one characterizes a recurring poor code structure

and aects a specic scope of program elements. For our study, we

selected the six smell types listed in Table 1. These smell types are

related to bad design decisions or reect important maintainability

aspects [

]. The rst column represents the smell type name. The

second column presents a brief description of the type. We have

chosen these smell types due to the dierent scopes of a program

aected by them, namely classes, methods, elds, and parameters.

Table 1: Types of Code Smells Investigated in this Study

Name Description

Complex Class (CC)

Classes that involve a lot of dierent but

related parts.

Class Data Should

be Private (CDSBP)

Classes that expose its attributes unneces-

sarily.

God Class (GC)

Classes that tend to centralize the intelli-

gence of the system.

Lazy Class (LC) Classes that do not do enough.

Spaghetti Code (SC)

Code that has a complex and tangled struc-

ture.

Speculative General-

ity (SG)

Unused classes, methods, elds or param-

eters created to future features that never

get implemented.

1https://smelldetection.github.io/

Machine Learning to Customized Smell Detection SBES ’20, October 21–23, 2020, Natal, Brazil

Previous studies proposed strategies to detect code smells [

], including the use of ML techniques [

However, these strategies are mostly not representative of how

developers perform smell detection, once developers have particu-

lar ways to derive a detection strategy. In this sense, some studies

suggest that developers customize their detection strategies accord-

ing to their own previous experience in recognizing which smells

are harmful and should be refactored [

]. As aforementioned,

smell detection takes into account the particular ways to derive

a detection strategy. As a consequence, particular customizations

largely dier from others and it is rare or even impossible to de-

rive a single detection strategy that is acceptable in every software

project.

In fact, empowering smell detection strategies with customiza-

tion can help developers and companies to consider code smells that

are indeed harmful according to their quality standards. Therefore,

customized strategies can identify and report to developers only

smells in which they are interested. Also, constant warnings of

non-customized detection strategies can cause a waste of time on

the inspection of irrelevant smells. Strategies without customiza-

tion also hinder the developer concentration on harmful smells, or

camouage smells that are considered more harmful according to

the developers’ perception.

2.2 ML Techniques

To investigate the customization of smell detection, we analyzed

seven ML techniques frequently used in literature [

]. These

techniques involve dierent data analysis approaches, such as de-

cision trees, regression analysis and based-rule analysis that are

responsible to create the classier models. The seven ML techniques

are listed below:

Naive Bayes:

A probabilistic classier based on the application of

Bayes’ theorem [

]. This technique is highly scalable and com-

pletely disregards the correlation between the variables in the train-

ing set. This classier describes the probability of an event, based

on prior knowledge of conditions that might be related to the event.

Support Vector Machine (SVM):

An implementation of inte-

grated software for the classication of support vectors [

] that

analyzes the data used for classication and regression analysis.

SVM assigns new examples to one of the two categories introduced

in the training set, making it a non-probabilistic binary linear clas-

sier. To make this classication, SVM creates classication models

that are a representation of examples as points in space. These

points are mapped in such a way that the examples in each cate-

gory are divided by a clear space that is as broad as possible. Each

new instance is mapped in the same space and predicted as belong-

ing to a category based on which side of space they are placed.

Sequential Minimal Optimization (SMO):

An implementation

of John Platt’s minimal sequential optimization algorithm to train

a support vector classier [

]. In other words, SMO is a technique

for optimizing the SVM training expediting the training process

and making it less complex. For that, SMO breaks the problem to

be solved into a series of smallest possible sub-problems, which are

solved analytically.

OneRule (OneR):

A classication technique that generates a rule

for each predictor in the data. Then, it selects the rule with the

lowest total error as its “single rule” [

]. In order to create this rule,

this technical analysis of the training set associating a single data

to a specic category based on its frequency; in other words, if a

specic data is usually classied as category A, then a rule is created

linking them. After the rules creation, the technique chooses the

one with the lowest total error.

Random Forest (RF):

A classier responsible for building numer-

ous classication trees representing a forest with random decision

trees [

]. The RF technique adds extra randomness to the model

when the tree’s creation. Instead of looking for the best feature

when partitioning nodes, it looks for the best feature in a random

subset of features. This process creates a great diversity, which

generally leads to the generation of better models, besides that this

diversity also reduces the overtting eect.

JRip:

An implementation of an apprentice of propositional

rules [

]. It is based in association rules with reduced error prun-

ing, a very common and accurate technique found in decision tree

algorithms. Dierent from the other algorithms, JRip splits its train-

ing stage into two steps, a growing phase, and a pruning phase. The

rst phase grows a rule by greedily adding antecedents (or condi-

tions) to the rule until the rule is perfect, (i.e., 100% of accuracy).

The second phase incrementally prunes each rule and allow the

pruning of any nal sequences of the antecedents.

J48:

A Java implementation of the C4.5 decision tree technique [

J48 builds decision trees from a training data set. At each node of the

tree, this technique chooses the data attribute that most eectively

partitions its set of samples into subsets tending to one category

or another. The partitioning criterion is the information gain. The

attribute with the highest gain of information is chosen to make

the decision. This process is repeated on the smaller partitions.

3 RELATED WORK

Several machine learning techniques have been applied to derive

automated smell detection strategies (e.g., [

]). In addition

to the detection of code smells, such techniques have also been

applied in close activities such as prioritizing code smells [

] and

detecting architectural smells [11].

To design and evaluate ML-based smell detection strategies, re-

searchers often rely on large datasets of manually validated code

smells. These datasets are large to facilitate the learning process,

thereby leading to more accurate results. The study described

in [

], for example, assessed the accuracy of Support Vector Machine

(SVM) in the detection of four types of code smell: Blob,Functional

Decomposition,Spaghetti Code, and Swiss Army Knife. The SVM

obtained an accuracy of up to 0.74.

In [

], the authors proposed the use of Decision Tree technique to

detect code smells. The authors used a single dataset containing a

huge number of examples validated by a few developers. The results

indicate that the Decision Tree is able to reach an accuracy up to 0.78.

Fontana et.al. [

] presented a large study that compares and exper-

iments dierent congurations of machine learning techniques to

detect four code smell types (Data Class, Large Class, Feature Envy,

Long Method). To perform the training of these techniques, the

authors used a dataset containing several examples of code smells

SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.

manually validated by a few developers. The J48 and Random Forest

obtained the highest accuracy, reaching values up to 0.95. However,

a recent study [

] indicates that the dataset used by Fontana [

]

had a high inuence on the accuracy obtained by the techniques.

Hozano et al. [

] analyzed the accuracy and eciency of ML

techniques in the detection of four distinct code smell types (God

Class, Data Class, Feature Envy, Long Method). The results indicate

that the Random Forest is able to reach high accuracy and eciency

when detecting these smell types. Despite presenting important

advances regarding the detection of code smells, this study and

other ones reported in the literature are far from capturing how

developers work on smell detection. The reason is that existing

studies rely on the premise that either (i) it is possible to derive

a universal strategy based on a large training dataset or (ii) each

company will customize strategies for the context of each software

project. Nevertheless, both premises are false.

There is evidence that the detection of code smells is highly sen-

sitive to contextual factors, such as the software developer [

This happens because each developer may have a particular way

of deriving detection strategies for code smell based on previous

experiences. Thus, a universal strategy would hardly present satis-

factory results in any project. In addition, it is dicult to imagine

that developers would spend time validating datasets of code smells

for each software project, which makes it dicult to adopt existing

techniques.

Another limitation of existing ML-based smell detection strate-

gies is regarding data balancing [

]. This happens because the

proportion of samples without code smells is usually much higher

than the proportion of samples aected by validated smells. Some

researchers [

] investigated whether data balancing techniques

are able to improve the accuracy of ML-based smell detection strate-

gies. However, their results indicate that the existing techniques for

data balancing are not capable of signicantly improving accuracy.

Therefore, there is still a need for other ways to improve existing

ML-based smell detection strategies.

Given the aforementioned limitations, in this work, we apply

and evaluate a new way for the automated customization of early

code smell detection strategies. For training the ML algorithms,

we take a dataset of code smells that were refactored in practice.

We conjecture that such refactorings are strong indicators of rel-

evance as they indicate that the developers actually spent eort

on removing the smells. This prevents developers from receiving

late warnings about smells that they may consider not harmful.

Therefore, following this customization approach, the application

of ML-based smell detection strategies can become viable in any

project or company that has a history of refactorings carried out by

the development team in the past. Even when the number of smell

instances is rare.

4 STUDY DESIGN

The goal of our study is to investigate the early customized detection

of code smells taking into account the typical limited training data.

Based on this goal, we derived two research questions, as follows.

RQ1. How accurate are the ML techniques on customizing the

detection of smells?

This RQ aims at investigating the accuracy of

the seven ML techniques in customizing the detection of six smell

types. A customized detection focuses only on smell instances in

which developers are interested. These smells are considered more

harmful based on developers’ perceptions. In this way, the detection

will allow the proper removal of these smells. Thus, it is important

to investigate techniques that are able to detect harmful smells

with high accuracy. Once ML techniques have been considered a

promising way to detect code smells [

], these techniques are a

strong candidate for customizing smell detection properly.

RQ2. How ecient are the ML techniques for early customized

detecting of smells?

Our second RQ aims at analyzing the e-

ciency of the ML techniques in detecting smells, i.e., how accurate

a ML technique detects smells whereas we gradually increase the

number of examples used to perform its training. Although ML

techniques have been considered a promising way to detect code

smells, these techniques require code smell examples annotated to

perform their training. However, the annotation of a large number

of examples may introduce an unfeasible additional time and eort.

Hence, it is important to analyze the accuracy of these techniques

with a low number of examples used in the training set.

To answer the posed RQs we rely on two aspects: (i) the type of

smell analyzed; and (ii) the number of instances used to perform

the training of the ML techniques, i.e. the training dataset. The

next sections present the subject projects and how we collected

and analyze data regarding these two aspects.

4.1 Subject Projects

The instances of code smells considered in the scope of our study

(see Table 1) were detected from ten open source Java projects:

Apache Ant

, Apache Derby

, Apache Tomcat

, Elastic Search

, Ar-

gouml

, Apache Xerces

, Google j2objc

, Presto db

, SpringFrame-

work

and Achilles

. We selected such projects because they have

dierent sizes and are from distinct domains. A mix of domains is

an interesting benchmark for companies that have a few projects

in their portfolio but operates in multiple domains. Also, they were

evaluated by existing smell detection techniques and, which found

that their source code contains a variety of suspicious code smells

that enable the execution of our study [8].

4.2 Data Collection

Data to answer RQ1. We extracted 200 (100 smelly and 100 non-

smelly) code fragments from the analyzed projects for each smell

type. The smell detection process was made using a detection

tool [

]. This tool is based on a set of metrics and thresholds and

has a high overall recall of 81% [

], i.e., it detects the vast majority

of existing smells. These 200 code fragments for each smell allow

us to observe the behavior of the techniques in a diversity of code

smell instances. In addition to detected code smells, we selected

2https://ant.apache.org/

3https://db.apache.org/derby/

4http://tomcat.apache.org/

5https://www.elastic.co/

6https://argouml.tigris.org/

7http://xerces.apache.org/

8https://github.com/google/j2objc

9https://prestodb.io/

10https://spring.io/

11http://www.ganttproject.biz

Machine Learning to Customized Smell Detection SBES ’20, October 21–23, 2020, Natal, Brazil

only those that were directly refactored by developers. That is, the

examples of smells used in the training dataset and in the evalua-

tion of the ML techniques were evidently relevant for developers.

These smells were considered important since they were identied

and removed by the developer, so they were harmful and possibly

hindered some development tasks. In this way, the tool’s threshold

was only a starting point to identify the smells. Then, the project’s

developers indicated the harmful smells. This makes the learning

process be based on more relevant smell instances, i.e., not simply

based on initial thresholds. Thus, we can verify if the techniques

are able to customize their detection for more relevant smell types.

To identify the refactorings that were related to the detected

smells, we used the RefMiner [

] tool. RefMiner is widely used

in the literature [

]. From this information, we could

lter the analyzed smells among those that directly underwent a

refactoring. Also, this lter avoids bias regarding a single set of

detection metrics/thresholds and ensure a relevant dataset.

Finally, the application of ML techniques requires collecting fea-

tures for all smell instances. For this task, we used Understand

, a

tool to extract software features. Altogether, 42 features were con-

sidered. The complete list of features is publicly available

. These

features were used during the training process of the seven ML

techniques. Once we are addressing smells considered harmful by

the developer, these smells may not be aligned with the features

proposed by the literature for each smell type [

], since the

literature evaluates all instances of smells [

]. Therefore, a high

number of features allow the machine learning techniques to eval-

uate which ones best characterize a smell as harmful. Also, these

features cover dierent information about classes, methods, elds,

and parameters, indicating, e.g., the number of lines of the code

fragments, relations of complexity within and between elements

and several other counters.

Data to answer RQ2. In order to evaluate the accuracy of the early

customized detection of ML techniques, we adopted dierent sizes

of training datasets, that were applied incrementally. The dataset

of each subject project was split into six subsets of dierent sizes:

20, 40, 80, 120, 160 and 200 code smell instances. This division was

made such that during the evaluation each increment, i.e. new set

of instances, also includes the same instances of the preceding set.

For example, the second set with 40 instances is composed of all

instances of the rst set added of more 20 new ones.

To assess the accuracy of the ML techniques, we computed the

f-measure that considers both the recall and precision [

]. To com-

pute these measures, the true positive (TP) elements represent the

code fragment classied by the ML techniques as a code smell that

is, actually, a real code smell. The false positive (FP) elements refer

to the code fragments wrongly classied as a code smell. Similarly,

the true negative (TN) represents the code fragments correctly clas-

sied as not-smell. Finally, the false negative (FN) represents the

wrong ones. Based on that, we can compute recall, precision, and

f-measure, as described in the equations below. F-measure is widely

used in previous studies [

] that assess the ML techniques

on detecting code smells.

12https://scitools.com/features/

•Recall (R)

: The Number of code fragments correctly classi-

ed as code smells among the total of code smell instances

in the data collection.

𝑅=

𝑇 𝑃

𝑇 𝑃 +𝐹 𝑁 (1)

•Precision (P)

: The Number of code fragments correctly

classied as code smell among the total of code fragments

classied as code smell by the ML technique.

𝑃=

𝑇 𝑃

𝑇 𝑃 +𝐹 𝑃 (2)

•F-measure: Harmonic mean of precision and recall.

𝐹1=2·

𝑃·𝑅

𝑃+𝑅(3)

4.3 Data Analysis

Using the datasets containing the classied (non-)smelly instances

and the software features for each analyzed code fragment, we

performed two dierent analyses. Each analysis aims at answering

a research question.

Analysis to answer RQ1. Here we used the datasets to analyze the

accuracy (in terms of f-measure) of the ML techniques on detecting

a specic smell type. For each smell type, we calculated the overall

accuracy of each technique by applying a 5-fold cross-validation

procedure on the 200 classied instances.

Analysis to answer RQ2. For this analysis, we evaluated the e-

ciency of the ML techniques, i.e., the accuracy of each ML technique

whereas we increment the number of code smells examples used

to perform the training of these techniques. In other words, we

repeat the accuracy experiment six times, one for each subset of

the respective smell. The repetition aimed to guarantee that both,

the training and test sets, were composed of equals number of

fragments classied as smell or not.

4.4 Implementation Aspects

The ML techniques presented in Section 2.2 were implemented on

top of Weka

and R Project

. Weka is an open source ML software,

based Java programming language, containing a plethora of tools

and algorithms [

]. R is a free software environment for statistical

computing that is widely used for data mining, data analysis, and

to implement ML techniques [31].

5 RESULTS AND DISCUSSION

This section presents and discusses the results of our multi-project

study. The results are organized in terms of the two research ques-

tions presented in the previous section.

5.1 RQ1. How accurate are the ML techniques

on customizing the detection of smells?

To evaluate the accuracy of ML techniques in detecting dierent

types of code smells, Figure 1 presents the accuracy when consider-

ing the greatest dataset of our study, namely 200 instances of code

smells. The use of this dataset allows us to perform an analysis

13https://www.cs.waikato.ac.nz/ml/weka/

14https://www.r-project.org

SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.

of the overall accuracy of the ML techniques in the scenario with

the most number of instances for training. In the gure, the x-axis

is divided per smell and presents sequentially the ML technique

used to detect the respective code smell. The y-axis describes the

values of the accuracy (in terms of f-measure) obtained by the ML

technique on detecting the respective smell type. To improve read-

ability, we attach the values of the f-measure in the table below the

bars associated with each smell and highlighted the higher results.

Figure 1: Overall accuracy reached by the ML Techniques on

customized smell detection.

5.1.1 Overall Accuracy. By analyzing Figure 1, we can observe

that there is no technique with the best accuracy for all code

smell types. This is interesting since previous studies observed

that Random Forest reached the highest overall accuracy on de-

tecting smells [

]. However, in our study, we noticed that,

when detecting harmful smells, Random Forest did not have the

best overall accuracy. Although Random Forest obtained only one

best result (for CC), it also obtained closer results when compared

with the better ones in the other four types of smell, namely CDSBP,

GC, LC, and SC. In summary, all the ML techniques have a similar

accuracy when observing each smell type individually. The highest

divergence (0,164) is observed in the detection of SG, existing be-

tween JRip and Sequential Minimal Optimization (SMO). Finally, all

techniques achieved accuracy between 0.6 and 0.8 for the detection

of CC and SC.

Finding 1

. When detecting harmful smells, the types of smell have

more inuence on accuracy than the ML techniques themselves.

5.1.2 Accuracy per code smell type. In the following paragraphs,

we discuss the results of each code smell type individually.

Complex Class. Regarding CC, Random Forest reached the best

accuracy, equal to 0.715, while J48 and Sequential Minimal Opti-

mization obtained the lowest values with a slight dierence between

them. Note that none of the ML techniques reached accuracy above

0.8. We believe this happens because the developer’s perception

of what is a complex class can vary a lot. This subjectivity in the

detection of CC could be so divergent, that ML techniques could

not nd a proper model that best represents them. In other words,

since the identication of CC is hard to be found by our customized

detection ML techniques, its identication based on metrics and

thresholds certainly is even more complex.

Spaghetti Code. Although a slightly better, the results of accuracy

obtained by the ML techniques for the customized detection of SC

instances is similar to CC. None of the ML techniques was able to

reach accuracy better than 0.8. For the SC smell, J48 reached the

lowest accuracy (0.675). On the other hand, the highest accuracy

(0.765) was obtained by JRip.

Class Data Should Be Private. For CDSBP, the ML techniques

J48, JRip, and Random Forest reached results above 0.8, and the

remaining techniques obtained accuracy below this value. Naive

Bayes obtained only 0.662. Vector Machine, One Rule, and Minimal

Optimization obtained values between 0.7 and 0.8.

Speculative Generality. The ML techniques were able to reach

values higher than 0,9 for SG. JRip, once again, reached the highest

accuracy, equal to 0.913, followed closely by One Rule that also

exceeded 0.9. Sequential Minimal Optimization obtained the worst

accuracy, equal to 0.749. An important fact to note is that SG should

be dicult to detect whether we look at a single instance at a

time, as the ML techniques do because this smell occurs when a

developer implements an element (i.e., methods, classes, and elds)

that never is used. In other words, it should be necessary to look at

this element along dierent versions to decide whether there is the

existence of this smell. This contradicts the good result obtained

by some algorithms.

God Class. Dierently from the smells previously discussed, the

result for GC reached high values. All ML techniques could reach an

accuracy higher than 0.8. JRip reached the highest accuracy (0.877).

Similarly to CDSBP, the worst result was obtained by Naive Bayes.

Here we can note that the combination of dierent features (i.e.

metrics) also seems to implicate in the accuracy obtained by the

techniques, similar to what we discussed for the CC smell. However,

for this code smell, some specic metrics such as Lines of Code, can

have higher inuence [

]. Besides that, GC is usually associated

with high values of the metrics.

Lazy Class. The customized detection of LC is by far the one

where the ML techniques obtained the best results. All techniques

reached accuracy values better than 0.9. Another dierence between

the smells previously observed is regarding the Naive Bayes tech-

nique, which reached the best result, in contrast with its previous

results. It is also possible to observe that Random Forest obtained

accuracy slightly close to Naive Bayes. This similarity also occurs

between Vector Machine and Sequential Minimal Optimization,

besides J48 and JRip.

Finding 2

. Particularities of each code smell type, for example the

metrics (features) that best represent them, aect the accuracy of the

ML techniques.

5.1.3 Smells with low accuracy. Our Finding 2 can be used to ex-

plain some of the lowest results. Although further studies are

needed for a more in-depth analysis, a possible explanation for

the cases with a low accuracy may be related to the number of

features required to train each ML technique. All techniques re-

ceived as input the same set of 42 features. However, not all these

features contribute to equality to identify the smell. For example,

the detection of GC using classic detection strategies relies on two

Machine Learning to Customized Smell Detection SBES ’20, October 21–23, 2020, Natal, Brazil

features: lines of code and cohesion [

]; whereas, the detection of

CC only relies on one metric: cyclomatic complexity [

]. Neverthe-

less, the techniques received the same features to detect both smells.

In this way, the ML techniques might mistakenly consider dierent

features (i.e., the weight assigned to each metric) as relevant. For

example, when the techniques were trained, they probably selected

more relevant features to detect GC than to detect CC.

Furthermore, developers among distinct projects might not agree

with the relevance of CC instances, as they can associate complexity

with dierent features [

]. This disagreement directly aects

the accuracy of the customized detection of smells, once we are con-

sidering only harmful smells refactored by developers. For example,

dierent developers can classify the smells as relevant consider-

ing dierent metrics, as already identied in existing studies [

Therefore, based on this discussion, we have our third nding.

Finding 3

. Smells with more subjective denitions, i.e more complex,

tend to obtain less accuracy, since the training set will have a higher

variation in the values of the features caused by the divergence of

developers’ opinion about the relevance of a smell instance.

This variation in training is even more intense, as we are work-

ing with 42 distinct features. Too many features may cause every

training instance in the dataset to appear equidistant from all the

other ones. Thus, if the distance between the instances appears to

be equally alike, the techniques cannot nd meaningful clusters

to classify a code fragment as smelly or not-smelly. This scenario

may have happened to train the detection of some smells (e.g., CC)

but not other ones (e.g., GC). This issue is known as the Curse of

Dimensionality [6].

Finally, the detection of GC is more robust than the detection of

CC. Since detecting a GC usually requires analysis of more metrics,

the ML technique can be less aected by the curse of dimensionality

in detecting GC than detecting CC. However, this same explana-

tion cannot be generalized for LC, which uses a simple detection

strategy: fewer lines of code than average lines of code of the sys-

tem. Consequently, another factor plays an important role in the

techniques’ accuracy. Some detection strategies use metrics that

compute the average value across the system. Detection strategies

that use the average measurement tend to create techniques with

better accuracy. For example, LC and GC use average measure-

ment have high accuracy, whereas CC and SC do not use these

averages [5, 32].

Even though we provided possible explanations for the dierence

of accuracy, we highlight that understanding the mechanisms of ML

techniques is not trivial. Yet, unseen factors can help to contribute

to the dierent accuracy among the dierent smell types.

RQ1 Answer

:When customizing the detection of code smells,

most of the ML techniques detected God Class, Speculative Gen-

erality, Large Class, and CDSBP with high accuracy. For Specula-

tive Generality and Complex Class, the ML techniques obtained

a lower accuracy, but still reached accuracy above 0.7 for most

of the ML techniques.

5.2 RQ2. How ecient are the ML techniques

for early customized detecting of smells?

Dierently from the previous section, that discussed overall accu-

racy, here our focus is the analysis of early customized detection

of code smells. In this regard, we verify whether the techniques

reached high accuracy with a low number of instances required

for training. Figures 2 to 8 present the results that support such

analysis. These gures represent the eciency reached by the ML

techniques on detecting each smell type. The x-axis describes the

number of the examples used (20/40/80/120/160/200) in the train-

ing phase of the techniques divided per smell, whereas the y-axis

represents the accuracy values obtained by each ML technique on

detecting smells.

We observed that the ML techniques do not follow a unique

behavior when the number of analyzed examples increases. Tech-

niques such as Random Forest and Vector Machine had a signicant

increase in SG detection during the addition of new (non-)smelly in-

stances in the training dataset. In contrast, for J48 a smaller training

dataset resulted in the best results of accuracy. This same behavior

can be seen in JRip when detecting CC, and Naive Bayes when

detecting CDSBP and GC. In general, the ML techniques reached

results near to their best results in this study on detecting the re-

spective smell with a low number of examples. Some cases, such

as the detection of CC using Naive Bayes, are exceptions, in these

cases, using a dataset containing a low number of instances did

not reach high results. We can also note that all algorithms did not

need more than 20 instances to reach accuracy above 0.8 for LC

and GC smells.

Finding 4

. When using on datasets that well-represent what de-

velopers consider as relevant in practice, ML techniques are able to

signicantly early detect code smells.

The results and analysis presented in this section allow us pro-

vide the answer for the RQ2.

RQ2 Answer

:In most cases of our study, the ML techniques did

not need a training dataset with numerous examples to reach

their best detection results. In fact, the increase in the number of

instances did not appear to have a direct relationship with the

increase in accuracy for all ML techniques. Interestingly, our

results contradict those results found in previous studies [

], which stated that the techniques needed many examples to

get good results.

6 LIMITATIONS AND THREATS TO VALIDITY

The threats to validity and the limitations of our study, along with

the ways we mitigate them, are presented next.

6.1 Threats to Validity

This section discusses the threats to validity.

Construct Validity:

The datasets that supported our study were

built from code fragments collected using rule-based strategies that

have a set of metrics and thresholds. These thresholds are threats,

once they can bias the techniques learning because of the analyzed

smelly fragments were ltered by these thresholds. To lessen this

SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.

Figure 2: J48 Eciency Figure 3: Naive Bayes Eciency

Figure 4: Support Vector Machine Eciency Figure 5: One Rule Eciency

Figure 6: JRip Eciency Figure 7: Random Forest Eciency

Figure 8: Sequential Minimal Optimization Eciency

Machine Learning to Customized Smell Detection SBES ’20, October 21–23, 2020, Natal, Brazil

bias, we lter the smells by selecting only those that caught the

developer’s attention to refactor them.

Internal and External Validity.

The use of the Weka package of

the R platform to implement the techniques analyzed in our study

enabled us to experiment a variety of congurations, which aect

the training process of the techniques. In such context, the cong-

urations considered in our experiments may impact the accuracy

and eciency of the techniques. In order to mitigate this threat,

we congured all ML techniques according to the better settings

dened in [

]. Indeed, [

] performed a variety of experiments in

order to nd the best adjust for each technique.

As far as external validity is concerned, the code fragments were

extracted from ten Java projects. However, although the implemen-

tation of these projects presents classes and methods with dierent

characteristics (i.e., size and complexity), our results might not hold

to other projects.

6.2 Limitations

This section discusses the limitations found during the study, which

will be considered in future studies.

Number of Smells:

The catalog of smell types presented in [

]

categorizes the smells based on their area of action in the code. Also

dening a high number of smell types than those addressed in our

empirical study. These additional smells can also harm the quality

of the software, making their detection important. However, their

detection through machine learning requires the evaluation of code

fragments that are suspicious of containing these smells, which

leads us to the second limitation.

Evaluated Projects:

Ten dierent projects are currently covered in

our dataset. However, all of these projects are open source projects

written with the Java programming language. These common char-

acteristics among the chosen projects tend may reduce the variety

of particular manifestations of a smell type. A larger dataset, in-

cluding both closed source and additional open source projects, can

expose a wider variety of smell structures.

Classier Model Customization:

We observed that each ML

technique did not support general, highly-accurate detection of

all smell types. However, the achieved an improvement in their

accuracy when are analyzing a subset of specic smell types. This

improvement could be related to the classier model built by the

techniques. It is important to note that this model can be improved

manually changing the parameters during the technique imple-

mentation, or automatically through trial and error. Previous stud-

ies [

] suggest that this improvement by customization could

also be explored to better detect smells for specics developers.

Project Sensitive Customization:

Better behavior of a ML tech-

nique perhaps could also be observed if the training and the detec-

tion involves a single software project. Given this narrower scope,

we would reduce the number of developers involved in the dataset.

Thus, the ML techniques may be able to better adapt themselves

during the training process. If we further narrow the scope to the

system’s modules, we will have code fragments with similar re-

sponsibilities and a subset of developers in charge. This change

may allow the techniques to customize their detection for the spe-

cic concerns being addressed by each module, hopefully further

improving their accuracy but in detriment of possibly not having a

reasonable number of smelly instances to properly train the model.

7 CONCLUSION AND FUTURE WORK

This study analyzed the accuracy and eciency of ML techniques

for detecting code smells. Firstly, we evaluated the accuracy of the

ML techniques for customizing smell detection. Then, we analyzed

the eciency of the ML techniques by evaluating their accuracy

according to the number of examples used to perform the training

process.

The results indicated that, when detecting harmful smells, the

types of smell have more inuence on accuracy that the dierent

strategies of the ML techniques. Indeed, particularities of each smell

type that represent them aect the accuracy of the ML techniques.

That is, ML techniques tend to have low accuracy when detection

complex smell types considering only instances relevant for the

developers. In this context, JRip and Random Forest reached the

highest overall accuracy on detecting smells, meanwhile, Naive

Bayes obtained the lowest overall accuracy.

Regarding the techniques’ eciency, we observed a dierent

result from previous studies. In our study, the increase in the num-

ber of instances in the training set did not appear to have a direct

relationship with the increase in accuracy. Once the ML techniques

do not need a high number of examples to reach their best results,

the eort to train the techniques is reduced, enabling the use of

this technique in projects with dierent sizes. Also, a reduced num-

ber of needed examples allows techniques to early detect smells,

enabling the removal of smells at the beginning of the lifecycle of

the software.

As future work, we intend to investigate the accuracy of ML

techniques on detecting other smell types. In addition, we also

intend to replicate this study in controlled scenarios, reducing the

analyzed scope per project and, after that, per system’s modules.

In this way, we expect to identify the behavior of the techniques in

more specic contexts.

ACKNOWLEDGMENT

We thank CNPq (grants 427787/2018-1, 434969/2018-4, 312149/2016-

6, 141276/2020-7, and 408356/2018-9), CAPES/Procad (grant 175956),

CAPES/Proex, FAPPR (grant 51435), and FAPERJ (grant 200773/2019,

010002285/2019).

REFERENCES

[1]

Marwen Abbes, Foutse Khomh, Yann-Gael Gueheneuc, and Giuliano Antoniol.

2011. An empirical study of the impact of two antipatterns, blob and spaghetti

code, on programcomprehension. In 15th European Conference on Software Main-

tenance and Reengineering (CSMR). IEEE, 181–190.

[2]

Lucas Amorim, Evandro Costa, Nuno Antunes, Baldoino Fonseca, and Marcio

Ribeiro. 2015. Experience Report: Evaluating the Eectiveness of Decision Trees

for Detecting Code Smells. In Proceedings of the 2015 IEEE 26th International

Symposium on Software Reliability Engineering (ISSRE ’15). IEEE Computer Society,

Washington, DC, USA, 261–269. https://doi.org/10.1109/ISSRE.2015.7381819

[3]

Roberta Arcoverde, Isela Macia, Alessandro Garcia, and Arndt Von Staa. 2012.

Automatically detecting architecturally-relevant code anomalies. In 2012 Third

International Workshop on Recommendation Systems for Software Engineering

(RSSE). IEEE, 90–91.

[4]

Muhammad Ilyas Azeem, Fabio Palomba, Lin Shi, and Qing Wang. 2019. Machine

learning techniques for code smell detection: A systematic literature review

and meta-analysis. Information and Software Technology 108 (2019), 115 – 138.

https://doi.org/10.1016/j.infsof.2018.12.009

SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.

[5]

Gabriele Bavota, Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, and

Fabio Palomba. 2015. An experimental investigation on the innate relationship

between quality and refactoring. Journal of Systems and Software ( JSS) 107 (2015),

1–14.

[6]

Richard Bellman. 1966. Dynamic programming. Science 153, 3731 (1966), 34–37.

[7]

Ana Carla Bibiano, Eduardo Fernandes, Daniel Oliveira, Alessandro Garcia, Mar-

cos Kalinowski, Baldoino Fonseca, Roberto Oliveira, Anderson Oliveira, and

Diego Cedrim. 2019. A Quantitative Study on Characteristics and Eect of

Batch Refactoring on Code Smells. In 13th International Symposium on Empirical

Software Engineering and Measurement (ESEM). 1–11.

[8]

Diego Cedrim, Alessandro Garcia, Melina Mongiovi, Rohit Gheyi, Leonardo

Sousa, Rafael de Mello, Baldoino Fonseca, Márcio Ribeiro, and Alexander Chávez.

2017. Understanding the impact of refactoring on smells. In ACM Joint European

Software Engineering Conference and Symposium on the Foundations of Software

Engineering (ESEC/FSE). 465–475.

[9]

Alexander Chávez, Isabella Ferreira, Eduardo Fernandes, Diego Cedrim, and

Alessandro Garcia. 2017. How does refactoring aect internal quality attributes?

A multi-project study. In Proceedings of the 31st Brazilian Symposium on Software

Engineering (SBES). 74–83.

[10]

William W. Cohen. 1995. Fast Eective Rule Induction. In Twelfth International

Conference on Machine Learning. Morgan Kaufmann, 115–123.

[11]

Warteruzannan Soyer Cunha and Valter Vieira de Camargo. 2019. Uma In-

vestigação da Aplicação de Aprendizado de Máquina para Detecção de Smells

Arquiteturais. In Anais do VII Workshop on Software Visualization, Evolution and

Maintenance (VEM) (Salvador). SBC, Porto Alegre, RS, Brasil, 78–85. https:

//doi.org/10.5753/vem.2019.7587

[12]

R. M. d. Mello, R. F. Oliveira, and A. F. Garcia. 2017. On the Inuence of Hu-

man Factors for Identifying Code Smells: A Multi-Trial Empirical Study. In 2017

ACM/IEEE International Symposium on Empirical Software Engineering and Mea-

surement (ESEM). 68–77. https://doi.org/10.1109/ESEM.2017.13

[13]

Rafael de Mello, Anderson Uchôa, Roberto Oliveira, Willian Oizumi, Jairo Souza,

Kleyson Mendes, Daniel Oliveira, Baldoino Fonseca, and Alessandro Garcia.

2019. Do Research and Practice of Code Smell Identication Walk Together? A

Social Representations Analysis. In 2019 ACM/IEEE International Symposium on

Empirical Software Engineering and Measurement (ESEM). IEEE, 1–6.

[14]

D. Di Nucci, F. Palomba, D. A. Tamburri, A. Serebrenik, and A. De Lucia. 2018.

Detecting code smells using machine learning techniques: Are we there yet?.

In 2018 IEEE 25th International Conference on Software Analysis, Evolution and

Reengineering (SANER). 612–621.

[15]

Eduardo Fernandes, Johnatan Oliveira, Gustavo Vale, Thanis Paiva, and Eduardo

Figueiredo. 2016. A review-based comparative study of bad smell detection tools.

In Proceedings of the 20th International Conference on Evaluation and Assessment

in Software Engineering (EASE). 18:1–18:12.

[16]

Francesca Arcelli Fontana, Pietro Braione, and Marco Zanoni. 2012. Automatic

detection of bad smells in code: An experimental assessment. Journal of Object

Technology 11, 2 (2012), 5–1.

[17]

Francesca Arcelli Fontana, Mika V. Mäntylä, Marco Zanoni, and Alessandro

Marino. 2015. Comparing and experimenting machine learning techniques

for code smell detection. Empirical Software Engineering (June 2015). https:

//doi.org/10.1007/s10664-015- 9378-4

[18]

Francesca Arcelli Fontana, Marco Zanoni, Alessandro Marino, and Mika V.

Mäntylä. 2013. Code Smell Detection: Towards a Machine Learning-Based Ap-

proach. 2013 IEEE International Conference on Software Maintenance (sep 2013),

396–399. https://doi.org/10.1109/ICSM.2013.56

[19] Martin Fowler. 1999. Refactoring (1 ed.). Addison-Wesley Professional.

[20]

Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-

Wesley, Boston, MA, USA.

[21]

Mark Hall, Eibe Frank, Georey Holmes, Bernhard Pfahringer, and Ian H Reute-

mann, Peter andWitten. 2009. The WEKA data mining software: an update. ACM

SIGKDD explorations newsletter 11, 1 (2009), 10–18.

[22]

Tin Kam Ho. 1995. Random decision forests. In Document analysis and recognition,

1995., proceedings of the third international conference on, Vol. 1. IEEE, 278–282.

[23]

R.C. Holte. 1993. Very simple classication rules perform well on most commonly

used datasets. Machine Learning 11 (1993), 63–91.

[24]

Mario Hozano, Nuno Antunes, Baldoino Fonseca, and Evandro Costa. 2017. Eval-

uating the Accuracy of Machine Learning Algorithms on Detecting Code Smells

for Dierent Developers. In Proceedings of the 19th International Conference on

Enterprise Information Systems. 474–482.

[25]

Mario Hozano, Alessandro Garcia, Nuno Antunes, Baldoino Fonseca, and Evandro

Costa. 2017. Smells Are Sensitive to Developers!: On the Eciency of (Un)Guided

Customized Detection. In Proceedings of the 25th International Conference on Pro-

gram Comprehension (Buenos Aires, Argentina) (ICPC ’17). IEEE Press, Piscataway,

NJ, USA, 110–120. https://doi.org/10.1109/ICPC.2017.32

[26]

Mário Hozano, Alessandro Garcia, Baldoino Fonseca, and Evandro Costa. 2018.

Are You Smelling It? Investigating How Similar Developers Detect Code Smells.

Information and Software Technology (IST) 93, C (Jan. 2018), 130–146. https:

//doi.org/10.1016/j.infsof.2017.09.002

[27]

Allen Kent, Madeline M Berry,Fred U Luehrs Jr, and James W Perry. 1955. Machine

literature searching VIII. Operational criteria for designing information retrieval

systems. American documentation 6, 2 (1955), 93–101.

[28]

Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano

Antoniol. 2011. An exploratory study of the impact of antipatterns on class

change- and fault-proneness. Empirical Software Engineering 17, 3 (Aug. 2011),

243–275. https://doi.org/10.1007/s10664-011- 9171-y

[29]

F Khomh, S Vaucher, Y G Guéhéneuc, and H Sahraoui. 2009. A bayesian approach

for the detection of code and design smells. In Quality Software, 2009. QSIC’09.

9th International Conference on. IEEE, 305–314.

[30]

Miryung Kim, Thomas Zimmermann, and Nachiappan Nagappan. 2014. An

empirical study of refactoring challenges and benets at Microsoft. TSE’14 40, 7

(2014), 633–649.

[31]

Brett Lantz. 2019. Machine learning with R: expert techniques for predictive model-

ing. Packt Publishing Ltd.

[32]

Michele Lanza and Radu Marinescu. 2007. Object-oriented metrics in practice: using

software metrics to characterize, evaluate, and improve the design of object-oriented

systems. Springer Science & Business Media.

[33]

Michele Lanza, Radu Marinescu, and Stéphane Ducasse. 2005. Object-Oriented

Metrics in Practice. Springer-Verlag New York, Inc., Secaucus, NJ, USA.

[34]

Isela Macia, Alessandro Garcia, Christina Chavez, and Arndt von Staa. 2013.

Enhancing the detection of code anomalies with architecture-sensitive strategies.

In Software Maintenance and Reengineering (CSMR), 2013 17th European Conference

on. IEEE, 177–186.

[35]

Abdou Maiga, Nasir Ali, Neelesh Bhattacharya, Aminata Sabane, and Esma

Gueheneuc, Yann-Gael andAimeur. 2012. SMURF: A SVM-based Incremental

Anti-pattern Detection Approach. 2012 19th Working Conference on Reverse

Engineering (Oct. 2012), 466–475. https://doi.org/10.1109/WCRE.2012.56

[36]

Tom M. Mitchell. 1997. Machine learning. McGraw-Hill, Boston (Mass.), Burr

Ridge (Ill.), Dubuque (Iowa). http://opac.inria.fr/record=b1093076

[37]

M.J. Munro. 2005. Product Metrics for Automatic Identication of "Bad Smell"

Design Problems in Java Source-Code. 11th IEEE International Software Metrics

Symposium (METRICS) (2005), 15–15. https://doi.org/10.1109/METRICS.2005.38

[38]

Daniel Oliveira. 2020. Towards customizing smell detection andrefactorings.

(2020). Master dissertation. Pontical University of Rio de Janeiro.

[39]

Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and

AndreaDe Lucia. 2014. Do They Really Smell Bad? A Study on Developers’

Perception of Bad Code Smells. IEEE International Conference on Software Main-

tenance and Evolution (2014), 101–110. https://doi.org/10.1109/ICSME.2014.32

[40]

Fabiano Pecorelli, Dario Di Nucci, Coen De Roover, and AndreaDe Lucia. 2019. On

the Role of Data Balancing for Machine Learning-Based Code Smell Detection. In

Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning

Techniques for Software Quality Evaluation (Tallinn, Estonia) (MaLTeSQuE 2019).

Association for Computing Machinery, New York, NY, USA, 19–24. https://doi.

org/10.1145/3340482.3342744

[41]

Fabiano Pecorelli, Dario [Di Nucci], Coen [De Roover], and Andrea [De Lucia].

2020. A large empirical assessment of the role of data balancing in machine-

learning-based code smell detection. Journal of Systems and Software 169 (2020),

110693. https://doi.org/10.1016/j.jss.2020.110693

[42]

Fabiano Pecorelli, Fabio Palomba, Foutse Khomh, and Andrea De Lucia. 2020.

Developer-Driven Code Smell Prioritization. In International Conference on Min-

ing Software Repositories.

[43]

J. Platt. 1998. Fast Training of Support Vector Machines using Sequential Min-

imal Optimization. In Advances in Kernel Methods - Support Vector Learning,

B. Schoelkopf, C. Burges, and A. Smola (Eds.). MIT Press. http://research.

microsoft.com/~jplatt/smo.html

[44]

Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann

Publishers, San Mateo, CA.

[45]

José Amancio M. Santos, Manoel G. Mendonça, Cleber Pereira dos Santos, and

Renato Lima Novais. 2014. The problem of conceptualization in god class detec-

tion: agreement, strategies and decision drivers. Journal of Software Engineering

Research and Development 2 (2014), 1–33.

[46]

Danilo Silva, Nikolaos Tsantalis, and Marco TulioValente. 2016. Why we refactor?.

In FSE’16. 858–870.

[47]

Ingo Steinwart and Andreas Christmann. 2008. Support vector machines. Springer

Science & Business Media.

[48]

Nikolaos Tsantalis, Victor Guana, Eleni Stroulia, and Abram Hindle. 2013. A mul-

tidimensional empirical study on refactoring activity. In 23rd Annual International

Conference on Computer Science and Software Engineering. 132–146.

[49]

Aiko Yamashita and Leon Moonen. 2012. Do code smells reect important

maintainability aspects?. In 2012 28th IEEE international conference on software

maintenance (ICSM). IEEE, 306–315.

[50]

Aiko Yamashita and Leon Moonen. 2013. Exploring the Impact of Inter-smell

Relations on Software Maintainability: An Empirical Study. In Proceedings of the

2013 International Conference on Software Engineering (San Francisco, CA, USA)

(ICSE ’13). IEEE Press, Piscataway, NJ, USA, 682–691. http://dl.acm.org/citation.

cfm?id=2486788.2486878

A Survey on Machine Learning Techniques for Source Code Analysis

Preprint

Full-text available

Oct 2021

Context: The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis such as testing and vulnerabilities detection. A large number of studies poses challenges to the community to understand the current landscape. Objective: We aim to summarize the current knowledge in the area of applied machine learning for source code analysis. Method: We investigate studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we carried out an extensive literature search and identified 364 primary studies published between 2002 and 2021. We summarize our observations and findings with the help of the identified studies. Results: Our findings suggest that the usage of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task, and summarize the employed machine learning techniques. Additionally, we collate a comprehensive list of available datasets and tools useable in this context. Finally, we summarize the perceived challenges in this area that include availability of standard datasets, reproducibility and replicability, and hardware resources.

Developers’ perception matters: machine learning to detect developer-sensitive smells

Article

Full-text available

Oct 2022
EMPIR SOFTW ENG

Code smells are symptoms of poor design that hamper software evolution and maintenance. Hence, code smells should be detected as early as possible to avoid software quality degradation. However, the notion of whether a design and/or implementation choice is smelly is subjective, varying for different projects and developers. In practice, developers may have different perceptions about the presence (or not) of a smell, which we call developer-sensitive smell detection. Although Machine Learning (ML) techniques are promising to detect smells, there is little knowledge regarding the accuracy of these techniques to detect developer-sensitive smells. Besides, companies may change developers frequently, and the models should adapt quickly to the preferences of new developers, i.e., using few training instances. Based on that, we present an investigation of the behavior of ML techniques in detecting developer-sensitive smells. We evaluated seven popular ML techniques based on their accuracy and efficiency for identifying 10 smell types according to individual perceptions of 63 developers, with some divergent agreement on the presence of smells. The results showed that five out of seven techniques had statistically similar behavior, being able to properly detect smells. However, the accuracy of all ML techniques was affected by developers’ opinion agreement and smell types. We also observed that the detection rules generated for developers individually have more metrics than in related studies. We can conclude that code smells detection tools should consider the individual perception of each developer to reach higher accuracy. However, untrained developers or developers with high disagreement can introduce bias in the smell detection, which can be risky for overall software quality. Moreover, our findings shed light on improving the state of the art and practice for the detection of code smells, contributing to multiple stakeholders.

Improving accuracy of code smells detection using machine learning with data balancing techniques

Article

Full-text available

Jun 2024
J SUPERCOMPUT

Code smells indicate potential symptoms or problems in software due to inefficient design or incomplete implementation. These problems can affect software quality in the long-term. Code smell detection is fundamental to improving software quality and maintainability, reducing software failure risk, and helping to refactor the code. Previous works have applied several prediction methods for code smell detection. However, many of them show that machine learning (ML) and deep learning (DL) techniques are not always suitable for code smell detection due to the problem of imbalanced data. So, data imbalance is the main challenge for ML and DL techniques in detecting code smells. To overcome these challenges, this study aims to present a method for detecting code smell based on DL algorithms (Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU)) combined with data balancing techniques (random oversampling and Tomek links) to mitigate data imbalance issue. To establish the effectiveness of the proposed models, the experiments were conducted on four code smells datasets (God class, data Class, feature envy, and long method) extracted from 74 open-source systems. We compare and evaluate the performance of the models according to seven different performance measures accuracy, precision, recall, f-measure, Matthew’s correlation coefficient (MCC), the area under a receiver operating characteristic curve (AUC), the area under the precision–recall curve (AUCPR) and mean square error (MSE). After comparing the results obtained by the proposed models on the original and balanced data sets, we found out that the best accuracy of 98% was obtained for the Long method by using both models (Bi-LSTM and GRU) on the original datasets, the best accuracy of 100% was obtained for the long method by using both models (Bi-LSTM and GRU) on the balanced datasets (using random oversampling), and the best accuracy 99% was obtained for the long method by using Bi-LSTM model and 99% was obtained for the data class and Feature envy by using GRU model on the balanced datasets (using Tomek links). The results indicate that the use of data balancing techniques had a positive effect on the predictive accuracy of the models presented. The results show that the proposed models can detect the code smells more accurately and effectively.

Improving Accuracy of Code Smells Detection with Data Balancing Techniques

Preprint

Full-text available

Feb 2022

Nasraldeen Alnor Adam Khleel

Code smells are indicators of potential symptoms or problems in software due to inefficient design or incomplete implementation. These problems can affect software quality in the long term. Code smell detection is fundamental to improving software quality and maintainability, reducing the risk of software failure, and helping to refactor the code. Data imbalance is the main challenge of machine learning (ML) techniques in detecting the code smells. Several prediction methods have been applied for code smells detection in our previous works. However, many of them show that, ML methods is not always suitable for code smells detection due to the problem of highly unbalanced data. To overcome these challenges, the objective of this study is to present a code smells detection method based on ML models with data balancing techniques to mitigate data unbalancing issues by taking a corpus of Java projects as experimental datasets. In our experiments, we have used Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). The performance of these models has been evaluated based on accuracy, precision, recall, f-measure, ROC curve. The results indicate that the use of data balancing techniques had a positive effect on the predictive accuracy of the models presented.

A survey on machine learning techniques applied to source code

Article

Mar 2024
J SYST SOFTWARE

Exploring the Intersection between Software Maintenance and Machine Learning—A Systematic Mapping Study

Article

Full-text available

Jan 2023

While some areas of software engineering knowledge present great advances with respect to the automation of processes, tools, and practices, areas such as software maintenance have scarcely been addressed by either industry or academia, thus delegating the solution of technical tasks or human capital to manual or semiautomatic forms. In this context, machine learning (ML) techniques play an important role when it comes to improving maintenance processes and automation practices that can accelerate delegated but highly critical stages when the software launches. The aim of this article is to gain a global understanding of the state of ML-based software maintenance by using the compilation, classification, and analysis of a set of studies related to the topic. The study was conducted by applying a systematic mapping study protocol, which was characterized by the use of a set of stages that strengthen its replicability. The review identified a total of 3776 research articles that were subjected to four filtering stages, ultimately selecting 81 articles that were analyzed thematically. The results reveal an abundance of proposals that use neural networks applied to preventive maintenance and case studies that incorporate ML in subjects of maintenance management and management of the people who carry out these tasks. In the same way, a significant number of studies lack the minimum characteristics of replicability.

The problem of conceptualization in god class detection: agreement, strategies and decision drivers

Article

Full-text available

Dec 2014

Do Research and Practice of Code Smell Identification Walk Together? A Social Representations Analysis

Conference Paper

Full-text available

Sep 2019

Context: It is frequently claimed the need for bridging the gap between software engineering research and practice. In this sense, the theory of social representations may be useful to characterize the actual concerns of software developers. It comprises the system of values, behaviors, and practices of communities regarding a particular social object, such as the task of smell identification. Aim: To characterize the social representations of smell identification by software developers. Method: Based on the answers given to a questionnaire, we analyzed the associations made by the developers about smell identification, i.e., what immediately comes to their minds when they think about this task. Results: We found that developers strongly associate smell identification with the practice of smell removal and with the incidence of bugs. They also frequently associate the task with the practice of inspection and with the need of having individual skills. Besides, we verified that the current state of the art on smell identification partially address the social representations of the software developers. Conclusion: There is a considerable gap between the research of smell identification and its practice. We propose directions to mitigating this gap.

A Quantitative Study on Characteristics and Effect of Batch Refactoring on Code Smells

Conference Paper

Full-text available

Sep 2019

Background: Code refactoring aims to improve code structures via code transformations. A single transformation rarely suffices to fully remove code smells that reveal poor code structures. Most transformations are applied in batches, i.e. sets of interrelated transformations, rather than in isolation. Nevertheless, empirical knowledge on batch application, or batch refactoring, is scarce. Such scarceness helps little to improve current refactoring practices. Aims: We analyzed 57 open and closed software projects. We aimed to understand batch application from two perspectives: characteristics that typically constitute a batch (e.g., the variety of transformation types employed), and the batch effect on smells. Method: We analyzed 19 smell types and 13 transformation types. We identified 4,607 batches, each applied by the same developer on the same code element (method or class); we expected to have batches whose transformations are closely interrelated. We computed (1) the frequency in which five batch characteristic manifest, (2) the probability of each batch characteristics to remove smells, and (3) the frequency in which batches introduce and remove smells. Results: Most batches are quite simple: although most batches are applied on more than one method (90%), they are usually composed of the same transformation type (72%) and only two transformations (57%). Batches applied on a single method are 2.6 times more prone to fully remove smells than batches affecting more than one method. Surprisingly, batches mostly ended up introducing (51%) or not fully removing (38%) smells. Conclusions: The batch simplicity suggests that developers have sub-explored the combinations of transformations within a batch. We summarized some batches that may fully remove smells, so that developers can incorporate them into current refactoring practices.

Fast Training of Support Vector Machines Using Sequential Minimal Optimization

Chapter

Dec 1998

John C. Platt

The Support Vector Machine is a powerful new learning algorithm for solving a variety of learning and function estimation problems, such as pattern recognition, regression estimation, and operator inversion. The impetus for this collection was a workshop on Support Vector Machines held at the 1997 NIPS conference. The contributors, both university researchers and engineers developing applications for the corporate world, form a Who's Who of this exciting new area. Contributors Peter Bartlett, Kristin P. Bennett, Christopher J.C. Burges, Nello Cristianini, Alex Gammerman, Federico Girosi, Simon Haykin, Thorsten Joachims, Linda Kaufman, Jens Kohlmorgen, Ulrich Kreßel, Davide Mattera, Klaus-Robert Müller, Manfred Opper, Edgar E. Osuna, John C. Platt, Gunnar Rätsch, Bernhard Schölkopf, John Shawe-Taylor, Alexander J. Smola, Mark O. Stitson, Vladimir Vapnik, Volodya Vovk, Grace Wahba, Chris Watkins, Jason Weston, Robert C. Williamson

Developer-Driven Code Smell Prioritization

Conference Paper

Jun 2020

A large empirical assessment of the role of data balancing in machine-learning-based code smell detection

Article

Jun 2020
J SYST SOFTWARE

Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set of metrics is used to detect smelly code components. However, these techniques suffer from subjective interpretations, a low agreement between detectors, and threshold dependability. To overcome these limitations, previous work applied Machine-Learning that can learn from previous datasets without needing any threshold definition. However, more recent work has shown that Machine-Learning is not always suitable for code smell detection due to the highly imbalanced nature of the problem. In this study, we investigate five approaches to mitigate data imbalance issues to understand their impact on Machine Learning-based approaches for code smell detection in Object-Oriented systems and those implementing the Model-View-Controller pattern. Our findings show that avoiding balancing does not dramatically impact accuracy. Existing data balancing techniques are inadequate for code smell detection leading to poor accuracy for Machine-Learning-based approaches. Therefore, new metrics to exploit different software characteristics and new techniques to effectively combine them are needed.

Uma Investigação da Aplicação de Aprendizado de Máquina para Detecção de Smells Arquiteturais

Conference Paper

Sep 2019

Uma investigação da aplicação de aprendizado de máquina para detectar os smells arquiteturais God Component (GC) e Unstable Dependency (UD) é apresentada neste trabalho. Dois datasets foram criados com exemplos coletados de sistemas reais. A acurácia, precisão e recall foram avaliadas com um conjunto de 10 algoritmos preditivos. Uma seleção de atributos foi realizada a fim de encontrar os mais relevantes para essa detecção. Os algoritmos AdaBoost e SVM (Support Vector Machine) com kernel linear alcançaram os melhores resultados para o GC e UD, respectivamente. Além disso, observouse que alguns atributos que a princípio não seriam considerados, contribuíram para a precisão da detecção.

On the role of data balancing for machine learning-based code smell detection

Conference Paper

Aug 2019

Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set of metrics (e.g., code metrics, process metrics) is used to detect smelly code components. However, these techniques suffer of subjective interpretation, low agreement between detectors, and threshold dependability. To overcome these limitations, previous work applied Machine Learning techniques that can learn from previous datasets without needing any threshold definition. However, more recent work has shown that Machine Learning is not always suitable for code smell detection due to the highly unbalanced nature of the problem. In this study we investigate several approaches able to mitigate data unbalancing issues to understand their impact on ML-based approaches for code smell detection. Our findings highlight a number of limitations and open issues with respect to the usage of data balancing in ML-based code smell detection.

Machine Learning Techniques for Code Smell Detection: A Systematic Literature Review and Meta-Analysis

Article

Jan 2019
INFORM SOFTWARE TECH

Detecting Code Smells using Machine Learning Techniques: Are We There Yet?

Conference Paper

Mar 2018

Code smells are symptoms of poor design and implementation choices weighing heavily on the quality of produced source code. During the last decades several code smell detection tools have been proposed. However, the literature shows that the results of these tools can be subjective and are intrinsically tied to the nature and approach of the detection. In a recent work the use of Machine-Learning (ML) techniques for code smell detection has been proposed, possibly solving the issue of tool subjectivity giving to a learner the ability to discern between smelly and non-smelly source code elements. While this work opened a new perspective for code smell detection, it only considered the case where instances affected by a single type smell are contained in each dataset used to train and test the machine learners. In this work we replicate the study with a different dataset configuration containing instances of more than one type of smell. The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.

Applying Machine Learning to Customized Smell Detection: A Multi-Project Study

Figures

Recommended publications

Machine learning techniques for code smells detection: an empirical experiment on a highly imbalance...

Developers’ perception matters: machine learning to detect developer-sensitive smells

Smells Are Sensitive to Developers! On the Efficiency of (Un)Guided Customized Detection

Recommending Composite Refactorings for Smell Removal: Heuristics and Evaluation

Characterizing and Identifying Composite Refactorings: Concepts, Heuristics and Patterns