Conference PaperPDF Available

Applying Machine Learning to Customized Smell Detection: A Multi-Project Study

Authors:

Figures

Applying Machine Learning to Customized Smell Detection:
A Multi-Project Study
Daniel Oliveira
doliveira@inf.puc-rio.br
Pontical Catholic University
Rio de Janeiro, RJ
Wesley K. G. Assunção
wesleyk@utfpr.edu.br
Federal University of Technology
Toledo, PR
Leonardo Souza
leo.sousa@sv.cmu.edu
Carnegie Mellon University
Silicon Valley, CA
Willian Oizumi
woizumi@inf.puc-rio.br
Pontical Catholic University
Rio de Janeiro, RJ
Alessandro Garcia
afgarcia@inf.puc-rio.br
Pontical Catholic University
Rio de Janeiro, RJ
Baldoino Fonseca
baldoino@ic.ufal.br
Federal University of Alagoas
Maceió, AL
ABSTRACT
Code smells are considered symptoms of poor implementation
choices, which may hamper the software maintainability. Hence,
code smells should be detected as early as possible to avoid software
quality degradation. Unfortunately, detecting code smells is not a
trivial task. Some preliminary studies investigated and concluded
that machine learning (ML) techniques are a promising way to
better support smell detection. However, these techniques are hard
to be customized to promote an early and accurate detection of
specic smell types. Yet, ML techniques usually require numerous
code examples to be trained (composing a relevant dataset) in order
to achieve satisfactory accuracy. Unfortunately, such a dependency
on a large validated dataset is impractical and leads to late detection
of code smells. Thus, a prevailing challenge is the early customized
detection of code smells taking into account the typical limited
training data. In this direction, this paper reports a study in which
we collected code smells, from ten active projects, that were actually
refactored by developers, dierently from studies that rely on code
smells inferred by researchers. These smells were used for evaluat-
ing the accuracy regarding early detection of code smells by using
seven ML techniques. Once we take into account such smells that
were considered as important by developers, the ML techniques are
able to customize the detection in order to focus on smells observed
as relevant in the investigated systems. The results showed that
all the analyzed techniques are sensitive to the type of smell and
obtained good results for the majority of them, especially JRip and
Random Forest. We also observe that the ML techniques did not
need a high number of examples to reach their best accuracy results.
This nding implies that ML techniques can be successfully used
for early detection of smells without depending on the curation of
a large dataset.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
SBES ’20, October 21–23, 2020, Natal, Brazil
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8753-8/20/09. . . $15.00
https://doi.org/10.1145/3422392.3422427
CCS CONCEPTS
Software and its engineering Software design engineer-
ing.
KEYWORDS
code smell, code smell detection, software quality
ACM Reference Format:
Daniel Oliveira, Wesley K. G. Assunção, Leonardo Souza, Willian Oizumi,
Alessandro Garcia, and Baldoino Fonseca. 2020. Applying Machine Learning
to Customized Smell Detection: A Multi-Project Study. In 34th Brazilian
Symposium on Software Engineering (SBES ’20), October 21–23, 2020, Natal,
Brazil. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3422392.
3422427
1 INTRODUCTION
Code smells are considered symptoms of poor implementation
choices, which make software systems hard to maintain [
20
]. Due
to their harmfulness to software quality [
1
,
28
,
50
], code smells
should be detected and removed as early as possible along the
software lifecycle. Although removing code smells is of paramount
importance to keep the internal quality of a software, their detection
is not an easy task. Their detection is mostly subject to developer
judgment. In fact, identifying code smells is a subjective activity [
24
].
Developers have the knowledge to conrm the harmfulness of
a smelly structure [
13
]. This knowledge varies according to the
developer’s experience, skills, and mastery of the source code being
analyzed.
Existing literature suggests dierent strategies to detect code
smells. The vast majority of these detection strategies are based on
metrics and their respective thresholds [
3
,
32
,
34
]. These strategies
tend to analyze each code fragment and employ some previously de-
ned thresholds to classify the fragment as host (or not) of a specic
smell. This diculty stems from the fact that the operationalization
(i.e., the threshold denition) of the strategy for detecting each smell
type requires proper reasoning. Such an operationalization cannot
be solely based on nding metrics and thresholds in compliance
with the conceptual denition of a smell type. The operationaliza-
tion also needs to be customized by considering various contextual
information of the projects, to which only the developer has access.
Preliminary studies have analyzed the use of machine learning
(ML) techniques as a promising way to detect code smells [
2
,
17
,
SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.
18
,
24
,
25
,
38
]. These previous studies evaluate the use of training
datasets containing numerous code examples annotated as smelly or
non-smelly code by developers. From these training datasets, the ML
techniques generate detection models that can successfully detect
similar smells to the ones used for the training. However, these
techniques are hard to be customized, for example, the detection of
only certain types of code smell that are relevant for developers of
specic projects. Yet, ML techniques usually require numerous code
examples to be trained. Obtaining such many instances with desired
properties to compose a relevant dataset can be time-consuming,
leading to late detection of code smells. Thus, a prevailing challenge
is the early customized detection of code smells taking into account
the typical limited training data.
To overcome the described limitations, this paper reports a multi-
project study in which we collected code smells that were actually
refactored by developers, dierently from studies that infer code
smells based on metrics and thresholds. This study relies on ten
active projects, with dierent sizes and belonging to distinct do-
mains, and six types of code smells. We chose smell types that cover
dierent system scopes, such as classes, methods, elds, and param-
eters. The composed dataset was used in the training and evaluation
of seven ML techniques in terms of the accuracy regarding early
detection of code smells. The accuracy was computed based on
the traditional F-Measure [
27
]. In addition to existing strategies,
since we take into account such smells that were considered as
important by developers, the ML techniques are able to customize
the detection in order to focus on smells observed as relevant in
the investigated systems.
The results pointed out that, when considering only relevant
smells, the ML techniques have similar behavior for the same smell
type. Besides that, they also have good support for detecting code
smells in projects with dierent sizes, once they do not need a high
number of examples to reach high results. By the results obtained
and analysis performed, our study led to the following ndings:
Smell types vs. ML techniques. When detecting harmful smells,
the smell types have more inuence on accuracy than the
ML techniques themselves.
Eect of the metrics on the customized smell detection. Partic-
ularities of each code smell type aect the accuracy of the
ML techniques. For example, the metrics (features) that best
represent the smell.
The subjectivity on detecting customized smells aect the ac-
curacy of the ML techniques. Smells with more subjective
denitions, i.e more complex, tend to obtain less accuracy
since the training set will have a higher variation in the
values of the features.
The ML techniques using a small set of curated examples (based
on previous refactorings of developers) can successfully support
early customized detection of code smells. When using on
datasets that well-represent what developers consider as
relevant in practice, ML techniques are able to, signicantly,
early detect code smells.
The contributions of this work are as follows. First, dierent
from previous studies, we focused on assessing the use of ML tech-
niques only for detecting smells that ended up being refactored by
a developer. The refactoring of the smelly code indicates that the
developer, either consciously or not, conrmed the relevance of a
smell. Those smells can be considered relevant to the program as
their removal helped the developer to achieve his maintenance goal.
Second, we assessed the accuracy of six well-known ML techniques
on detecting code smells using our customized dataset, consider-
ing dierent samples of training, whereas we gradually increase
the number of examples used to perform ML technique training.
Third, we make publicly available
1
our dataset for future studies
and replication.
The remaining of this document is structured as follows. Sec-
tion 2 describes the background related to code smells and ML
techniques. Related work is described in Section 3. Section 4 de-
tails the design of our study. Section 5 presents all the results and
analysis as well as the answers to our research questions. Section 6
describes the threats and limitations of the study. Finally, Section 7
presents the concluding remarks and discusses future work.
2 BACKGROUND
Our study encompasses two research topics, namely, code smells
and ML techniques. The next sections describe the six code smells
we considered and the seven ML techniques, respectively.
2.1 Code Smells
Poor code structures are often represented by the so-called code
smells [
19
]. A code smell is considered a symptom of a poor imple-
mentation choice. As a consequence, software maintenance requires
additional cost and eort on understanding and re-structuring the
smelly code [
8
,
9
,
30
]. Due to their harmfulness to software main-
tainability [
1
,
28
,
50
], code smells should be detected and removed
as early as possible along the software lifecycle.
There are several types of code smells described in the litera-
ture [
19
]. Each one characterizes a recurring poor code structure
and aects a specic scope of program elements. For our study, we
selected the six smell types listed in Table 1. These smell types are
related to bad design decisions or reect important maintainability
aspects [
49
]. The rst column represents the smell type name. The
second column presents a brief description of the type. We have
chosen these smell types due to the dierent scopes of a program
aected by them, namely classes, methods, elds, and parameters.
Table 1: Types of Code Smells Investigated in this Study
Name Description
Complex Class (CC)
Classes that involve a lot of dierent but
related parts.
Class Data Should
be Private (CDSBP)
Classes that expose its attributes unneces-
sarily.
God Class (GC)
Classes that tend to centralize the intelli-
gence of the system.
Lazy Class (LC) Classes that do not do enough.
Spaghetti Code (SC)
Code that has a complex and tangled struc-
ture.
Speculative General-
ity (SG)
Unused classes, methods, elds or param-
eters created to future features that never
get implemented.
1https://smelldetection.github.io/
Machine Learning to Customized Smell Detection SBES ’20, October 21–23, 2020, Natal, Brazil
Previous studies proposed strategies to detect code smells [
8
,
15
,
16
,
33
,
37
,
39
], including the use of ML techniques [
17
,
24
,
29
,
35
].
However, these strategies are mostly not representative of how
developers perform smell detection, once developers have particu-
lar ways to derive a detection strategy. In this sense, some studies
suggest that developers customize their detection strategies accord-
ing to their own previous experience in recognizing which smells
are harmful and should be refactored [
25
,
26
]. As aforementioned,
smell detection takes into account the particular ways to derive
a detection strategy. As a consequence, particular customizations
largely dier from others and it is rare or even impossible to de-
rive a single detection strategy that is acceptable in every software
project.
In fact, empowering smell detection strategies with customiza-
tion can help developers and companies to consider code smells that
are indeed harmful according to their quality standards. Therefore,
customized strategies can identify and report to developers only
smells in which they are interested. Also, constant warnings of
non-customized detection strategies can cause a waste of time on
the inspection of irrelevant smells. Strategies without customiza-
tion also hinder the developer concentration on harmful smells, or
camouage smells that are considered more harmful according to
the developers’ perception.
2.2 ML Techniques
To investigate the customization of smell detection, we analyzed
seven ML techniques frequently used in literature [
24
,
25
]. These
techniques involve dierent data analysis approaches, such as de-
cision trees, regression analysis and based-rule analysis that are
responsible to create the classier models. The seven ML techniques
are listed below:
Naive Bayes:
A probabilistic classier based on the application of
Bayes’ theorem [
36
]. This technique is highly scalable and com-
pletely disregards the correlation between the variables in the train-
ing set. This classier describes the probability of an event, based
on prior knowledge of conditions that might be related to the event.
Support Vector Machine (SVM):
An implementation of inte-
grated software for the classication of support vectors [
47
] that
analyzes the data used for classication and regression analysis.
SVM assigns new examples to one of the two categories introduced
in the training set, making it a non-probabilistic binary linear clas-
sier. To make this classication, SVM creates classication models
that are a representation of examples as points in space. These
points are mapped in such a way that the examples in each cate-
gory are divided by a clear space that is as broad as possible. Each
new instance is mapped in the same space and predicted as belong-
ing to a category based on which side of space they are placed.
Sequential Minimal Optimization (SMO):
An implementation
of John Platt’s minimal sequential optimization algorithm to train
a support vector classier [
43
]. In other words, SMO is a technique
for optimizing the SVM training expediting the training process
and making it less complex. For that, SMO breaks the problem to
be solved into a series of smallest possible sub-problems, which are
solved analytically.
OneRule (OneR):
A classication technique that generates a rule
for each predictor in the data. Then, it selects the rule with the
lowest total error as its “single rule” [
23
]. In order to create this rule,
this technical analysis of the training set associating a single data
to a specic category based on its frequency; in other words, if a
specic data is usually classied as category A, then a rule is created
linking them. After the rules creation, the technique chooses the
one with the lowest total error.
Random Forest (RF):
A classier responsible for building numer-
ous classication trees representing a forest with random decision
trees [
22
]. The RF technique adds extra randomness to the model
when the tree’s creation. Instead of looking for the best feature
when partitioning nodes, it looks for the best feature in a random
subset of features. This process creates a great diversity, which
generally leads to the generation of better models, besides that this
diversity also reduces the overtting eect.
JRip:
An implementation of an apprentice of propositional
rules [
10
]. It is based in association rules with reduced error prun-
ing, a very common and accurate technique found in decision tree
algorithms. Dierent from the other algorithms, JRip splits its train-
ing stage into two steps, a growing phase, and a pruning phase. The
rst phase grows a rule by greedily adding antecedents (or condi-
tions) to the rule until the rule is perfect, (i.e., 100% of accuracy).
The second phase incrementally prunes each rule and allow the
pruning of any nal sequences of the antecedents.
J48:
A Java implementation of the C4.5 decision tree technique [
44
].
J48 builds decision trees from a training data set. At each node of the
tree, this technique chooses the data attribute that most eectively
partitions its set of samples into subsets tending to one category
or another. The partitioning criterion is the information gain. The
attribute with the highest gain of information is chosen to make
the decision. This process is repeated on the smaller partitions.
3 RELATED WORK
Several machine learning techniques have been applied to derive
automated smell detection strategies (e.g., [
17
,
24
,
25
]). In addition
to the detection of code smells, such techniques have also been
applied in close activities such as prioritizing code smells [
42
] and
detecting architectural smells [11].
To design and evaluate ML-based smell detection strategies, re-
searchers often rely on large datasets of manually validated code
smells. These datasets are large to facilitate the learning process,
thereby leading to more accurate results. The study described
in [
35
], for example, assessed the accuracy of Support Vector Machine
(SVM) in the detection of four types of code smell: Blob,Functional
Decomposition,Spaghetti Code, and Swiss Army Knife. The SVM
obtained an accuracy of up to 0.74.
In [
2
], the authors proposed the use of Decision Tree technique to
detect code smells. The authors used a single dataset containing a
huge number of examples validated by a few developers. The results
indicate that the Decision Tree is able to reach an accuracy up to 0.78.
Fontana et.al. [
17
] presented a large study that compares and exper-
iments dierent congurations of machine learning techniques to
detect four code smell types (Data Class, Large Class, Feature Envy,
Long Method). To perform the training of these techniques, the
authors used a dataset containing several examples of code smells
SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.
manually validated by a few developers. The J48 and Random Forest
obtained the highest accuracy, reaching values up to 0.95. However,
a recent study [
14
] indicates that the dataset used by Fontana [
17
]
had a high inuence on the accuracy obtained by the techniques.
Hozano et al. [
24
] analyzed the accuracy and eciency of ML
techniques in the detection of four distinct code smell types (God
Class, Data Class, Feature Envy, Long Method). The results indicate
that the Random Forest is able to reach high accuracy and eciency
when detecting these smell types. Despite presenting important
advances regarding the detection of code smells, this study and
other ones reported in the literature are far from capturing how
developers work on smell detection. The reason is that existing
studies rely on the premise that either (i) it is possible to derive
a universal strategy based on a large training dataset or (ii) each
company will customize strategies for the context of each software
project. Nevertheless, both premises are false.
There is evidence that the detection of code smells is highly sen-
sitive to contextual factors, such as the software developer [
12
,
25
].
This happens because each developer may have a particular way
of deriving detection strategies for code smell based on previous
experiences. Thus, a universal strategy would hardly present satis-
factory results in any project. In addition, it is dicult to imagine
that developers would spend time validating datasets of code smells
for each software project, which makes it dicult to adopt existing
techniques.
Another limitation of existing ML-based smell detection strate-
gies is regarding data balancing [
14
]. This happens because the
proportion of samples without code smells is usually much higher
than the proportion of samples aected by validated smells. Some
researchers [
40
,
41
] investigated whether data balancing techniques
are able to improve the accuracy of ML-based smell detection strate-
gies. However, their results indicate that the existing techniques for
data balancing are not capable of signicantly improving accuracy.
Therefore, there is still a need for other ways to improve existing
ML-based smell detection strategies.
Given the aforementioned limitations, in this work, we apply
and evaluate a new way for the automated customization of early
code smell detection strategies. For training the ML algorithms,
we take a dataset of code smells that were refactored in practice.
We conjecture that such refactorings are strong indicators of rel-
evance as they indicate that the developers actually spent eort
on removing the smells. This prevents developers from receiving
late warnings about smells that they may consider not harmful.
Therefore, following this customization approach, the application
of ML-based smell detection strategies can become viable in any
project or company that has a history of refactorings carried out by
the development team in the past. Even when the number of smell
instances is rare.
4 STUDY DESIGN
The goal of our study is to investigate the early customized detection
of code smells taking into account the typical limited training data.
Based on this goal, we derived two research questions, as follows.
RQ1. How accurate are the ML techniques on customizing the
detection of smells?
This RQ aims at investigating the accuracy of
the seven ML techniques in customizing the detection of six smell
types. A customized detection focuses only on smell instances in
which developers are interested. These smells are considered more
harmful based on developers’ perceptions. In this way, the detection
will allow the proper removal of these smells. Thus, it is important
to investigate techniques that are able to detect harmful smells
with high accuracy. Once ML techniques have been considered a
promising way to detect code smells [
4
], these techniques are a
strong candidate for customizing smell detection properly.
RQ2. How ecient are the ML techniques for early customized
detecting of smells?
Our second RQ aims at analyzing the e-
ciency of the ML techniques in detecting smells, i.e., how accurate
a ML technique detects smells whereas we gradually increase the
number of examples used to perform its training. Although ML
techniques have been considered a promising way to detect code
smells, these techniques require code smell examples annotated to
perform their training. However, the annotation of a large number
of examples may introduce an unfeasible additional time and eort.
Hence, it is important to analyze the accuracy of these techniques
with a low number of examples used in the training set.
To answer the posed RQs we rely on two aspects: (i) the type of
smell analyzed; and (ii) the number of instances used to perform
the training of the ML techniques, i.e. the training dataset. The
next sections present the subject projects and how we collected
and analyze data regarding these two aspects.
4.1 Subject Projects
The instances of code smells considered in the scope of our study
(see Table 1) were detected from ten open source Java projects:
Apache Ant
2
, Apache Derby
3
, Apache Tomcat
4
, Elastic Search
5
, Ar-
gouml
6
, Apache Xerces
7
, Google j2objc
8
, Presto db
9
, SpringFrame-
work
10
and Achilles
11
. We selected such projects because they have
dierent sizes and are from distinct domains. A mix of domains is
an interesting benchmark for companies that have a few projects
in their portfolio but operates in multiple domains. Also, they were
evaluated by existing smell detection techniques and, which found
that their source code contains a variety of suspicious code smells
that enable the execution of our study [8].
4.2 Data Collection
Data to answer RQ1. We extracted 200 (100 smelly and 100 non-
smelly) code fragments from the analyzed projects for each smell
type. The smell detection process was made using a detection
tool [
8
]. This tool is based on a set of metrics and thresholds and
has a high overall recall of 81% [
15
], i.e., it detects the vast majority
of existing smells. These 200 code fragments for each smell allow
us to observe the behavior of the techniques in a diversity of code
smell instances. In addition to detected code smells, we selected
2https://ant.apache.org/
3https://db.apache.org/derby/
4http://tomcat.apache.org/
5https://www.elastic.co/
6https://argouml.tigris.org/
7http://xerces.apache.org/
8https://github.com/google/j2objc
9https://prestodb.io/
10https://spring.io/
11http://www.ganttproject.biz
Machine Learning to Customized Smell Detection SBES ’20, October 21–23, 2020, Natal, Brazil
only those that were directly refactored by developers. That is, the
examples of smells used in the training dataset and in the evalua-
tion of the ML techniques were evidently relevant for developers.
These smells were considered important since they were identied
and removed by the developer, so they were harmful and possibly
hindered some development tasks. In this way, the tool’s threshold
was only a starting point to identify the smells. Then, the project’s
developers indicated the harmful smells. This makes the learning
process be based on more relevant smell instances, i.e., not simply
based on initial thresholds. Thus, we can verify if the techniques
are able to customize their detection for more relevant smell types.
To identify the refactorings that were related to the detected
smells, we used the RefMiner [
46
,
48
] tool. RefMiner is widely used
in the literature [
7
,
8
,
46
,
48
]. From this information, we could
lter the analyzed smells among those that directly underwent a
refactoring. Also, this lter avoids bias regarding a single set of
detection metrics/thresholds and ensure a relevant dataset.
Finally, the application of ML techniques requires collecting fea-
tures for all smell instances. For this task, we used Understand
12
, a
tool to extract software features. Altogether, 42 features were con-
sidered. The complete list of features is publicly available
1
. These
features were used during the training process of the seven ML
techniques. Once we are addressing smells considered harmful by
the developer, these smells may not be aligned with the features
proposed by the literature for each smell type [
5
,
32
], since the
literature evaluates all instances of smells [
4
]. Therefore, a high
number of features allow the machine learning techniques to eval-
uate which ones best characterize a smell as harmful. Also, these
features cover dierent information about classes, methods, elds,
and parameters, indicating, e.g., the number of lines of the code
fragments, relations of complexity within and between elements
and several other counters.
Data to answer RQ2. In order to evaluate the accuracy of the early
customized detection of ML techniques, we adopted dierent sizes
of training datasets, that were applied incrementally. The dataset
of each subject project was split into six subsets of dierent sizes:
20, 40, 80, 120, 160 and 200 code smell instances. This division was
made such that during the evaluation each increment, i.e. new set
of instances, also includes the same instances of the preceding set.
For example, the second set with 40 instances is composed of all
instances of the rst set added of more 20 new ones.
To assess the accuracy of the ML techniques, we computed the
f-measure that considers both the recall and precision [
27
]. To com-
pute these measures, the true positive (TP) elements represent the
code fragment classied by the ML techniques as a code smell that
is, actually, a real code smell. The false positive (FP) elements refer
to the code fragments wrongly classied as a code smell. Similarly,
the true negative (TN) represents the code fragments correctly clas-
sied as not-smell. Finally, the false negative (FN) represents the
wrong ones. Based on that, we can compute recall, precision, and
f-measure, as described in the equations below. F-measure is widely
used in previous studies [
17
,
24
,
38
] that assess the ML techniques
on detecting code smells.
12https://scitools.com/features/
Recall (R)
: The Number of code fragments correctly classi-
ed as code smells among the total of code smell instances
in the data collection.
𝑅=
𝑇 𝑃
𝑇 𝑃 +𝐹 𝑁 (1)
Precision (P)
: The Number of code fragments correctly
classied as code smell among the total of code fragments
classied as code smell by the ML technique.
𝑃=
𝑇 𝑃
𝑇 𝑃 +𝐹 𝑃 (2)
F-measure: Harmonic mean of precision and recall.
𝐹1=2·
𝑃·𝑅
𝑃+𝑅(3)
4.3 Data Analysis
Using the datasets containing the classied (non-)smelly instances
and the software features for each analyzed code fragment, we
performed two dierent analyses. Each analysis aims at answering
a research question.
Analysis to answer RQ1. Here we used the datasets to analyze the
accuracy (in terms of f-measure) of the ML techniques on detecting
a specic smell type. For each smell type, we calculated the overall
accuracy of each technique by applying a 5-fold cross-validation
procedure on the 200 classied instances.
Analysis to answer RQ2. For this analysis, we evaluated the e-
ciency of the ML techniques, i.e., the accuracy of each ML technique
whereas we increment the number of code smells examples used
to perform the training of these techniques. In other words, we
repeat the accuracy experiment six times, one for each subset of
the respective smell. The repetition aimed to guarantee that both,
the training and test sets, were composed of equals number of
fragments classied as smell or not.
4.4 Implementation Aspects
The ML techniques presented in Section 2.2 were implemented on
top of Weka
13
and R Project
14
. Weka is an open source ML software,
based Java programming language, containing a plethora of tools
and algorithms [
21
]. R is a free software environment for statistical
computing that is widely used for data mining, data analysis, and
to implement ML techniques [31].
5 RESULTS AND DISCUSSION
This section presents and discusses the results of our multi-project
study. The results are organized in terms of the two research ques-
tions presented in the previous section.
5.1 RQ1. How accurate are the ML techniques
on customizing the detection of smells?
To evaluate the accuracy of ML techniques in detecting dierent
types of code smells, Figure 1 presents the accuracy when consider-
ing the greatest dataset of our study, namely 200 instances of code
smells. The use of this dataset allows us to perform an analysis
13https://www.cs.waikato.ac.nz/ml/weka/
14https://www.r-project.org
SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.
of the overall accuracy of the ML techniques in the scenario with
the most number of instances for training. In the gure, the x-axis
is divided per smell and presents sequentially the ML technique
used to detect the respective code smell. The y-axis describes the
values of the accuracy (in terms of f-measure) obtained by the ML
technique on detecting the respective smell type. To improve read-
ability, we attach the values of the f-measure in the table below the
bars associated with each smell and highlighted the higher results.
Figure 1: Overall accuracy reached by the ML Techniques on
customized smell detection.
5.1.1 Overall Accuracy. By analyzing Figure 1, we can observe
that there is no technique with the best accuracy for all code
smell types. This is interesting since previous studies observed
that Random Forest reached the highest overall accuracy on de-
tecting smells [
17
,
24
,
38
]. However, in our study, we noticed that,
when detecting harmful smells, Random Forest did not have the
best overall accuracy. Although Random Forest obtained only one
best result (for CC), it also obtained closer results when compared
with the better ones in the other four types of smell, namely CDSBP,
GC, LC, and SC. In summary, all the ML techniques have a similar
accuracy when observing each smell type individually. The highest
divergence (0,164) is observed in the detection of SG, existing be-
tween JRip and Sequential Minimal Optimization (SMO). Finally, all
techniques achieved accuracy between 0.6 and 0.8 for the detection
of CC and SC.
Finding 1
. When detecting harmful smells, the types of smell have
more inuence on accuracy than the ML techniques themselves.
5.1.2 Accuracy per code smell type. In the following paragraphs,
we discuss the results of each code smell type individually.
Complex Class. Regarding CC, Random Forest reached the best
accuracy, equal to 0.715, while J48 and Sequential Minimal Opti-
mization obtained the lowest values with a slight dierence between
them. Note that none of the ML techniques reached accuracy above
0.8. We believe this happens because the developer’s perception
of what is a complex class can vary a lot. This subjectivity in the
detection of CC could be so divergent, that ML techniques could
not nd a proper model that best represents them. In other words,
since the identication of CC is hard to be found by our customized
detection ML techniques, its identication based on metrics and
thresholds certainly is even more complex.
Spaghetti Code. Although a slightly better, the results of accuracy
obtained by the ML techniques for the customized detection of SC
instances is similar to CC. None of the ML techniques was able to
reach accuracy better than 0.8. For the SC smell, J48 reached the
lowest accuracy (0.675). On the other hand, the highest accuracy
(0.765) was obtained by JRip.
Class Data Should Be Private. For CDSBP, the ML techniques
J48, JRip, and Random Forest reached results above 0.8, and the
remaining techniques obtained accuracy below this value. Naive
Bayes obtained only 0.662. Vector Machine, One Rule, and Minimal
Optimization obtained values between 0.7 and 0.8.
Speculative Generality. The ML techniques were able to reach
values higher than 0,9 for SG. JRip, once again, reached the highest
accuracy, equal to 0.913, followed closely by One Rule that also
exceeded 0.9. Sequential Minimal Optimization obtained the worst
accuracy, equal to 0.749. An important fact to note is that SG should
be dicult to detect whether we look at a single instance at a
time, as the ML techniques do because this smell occurs when a
developer implements an element (i.e., methods, classes, and elds)
that never is used. In other words, it should be necessary to look at
this element along dierent versions to decide whether there is the
existence of this smell. This contradicts the good result obtained
by some algorithms.
God Class. Dierently from the smells previously discussed, the
result for GC reached high values. All ML techniques could reach an
accuracy higher than 0.8. JRip reached the highest accuracy (0.877).
Similarly to CDSBP, the worst result was obtained by Naive Bayes.
Here we can note that the combination of dierent features (i.e.
metrics) also seems to implicate in the accuracy obtained by the
techniques, similar to what we discussed for the CC smell. However,
for this code smell, some specic metrics such as Lines of Code, can
have higher inuence [
32
,
45
]. Besides that, GC is usually associated
with high values of the metrics.
Lazy Class. The customized detection of LC is by far the one
where the ML techniques obtained the best results. All techniques
reached accuracy values better than 0.9. Another dierence between
the smells previously observed is regarding the Naive Bayes tech-
nique, which reached the best result, in contrast with its previous
results. It is also possible to observe that Random Forest obtained
accuracy slightly close to Naive Bayes. This similarity also occurs
between Vector Machine and Sequential Minimal Optimization,
besides J48 and JRip.
Finding 2
. Particularities of each code smell type, for example the
metrics (features) that best represent them, aect the accuracy of the
ML techniques.
5.1.3 Smells with low accuracy. Our Finding 2 can be used to ex-
plain some of the lowest results. Although further studies are
needed for a more in-depth analysis, a possible explanation for
the cases with a low accuracy may be related to the number of
features required to train each ML technique. All techniques re-
ceived as input the same set of 42 features. However, not all these
features contribute to equality to identify the smell. For example,
the detection of GC using classic detection strategies relies on two
Machine Learning to Customized Smell Detection SBES ’20, October 21–23, 2020, Natal, Brazil
features: lines of code and cohesion [
32
]; whereas, the detection of
CC only relies on one metric: cyclomatic complexity [
39
]. Neverthe-
less, the techniques received the same features to detect both smells.
In this way, the ML techniques might mistakenly consider dierent
features (i.e., the weight assigned to each metric) as relevant. For
example, when the techniques were trained, they probably selected
more relevant features to detect GC than to detect CC.
Furthermore, developers among distinct projects might not agree
with the relevance of CC instances, as they can associate complexity
with dierent features [
24
,
25
]. This disagreement directly aects
the accuracy of the customized detection of smells, once we are con-
sidering only harmful smells refactored by developers. For example,
dierent developers can classify the smells as relevant consider-
ing dierent metrics, as already identied in existing studies [
26
].
Therefore, based on this discussion, we have our third nding.
Finding 3
. Smells with more subjective denitions, i.e more complex,
tend to obtain less accuracy, since the training set will have a higher
variation in the values of the features caused by the divergence of
developers’ opinion about the relevance of a smell instance.
This variation in training is even more intense, as we are work-
ing with 42 distinct features. Too many features may cause every
training instance in the dataset to appear equidistant from all the
other ones. Thus, if the distance between the instances appears to
be equally alike, the techniques cannot nd meaningful clusters
to classify a code fragment as smelly or not-smelly. This scenario
may have happened to train the detection of some smells (e.g., CC)
but not other ones (e.g., GC). This issue is known as the Curse of
Dimensionality [6].
Finally, the detection of GC is more robust than the detection of
CC. Since detecting a GC usually requires analysis of more metrics,
the ML technique can be less aected by the curse of dimensionality
in detecting GC than detecting CC. However, this same explana-
tion cannot be generalized for LC, which uses a simple detection
strategy: fewer lines of code than average lines of code of the sys-
tem. Consequently, another factor plays an important role in the
techniques’ accuracy. Some detection strategies use metrics that
compute the average value across the system. Detection strategies
that use the average measurement tend to create techniques with
better accuracy. For example, LC and GC use average measure-
ment have high accuracy, whereas CC and SC do not use these
averages [5, 32].
Even though we provided possible explanations for the dierence
of accuracy, we highlight that understanding the mechanisms of ML
techniques is not trivial. Yet, unseen factors can help to contribute
to the dierent accuracy among the dierent smell types.
RQ1 Answer
:When customizing the detection of code smells,
most of the ML techniques detected God Class, Speculative Gen-
erality, Large Class, and CDSBP with high accuracy. For Specula-
tive Generality and Complex Class, the ML techniques obtained
a lower accuracy, but still reached accuracy above 0.7 for most
of the ML techniques.
5.2 RQ2. How ecient are the ML techniques
for early customized detecting of smells?
Dierently from the previous section, that discussed overall accu-
racy, here our focus is the analysis of early customized detection
of code smells. In this regard, we verify whether the techniques
reached high accuracy with a low number of instances required
for training. Figures 2 to 8 present the results that support such
analysis. These gures represent the eciency reached by the ML
techniques on detecting each smell type. The x-axis describes the
number of the examples used (20/40/80/120/160/200) in the train-
ing phase of the techniques divided per smell, whereas the y-axis
represents the accuracy values obtained by each ML technique on
detecting smells.
We observed that the ML techniques do not follow a unique
behavior when the number of analyzed examples increases. Tech-
niques such as Random Forest and Vector Machine had a signicant
increase in SG detection during the addition of new (non-)smelly in-
stances in the training dataset. In contrast, for J48 a smaller training
dataset resulted in the best results of accuracy. This same behavior
can be seen in JRip when detecting CC, and Naive Bayes when
detecting CDSBP and GC. In general, the ML techniques reached
results near to their best results in this study on detecting the re-
spective smell with a low number of examples. Some cases, such
as the detection of CC using Naive Bayes, are exceptions, in these
cases, using a dataset containing a low number of instances did
not reach high results. We can also note that all algorithms did not
need more than 20 instances to reach accuracy above 0.8 for LC
and GC smells.
Finding 4
. When using on datasets that well-represent what de-
velopers consider as relevant in practice, ML techniques are able to
signicantly early detect code smells.
The results and analysis presented in this section allow us pro-
vide the answer for the RQ2.
RQ2 Answer
:In most cases of our study, the ML techniques did
not need a training dataset with numerous examples to reach
their best detection results. In fact, the increase in the number of
instances did not appear to have a direct relationship with the
increase in accuracy for all ML techniques. Interestingly, our
results contradict those results found in previous studies [
17
,
24
,
38
], which stated that the techniques needed many examples to
get good results.
6 LIMITATIONS AND THREATS TO VALIDITY
The threats to validity and the limitations of our study, along with
the ways we mitigate them, are presented next.
6.1 Threats to Validity
This section discusses the threats to validity.
Construct Validity:
The datasets that supported our study were
built from code fragments collected using rule-based strategies that
have a set of metrics and thresholds. These thresholds are threats,
once they can bias the techniques learning because of the analyzed
smelly fragments were ltered by these thresholds. To lessen this
SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.
Figure 2: J48 Eciency Figure 3: Naive Bayes Eciency
Figure 4: Support Vector Machine Eciency Figure 5: One Rule Eciency
Figure 6: JRip Eciency Figure 7: Random Forest Eciency
Figure 8: Sequential Minimal Optimization Eciency
Machine Learning to Customized Smell Detection SBES ’20, October 21–23, 2020, Natal, Brazil
bias, we lter the smells by selecting only those that caught the
developer’s attention to refactor them.
Internal and External Validity.
The use of the Weka package of
the R platform to implement the techniques analyzed in our study
enabled us to experiment a variety of congurations, which aect
the training process of the techniques. In such context, the cong-
urations considered in our experiments may impact the accuracy
and eciency of the techniques. In order to mitigate this threat,
we congured all ML techniques according to the better settings
dened in [
17
]. Indeed, [
17
] performed a variety of experiments in
order to nd the best adjust for each technique.
As far as external validity is concerned, the code fragments were
extracted from ten Java projects. However, although the implemen-
tation of these projects presents classes and methods with dierent
characteristics (i.e., size and complexity), our results might not hold
to other projects.
6.2 Limitations
This section discusses the limitations found during the study, which
will be considered in future studies.
Number of Smells:
The catalog of smell types presented in [
20
]
categorizes the smells based on their area of action in the code. Also
dening a high number of smell types than those addressed in our
empirical study. These additional smells can also harm the quality
of the software, making their detection important. However, their
detection through machine learning requires the evaluation of code
fragments that are suspicious of containing these smells, which
leads us to the second limitation.
Evaluated Projects:
Ten dierent projects are currently covered in
our dataset. However, all of these projects are open source projects
written with the Java programming language. These common char-
acteristics among the chosen projects tend may reduce the variety
of particular manifestations of a smell type. A larger dataset, in-
cluding both closed source and additional open source projects, can
expose a wider variety of smell structures.
Classier Model Customization:
We observed that each ML
technique did not support general, highly-accurate detection of
all smell types. However, the achieved an improvement in their
accuracy when are analyzing a subset of specic smell types. This
improvement could be related to the classier model built by the
techniques. It is important to note that this model can be improved
manually changing the parameters during the technique imple-
mentation, or automatically through trial and error. Previous stud-
ies [
24
,
38
] suggest that this improvement by customization could
also be explored to better detect smells for specics developers.
Project Sensitive Customization:
Better behavior of a ML tech-
nique perhaps could also be observed if the training and the detec-
tion involves a single software project. Given this narrower scope,
we would reduce the number of developers involved in the dataset.
Thus, the ML techniques may be able to better adapt themselves
during the training process. If we further narrow the scope to the
system’s modules, we will have code fragments with similar re-
sponsibilities and a subset of developers in charge. This change
may allow the techniques to customize their detection for the spe-
cic concerns being addressed by each module, hopefully further
improving their accuracy but in detriment of possibly not having a
reasonable number of smelly instances to properly train the model.
7 CONCLUSION AND FUTURE WORK
This study analyzed the accuracy and eciency of ML techniques
for detecting code smells. Firstly, we evaluated the accuracy of the
ML techniques for customizing smell detection. Then, we analyzed
the eciency of the ML techniques by evaluating their accuracy
according to the number of examples used to perform the training
process.
The results indicated that, when detecting harmful smells, the
types of smell have more inuence on accuracy that the dierent
strategies of the ML techniques. Indeed, particularities of each smell
type that represent them aect the accuracy of the ML techniques.
That is, ML techniques tend to have low accuracy when detection
complex smell types considering only instances relevant for the
developers. In this context, JRip and Random Forest reached the
highest overall accuracy on detecting smells, meanwhile, Naive
Bayes obtained the lowest overall accuracy.
Regarding the techniques’ eciency, we observed a dierent
result from previous studies. In our study, the increase in the num-
ber of instances in the training set did not appear to have a direct
relationship with the increase in accuracy. Once the ML techniques
do not need a high number of examples to reach their best results,
the eort to train the techniques is reduced, enabling the use of
this technique in projects with dierent sizes. Also, a reduced num-
ber of needed examples allows techniques to early detect smells,
enabling the removal of smells at the beginning of the lifecycle of
the software.
As future work, we intend to investigate the accuracy of ML
techniques on detecting other smell types. In addition, we also
intend to replicate this study in controlled scenarios, reducing the
analyzed scope per project and, after that, per system’s modules.
In this way, we expect to identify the behavior of the techniques in
more specic contexts.
ACKNOWLEDGMENT
We thank CNPq (grants 427787/2018-1, 434969/2018-4, 312149/2016-
6, 141276/2020-7, and 408356/2018-9), CAPES/Procad (grant 175956),
CAPES/Proex, FAPPR (grant 51435), and FAPERJ (grant 200773/2019,
010002285/2019).
REFERENCES
[1]
Marwen Abbes, Foutse Khomh, Yann-Gael Gueheneuc, and Giuliano Antoniol.
2011. An empirical study of the impact of two antipatterns, blob and spaghetti
code, on programcomprehension. In 15th European Conference on Software Main-
tenance and Reengineering (CSMR). IEEE, 181–190.
[2]
Lucas Amorim, Evandro Costa, Nuno Antunes, Baldoino Fonseca, and Marcio
Ribeiro. 2015. Experience Report: Evaluating the Eectiveness of Decision Trees
for Detecting Code Smells. In Proceedings of the 2015 IEEE 26th International
Symposium on Software Reliability Engineering (ISSRE ’15). IEEE Computer Society,
Washington, DC, USA, 261–269. https://doi.org/10.1109/ISSRE.2015.7381819
[3]
Roberta Arcoverde, Isela Macia, Alessandro Garcia, and Arndt Von Staa. 2012.
Automatically detecting architecturally-relevant code anomalies. In 2012 Third
International Workshop on Recommendation Systems for Software Engineering
(RSSE). IEEE, 90–91.
[4]
Muhammad Ilyas Azeem, Fabio Palomba, Lin Shi, and Qing Wang. 2019. Machine
learning techniques for code smell detection: A systematic literature review
and meta-analysis. Information and Software Technology 108 (2019), 115 – 138.
https://doi.org/10.1016/j.infsof.2018.12.009
SBES ’20, October 21–23, 2020, Natal, Brazil Oliveira et al.
[5]
Gabriele Bavota, Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, and
Fabio Palomba. 2015. An experimental investigation on the innate relationship
between quality and refactoring. Journal of Systems and Software ( JSS) 107 (2015),
1–14.
[6]
Richard Bellman. 1966. Dynamic programming. Science 153, 3731 (1966), 34–37.
[7]
Ana Carla Bibiano, Eduardo Fernandes, Daniel Oliveira, Alessandro Garcia, Mar-
cos Kalinowski, Baldoino Fonseca, Roberto Oliveira, Anderson Oliveira, and
Diego Cedrim. 2019. A Quantitative Study on Characteristics and Eect of
Batch Refactoring on Code Smells. In 13th International Symposium on Empirical
Software Engineering and Measurement (ESEM). 1–11.
[8]
Diego Cedrim, Alessandro Garcia, Melina Mongiovi, Rohit Gheyi, Leonardo
Sousa, Rafael de Mello, Baldoino Fonseca, Márcio Ribeiro, and Alexander Chávez.
2017. Understanding the impact of refactoring on smells. In ACM Joint European
Software Engineering Conference and Symposium on the Foundations of Software
Engineering (ESEC/FSE). 465–475.
[9]
Alexander Chávez, Isabella Ferreira, Eduardo Fernandes, Diego Cedrim, and
Alessandro Garcia. 2017. How does refactoring aect internal quality attributes?
A multi-project study. In Proceedings of the 31st Brazilian Symposium on Software
Engineering (SBES). 74–83.
[10]
William W. Cohen. 1995. Fast Eective Rule Induction. In Twelfth International
Conference on Machine Learning. Morgan Kaufmann, 115–123.
[11]
Warteruzannan Soyer Cunha and Valter Vieira de Camargo. 2019. Uma In-
vestigação da Aplicação de Aprendizado de Máquina para Detecção de Smells
Arquiteturais. In Anais do VII Workshop on Software Visualization, Evolution and
Maintenance (VEM) (Salvador). SBC, Porto Alegre, RS, Brasil, 78–85. https:
//doi.org/10.5753/vem.2019.7587
[12]
R. M. d. Mello, R. F. Oliveira, and A. F. Garcia. 2017. On the Inuence of Hu-
man Factors for Identifying Code Smells: A Multi-Trial Empirical Study. In 2017
ACM/IEEE International Symposium on Empirical Software Engineering and Mea-
surement (ESEM). 68–77. https://doi.org/10.1109/ESEM.2017.13
[13]
Rafael de Mello, Anderson Uchôa, Roberto Oliveira, Willian Oizumi, Jairo Souza,
Kleyson Mendes, Daniel Oliveira, Baldoino Fonseca, and Alessandro Garcia.
2019. Do Research and Practice of Code Smell Identication Walk Together? A
Social Representations Analysis. In 2019 ACM/IEEE International Symposium on
Empirical Software Engineering and Measurement (ESEM). IEEE, 1–6.
[14]
D. Di Nucci, F. Palomba, D. A. Tamburri, A. Serebrenik, and A. De Lucia. 2018.
Detecting code smells using machine learning techniques: Are we there yet?.
In 2018 IEEE 25th International Conference on Software Analysis, Evolution and
Reengineering (SANER). 612–621.
[15]
Eduardo Fernandes, Johnatan Oliveira, Gustavo Vale, Thanis Paiva, and Eduardo
Figueiredo. 2016. A review-based comparative study of bad smell detection tools.
In Proceedings of the 20th International Conference on Evaluation and Assessment
in Software Engineering (EASE). 18:1–18:12.
[16]
Francesca Arcelli Fontana, Pietro Braione, and Marco Zanoni. 2012. Automatic
detection of bad smells in code: An experimental assessment. Journal of Object
Technology 11, 2 (2012), 5–1.
[17]
Francesca Arcelli Fontana, Mika V. Mäntylä, Marco Zanoni, and Alessandro
Marino. 2015. Comparing and experimenting machine learning techniques
for code smell detection. Empirical Software Engineering (June 2015). https:
//doi.org/10.1007/s10664-015- 9378-4
[18]
Francesca Arcelli Fontana, Marco Zanoni, Alessandro Marino, and Mika V.
Mäntylä. 2013. Code Smell Detection: Towards a Machine Learning-Based Ap-
proach. 2013 IEEE International Conference on Software Maintenance (sep 2013),
396–399. https://doi.org/10.1109/ICSM.2013.56
[19] Martin Fowler. 1999. Refactoring (1 ed.). Addison-Wesley Professional.
[20]
Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-
Wesley, Boston, MA, USA.
[21]
Mark Hall, Eibe Frank, Georey Holmes, Bernhard Pfahringer, and Ian H Reute-
mann, Peter andWitten. 2009. The WEKA data mining software: an update. ACM
SIGKDD explorations newsletter 11, 1 (2009), 10–18.
[22]
Tin Kam Ho. 1995. Random decision forests. In Document analysis and recognition,
1995., proceedings of the third international conference on, Vol. 1. IEEE, 278–282.
[23]
R.C. Holte. 1993. Very simple classication rules perform well on most commonly
used datasets. Machine Learning 11 (1993), 63–91.
[24]
Mario Hozano, Nuno Antunes, Baldoino Fonseca, and Evandro Costa. 2017. Eval-
uating the Accuracy of Machine Learning Algorithms on Detecting Code Smells
for Dierent Developers. In Proceedings of the 19th International Conference on
Enterprise Information Systems. 474–482.
[25]
Mario Hozano, Alessandro Garcia, Nuno Antunes, Baldoino Fonseca, and Evandro
Costa. 2017. Smells Are Sensitive to Developers!: On the Eciency of (Un)Guided
Customized Detection. In Proceedings of the 25th International Conference on Pro-
gram Comprehension (Buenos Aires, Argentina) (ICPC ’17). IEEE Press, Piscataway,
NJ, USA, 110–120. https://doi.org/10.1109/ICPC.2017.32
[26]
Mário Hozano, Alessandro Garcia, Baldoino Fonseca, and Evandro Costa. 2018.
Are You Smelling It? Investigating How Similar Developers Detect Code Smells.
Information and Software Technology (IST) 93, C (Jan. 2018), 130–146. https:
//doi.org/10.1016/j.infsof.2017.09.002
[27]
Allen Kent, Madeline M Berry,Fred U Luehrs Jr, and James W Perry. 1955. Machine
literature searching VIII. Operational criteria for designing information retrieval
systems. American documentation 6, 2 (1955), 93–101.
[28]
Foutse Khomh, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano
Antoniol. 2011. An exploratory study of the impact of antipatterns on class
change- and fault-proneness. Empirical Software Engineering 17, 3 (Aug. 2011),
243–275. https://doi.org/10.1007/s10664-011- 9171-y
[29]
F Khomh, S Vaucher, Y G Guéhéneuc, and H Sahraoui. 2009. A bayesian approach
for the detection of code and design smells. In Quality Software, 2009. QSIC’09.
9th International Conference on. IEEE, 305–314.
[30]
Miryung Kim, Thomas Zimmermann, and Nachiappan Nagappan. 2014. An
empirical study of refactoring challenges and benets at Microsoft. TSE’14 40, 7
(2014), 633–649.
[31]
Brett Lantz. 2019. Machine learning with R: expert techniques for predictive model-
ing. Packt Publishing Ltd.
[32]
Michele Lanza and Radu Marinescu. 2007. Object-oriented metrics in practice: using
software metrics to characterize, evaluate, and improve the design of object-oriented
systems. Springer Science & Business Media.
[33]
Michele Lanza, Radu Marinescu, and Stéphane Ducasse. 2005. Object-Oriented
Metrics in Practice. Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[34]
Isela Macia, Alessandro Garcia, Christina Chavez, and Arndt von Staa. 2013.
Enhancing the detection of code anomalies with architecture-sensitive strategies.
In Software Maintenance and Reengineering (CSMR), 2013 17th European Conference
on. IEEE, 177–186.
[35]
Abdou Maiga, Nasir Ali, Neelesh Bhattacharya, Aminata Sabane, and Esma
Gueheneuc, Yann-Gael andAimeur. 2012. SMURF: A SVM-based Incremental
Anti-pattern Detection Approach. 2012 19th Working Conference on Reverse
Engineering (Oct. 2012), 466–475. https://doi.org/10.1109/WCRE.2012.56
[36]
Tom M. Mitchell. 1997. Machine learning. McGraw-Hill, Boston (Mass.), Burr
Ridge (Ill.), Dubuque (Iowa). http://opac.inria.fr/record=b1093076
[37]
M.J. Munro. 2005. Product Metrics for Automatic Identication of "Bad Smell"
Design Problems in Java Source-Code. 11th IEEE International Software Metrics
Symposium (METRICS) (2005), 15–15. https://doi.org/10.1109/METRICS.2005.38
[38]
Daniel Oliveira. 2020. Towards customizing smell detection andrefactorings.
(2020). Master dissertation. Pontical University of Rio de Janeiro.
[39]
Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and
AndreaDe Lucia. 2014. Do They Really Smell Bad? A Study on Developers’
Perception of Bad Code Smells. IEEE International Conference on Software Main-
tenance and Evolution (2014), 101–110. https://doi.org/10.1109/ICSME.2014.32
[40]
Fabiano Pecorelli, Dario Di Nucci, Coen De Roover, and AndreaDe Lucia. 2019. On
the Role of Data Balancing for Machine Learning-Based Code Smell Detection. In
Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning
Techniques for Software Quality Evaluation (Tallinn, Estonia) (MaLTeSQuE 2019).
Association for Computing Machinery, New York, NY, USA, 19–24. https://doi.
org/10.1145/3340482.3342744
[41]
Fabiano Pecorelli, Dario [Di Nucci], Coen [De Roover], and Andrea [De Lucia].
2020. A large empirical assessment of the role of data balancing in machine-
learning-based code smell detection. Journal of Systems and Software 169 (2020),
110693. https://doi.org/10.1016/j.jss.2020.110693
[42]
Fabiano Pecorelli, Fabio Palomba, Foutse Khomh, and Andrea De Lucia. 2020.
Developer-Driven Code Smell Prioritization. In International Conference on Min-
ing Software Repositories.
[43]
J. Platt. 1998. Fast Training of Support Vector Machines using Sequential Min-
imal Optimization. In Advances in Kernel Methods - Support Vector Learning,
B. Schoelkopf, C. Burges, and A. Smola (Eds.). MIT Press. http://research.
microsoft.com/~jplatt/smo.html
[44]
Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann
Publishers, San Mateo, CA.
[45]
José Amancio M. Santos, Manoel G. Mendonça, Cleber Pereira dos Santos, and
Renato Lima Novais. 2014. The problem of conceptualization in god class detec-
tion: agreement, strategies and decision drivers. Journal of Software Engineering
Research and Development 2 (2014), 1–33.
[46]
Danilo Silva, Nikolaos Tsantalis, and Marco TulioValente. 2016. Why we refactor?.
In FSE’16. 858–870.
[47]
Ingo Steinwart and Andreas Christmann. 2008. Support vector machines. Springer
Science & Business Media.
[48]
Nikolaos Tsantalis, Victor Guana, Eleni Stroulia, and Abram Hindle. 2013. A mul-
tidimensional empirical study on refactoring activity. In 23rd Annual International
Conference on Computer Science and Software Engineering. 132–146.
[49]
Aiko Yamashita and Leon Moonen. 2012. Do code smells reect important
maintainability aspects?. In 2012 28th IEEE international conference on software
maintenance (ICSM). IEEE, 306–315.
[50]
Aiko Yamashita and Leon Moonen. 2013. Exploring the Impact of Inter-smell
Relations on Software Maintainability: An Empirical Study. In Proceedings of the
2013 International Conference on Software Engineering (San Francisco, CA, USA)
(ICSE ’13). IEEE Press, Piscataway, NJ, USA, 682–691. http://dl.acm.org/citation.
cfm?id=2486788.2486878
... Feature extraction: The majority of the articles [5,28,31,41,82,84,102,103,121,124,153,166,193,218,235,251,313] in this category use object-oriented metrics as features. These metrics include class-level metrics (such as lines of code, lack of cohesion among methods, number of methods, fan-in and fan-out) and method-level metrics (such as parameter count, lines of code, cyclomatic complexity, and depth of nested conditional). ...
... These models include Decision Tree, Support Vector Machine, Random Forest, Naive Bayes, Logistic Regression, Linear Regression, Polynomial Regression, Bagging, and Multilayer Perceptron. The majority of studies [5,82,84,91,102,103,121,166,200,235,242,251,313] in this category compared the performance of various ml models. Some of the authors experimented with individual ml models; for example, Kaur et al. [153] and Amorim et al. [28] used Support Vector Machine and Decision Tree, respectively, for smell detection. ...
... A typical ml model trained to classify samples into either smelly or non-smelly samples. The majority of the studies focused on a relatively small set of known code smellsgod class [5,31,41,61,82,103,117,121,124,153,200,235], feature envy [5,31,41,82,102,103,124,153,283], long method [31,35,82,102,103,117,121,124,153], data class [31,102,103,117,153,235], and complex class [121,200,235]. Results of these efforts vary significantly; F1 score of the ml models vary between 0.3 to 0.99. ...
Preprint
Full-text available
Context: The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis such as testing and vulnerabilities detection. A large number of studies poses challenges to the community to understand the current landscape. Objective: We aim to summarize the current knowledge in the area of applied machine learning for source code analysis. Method: We investigate studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we carried out an extensive literature search and identified 364 primary studies published between 2002 and 2021. We summarize our observations and findings with the help of the identified studies. Results: Our findings suggest that the usage of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task, and summarize the employed machine learning techniques. Additionally, we collate a comprehensive list of available datasets and tools useable in this context. Finally, we summarize the perceived challenges in this area that include availability of standard datasets, reproducibility and replicability, and hardware resources.
... When considering the use of ML techniques taking into account developers perception, there are few studies (Hozano et al. 2017a, b, Oliveira et al. 2020. Hozano et al. (2017a) analyzed the accuracy and efficiency of six techniques in the detection of four smell types using a set of 600 examples of (non-) smells manually validated by 40 developers. ...
... In summary, the results of this present paper reinforce and complement the findings of the previous studies (Hozano et al. 2017a, b;Oliveira et al. 2020). First, we could confirm that observed that the RF technique is a promising way to identify code smells even when taking into account developer-sensitive smells. ...
Article
Full-text available
Code smells are symptoms of poor design that hamper software evolution and maintenance. Hence, code smells should be detected as early as possible to avoid software quality degradation. However, the notion of whether a design and/or implementation choice is smelly is subjective, varying for different projects and developers. In practice, developers may have different perceptions about the presence (or not) of a smell, which we call developer-sensitive smell detection. Although Machine Learning (ML) techniques are promising to detect smells, there is little knowledge regarding the accuracy of these techniques to detect developer-sensitive smells. Besides, companies may change developers frequently, and the models should adapt quickly to the preferences of new developers, i.e., using few training instances. Based on that, we present an investigation of the behavior of ML techniques in detecting developer-sensitive smells. We evaluated seven popular ML techniques based on their accuracy and efficiency for identifying 10 smell types according to individual perceptions of 63 developers, with some divergent agreement on the presence of smells. The results showed that five out of seven techniques had statistically similar behavior, being able to properly detect smells. However, the accuracy of all ML techniques was affected by developers’ opinion agreement and smell types. We also observed that the detection rules generated for developers individually have more metrics than in related studies. We can conclude that code smells detection tools should consider the individual perception of each developer to reach higher accuracy. However, untrained developers or developers with high disagreement can introduce bias in the smell detection, which can be risky for overall software quality. Moreover, our findings shed light on improving the state of the art and practice for the detection of code smells, contributing to multiple stakeholders.
... While code smells are not necessarily indicative of bugs and the code will still work, they can make future development more difficult and increase the risk of bugs and code smells often suggest that the code could be refactored, or changeability of a given part of the source code to improve clarity, maintainability, or efficiency [22][23][24][25]. As dependent variables, choose code smells with high frequency that may have the most significant negative impact on the software quality, which can be recognized by some available detection tools [26]. Code smell detection is the primary requirement to guide the subsequent steps in the refactoring process [18,27]. ...
Article
Full-text available
Code smells indicate potential symptoms or problems in software due to inefficient design or incomplete implementation. These problems can affect software quality in the long-term. Code smell detection is fundamental to improving software quality and maintainability, reducing software failure risk, and helping to refactor the code. Previous works have applied several prediction methods for code smell detection. However, many of them show that machine learning (ML) and deep learning (DL) techniques are not always suitable for code smell detection due to the problem of imbalanced data. So, data imbalance is the main challenge for ML and DL techniques in detecting code smells. To overcome these challenges, this study aims to present a method for detecting code smell based on DL algorithms (Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU)) combined with data balancing techniques (random oversampling and Tomek links) to mitigate data imbalance issue. To establish the effectiveness of the proposed models, the experiments were conducted on four code smells datasets (God class, data Class, feature envy, and long method) extracted from 74 open-source systems. We compare and evaluate the performance of the models according to seven different performance measures accuracy, precision, recall, f-measure, Matthew’s correlation coefficient (MCC), the area under a receiver operating characteristic curve (AUC), the area under the precision–recall curve (AUCPR) and mean square error (MSE). After comparing the results obtained by the proposed models on the original and balanced data sets, we found out that the best accuracy of 98% was obtained for the Long method by using both models (Bi-LSTM and GRU) on the original datasets, the best accuracy of 100% was obtained for the long method by using both models (Bi-LSTM and GRU) on the balanced datasets (using random oversampling), and the best accuracy 99% was obtained for the long method by using Bi-LSTM model and 99% was obtained for the data class and Feature envy by using GRU model on the balanced datasets (using Tomek links). The results indicate that the use of data balancing techniques had a positive effect on the predictive accuracy of the models presented. The results show that the proposed models can detect the code smells more accurately and effectively.
... So, code smells can lead to shortcomings in software that make software hard to evolve and maintain and trigger refactoring of code [24]. As dependent variables choose code smells with high frequency that may have the greatest negative impact on the software quality, which can be recognized by some available detection tools [25,26]. Code smells detection is the primary requirement to guide the subsequent steps in the refactoring process [16]. ...
Preprint
Full-text available
Code smells are indicators of potential symptoms or problems in software due to inefficient design or incomplete implementation. These problems can affect software quality in the long term. Code smell detection is fundamental to improving software quality and maintainability, reducing the risk of software failure, and helping to refactor the code. Data imbalance is the main challenge of machine learning (ML) techniques in detecting the code smells. Several prediction methods have been applied for code smells detection in our previous works. However, many of them show that, ML methods is not always suitable for code smells detection due to the problem of highly unbalanced data. To overcome these challenges, the objective of this study is to present a code smells detection method based on ML models with data balancing techniques to mitigate data unbalancing issues by taking a corpus of Java projects as experimental datasets. In our experiments, we have used Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). The performance of these models has been evaluated based on accuracy, precision, recall, f-measure, ROC curve. The results indicate that the use of data balancing techniques had a positive effect on the predictive accuracy of the models presented.
Article
Full-text available
While some areas of software engineering knowledge present great advances with respect to the automation of processes, tools, and practices, areas such as software maintenance have scarcely been addressed by either industry or academia, thus delegating the solution of technical tasks or human capital to manual or semiautomatic forms. In this context, machine learning (ML) techniques play an important role when it comes to improving maintenance processes and automation practices that can accelerate delegated but highly critical stages when the software launches. The aim of this article is to gain a global understanding of the state of ML-based software maintenance by using the compilation, classification, and analysis of a set of studies related to the topic. The study was conducted by applying a systematic mapping study protocol, which was characterized by the use of a set of stages that strengthen its replicability. The review identified a total of 3776 research articles that were subjected to four filtering stages, ultimately selecting 81 articles that were analyzed thematically. The results reveal an abundance of proposals that use neural networks applied to preventive maintenance and case studies that incorporate ML in subjects of maintenance management and management of the people who carry out these tasks. In the same way, a significant number of studies lack the minimum characteristics of replicability.
Conference Paper
Full-text available
Context: It is frequently claimed the need for bridging the gap between software engineering research and practice. In this sense, the theory of social representations may be useful to characterize the actual concerns of software developers. It comprises the system of values, behaviors, and practices of communities regarding a particular social object, such as the task of smell identification. Aim: To characterize the social representations of smell identification by software developers. Method: Based on the answers given to a questionnaire, we analyzed the associations made by the developers about smell identification, i.e., what immediately comes to their minds when they think about this task. Results: We found that developers strongly associate smell identification with the practice of smell removal and with the incidence of bugs. They also frequently associate the task with the practice of inspection and with the need of having individual skills. Besides, we verified that the current state of the art on smell identification partially address the social representations of the software developers. Conclusion: There is a considerable gap between the research of smell identification and its practice. We propose directions to mitigating this gap.
Conference Paper
Full-text available
Background: Code refactoring aims to improve code structures via code transformations. A single transformation rarely suffices to fully remove code smells that reveal poor code structures. Most transformations are applied in batches, i.e. sets of interrelated transformations, rather than in isolation. Nevertheless, empirical knowledge on batch application, or batch refactoring, is scarce. Such scarceness helps little to improve current refactoring practices. Aims: We analyzed 57 open and closed software projects. We aimed to understand batch application from two perspectives: characteristics that typically constitute a batch (e.g., the variety of transformation types employed), and the batch effect on smells. Method: We analyzed 19 smell types and 13 transformation types. We identified 4,607 batches, each applied by the same developer on the same code element (method or class); we expected to have batches whose transformations are closely interrelated. We computed (1) the frequency in which five batch characteristic manifest, (2) the probability of each batch characteristics to remove smells, and (3) the frequency in which batches introduce and remove smells. Results: Most batches are quite simple: although most batches are applied on more than one method (90%), they are usually composed of the same transformation type (72%) and only two transformations (57%). Batches applied on a single method are 2.6 times more prone to fully remove smells than batches affecting more than one method. Surprisingly, batches mostly ended up introducing (51%) or not fully removing (38%) smells. Conclusions: The batch simplicity suggests that developers have sub-explored the combinations of transformations within a batch. We summarized some batches that may fully remove smells, so that developers can incorporate them into current refactoring practices.
Chapter
The Support Vector Machine is a powerful new learning algorithm for solving a variety of learning and function estimation problems, such as pattern recognition, regression estimation, and operator inversion. The impetus for this collection was a workshop on Support Vector Machines held at the 1997 NIPS conference. The contributors, both university researchers and engineers developing applications for the corporate world, form a Who's Who of this exciting new area. Contributors Peter Bartlett, Kristin P. Bennett, Christopher J.C. Burges, Nello Cristianini, Alex Gammerman, Federico Girosi, Simon Haykin, Thorsten Joachims, Linda Kaufman, Jens Kohlmorgen, Ulrich Kreßel, Davide Mattera, Klaus-Robert Müller, Manfred Opper, Edgar E. Osuna, John C. Platt, Gunnar Rätsch, Bernhard Schölkopf, John Shawe-Taylor, Alexander J. Smola, Mark O. Stitson, Vladimir Vapnik, Volodya Vovk, Grace Wahba, Chris Watkins, Jason Weston, Robert C. Williamson
Article
Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set of metrics is used to detect smelly code components. However, these techniques suffer from subjective interpretations, a low agreement between detectors, and threshold dependability. To overcome these limitations, previous work applied Machine-Learning that can learn from previous datasets without needing any threshold definition. However, more recent work has shown that Machine-Learning is not always suitable for code smell detection due to the highly imbalanced nature of the problem. In this study, we investigate five approaches to mitigate data imbalance issues to understand their impact on Machine Learning-based approaches for code smell detection in Object-Oriented systems and those implementing the Model-View-Controller pattern. Our findings show that avoiding balancing does not dramatically impact accuracy. Existing data balancing techniques are inadequate for code smell detection leading to poor accuracy for Machine-Learning-based approaches. Therefore, new metrics to exploit different software characteristics and new techniques to effectively combine them are needed.
Conference Paper
Uma investigação da aplicação de aprendizado de máquina para detectar os smells arquiteturais God Component (GC) e Unstable Dependency (UD) é apresentada neste trabalho. Dois datasets foram criados com exemplos coletados de sistemas reais. A acurácia, precisão e recall foram avaliadas com um conjunto de 10 algoritmos preditivos. Uma seleção de atributos foi realizada a fim de encontrar os mais relevantes para essa detecção. Os algoritmos AdaBoost e SVM (Support Vector Machine) com kernel linear alcançaram os melhores resultados para o GC e UD, respectivamente. Além disso, observouse que alguns atributos que a princípio não seriam considerados, contribuíram para a precisão da detecção.
Conference Paper
Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set of metrics (e.g., code metrics, process metrics) is used to detect smelly code components. However, these techniques suffer of subjective interpretation, low agreement between detectors, and threshold dependability. To overcome these limitations, previous work applied Machine Learning techniques that can learn from previous datasets without needing any threshold definition. However, more recent work has shown that Machine Learning is not always suitable for code smell detection due to the highly unbalanced nature of the problem. In this study we investigate several approaches able to mitigate data unbalancing issues to understand their impact on ML-based approaches for code smell detection. Our findings highlight a number of limitations and open issues with respect to the usage of data balancing in ML-based code smell detection.
Conference Paper
Code smells are symptoms of poor design and implementation choices weighing heavily on the quality of produced source code. During the last decades several code smell detection tools have been proposed. However, the literature shows that the results of these tools can be subjective and are intrinsically tied to the nature and approach of the detection. In a recent work the use of Machine-Learning (ML) techniques for code smell detection has been proposed, possibly solving the issue of tool subjectivity giving to a learner the ability to discern between smelly and non-smelly source code elements. While this work opened a new perspective for code smell detection, it only considered the case where instances affected by a single type smell are contained in each dataset used to train and test the machine learners. In this work we replicate the study with a different dataset configuration containing instances of more than one type of smell. The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.