Conference Paper

Automatic Metric Thresholds Derivation for Code Smell Detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The cooperative approach performs different activities in a cooperative way. Fontana et al. (2015) proposed a detection strategy for the code smells. The authors have derived metric thresholds to detect code smells, from a benchmark of 74 java software systems. ...
... So, they went for manual validation on the instances. But in the proposed approach, the researchers have used the code smell deterministic rules from literature (Fontana F. A., 2015) to identify whether the class is smelly or not and instead of manual validation unsupervised learning is used. In the proposed work, the training dataset size is larger than the previous work. ...
... These rules will help to detect whether the method instances are smelly or not. The reason for choosing these detection rules is that, the threshold of the metrics are derived from a benchmark of selected 74 software system (Fontana F. A., 2015). The metrics CC, CM, and FANOUT are used to identify the characteristics of the shotgun surgery smell. ...
Chapter
Full-text available
Code smell is an inherent property of software that results in design problems which makes the software hard to extend, understand, and maintain. In the literature, several tools are used to detect code smell that are informally defined or subjective in nature due to varying results of the code smell. To resolve this, machine leaning (ML) techniques are proposed and learn to distinguish the characteristics of smelly and non-smelly code elements (classes or methods). However, the dataset constructed by the ML techniques are based on the tools and manually validated code smell samples. In this article, instead of using tools and manual validation, the authors considered detection rules for identifying the smell then applied unsupervised learning for validation to construct two smell datasets. Then, applied classification algorithms are used on the datasets to detect the code smells. The researchers found that all algorithms have achieved high performance in terms of accuracy, F-measure and area under ROC, yet the tree-based classifiers are performing better than other classifiers.
... One of the primary goals of code metric studies is to provide guidelines about metrics thresholds for maintainability [3,4,8]. And yet, despite being the most important metric, there is no systematic study that provides guidelines about the relationship between size and future maintainability. ...
... Getters/setters: Getters and setters, also known as accessor methods, are often generated automatically and create noise in code metric studies [8,31,50]. Similar to earlier studies [1,8], we filtered them out from our dataset, before calculating any code metrics. If a method name starts with get, has non-void return type, and does not have any parameter, then we identify this method as a getter. ...
... Getters/setters: Getters and setters, also known as accessor methods, are often generated automatically and create noise in code metric studies [8,31,50]. Similar to earlier studies [1,8], we filtered them out from our dataset, before calculating any code metrics. If a method name starts with get, has non-void return type, and does not have any parameter, then we identify this method as a getter. ...
Preprint
Full-text available
Code metrics have been widely used to estimate software maintenance effort. Metrics have generally been used to guide developer effort to reduce or avoid future maintenance burdens. Size is the simplest and most widely deployed metric. The size metric is pervasive because size correlates with many other common metrics (e.g., McCabe complexity, readability, etc.). Given the ease of computing a method's size, and the ubiquity of these metrics in industrial settings, it is surprising that no systematic study has been performed to provide developers with meaningful method size guidelines with respect to future maintenance effort. In this paper we examine the evolution of around 785K Java methods and show that developers should strive to keep their Java methods under 24 lines in length. Additionally, we show that decomposing larger methods to smaller methods also decreases overall maintenance efforts. Taken together, these findings provide empirical guidelines to help developers design their systems in a way that can reduce future maintenance.
... This threshold is used to understand if a metric is a significant fault predictor. The data-driven method was suggested by Fontana et al. [14], which is repeatable and transparent, and enables the extraction of thresholds. This method is also based on the metric's statistical distribution. ...
... Alan and Catal [35] provided an outlier detection algorithm used to improve the performance of fault predictors in the software engineering domain. The application domains for the threshold calculation method in the paper by Fontana et al. [14] are software quality and maintainability assessments, and they are especially useful for code smell detection rules. The scope of the paper by Herbold et al. [11] can determine the source of code discrepancies. ...
... Moreover, he used more than one category in another study [52]: the existing size, cohesion, and coupling metrics. Paper [14] applied cohesion and complexity metrics, which contain a code smell metrics set. ...
Article
Full-text available
Several aspects of software product quality can be assessed and measured using product metrics. Without software metric threshold values, it is difficult to evaluate different aspects of quality. To this end, the interest in research studies that focus on identifying and deriving threshold values is growing, given the advantage of applying software metric threshold values to evaluate various software projects during their software development life cycle phases. The aim of this paper is to systematically investigate research on software metric threshold calculation techniques. In this study, electronic databases were systematically searched for relevant papers; 45 publications were selected based on inclusion/exclusion criteria, and research questions were answered. The results demonstrate the following important characteristics of studies: (a) both empirical and theoretical studies were conducted, a majority of which depends on empirical analysis; (b) the majority of papers apply statistical techniques to derive object-oriented metrics threshold values; (c) Chidamber and Kemerer (CK) metrics were studied in most of the papers, and are widely used to assess the quality of software systems; and (d) there is a considerable number of studies that have not validated metric threshold values in terms of quality attributes. From both the academic and practitioner points of view, the results of this review present a catalog and body of knowledge on metric threshold calculation techniques. The results set new research directions, such as conducting mixed studies on statistical and quality-related studies, studying an extensive number of metrics and studying interactions among metrics, studying more quality attributes, and considering multivariate threshold derivation.
... The first step in the detection process may require a specific format of the input data. Investigating the input type reveals any method used by the bad smell detection technique to Metrics-based Weight-based distance metric, B-splines, and Bayesian-based Previous studies Rule-based BBN and multipattern matching algorithm Previous studies [117][118][119][120][121][122][123][124][125][126][127][128][129][130][131][132][133][134] Token-based Similarity scoring algorithm, fingerprinting algorithm, token matching algorithm, bit-vector, longest match algorithm, lexical comparison algorithm, and GCDA Previous studies [135][136][137][138][139][140][141][142][143][144][145][146][147][148][149][150][151][152][153] Text-based Matching algorithm, linguistic antipattern detector (LAPD), and information retrieval (IR) algorithm ...
... Cohesion Previous studies 47,48,50,54,55,57,59,65,69,70,72,73,87,89,90,[97][98][99]101,102,104,105,107,[110][111][112]115,116,120,121,124,125,128,132,164,165,179 Size Previous studies 45,[48][49][50][54][55][56][57]59,65,69,70,87,[98][99][100][101][102][103][104][105]111,112,114,116,118,[123][124][125]128,132,164,165 Coupling Previous studies [47][48][49][50]54,55,57,59,65,69,70,92,97,99,100,104,105,107,[110][111][112]115,116,[120][121][122][123]128,132,164,165,179 Complexity Previous studies 42,45,[47][48][49][50]54,55,59,65,69,70,72,73,93,97,[99][100][101]105,108,110,111,115,120,121,128,132,164,165,179 Similarity Previous studies 66,68,[80][81][82][83][84]86,91,109,113,[148][149][150]152,153,175,177,[179][180][181] Inheritance Previous studies 56,57,65,69,70,115 Encapsulation Previous studies 55,59,89,90 Information hiding Previous studies 89,90 Data abstraction Previous studies 55 ...
... Cohesion Previous studies 47,48,50,54,55,57,59,65,69,70,72,73,87,89,90,[97][98][99]101,102,104,105,107,[110][111][112]115,116,120,121,124,125,128,132,164,165,179 Size Previous studies 45,[48][49][50][54][55][56][57]59,65,69,70,87,[98][99][100][101][102][103][104][105]111,112,114,116,118,[123][124][125]128,132,164,165 Coupling Previous studies [47][48][49][50]54,55,57,59,65,69,70,92,97,99,100,104,105,107,[110][111][112]115,116,[120][121][122][123]128,132,164,165,179 Complexity Previous studies 42,45,[47][48][49][50]54,55,59,65,69,70,72,73,93,97,[99][100][101]105,108,110,111,115,120,121,128,132,164,165,179 Similarity Previous studies 66,68,[80][81][82][83][84]86,91,109,113,[148][149][150]152,153,175,177,[179][180][181] Inheritance Previous studies 56,57,65,69,70,115 Encapsulation Previous studies 55,59,89,90 Information hiding Previous studies 89,90 Data abstraction Previous studies 55 ...
Article
Software smells indicate design or code issues that might degrade the evolution and maintenance of software systems. Detecting and identifying these issues are challenging tasks. This paper explores, identifies, and analyzes the existing software smell detection techniques at design and code levels. We carried out a systematic literature review (SLR) to identify and collect 145 primary studies related to smell detection in software design and code. Based on these studies, we address several questions related to the analysis of the existing smell detection techniques in terms of abstraction level (design or code), targeted smells, used metrics, implementation, and validation. Our analysis identified several detection techniques categories. We observed that 57% of the studies did not use any performance measures, 41% of them omitted details on the targeted programing language, and the detection techniques were not validated in 14% of these studies. With respect to the abstraction level, only 18% of the studies addressed bad smell detection at the design level. This low coverage urges for more focus on bad smell detection at the design level to handle them at early stages. Finally, our SLR brings to the attention of the research community several opportunities for future research. Identified and collected 145 primary studies (PS) related to smell detection in software design and code. Identified several detection techniques categories. Observed that 57% of the studies did not use any performace measures. 41% of the PSs omitted details on the targeted programing language. Detection techniques were not validated in 14% of the PSs. Only 18% of the studies addressed bad smell detection at the design level. Identified a number of open issues.
... Since efficient software quality evaluation can be done only with reliable threshold values, the process of threshold derivation is very important. Different approaches for deriving threshold values are proposed in the literature [4], including approaches based on benchmark data, like [19,10,1,30,13,21]. They use software metric values as an input, and provide concrete threshold values for selected software metrics. ...
... Different software metric tools are available that enable the collection of software metric values. However, the implementation of the same software metric often varies within different tools [31,10,13,44,22,14,33], resulting in different values for the same software metric using the same input data. In various software metric tools, a set of supported metrics differs and, additionally, new tool-specific software metrics can be detected. ...
... Different approaches for deriving threshold values are available in the literature [4]. Fontana et al. [13] categorizes derivation approaches into (1) approaches based on observations, (2) error-based approaches, (3) approaches using machine learning and, (4) approaches that derive thresholds based on a statistical analysis of benchmark data. In the presented research, we focus on the latter. ...
Article
Full-text available
Without reliable software metrics threshold values, the efficient quality evaluation of software could not be done. In order to derive reliable thresholds, we have to address several challenges, which impact the final result. For instance, software metrics implementations vary in various software metrics tools, including varying threshold values that result from different threshold derivation approaches. In addition, the programming language is also another important aspect. In this paper, we present the results of an empirical study aimed at comparing systematically obtained threshold values for nine software metrics in four object-oriented programming languages (i.e., Java, C++, C#, and Python).We addressed challenges in the threshold derivation domain within introduced adjustments of the benchmarkbased threshold derivation approach. The data set was selected in a uniform way, allowing derivation repeatability, while input values were collected using a single software metric tool, enabling the comparison of derived thresholds among the chosen object-oriented programming languages.Within the performed empirical study, the comparison reveals that threshold values differ between different programming languages.
... The cooperative approach performs different activities in a cooperative way. Fontana et al. (2015) proposed a detection strategy for the code smells. The authors have derived metric thresholds to detect code smells, from a benchmark of 74 java software systems. ...
... So, they went for manual validation on the instances. But in the proposed approach, the researchers have used the code smell deterministic rules from literature (Fontana F. A., 2015) to identify whether the class is smelly or not and instead of manual validation unsupervised learning is used. In the proposed work, the training dataset size is larger than the previous work. ...
... These rules will help to detect whether the method instances are smelly or not. The reason for choosing these detection rules is that, the threshold of the metrics are derived from a benchmark of selected 74 software system (Fontana F. A., 2015). The metrics CC, CM, and FANOUT are used to identify the characteristics of the shotgun surgery smell. ...
Article
Full-text available
Code smell is an inherent property of software that results in design problems which makes the software hard to extend, understand, and maintain. In the literature, several tools are used to detect code smell that are informally defined or subjective in nature due to varying results of the code smell. To resolve this, machine leaning (ML) techniques are proposed and learn to distinguish the characteristics of smelly and non-smelly code elements (classes or methods). However, the dataset constructed by the ML techniques are based on the tools and manually validated code smell samples. In this article, instead of using tools and manual validation, the authors considered detection rules for identifying the smell then applied unsupervised learning for validation to construct two smell datasets. Then, applied classification algorithms are used on the datasets to detect the code smells. The researchers found that all algorithms have achieved high performance in terms of accuracy, F-measure and area under ROC, yet the tree-based classifiers are performing better than other classifiers.
... • 21% of studies are grouped under empirical studies. Studies (Yamashita et al., 2009;Li and Thompson, 2010;Olbrich et al., 2010;Zazworka et al., 2011a;Hermans et al., 2012;Yamashita and Moonen, 2012;Hall et al., 2014;Fontana et al., 2015c;Szőke et al., 2015;Ahmed et al., 2017;Karađuzović-Hadžiabdić & Spahić, 2018b;Wang et al., 2018;Pecorelli et al., 2019;Kumar and Ram, 2021;Pigazzini et al., 2021) are found to analyze the data using existing techniques of code smell detection. • 12% are found to target refactoring. ...
... This not only adds value to the subject data but also helps in setting a benchmark for evaluation of code smell detection tools and techniques. , 2015a, 2015b, 2015c) Fontana and Spinelli (2011), Fontana and Zanoni (2011), Fontans et al. (2012) and Di Nucci et al. (2018 used the datasets of Tempero et al. (2010) in their works on code smells. These works by the authors are a great source of learning about code smells and were extended and used by other researchers. ...
Article
Code Smells have been detected, predicted and studied by researchers from several perspectives. This literature review is conducted to understand tools and algorithms used to detect and analyze code smells to summarize research agenda. 114 studies have been selected from 2009 to 2022 to conduct this review. The studies are deeply analyzed under the categorization of machine learning and non-machine learning, which are found to be 25 and 89 respectively. The studies are analyzed to gain insight into algorithms, tools and limitations of the techniques. Long Method, Feature Envy, and Duplicate Code are reported to be the most popular smells. 38% of the studies focused their research on the enhancement of tools and methods. Random Forest and JRip algorithms are found to give the best results under machine learning techniques. We extended the previous studies on code smell detection tools, reporting a total 87 tools during the review. Java is found to be the dominant programming language during the study of smells.
... lines of code) than other components in the system [6]. Originally, GC was defined using a fixed threshold on the lines of code [6], ARCAN however uses a variable benchmark based on the frequencies of the number of lines of code of the other packages in the system [23]. Adopting a benchmark to derive the detection threshold fits particularly well in this case because what is considered a "large component" depends on the size of other components in the system under analysis and in many other systems. ...
... 3.3.2.2 Size-based smells: God Component is a smell that is detected based on the number of lines of code an artefact has (calculated summing up the LOC of the directly contained files) and whether it exceeds a certain threshold. The threshold is calculated using an adaptive statistical approach that takes into consideration the number of LOC of the other packages in the system and in a benchmark of over 100 systems [23]. The adaptive threshold is defined in such a way that it is always larger than the median lines of code of the packages/components in the system and benchmark. ...
Article
Full-text available
A key aspect of technical debt (TD) management is the ability to measure the amount of principal accumulated in a system. The current literature contains an array of approaches to estimate TD principal, however, only a few of them focus specifically on architectural TD, but none of them satisfies all three of the following criteria: being fully automated, freely available, and thoroughly validated. Moreover, a recent study has shown that many of the current approaches suffer from certain shortcomings, such as relying on hand-picked thresholds. In this paper, we propose a novel approach to estimate architectural technical debt principal based on machine learning and architectural smells to address such shortcomings. Our approach can estimate the amount of technical debt principal generated by a single architectural smell instance. To do so, we adopt novel techniques from Information Retrieval to train a learning-to-rank machine learning model (more specifically, a gradient boosting machine) that estimates the severity of an architectural smell and ensure the transparency of the predictions. Then, for each instance, we statically analyse the source code to calculate the exact number of lines of code creating the smell. Finally, we combine these two values to calculate the technical debt principal. To validate the approach, we conducted a case study and interviewed 16 practitioners, from both open source and industry, and asked them about their opinions on the TD principal estimations for several smells detected in their projects. The results show that for 71% of instances, practitioners agreed that the estimations provided were representative of the effort necessary to refactor the smell.
... lines of code) than other components in the system [6]. Originally, GC was defined using a fixed threshold on the lines of code [6], ARCAN however uses a variable benchmark based on the frequencies of the number of lines of code of the other packages in the system [22]. Adopting a benchmark to derive the detection threshold fits particularly well in this case because what is considered a "large component" depends on the size of other components in the system under analysis and in many other systems. ...
... 3.3.2.2 Size-based smells: God Component is a smell that is detected based on the number of lines of code an artefact has (calculated summing up the LOC of the directly contained files) and whether it exceeds a certain threshold. The threshold is calculated using an adaptive statistical approach that takes into consideration the number of LOC of the other packages in the system and in a benchmark of over 100 systems [22]. The adaptive threshold is defined in such a way that it is always larger than the median lines of code of the packages/components in the system and benchmark. ...
Preprint
Full-text available
A key aspect of technical debt (TD) management is the ability to measure the amount of principal accumulated in a system. The current literature contains an array of approaches to estimate TD principal, however, only a few of them focus specifically on architectural TD, and none of these are fully automated, freely available, and thoroughly validated. Moreover, a recent study has shown that many of the current approaches suffer from certain shortcomings, such as relying on hand-picked thresholds. In this paper, we propose a novel approach to estimate architectural technical debt principal based on machine learning and architectural smells to address such shortcomings. Our approach can estimate the amount of technical debt principal generated by a single architectural smell instance. To do so, we adopt novel techniques from Information Retrieval to train a learning-to-rank machine learning model that estimates the severity of an architectural smell and ensure the transparency of the predictions. Then, for each instance, we statically analyse the source code to calculate the exact number of lines of code creating the smell. Finally, we combine these two values to calculate the technical debt principal. To validate the approach, we conducted a case study and interviewed 16 practitioners, from both open source and industry, and asked them about their opinions on the TD principal estimations for several smells detected in their projects. The results show that for 71\% of instances, practitioners agreed that the estimations provided were \emph{representative} of the effort necessary to refactor the smell.
... Rule priority guidelines for default and custom-made rules can be found in the PMD project documentation. 21 SonarQube LTS 6.7.7 detects a total of 413 rules which are grouped based on type and severity. Sonar-Qube categorizes the 413 rules under 3 types: Bugs, Code Smells, and Vulnerabilities. ...
... As such, we claim that the adoption of Checkstyle would be ideal when used in combination with additional SATs. To broaden the scope of the discussion, the poor performance achieved by the considered tools reinforces the preliminary research efforts to devise approaches for the automatic/adaptive configuration of SATs [18,19] as well as for the automatic derivation of proper thresholds to use when locating the presence of design issues in source code [20,21]. It might indeed be possible that the integration of those approaches into the inner workings of the currently available SATs could lead to a reduction of the number of false positive. ...
... Secondly, different detectors do not output the same results, making even harder for developers to decide on whether to refactor source code [5]. Finally, these detectors require thresholds to distinguish smelly from non-smelly components which are hard to tune [6]. ...
... For each of the rule sets, the configuration file was downloaded directly from Checkstyle's guidelines. 6 In order to start the analysis, the checkstyle-8.30-all.jar and the configuration file in question were saved in the directory where all the projects resided. ...
Article
Full-text available
Code smells are poor implementation choices that developers apply while evolving source code and that affect program maintainability. Multiple automated code smell detectors have been proposed: while most of them relied on heuristics applied over software metrics, a recent trend concerns the definition of machine learning techniques. However, machine learning-based code smell detectors still suffer from low accuracy: one of the causes is the lack of adequate features to feed machine learners. In this paper, we face this issue by investigating the role of static analysis warnings generated by three state-of-the-art tools to be used as features of machine learning models for the detection of seven code smell types. We conduct a three-step study in which we (1) verify the relation between static analysis warnings and code smells and the potential predictive power of these warnings; (2) build code smell prediction models exploiting and combining the most relevant features coming from the first analysis; (3) compare and combine the performance of the best code smell prediction model with the one achieved by a state of the art approach. The results reveal the low performance of the models exploiting static analysis warnings alone, while we observe significant improvements when combining the warnings with additional code metrics. Nonetheless, we still find that the best model does not perform better than a random model, 2 Fabiano Pecorelli et al. hence leaving open the challenges related to the definition of ad-hoc features for code smell prediction.
... Once the metrics are defined, we need to give them a severity rating. For this, we used the Benchmark-based threshold derivation proposed by Alves et al. [2] methodology, which follows three core principles [11], which states that the method should (1) be A test that tries to verify too many functionalities, which can lead to difficulty in understanding the test code [31] It is hard to read and understand, and therefore more difficult to use as documentation. Moreover, it makes tests more dependent on each other and harder to maintain. ...
... We chose this technique as it (i) does not assume the normality of the metric values distribution, (ii) uses a weight function (LOC), which emphasizes the metric variability, (iii) separates the thresholds into different risk categories. Furthermore, this state-of-art benchmarking technique has been used in many previous studies that needed to calculate thresholds for new metrics [1,3,5,11,29]. The Benchmark-based threshold derivation enables us to define severity levels based on the representation of occurrences in the benchmark dataset. ...
Conference Paper
Test smells are poor design decisions implemented in test code, which can have an impact on the effectiveness and maintainability of unit tests. Even though test smell detection tools exist, how to rank the severity of the detected smells is an open research topic. In this work, we aim at investigating the severity rating for four test smells and investigate their perceived impact on test suite maintainability by the developers. To accomplish this, we first analyzed some 1,500 open-source projects to elicit severity thresholds for commonly found test smells. Then, we conducted a study with developers to evaluate our thresholds. We found that (1) current detection rules for certain test smells are considered as too strict by the developers and (2) our newly defined severity thresholds are in line with the participants' perception of how test smells have an impact on the maintainability of a test suite. Preprint [https://doi.org/10.5281/zenodo.3744281], data and material [https://doi.org/10.5281/zenodo.3611111].
... These tools usually use generic metric thresholds for classifying source code elements (such as classes and methods) of one or more systems into categories (e.g. low or high) [39,40,41,42,43,44]. For instance, Lanza and Marinescu [39] classify as long any method that has more than 20 lines of code (LOC) in Java systems. ...
... Threshold selection is a challenge because of the proneness to false positives [64]. A threshold that points out code smells that hold good in the context of an application module may not necessarily make sense for other applications or other modules of the same application [43]. Previous works suggest that deriving metric thresholds according to the application design context might reduce false code smell alarms [45,46,48]. ...
Preprint
Full-text available
Context: Software code review aims to early find code anomalies and to perform code improvements when they are less expensive. However, issues and challenges faced by developers who do not apply code review practices regularly are unclear. Goal: Investigate difficulties developers face to apply code review practices without limiting the target audience to developers who already use this practice regularly. Method: We conducted a web-based survey with 350 Brazilian practitioners engaged on the software development industry. Results: Code review practices are widespread among Brazilian practitioners who recognize its importance. However, there is no routine for applying these practices. In addition, they report difficulties to fit static analysis tools in the software development process. One possible reason recognized by practitioners is that most of these tools use a single metric threshold, which might be not adequate to evaluate all system classes. Conclusion: Improving guidelines to fit code review practices into the software development process could help to make them widely used. Additionally, future studies should investigate whether multiple metric thresholds that take source code context into account reduce static analysis tool false alarms. Finally, these tools should allow their use in distinct phases of the software development process.
... In the recent past, code smell has been actively studied by several researchers and many tools and techniques have been proposed to handle code smell situations [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24]. Most of these approaches fall into three main categories, namely (i) metric-based; (ii) heuristic/rule-based; and (iii) machine learning-based. ...
Article
Full-text available
Code smells negatively impact software maintenance and several mitigation tools and/ or techniques have been devised in the past. However, their interpretation is subjective and threshold-dependent. To overcome this limitation several supervised machine learning classifiers have been suggested in the past. However, their performance is highly dependent on the quality of the available dataset. An imbalanced dataset highly degrades a classifier's performance. Moreover, the size of the dataset (in terms of used training and testing features) directly affects the performance and time parameters of the classifier. A reduced dimensionality generally improves the performance and training/ testing time of the classifier. Therefore, this paper aims to propose a new ensemble feature selection technique that helps significantly reduce the dataset’s dimensionality (used features for training and testing purposes). Moreover, the performance of the proposed approach is experimentally evaluated by considering six classifiers for identifying six code smells from the dataset. The suitability of different classifiers in detecting considered code smells is evaluated in two cases, namely by considering all features at once and by considering selected features suggested by the proposed approach. Based on the experimentation, we conclude that the proposed approach is capable of improving and achieving sufficiently high accuracy (> 95%).
... For the software metrics-based detection strategies, thresholds must be defined to identify code smells. Defining proper thresholds is a major drawback due to their innate complexity, being addressed by several studies that propose methods for threshold derivation [14,17]. ...
Conference Paper
Code smells are symptoms of bad design choices implemented on the source code. Several code smell detection tools and strategies have been proposed over the years, including the use of machine learning algorithms. However, we lack empirical evidence on how expert feedback could improve machine learning based detection of code smells. This paper aims to propose and evaluate a conceptual strategy to improve machine-learning detection of code smells by means of continuous feedback. To evaluate the strategy, we follow an exploratory evaluation design to compare results of the smell detection before and after feedback provided by a service - acting as a software expert. We focus on four code smells - God Class, Long Method, Feature Envy, and Refused Bequest - detected in 20 Java systems. As results, we observed that continuous feedback improves the performance of code smell detection. For the detection of the class-level code smells, God Class and Refused Bequest, we achieved an average improvement in terms of F1 of 0.13 and 0.58, respectively, after 50 iterations of feedback. For the method-level code smells, Long Method and Feature Envy, the improvements of F1 were 0.66 and 0.72, respectively.
... than the other components of the system. The threshold used to measure " large " is derived by adapting a benchmark in the same way as defined by Arcelli et al. [44] for the automatic metric thresholds derivation. GC is detected at the container level (e.g., packages in Java). ...
... On the other hand, few studies specified the use of maintenance support tools (RQ-M5), with the main tool being static code analyzers. The handling of thresholds that activate metrics for this type of software was mentioned as a problem in previous publications [60], indicating code as defective when it is not and vice versa. Software has benefitted from such tools, but their use with factory configurations or by default should be avoided. ...
Article
Full-text available
While some areas of software engineering knowledge present great advances with respect to the automation of processes, tools, and practices, areas such as software maintenance have scarcely been addressed by either industry or academia, thus delegating the solution of technical tasks or human capital to manual or semiautomatic forms. In this context, machine learning (ML) techniques play an important role when it comes to improving maintenance processes and automation practices that can accelerate delegated but highly critical stages when the software launches. The aim of this article is to gain a global understanding of the state of ML-based software maintenance by using the compilation, classification, and analysis of a set of studies related to the topic. The study was conducted by applying a systematic mapping study protocol, which was characterized by the use of a set of stages that strengthen its replicability. The review identified a total of 3776 research articles that were subjected to four filtering stages, ultimately selecting 81 articles that were analyzed thematically. The results reveal an abundance of proposals that use neural networks applied to preventive maintenance and case studies that incorporate ML in subjects of maintenance management and management of the people who carry out these tasks. In the same way, a significant number of studies lack the minimum characteristics of replicability.
... Some researchers do not create full machine learning models but instead decide to only tune parameters of pre-created models [37,38]. Often, this tuning uses sophisticated statistical [39] or evolutionary [40] techniques. ...
Article
Full-text available
Context Code smells are patterns in source code associated with an increased defect rate and a higher maintenance effort than usual, but without a clear definition. Code smells are often detected using rules hard-coded in detection tools. Such rules are often set arbitrarily or derived from data sets tagged by reviewers without the necessary industrial know-how. Conclusions from studying such data sets may be unreliable or even harmful, since algorithms may achieve higher values of performance metrics on them than on models tagged by experts, despite not being industrially useful. Objective Our goal is to investigate the performance of various machine learning algorithms for automated code smell detection trained on code smell data set(MLCQ) derived from actively developed and industry-relevant projects and reviews performed by experienced software developers. Method We assign the severity of the smell to the code sample according to a consensus between the severities assigned by the reviewers, use the Matthews Correlation Coefficient (MCC) as our main performance metric to account for the entire confusion matrix, and compare the median value to account for non-normal distributions of performance. We compare 6720 models built using eight machine learning techniques. The entire process is automated and reproducible. Results Performance of compared techniques depends heavily on analyzed smell. The median value of our performance metric for the best algorithm was 0.81 for Long Method, 0.31 for Feature Envy, 0.51 for Blob, and 0.57 for Data Class. Conclusions Random Forest and Flexible Discriminant Analysis performed the best overall, but in most cases the performance difference between them and the median algorithm was no more than 10% of the latter. The performance results were stable over multiple iterations. Although the F-score omits one quadrant of the confusion matrix (and thus may differ from MCC), in code smell detection, the actual differences are minimal.
... Additionally, the central component is also overloaded with responsibility and has a high coupling. This structure is thus not desirable, as it increases the potential effort necessary to make changes to all of the elements involved in the smell. of code, ARCAN however uses a variable threshold-detection approach based on the frequencies of the number of lines of code of the other packages in the system (Arcelli Fontana et al. 2015). ...
Article
Full-text available
Architectural smells (AS) are notorious for their long-term impact on the Maintainability and Evolvability of software systems. The majority of research work has investigated this topic by mining software repositories of open source Java systems, making it hard to generalise and apply them to an industrial context and other programming languages. To address this research gap, we conducted an embedded multiple-case case study, in collaboration with a large industry partner, to study how AS evolve in industrial embedded systems. We detect and track AS in 9 C/C++ projects with over 30 releases for each project that span over two years of development, with over 20 millions lines of code in the last release only. In addition to these quantitative results, we also interview 12 among the developers and architects working on these projects, collecting over six hours of qualitative data about the usefulness of AS analysis and the issues they experienced while maintaining and evolving artefacts affected by AS. Our quantitative findings show how individual smell instances evolve over time, how long they typically survive within the system, how they overlap with instances of other smell types, and finally what the introduction order of smell types is when they overlap. Our qualitative findings, instead, provide insights on the effects of AS on the long-term maintainability and evolvability of the system, supported by several excerpts from our interviews. Practitioners also mention what parts of the AS analysis actually provide actionable insights that they can use to plan refactoring activities.
... lines of code) than other components in the system [Martin Lippert, 2006] (see Figure 1d). Originally, GC was defined using a fixed threshold on the lines of code, Arcan however uses a variable threshold-detection approach based on the frequencies of the number of lines of code of the other packages in the system [Arcelli Fontana et al., 2015]. God components aggregate too many concerns together in a single artefact and they are generally a sign that there is a missing opportunity for splitting up the component into multiple sub-components. ...
Preprint
Full-text available
Architectural smells (AS) are notorious for their long-term impact on the Maintainability and Evolvability of software systems. The majority of research work has investigated this topic by mining software repositories of open source Java systems, making it hard to generalise and apply them to an industrial context and other programming languages. To address this research gap, we conducted an embedded multiple-case case study, in collaboration with a large industry partner, to study how AS evolve in industrial embedded systems. We detect and track AS in 9 C/C++ projects with over 30 releases for each project that span over two years of development, with over 20 millions lines of code in the last release only. In addition to these quantitative results, we also interview 12 among the developers and architects working on these projects, collecting over six hours of qualitative data about the usefulness of AS analysis and the issues they experienced while maintaining and evolving artefacts affected by AS. Our quantitative findings show how individual smell instances evolve over time, how long they typically survive within the system, how they overlap with instances of other smell types, and finally what the introduction order of smell types is when they overlap. Our qualitative findings, instead, provide insights on the effects of AS on the long-term maintainability and evolvability of the system, supported by several excerpts from our interviews. Practitioners also mention what parts of the AS analysis actually provide actionable insights that they can use to plan refactoring activities.
... HL is detected by simply looking at the number of incoming and outgoing dependencies a certain artifact has: If the sum of these dependencies surpasses a certain system-based threshold, then the artifact is marked as a hub. Finally, GC is detected using an automatically calculated variable threshold 31 with a precision of 100%. 5 Next, the results of Arcan were also validated in an industrial setting by two different studies: first on industrial C/C++ projects obtaining 50% precision 32 and then on industrial Java projects obtaining 70% precision. ...
Article
Full-text available
Although architectural smells are one of the most studied type of architectural technical debt, their impact on maintenance effort has not been thoroughly investigated. Studying this impact would help to understand how much technical debt interest is being paid due to the existence of architecture smells and how this interest can be calculated. This work is a first attempt to address this issue by investigating the relation between architecture smells and source code changes. Specifically, we study whether the frequency and size of changes are correlated with the presence of a selected set of architectural smells. We detect architectural smells using the Arcan tool, which detects architectural smells by building a dependency graph of the system analyzed and then looking for the typical structures of the architectural smells. The findings, based on a case study of 31 open-source Java systems, show that 87% of the analyzed commits present more changes in artifacts with at least one smell, and the likelihood of changing increases with the number of smells. Moreover, there is also evidence to confirm that change frequency increases after the introduction of a smell and that the size of changes is also larger in smelly artifacts. These findings hold true especially in Medium–Large and Large artifacts.
... Smell detection methods and the number of surveyed studies in [28]. Metrics-based smell detection methods are relatively easy to be implemented; however, a non-trivial challenge posed by those methods is the choice of the thresholds, as pointed out by the software engineering community (see, for instance, [30,31]). On this point, Lacerda et al. [32] write the following: "There is no consensus on the standard threshold values for the detection of smells, which are the cause of the disparity in the results of different approaches." ...
Article
Full-text available
Many scholars have reported that the adoption of Model Driven Engineering (MDE) in the industry is still marginal. Real-life case studies, completed with convincing empirical data about the quality of the developed source code, is an effective way to persuade the industry that the adoption of MDE brings an actual added value. This paper reports about the assessment of the quality of the code outputted by xGenerator: a Java technology platform for the development of enterprise Web applications, which implements the MDE paradigm. Two recent papers from Aniche and his colleagues were selected to carry out the measurements. The former study is about metrics and thresholds for MVC Web applications, while the latter presents a catalog of six smells tailored to MVC Web applications. A big merit of both of these proposals is that they fix the metric thresholds by taking into account the MVC software architecture. The results of the empirical assessment, carried out on a real-life project, proved that the quality of the code is high.
... To broaden the scope of the discussion, the poor performance achieved by the considered tools reinforces the preliminary research efforts to devise approaches for the automatic/adaptive configuration of static analysis tools [33,13] as well as for the automatic derivation of proper thresholds to use when locating the presence of design issues in source code [2,17]. It might indeed be possible that the integration of those approaches into the inner workings of the currently available static analysis tools could lead to a reduction of the number of false positive items. ...
Preprint
Full-text available
Background. Developers use Automated Static Analysis Tools (ASATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on ASATs to favor the understanding of their actual capabilities. Aims. Despite the work done so far, there is still a lack of knowledge regarding (1) which source quality problems can actually be detected by static analysis tool warnings, (2) what is their agreement, and (3) what is the precision of their recommendations. We aim at bridging this gap by proposing a large-scale comparison of six popular static analysis tools for Java projects: Better Code Hub, CheckStyle, Coverity Scan, Findbugs, PMD, and SonarQube. Method. We analyze 47 Java projects and derive a taxonomy of warnings raised by 6 state-of-the-practice ASATs. To assess their agreement, we compared them by manually analyzing - at line-level - whether they identify the same issues. Finally, we manually evaluate the precision of the tools. Results. The key results report a comprehensive taxonomy of ASATs warnings, show little to no agreement among the tools and a low degree of precision. Conclusions. We provide a taxonomy that can be useful to researchers, practitioners, and tool vendors to map the current capabilities of the tools. Furthermore, our study provides the first overview on the agreement among different tools as well as an extensive analysis of their precision.
... Therefore, we normalized these values to fit uniformly into the [0, 1] range. There are several approaches published in the literature for normalizing software metrics or deriving appropriate threshold values [3,46,5,37,18]. The simplest method would be to take the maximum value in the bug dataset for each metric, and divide all the values with this maximum. Even though this would transform the values into the desired range, it has some serious flaws. ...
Preprint
Full-text available
Forecasting defect proneness of source code has long been a major research concern. Having an estimation of those parts of a software system that most likely contain bugs may help focus testing efforts, reduce costs, and improve product quality. Many prediction models and approaches have been introduced during the past decades that try to forecast bugged code elements based on static source code metrics, change and history metrics, or both. However, there is still no universal best solution to this problem, as most suitable features and models vary from dataset to dataset and depend on the context in which we use them. Therefore, novel approaches and further studies on this topic are highly necessary. In this paper, we employ a chemometric approach - Partial Least Squares with Discriminant Analysis (PLS-DA) - for predicting bug prone Classes in Java programs using static source code metrics. To our best knowledge, PLS-DA has never been used before as a statistical approach in the software maintenance domain for predicting software errors. In addition, we have used rigorous statistical treatments including bootstrap resampling and randomization (permutation) test, and evaluation for representing the software engineering results. We show that our PLS-DA based prediction model achieves superior performances compared to the state-of-the-art approaches (i.e. F-measure of 0.44-0.47 at 90% confidence level) when no data re-sampling applied and comparable to others when applying up-sampling on the largest open bug dataset, while training the model is significantly faster, thus finding optimal parameters is much easier. In terms of completeness, which measures the amount of bugs contained in the Java Classes predicted to be defective, PLS-DA outperforms every other algorithm: it found 69.3% and 79.4% of the total bugs with no re-sampling and up-sampling, respectively.
... Another challenge is related to the correctness of outcomes. Too many false positive or false negatives can be identified as the information related to size, design, domain, and context of the analyzed dataset is not considered [31]. This increasing volume and variety of information have made it tedious for the programmers to manually detect every possible code smell. ...
Article
Context Code smells are symptoms, that something may be wrong in software systems that can cause complications in maintaining software quality. In literature, there exists many code smells and their identification is far from trivial. Thus, several techniques have also been proposed to automate code smell detection in order to improve software quality. Objective This paper presents an up-to-date review of simple and hybrid machine learning based code smell detection techniques and tools. Methods We collected all the relevant research published in this field till 2020. We extracted the data from those articles and classified them into two major categories. In addition, we compared the selected studies based on several aspects like, code smells, machine learning techniques, datasets, programming languages used by datasets, dataset size, evaluation approach, and statistical testing. Results Majority of empirical studies have proposed machine- learning based code smell detection tools. Support vector machine and decision tree algorithms are frequently used by the researchers. Along with this, a major proportion of research is conducted on Open Source Softwares (OSS) such as, Xerces, Gantt Project and ArgoUml. Furthermore, researchers paid more attention towards Feature Envy and Long Method code smells. Conclusion We identified several areas of open research like, need of code smell detection techniques using hybrid approaches, need of validation employing industrial datasets, etc.
... While a number of heuristicbased techniques, relying on different types of software metrics, have been devised (e.g., [27,29,32]), a recent trend is represented by the use of machine learning approaches [4]. In particular, machine learning has the potential to address some common limitations of heuristic-based approaches: (1) the subjectivity with which their output is interpreted by developers [14,28], (2) the need of defining thresholds for the detection [15], and (3) the low agreement among them [13]. Indeed, machine learning may be exploited to combine multiple metrics, learning code smell instances considered relevant by developers without the specification of any threshold [4]. ...
Conference Paper
Full-text available
Code smells are poor implementation choices applied during software evolution that can affect source code maintainability. While several heuristic-based approaches have been proposed in the past, machine learning solutions have recently gained attention since they may potentially address some limitations of state-of-the-art approaches. Unfortunately, however, machine learning-based code smell detectors still suffer from low accuracy. In this paper, we aim at advancing the knowledge in the field by investigating the role of static analysis warnings as features of machine learning models for the detection of three code smell types. We first verify the potential contribution given by these features. Then, we build code smell prediction models exploiting the most relevant features coming from the first analysis. The main finding of the study reports that the warnings given by the considered tools lead the performance of code smell prediction models to drastically increase with respect to what reported by previous research in the field.
... The problem of deriving appropriate threshold values, in the context of software measurement and metric-based code smell detection, has been extensively investigated by several researchers, who applied various statistical methods and machine learning techniques on a large number of software projects [49], [50], [51], [52], [53]. Dig [54] showed that precision and recall can vary significantly for the same software system based on the selected threshold value. ...
Article
Full-text available
Refactoring detection is crucial for a variety of applications and tasks: (i) empirical studies about code evolution, (ii) tools for library API migration, (iii) code reviews and change comprehension. However, recent research has questioned the accuracy of the state-of-the-art refactoring mining tools, which poses threats to the reliability of the detected refactorings. Moreover, the majority of refactoring mining tools depend on code similarity thresholds. Finding universal threshold values that can work well for all projects, regardless of their architectural style, application domain, and development practices is extremely challenging. Therefore, in a previous work [1], we introduced the first refactoring mining tool that does not require any code similarity thresholds to operate. In this work, we extend our tool to support low-level refactorings that take place within the body of methods. To evaluate our tool, we created one of the most accurate, complete, and representative refactoring oracles to date, including 7,226 true instances for 40 different refactoring types detected by one (minimum) up to six (maximum) different tools, and validated by one up to four refactoring experts. Our evaluation showed that our approach achieves the highest average precision (99.6%) and recall (94%) among all competitive tools, and on median is 2.6 times faster than the second faster competitive tool.
... The detection strategy of Brain Persistence Method relies on two metrics -structural complexity of methods and structural complexity of SQL queries -and two threshold values defining medium and high values for these metrics. We defined these threshold values based on the statistical measurement defined by Fontana et al. [6]. To do so, we computed the second and third quartiles, round their values and considered values above the second quartile as being of medium complexity and values above the third quartile as being high. ...
Conference Paper
Full-text available
If on one hand frameworks allow programmers to reuse well-known architectural solutions, on the other hand they can make programmers unaware of important design decisions that should be followed during software construction, maintenance and evolution. And if programmers are unaware of these design decisions, there is a high risk of introducing design violations in the source code, and the accumulation of these violations might hinder software maintainability and evolvability. The use of static analysis tools might be employed to mitigate these problems by assisting the detection of recurring design violations in a given architectural pattern. In this work, we present MTV-Checker, a tool to assist the automatic detection of 5 design violations in Django-based web applications. We also conducted an empirical study in the context of the SUAP system, a large-scale Django-based information system with more than 175.000 lines of Python code currently deployed in more than 30 Brazilian institutions. Our results present the most recurrent violations, how they evolve along software evolution, and the opinions and experiences of software architects regarding these violations.
... On the basis of these detection rules, a class/method of a project is marked as smelly if one of the logical propositions shown in Table 2 is true, i.e., if the actual metrics computed on the class/method exceed the threshold values defined in the detection strategy. It is worth pointing out that the thresholds used by JCodeOdor were empirically calibrated on 74 systems belonging to the Qualitas Corpus dataset [115] and are derived from the statistical distribution of the metrics contained in the dataset [32,33]. ...
Preprint
Code smells represent sub-optimal implementation choices applied by developers when evolving software systems. The negative impact of code smells has been widely investigated in the past: besides developers' productivity and ability to comprehend source code, researchers empirically showed that the presence of code smells heavily impacts the change-proneness of the affected classes. On the basis of these findings, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, ie models having as goal that of indicating to developers which classes are more likely to change in the future, so that they may apply preventive maintenance actions. Specifically, we exploit the so-called intensity index - a previously defined metric that captures the severity of a code smell - and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with the one of an alternative technique that considers the previously defined antipattern metrics, namely a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including models is statistically better than that of the baselines and (ii) the intensity is a more powerful metric with respect to the alternative smell-related ones.
... The quantiles of a distribution are used by Vale and Figuereido [31] and the quantiles of a weighted distribution by Alves et al. [2]. Bad smells are also defined and used by setting a threshold value on one or more variables by Lanza and Marinescu [16] Arcelli et al. [8], and Filo et al. [7]. None of these approaches comply with the properties of [26], basically because they are not based on information about the actual faultiness of modules, nor do they use any fault-proneness models. ...
Conference Paper
Background Binary Logistic Regression is widely used in Empirical Software Engineering to build estimation models, e.g., fault-proneness models, which estimate the probability that a given module is faulty, based on some measures of the module. Fault-proneness models are then used to build faultiness model, i.e., models that estimate whether a given module is faulty or non-faulty. Objective Because of the very nature of Binary Logistic Regression, there is always a range of values of the independent variable in which estimates are very close to the estimates that would be obtained via random estimation. These estimates are hardly accurate, and should be regarded as not reliable. For Binary Logistic Regression models used to build faultiness models---i.e., binary classifiers---a range where estimates are inaccurate can be regarded as an "uncertainty" area, where the model is uncertain about the faultiness of the given modules. We define and empirically validate a simple method to identify the uncertainty region. Method We compute the standard deviation of a probability estimate provided by a Binary Logistic Regression model. If the random estimate is within the range centered on the estimate and spanning a standard deviation, we regard the estimate as "too close" to the random estimate, hence unreliable. On the contrary, estimates that are far enough from the random estimate are considered reliable. This method was tested on 54 datasets from the PROMISE (now SEACRAFT) repository. Results Our results show that the variance of estimates can be effectively used to detect the uncertainty region. Estimates in the uncertainty area are rarely statistically significant, and always much less accurate than estimates obtained out of the uncertainty area. Conclusions Practitioners and researchers can use our results to assess the reliability of estimates obtained via Binary Logistic Regression models, and reject (or challenge) those estimates that fall in the uncertainty region.
Article
Fixing software bugs can be colossally expensive, especially if they are discovered in the later phases of the software development life cycle. As such, bug prediction has been a classic problem for the research community. As of now, the Google Scholar site generates ∼ 113,000 hits if searched with the “bug prediction” phrase. Despite this staggering effort by the research community, bug prediction research is criticized for not being decisively adopted in practice. A significant problem of the existing research is the granularity level (i.e., class/file level) at which bug prediction is historically studied. Practitioners find it difficult and time-consuming to locate bugs at the class/file level granularity. Consequently, method-level bug prediction has become popular in the last decade. We ask, are these method-level bug prediction models ready for industry use? Unfortunately, the answer is no . The reported high accuracies of these models dwindle significantly if we evaluate them in different realistic time-sensitive contexts. It may seem hopeless at first, but encouragingly, we show that future method-level bug prediction can be improved significantly. In general, we show how to reliably evaluate future method-level bug prediction models, and how to improve them by focusing on four different improvement avenues: building noise-free bug data, addressing concept drift, selecting similar training projects, and developing a mixture of models. Our findings are based on three publicly available method-level bug datasets, and a newly built bug dataset of 774,051 Java methods originating from 49 open-source software projects.
Article
Full-text available
Context: A code smell indicates a flaw in the design, implementation, or maintenance process that could degrade the software’s quality and potentially cause future disruptions. Since being introduced by Beck and Fowler, the term code smell has attracted several studies from researchers and practitioners. However, over time, studies are needed to discuss whether this issue is still interesting and relevant. Objective: Conduct a thorough systematic literature review to learn the most recent state of the art for studying code smells, including detection methods, practices, and challenges. Also, an overview of trends and future relevance of the topic of code smell, whether it is still developing, or if there has been a shift in the discussion. Method: The search methodology was employed to identify pertinent scholarly articles from reputable databases such as ScienceDirect, IEEE Xplore, ACM Digital Library, SpringerLink, ProQuest, and CiteSeerX. The application of inclusion and exclusion criteria serves to filter the search results. In addition, forward and backward snowballing techniques are employed to enhance the comprehensiveness of the results. Results: The inquiry yielded 354 scholarly articles published over the timeframe spanning from January 2013 to July 2022. After inclusion, exclusion, and snowballing techniques were applied, 69 main studies regarding code smells were identified. Many researchers focus on detecting code smells, primarily via machine learning techniques and, to a lesser extent, deep learning methods. Additional subjects encompass the ramifications of code smells; code smells within specific contexts, the correlation between code smells and software metrics, and facets about security, refactoring, and development habits. Contexts and types of code smells vary in the focus of the study. Some tools used are Jspirit, aDoctor, CAME, and SonarQube. The study also explores the concept of design smells and anti-pattern detection. While a singular dominating technique to code smell detection has yet to be thoroughly investigated, other aspects of code smell detection remain that still need to be examined. Conclusion: The findings underscore scholarly attention’s evolution towards code smells over the years. This study identified significant journals and conferences and influential researchers in this field. The detection methods used include empirical, machine learning, and deep learning. However, challenges include subjective interpretation and limited contextual applicability.
Article
Full-text available
Presence of code smells complicate the source code and can obstruct the development and functionality of the software project. As they represent improper behavior that might have an adverse effect on software maintenance, code smells are behavioral in nature. Python is widely used for various software engineering activities and tends tool to contain code smells that affect its quality. This study investigates five code smells diffused in 20 Python software comprising 10550 classes and analyses its severity index using metric distribution at the class level. Subsequently, a behavioral analysis has been conducted over the considered modification period (phases) for the code smell undergoing class change proneness. Furthermore, it helps to investigate the accurate multinomial classifier for mining the severity index. It witnesses the change in severity at the class level over the modification period by mapping its characteristics over various statistical functions and hypotheses. Our findings reveal that the Cognitive Complexity of code smell is the most severe one. The remaining four smells are centered around the moderate range, having an average severity index value. The results suggest that the J48 algorithm was the accurate multinomial classifier for classifying the severity of code smells with 92.98% accuracy in combination with the AdaBoost method. The findings of our empirical evaluation can be beneficial for the software developers to prioritize the code smells in the pre-refactoring phase and can help manage the code smells in forthcoming releases, subsequently saving ample time and resources spent in the development and maintenance of software projects.
Article
Background Developers use Static Analysis Tools (SATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on SATs to favor the understanding of their actual capabilities. Aims Despite the work done so far, there is still a lack of knowledge regarding (1) what is their agreement, and (2) what is the precision of their recommendations. We aim at bridging this gap by proposing a large-scale comparison of six popular SATs for Java projects: Better Code Hub, CheckStyle, Coverity Scan, FindBugs, PMD, and SonarQube. Methods We analyze 47 Java projects applying 6 SATs. To assess their agreement, we compared them by manually analyzing - at line- and class-level - whether they identify the same issues. Finally, we evaluate the precision of the tools against a manually-defined ground truth. Results The key results show little to no agreement among the tools and a low degree of precision. Conclusion Our study provides the first overview on the agreement among different tools as well as an extensive analysis of their precision that can be used by researchers, practitioners, and tool vendors to map the current capabilities of the tools and envision possible improvements.
Article
Full-text available
Context Facebook’s React is a widely popular JavaScript library to build rich and interactive user interfaces (UI). However, due to the complexity of modern Web UIs, React applications can have hundreds of components and source code files. Therefore, front-end developers are facing increasing challenges when designing and modularizing React-based applications. As a result, it is natural to expect maintainability problems in React-based UIs due to suboptimal design decisions. Objective To help developers with these problems, we propose a catalog with twelve React-related code smells and a prototype tool to detect the proposed smells in React-based Web apps. Method The smells were identified by conducting a grey literature review and by interviewing six professional software developers. We also use the tool in the top-10 most popular GitHub projects that use React and conducted a historical analysis to check how often developers remove the proposed smells. Results We detect 2,565 instances of the proposed code smells. The results show that the removal rates range from 0.9% to 50.5%. The smell with the most significant removal rate is Large File (50.5%). The smells with the lowest removal rates are Inheritance Instead of Composition (IIC) (0.9%), and Direct DOM Manipulation (14.7%). Conclusion The list of React smells proposed in this paper as well as the tool to detect them can assist developers to improve the source code quality of React applications. While the catalog describes common problems with React applications, our tool helps to detect them. Our historical analysis also shows the importance of each smell from the developers’ perspective, showing how often each smell is removed.
Chapter
The presence of bad smells in code hampers software’s maintainability, comprehensibility, and extensibility. A type of code smell, which is common in software projects, is “duplicated code” bad smell, also known as code clones. These types of smells generally arise in a software system due to the copy-paste-modify actions of software developers. They can either be exact copies or copies with certain modifications. Different clone detection techniques exist, which can be broadly classified as text-based, token-based, abstract syntax tree-based (AST-based), metrics-based, or program dependence graph-based (PDG-based) approaches based on the amount of preprocessing required on the input source code. Researchers have also built clone detection techniques using a hybrid of two or more approaches described above. In this paper, we did a narrative review of the metrics-based techniques (solo or hybrid) reported in the previously published studies and analyzed them for their quality in terms of run-time efficiency, accuracy values, and the types of clones they detect. This study can be helpful for practitioners to select an appropriate set of metrics, measuring all the code characteristics required for clone detection in a particular scenario.KeywordsClone detectionMetrics-based techniquesHybrid clone detection techniquesCategorizationQualitative analysis
Article
Defect prediction is commonly used to reduce the effort from the testing phase of software development. A promising strategy is to use machine learning techniques to predict which software components may be defective. Features are key factors to the prediction’s success, and thus extracting significant features can improve the model’s accuracy. In particular, code smells are a category of those features that have been shown to improve the prediction performance significantly. However, Design code smells, a state-of-the-art collection of code smells based on the violations of the object-oriented programming principles, have not been studied in the context of defect prediction. In this paper, we study the performance of defect prediction models by training multiple classifiers for 97 real projects. We compare using Design code smells as features and using other Traditional smells from the literature and both. Moreover, we cluster and analyze the models’ performance based on the categories of Design code smells. We conclude that the models trained with both the Design code smells and the smells from the literature performed the best, with an improvement of 4.1% for the AUC score, compared to models trained with only Traditional smells. Consequently, Design smells are a good addition to the smells commonly studied in the literature for defect prediction.
Article
Code smells instigate due to the consistent adoption of bad programming and implementation styles during the evolution of the software which adversely affects the software quality. They are essentially focused and prioritized for their effective removal based on their severity. The study proposed a hybrid approach for inspecting the severity based on the code smell intensity in Kotlin language and comparing the code smells which are found equivalent in Java language. The research work is examined on five common code smells that are complex method, large class long method, long parameter list, string literal duplication, and too many methods over 30 open-source systems (15 Kotlin/15 Java). The experiment compares different machine learning algorithms for the computation of human-readable code smell detection rules for Kotlin, where the JRip algorithm proved to be the best machine learning algorithm with 96% and 97% of overall precision and accuracy, validated at 10-fold cross-validation. Further, the severity of code smell at the class level is evaluated for prioritization of applications written in Kotlin and Java language. Moreover, the process of severity computation is semiautomated using the CART model, and thus, metric-based severity classification rules are achieved. The experimentation provides a complete understanding of prioritization of code smells in Kotlin and Java and helps to attain prioritized refactoring which will enhance the utilization of resources and minimize the overhead rework cost.
Article
Full-text available
Design smell detection has proven to be an efficient strategy to improve software quality and consequently decrease maintainability expenses. This work explores the influence of the information about project context expressed as project domain and size category information, on the automatic detection of the god class design smell by machine learning techniques. A set of experiments using eight classifiers to detect god classes was conducted on a dataset containing 12, 587 classes from 24 Java projects. The results show that classifiers change their behavior when they are used on datasets that differ in these kinds of project information. The results show that god class design smell detection can be improved by feeding machine learning classifiers with this project context information.
Article
In software development, it is easy to introduce code smells owing to the complexity of projects and the negligence of programmers. Code smells reduce code comprehensibility and maintainability, making programs error-prone. Hence, code smell detection is extremely important. Recently, machine learning-based technologies turn to be the mainstream detection approaches, which show promising performance. However, existing machine learning methods have two limitations: (1) most studies only focus on common smells, and (2) the proposed metrics are not effective when being used for uncommon code smell detection, e.g., change barrier based code smells. To overcome these limitations, this paper investigates the detection of uncommon change barrier based code smells. We study three typical code smells, i.e., Divergent Change, Shotgun Surgery, and Parallel Inheritance, which all belong to change barriers. We analyze the characteristics of change barriers and extract domain-specific metrics to train a Logistic Regression model for detection. The experimental results show that our method achieves 81.8%–100% precision and recall, outperforming existing algorithms by 10%–30%. In addition, we analyze the correlation and importance of the utilized metrics. We find our domain-specific metrics are important for the detection of change barriers. The results would help practitioners better design detection tools for such code smells.
Article
Context Software metrics may be an effective tool to assess the quality of software, but to guide their use it is important to define their thresholds. Bad smells and fault also impact the quality of software. Extracting metrics from software systems is relatively low cost since there are tools widely used for this purpose, which makes feasible applying software metrics to identify bad smells and to predict faults. Objective To inspect whether thresholds of object-oriented metrics may be used to aid bad smells detection and fault predictions. Method To direct this research, we have defined three research questions (RQ), two related to identification of bad smells, and one for identifying fault in software systems. To answer these RQs, we have proposed detection strategies for the bad smells: Large Class, Long Method, Data Class, Feature Envy, and Refused Bequest, based on metrics and their thresholds. To assess the quality of the derived thresholds, we have made two studies. The first one was conducted to evaluate their efficacy on detecting these bad smells on 12 systems. A second study was conducted to investigate for each of the class level software metrics: DIT, LCOM, NOF, NOM, NORM, NSC, NSF, NSM, SIX, and WMC, if the ranges of values determined by thresholds are useful to identify fault in software systems. Results Both studies confirm that metric thresholds may support the prediction of faults in software and are significantly and effective in the detection of bad smells. Conclusion The results of this work suggest practical applications of metric thresholds to identify bad smells and predict faults and hence, support software quality assurance activities.Their use may help developers to focus their efforts on classes that tend to fail, thereby minimizing the occurrence of future problems.
Article
Full-text available
Code smells are sub-optimal implementation choices applied by developers that have the effect of negatively impacting, among others, the change-proneness of the affected classes. Based on this consideration, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, i.e., models having the goal of indicating which classes are more likely to change in the future. We exploit the so-called intensity index—a previously defined metric that captures the severity of a code smell—and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with a model based on previously defined antipattern metrics, a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including models is statistically better than the baselines and, (ii) the intensity is a better predictor than antipattern metrics. We observed some orthogonality between the set of change-prone and non-change-prone classes correctly classified by the models relying on intensity and antipattern metrics: for this reason, we also devise and evaluate a smell-aware combined change prediction model including product, process, developer-based, and smell-related features. We show that the F-Measure of this model is notably higher than other models.
Chapter
Software that is badly written and prone to design problems often smells. Code smells results in design anomalies that make software hard to understand and maintain. Several tools and techniques available in literature helps in detection of code smells. But the severity of the smells in the code is often not known immediately as it lacks visualization. In this paper, two method level code smells namely long method and feature envy are visualized using chernoff faces. Techniques proposed in literature either use knowledge driven approach or data driven approach for code smell detection. In the proposed approach a fusion of both knowledge and data driven approach is used to identify the most relevant features. These most relevant features are mapped to the 15 desired features of chernoff faces to visualize the behavior of the code. The result shows that almost 95% of the smells are visualized correctly. This helps in analyzing the programmer’s capability in maintaining quality of source code.
Article
Code smells or bad smells are an accepted approach to identify design flaws in the source code. Although it has been explored by researchers, the interpretation of programmers is rather subjective. One way to deal with this subjectivity is to use machine learning techniques. This paper provides the reader with an overview of machine learning techniques and code smells found in the literature, aiming at determining which methods and practices are used when applying machine learning for code smells identification and which machine learning techniques have been used for code smells identification. A mapping study was used to identify the techniques used for each smell. We found that the Bloaters was the main kind of smell studied, addressed by 35% of the papers. The most commonly used technique was Genetic Algorithms (GA), used by 22.22% of the papers. Regarding the smells addressed by each technique, there was a high level of redundancy, in a way that the smells are covered by a wide range of algorithms. Nevertheless, Feature Envy stood out, being targeted by 63% of the techniques. When it comes to performance, the best average was provided by Decision Tree, followed by Random Forest, Semi-supervised and Support Vector Machine Classifier techniques. 5 out of the 25 analyzed smells were not handled by any machine learning techniques. Most of them focus on several code smells and in general there is no outperforming technique, except for a few specific smells. We also found a lack of comparable results due to the heterogeneity of the data sources and of the provided results. We recommend the pursuit of further empirical studies to assess the performance of these techniques in a standardized dataset to improve the comparison reliability and replicability.
Article
Full-text available
Software metrics have been developed to measure the quality of software systems. A proper use of metrics requires thresholds to determine whether the value of a metric is acceptable or not. Many approaches propose to define thresholds based on large analyses of software systems. However it has been shown that thresholds depend greatly on the context of the project. Thus there is a need for an approach that computes thresholds by taking into account this context. In this paper we propose such approach with the objective to reach a trade-off between representativeness of the threshold and computation cost. Our approach is based on an unbiased selection of software entities and makes no assumptions on the statistical properties of the software metrics values. It can therefore be used by anyone, ranging from developer to manager, for computing a representative metric threshold tailored to their context.
Conference Paper
Full-text available
Meaningful thresholds are essential for promoting source code metrics as an effective instrument to control the internal quality of software systems. Despite the increasing number of source code measurement tools, no publicly available tools support extraction of metric thresholds. Moreover, earlier studies suggest that in larger systems significant number of classes exceed recommended metric thresholds. Therefore, in our previous study we have introduced the notion of a relative threshold, i.e., a pair including an upper limit and a percentage of classes whose metric values should not exceed this limit. In this paper we propose RTTOOL, an open source tool for extracting relative thresholds from the measurement data of a benchmark of software systems. RTTOOL is publicly available at http://aserg.labsoft.dcc.ufmg.br/rttool.
Conference Paper
Full-text available
Software metrics have many uses, e.g., defect prediction, effort estimation, and benchmarking an organization against peers and industry standards. In all these cases, metrics may depend on the context, such as the programming language. Here we aim to investigate if the distributions of commonly used metrics do, in fact, vary with six context factors: application domain, programming language, age, lifespan, the number of changes, and the number of downloads. For this preliminary study we select 320 nontrivial software systems from Source Forge. These software systems are randomly sampled from nine popular application domains of Source Forge. We calculate 39 metrics commonly used to assess software maintainability for each software system and use Kruskal Wallis test and Mann-Whitney U test to determine if there are significant differences among the distributions with respect to each of the six context factors. We use Cliff's delta to measure the magnitude of the differences and find that all six context factors affect the distribution of 20 metrics and the programming language factor affects 35 metrics. We also briefly discuss how each context factor may affect the distribution of metric values. We expect our results to help software benchmarking and other software engineering methods that rely on these commonly used metrics to be tailored to a particular context.
Conference Paper
Full-text available
Establishing credible thresholds is a central challenge for promoting source code metrics as an effective instrument to control the internal quality of software systems. To address this challenge, we propose the concept of relative thresholds for evaluating metrics data following heavy-tailed distributions. The proposed thresholds are relative because they assume that metric thresholds should be followed by most source code entities, but that it is also natural to have a number of entities in the “long-tail” that do not follow the defined limits. In the paper, we describe an empirical method for extracting relative thresholds from real systems. We also report a study on applying this method in a corpus with 106 systems. Based on the results of this study, we argue that the proposed thresholds express a balance between real and idealized design practices.
Article
Full-text available
With the growing need for quality assessment of entire software systems in the industry, new issues are emerging. First, because most software quality metrics are defined at the level of individual software components, there is a need for aggregation methods to summarize the results at the system level. Second, because a software evaluation requires the use of different metrics, with possibly widely varying output ranges, there is a need to combine these results into a unified quality assessment. In this paper we derive, from our experience on real industrial cases and from the scientific literature, requirements for an aggregation method. We then present a solution through the Squale model for metric aggregation, a model specifically designed to address the needs of practitioners. We empirically validate the adequacy of Squale through experiments on Eclipse. Additionally, we compare the Squale model to both traditional aggregation techniques (e.g., the arithmetic mean), and to econometric inequality indices (e.g., the Gini or the Theil indices), recently applied to aggregation of software metrics. Copyright © 2012 John Wiley & Sons, Ltd.
Article
Full-text available
This paper documents a time series dataset on the evolution of seventeen object-oriented metrics extracted from ten open-source systems. By making this dataset public our goal is to assist researchers with interest in software evolution analysis and modeling.
Article
Full-text available
One of the goals of software engineering research is to achieve generality: Are the phenomena found in a few projects reflective of what goes on in others? Will a technique benefit more than just the projects it is evaluated on? The discipline of our community has gained rigor over the past twenty years and is now attempting to achieve generality through evaluation and study of an increasing number of software projects (sometime hundreds!). However, quantity is not the only important component. Selecting projects that are representative of a larger body of software of interest is just as critical. Little attention has been paid to selecting projects in such a way that generality and representativeness is maximized or even quantitatively characterized and reported. In this paper, we present a general technique for quantifying how representative a sample of software projects is of a population across many dimensions. We also present a greedy algorithm for choosing a maximally representative sample. We demonstrate our technique on research presented over the past two years at ICSE and FSE with respect to a population of 20,000 active open source projects. Finally, we propose methods of reporting objective measures of representativeness in research.
Article
Full-text available
Code smells are structural characteristics of software that may indicate a code or design problem that makes software hard to evolve and maintain, and may trigger refactoring of code. Recent research is active in defining automatic detection tools to help humans in finding smells when code size becomes unmanageable for manual review. Since the definitions of code smells are informal and subjective, assessing how effective code smell detection tools are is both important and hard to achieve. This paper reviews the current panorama of the tools for automatic code smell detection. It defines research questions about the consistency of their responses, their ability to expose the regions of code most affected by structural decay, and the relevance of their responses with respect to future software evolution. It gives answers to them by analyzing the output of four representative code smell detectors applied to six different versions of GanttProject, an open source system written in Java. The results of these experiments cast light on what current code smell detection tools are able to do and what the relevant areas for further improvement are.
Article
Full-text available
We provide an overview of the approach developed by the Software Improvement Group for code analysis and quality consulting focused on software maintainability. The approach uses a standardized measurement model based on the ISO/IEC 9126 definition of maintainability and source code metrics. Procedural standardization in evaluation projects further enhances the comparability of results. Individual assessments are stored in a repository that allows any system at hand to be compared to the industry-wide state of the art in code quality and maintainability. When a minimum level of software maintainability is reached, the certification body of TÜV Informationstechnik GmbH issues a Trusted Product Maintainability certificate for the software product. KeywordsSoftware product quality–Benchmarking–Certification–Standardization
Conference Paper
Full-text available
A wide variety of software metrics have been proposed and a broad range of tools is available to measure them. However, the effective use of software metrics is hindered by the lack of meaningful thresholds. Thresholds have been proposed for a few metrics only, mostly based on expert opinion and a small number of observations. Previously proposed methodologies for systematically deriving metric thresholds have made unjustified assumptions about the statistical properties of source code metrics. As a result, the general applicability of the derived thresholds is jeopardized. We designed a method that determines metric thresholds empirically from measurement data. The measurement data for different software systems are pooled and aggregated after which thresholds are selected that (i) bring out the metric's variability between systems and (ii) help focus on a reasonable percentage of the source code volume. Our method respects the distributions and scales of source code metrics, and it is resilient against outliers in metric values or system size. We applied our method to a benchmark of 100 object-oriented software systems, both proprietary and open-source, to derive thresholds for metrics included in the SIG maintainability model.
Article
Full-text available
There are a large number of different definitions used for sample quantiles in statistical computer packages. Often within the same package one definition will be used to compute a quantile explicitly, while other definitions may be used when producing a boxplot, a probability plot, or a QQ plot. We compare the most commonly implemented sample quantile definitions by writing them in a common notation and investigating their motivation and some of their properties. We argue that there is a need to adopt a standard definition for sample quantiles so that the same answers are produced by different packages and within each package. We conclude by recommending that the median-unbiased estimator be used because it has most of the desirable properties of a quantile estimator and can be defined independently of the underlying distribution.
Conference Paper
Full-text available
Software metrics offer us the promise of distilling useful information from vast amounts of software in order to track development progress, to gain insights into the nature of the software, and to identify potential problems. Unfortunately, however, many software metrics exhibit highly skewed, non- Gaussian distributions. As a consequence, usual ways of interpreting these metrics — for example, in terms of "av- erage" values — can be highly misleading. Many metrics, it turns out, are distributed like wealth — with high concen- trations of values in selected locations. We propose to an- alyze software metrics using the Gini coefficient, a higher- order statistic widely used in economics to study the dis- tribution of wealth. Our approach allows us not only to observe changes in software systems efficiently, but also to assess project risks and monitor the development process it- self. We apply the Gini coefficient to numerous metrics over a range of software projects, and we show that many met- rics not only display remarkably high Gini values, but that these values are remarkably consistent as a project evolves over time.
Conference Paper
Full-text available
In order to increase our ability to use measurement to support software development practise we need to do more analysis of code. However, empirical studies of code are expensive and their results are difficult to compare. We describe the Qualitas Corpus, a large curated collection of open source Java systems. The corpus reduces the cost of performing large empirical studies of code and supports comparison of measurements of the same artifacts. We discuss its design, organisation, and issues associated with its development.
Article
Full-text available
An empirical study of the relationship between object-oriented (OO) metrics and error-severity categories is presented. The focus of the study is to identify threshold values of software metrics using receiver operating characteristic curves. The study used the three releases of the Eclipse project and found threshold values for some OO metrics that separated no-error classes from classes that had high-impact errors. Although these thresholds cannot predict whether a class will definitely have errors in the future, they can provide a more scientific method to assess class error proneness and can be used by engineers easily. Copyright © 2009 John Wiley & Sons, Ltd.
Article
Full-text available
A single statistical framework, comprising power law distributions and scale-free networks, seems to fit a wide variety of phenomena. There is evidence that power laws appear in software at the class and function level. We show that distributions with long, fat tails in software are much more pervasive than previously established, appearing at various levels of abstraction, in diverse systems and languages. The implications of this phenomenon cover various aspects of software engineering research and practice.
Article
Full-text available
In this article, we present a novel algorithmic method for the calculation of thresholds for a metric set. To this aim, machine learning and data mining techniques are utilized. We define a data-driven methodology that can be used for efficiency optimization of existing metric sets, for the simplification of complex classification models, and for the calculation of thresholds for a metric set in an environment where no metric set yet exists. The methodology is independent of the metric set and therefore also independent of any language, paradigm or abstraction level. In four case studies performed on large-scale open-source software metric sets for C functions, C+ +, C# methods and Java classes are optimized and the methodology is validated.
Conference Paper
Full-text available
In order to support the maintenance of object-oriented software systems, the quality of their design must be evaluated using adequate quantification means. In spite of the current extensive use of metrics, if used in isolation, metrics are oftentimes too fine grained to quantify comprehensively an investigated aspect of the design. To help the software engineer detect and localize design problems, the novel detection strategy mechanism is defined so that deviations from good-design principles and heuristics are quantised inform of metrics-based rules. Using detection strategies an engineer can directly localize classes or methods affected by a particular design flaw (e.g. God Class), rather than having to infer the real design problem from a large set of abnormal metric values. In order to reach the ultimate goal of bridging the gap between qualitative and quantitative statements about design, the dissertation proposes a novel type of quality model, called factor-strategy. In contrast to traditional quality models that express the goodness of design in terms of a set of metrics, this novel model relates explicitly the quality of a design to its conformance with a set of essential principles, rules and heuristics, which are quantified using detection strategies.
Conference Paper
Full-text available
Presents a novel way of using object-oriented design metrics to support the incremental development of object-oriented programs. Based on a quality model (the factor-criteria-metrics model), so-called multi-metrics relate a number of simple structural measurements to design principles and rules. Single components of an object-oriented program like classes or subsystems are analyzed to determine whether they conform to specific design goals. Concise measurement reports, together with detailed explanations of the obtained values, identify problem spots in system design and give hints for improvement. This allows the designer to measure and evaluate programs at an appropriate level of abstraction. This paper details the use of the multi-metrics approach for the design and improvement of a framework for industry and its use for graphical applications. Multi-metrics tools were used with several versions of the framework. The measurement results were used in design reviews to quantify the effects of efforts to reorganize the framework. The results showed that this approach was very effective at giving good feedback, even to very experienced software developers. It helped them to improve their software and to create stable system designs
Article
Full-text available
We present a comprehensive study of an implementation of the Smalltalk object oriented system, one of the first and purest object-oriented programming environment, searching for scaling laws in its properties. We study ten system properties, including the distributions of variable and method names, inheritance hierarchies, class and method sizes, system architecture graph. We systematically found Pareto - or sometimes log-normal - distributions in these properties. This denotes that the programming activity, even when modeled from a statistical perspective, can in no way be simply modeled as a random addition of independent increments with finite variance, but exhibits strong organic dependencies on what has been already developed. We compare our results with similar ones obtained for large Java systems, reported in the literature or computed by ourselves for those properties never studied before, showing that the behavior found is similar in all studied object oriented systems. We show how the Yule process is able to stochastically model the generation of several of the power-laws found, identifying the process parameters and comparing theoretical and empirical tail indexes. Lastly, we discuss how the distributions found are related to existing object-oriented metrics, like Chidamber and Kemerer's, and how they could provide a starting point for measuring the quality of a whole system, versus that of single classes. In fact, the usual evaluation of systems based on mean and standard deviation of metrics can be misleading. It is more interesting to measure differences in the shape and coefficients of the data?s statistical distributions.
Article
Full-text available
Given the central role that software development plays in the delivery and application of information technology, managers are increasingly focusing on process improvement in the software development area. This demand has spurred the provision of a number of new and/or improved approaches to software development, with perhaps the most prominent being object-orientation (OO). In addition, the focus on process improvement has increased the demand for software measures, or metrics with which to manage the process. The need for such metrics is particularly acute when an organization is adopting a new technology for which established practices have yet to be developed. This research addresses these needs through the development and implementation of a new suite of metrics for OO design. Metrics developed in previous research, while contributing to the field's understanding of software development processes, have generally been subject to serious criticisms, including the lack of a theoretical base. Following Wand and Weber (1989), the theoretical base chosen for the metrics was the ontology of Bunge (1977). Six design metrics are developed, and then analytically evaluated against Weyuker's (1988) proposed set of measurement principles. An automated data collection tool was then developed and implemented to collect an empirical sample of these metrics at two field sites in order to demonstrate their feasibility and suggest ways in which managers may use these metrics for process improvement
Article
Full-text available
A practical application of object-oriented measures is to predict which classes are likely to contain a fault. This is contended to be meaningful because object-oriented measures are believed to be indicators of psychological complexity, and classes that are more complex are likely to be faulty Recently, a cognitive theory has been proposed suggesting that there are threshold effects for many object-oriented measures. This means that objectoriented classes are easy to understand as long as their complexity is below a threshold. Above that threshold their understandability decreases rapidly, leading to an increased probability of a fault. This occurs, according to the theory, due to an overflow of short-term human memory. If this theory is confirmed, then it would provide a mechanism that would explain the introduction of faults into objectoriented systems, and would also provide some practical guidance on how to design object-oriented programs. In this paper we empirically test this theory on two C++ telecommunications systems. We test for threshold effects in a subset of the Chidamber and Kemerer (CK) suite of measures. The dependent variable was the incidence of faults that lead to field failures. Our results indicate that there are no threshold effects for any of the measures studied. This means that there is no value for the studied CK measures where the fault- proneness changes from being steady to rapidly increasing. The results are consistent across the two systems. Therefore, we can provide no support to the posited cognitive theory.
Article
While software metrics are a generally desirable feature in the software management functions of project planning and project evaluation, they are of especial importance with a new technology such as the object-oriented approach. This is due to the significant need to train software engineers in generally accepted object-oriented principles. This paper presents theoretical work that builds a suite of metrics for object-oriented design. In particular, these metrics are based upon measurement theory and are informed by the insights of experienced object-oriented software developers. The proposed metrics are formally evaluated against a widelyaccepted list of software metric evaluation criteria.
Conference Paper
Several code smells detection tools have been developed providing different results, because smells can be subjectively interpreted and hence detected in different ways. Usually the detection techniques are based on the computation of different kinds of metrics, and other aspects related to the domain of the system under analysis, its size and other design features are not taken into account. In this paper we propose an approach we are studying based on machine learning techniques. We outline some common problems faced for smells detection and we describe the different steps of our approach and the algorithms we use for the classification.
Conference Paper
We propose a new approach to aggregating software metrics from the micro-level of individual artifacts (e.g., methods, classes and packages) to the macro-level of the entire software system. The approach, Theil index, is a well-known econometric measure of inequality. The Theil index allows to study the impact of different categorizations of the artifacts, e.g., based on the development technology or developers' teams, on the inequality of the metrics values measured. We apply the Theil index in a series of experiments. We have observed that the Theil index and the related notions provide valuable insights in organization and evolution of software systems, as well as in sources of inequality.
Conference Paper
Almost every expert in Object-Oriented Development stresses the importance of iterative development. As you proceed with the iterative development, you need to add function to the existing code base. If you are really lucky that code base is structured just right to support the new function while still preserving its design integrity. Of course most of the time we are not lucky, the code does not quite fit what we want to do. You could just add the function on top of the code base. But soon this leads to applying patch upon patch making your system more complex than it needs to be. This complexity leads to bugs, and cripples your productivity.
Article
Software engineering is a discipline in search of objective measures for factors that contribute to software quality. NPATH, which counts the acyclic execution paths through a function, is an objective measure of software complexity related to the ease with which software can be comprehensively tested.
Article
The impact of software maintainability has become one of the most important aspects of past, present, and future software systems. Tools and models that can measure software maintainability will play an increasingly important role in the software industry. This article reviews two early attempts at software maintainability assessment and describes five recently developed models. Two of these are then applied to industrial software systems, and the results are evaluated. Each of the two models are shown to be effective in evaluating industrial software systems at the component, subsystem, and system levels.
Article
Despite the importance of software metrics and the large number of proposed metrics, they have not been widely applied in industry yet. One reason might be that, for most metrics, the range of expected values, i.e., reference values are not known. This paper presents results of a study on the structure of a large collection of open-source programs developed in Java, of varying sizes and from different application domains. The aim of this work is the definition of thresholds for a set of object-oriented software metrics, namely: LCOM, DIT, coupling factor, afferent couplings, number of public methods, and number of public fields. We carried out an experiment to evaluate the practical use of the proposed thresholds. The results of this evaluation indicate that the proposed thresholds can support the identification of classes which violate design principles, as well as the identification of well-designed classes. The method used in this study to derive software metrics thresholds can be applied to other software metrics in order to find their reference values.
Conference Paper
Power law distributions have been found in many natural and social phenomena, and more recently in the source code and run-time characteristics of Object-Oriented (OO) systems. A power law implies that small values are extremely common, whereas large values are extremely rare. We identify twelve new power laws relating to the static graph structures of Java programs. The graph structures analyzed represented different forms of OO coupling, namely, inheritance, aggregation, interface, parameter type and return type. Identification of these new laws provides the basis for predicting likely features of classes in future developments. The research ties together work in object-based coupling and World Wide Web structures.
Article
This paper describes a graph-theoretic complexity measure and illustrates how it can be used to manage and control program complexity. The paper first explains how the graph-theory concepts apply and gives an intuitive explanation of the graph concepts in programming terms. The control graphs of several actual Fortran programs are then presented to illustrate the correlation between intuitive complexity and the graph-theoretic complexity. Several properties of the graph-theoretic complexity are then proved which show, for example, that complexity is independent of physical size (adding or subtracting functional statements leaves complexity unchanged) and complexity depends only on the decision structure of a program.
Code smell detection: towards a machine learning-based approach The Netherlands Available: http://icsm2013 Automatic detection of bad smells in code: An experimental assessment How does context affect the distribution of software maintainability metrics
  • F. Arcelli Fontana
  • M Zanoni
  • A Marino
  • M V Mäntylä
  • Era Track
  • F. Arcelli Fontana
  • P Braione
  • M Zanoni
F. Arcelli Fontana, M. Zanoni, A. Marino, and M. V. Mäntylä, " Code smell detection: towards a machine learning-based approach, " in Proc. 29th IEEE International Conference on Software Maintenance (ICSM 2013), ERA Track. Eindhoven, The Netherlands: IEEE, Sep 2013, pp. 396–399. [Online]. Available: http://icsm2013.tue.nl/ICSM2013.pdf [29] F. Arcelli Fontana, P. Braione, and M. Zanoni, " Automatic detection of bad smells in code: An experimental assessment, " Journal of Object Technology, vol. 11, no. 2, pp. 5:1–38, aug 2012. [Online]. Available: http://www.jot.fm/contents/issue 2012 08/article5.html [30] F. Zhang, A. Mockus, Y. Zou, F. Khomh, and A. E. Hassan, " How does context affect the distribution of software maintainability metrics? " in Proceedings of the 29th IEEE International Conference on Software Maintenance (ICSM 2013), Eindhoven, The Netherlands, Sep. 2013, pp. 350–359.
Applying and interpreting object oriented metrics
  • L H Rosemberg
L. H. Rosemberg, "Applying and interpreting object oriented metrics," Software Assurance Technology Center at NASA Goddard Space Flight Center, Tech. Rep., 2001.
A complexity measure NPATH: a measure of execution path complexity and its applications
  • T Mccabe
T. McCabe, " A complexity measure, " IEEE Transactions on Software Engineering, no. 4, pp. 308–320, 1976. [Online]. Available: http: //ieeexplore.ieee.org/xpls/abs\ all.jsp?arnumber=1702388 [10] B. Nejmeh, " NPATH: a measure of execution path complexity and its applications, " Communications of the ACM, vol. 31, no. 2, 1988.
Metrics for object oriented environment
  • L Rosemberg
L. Rosemberg, "Metrics for object oriented environment," in Proc. EFAITP/AIE 3rd Annual Software Metrics Conference, 1997.
Computing contextual metric thresholds
  • M Foucault
  • M Palyart
  • J.-R Falleri
  • X Blanc
M. Foucault, M. Palyart, J.-R. Falleri, and X. Blanc, "Computing contextual metric thresholds," in Proceedings of the 29th Annual ACM Symposium on Applied Computing (SAC'14). Gyeongju, Republic of Korea: ACM, 2014, pp. 1120-1125.