ArticlePDF Available

Improving the prediction of continuous integration build failures using deep learning

Authors:

Abstract and Figures

Continuous Integration (CI) aims at supporting developers in integrating code changes constantly and quickly through an automated build process. However, the build process is typically time and resource-consuming as running failed builds can take hours until discovering the breakage; which may cause disruptions in the development process and delays in the product release dates. Hence, preemptively detecting when a software state is most likely to trigger a failure during the build is of crucial importance for developers. Accurate build failures prediction techniques can cut the expenses of CI build cost by early predicting its potential failures. However, developing accurate prediction models is a challenging task as it requires learning long- and short-term dependencies in the historical CI build data as well as extensive feature engineering to derive informative features to learn from. In this paper, we introduce DL-CIBuild a novel approach that uses Long Short-Term Memory (LSTM)-based Recurrent Neural Networks (RNN) to construct prediction models for CI build outcome prediction. The problem is comprised of a single series of CI build outcomes and a model is required to learn from the series of past observations to predict the next CI build outcome in the sequence. In addition, we tailor Genetic Algorithm (GA) to tune the hyper-parameters for our LSTM model. We evaluate our approach and investigate the performance of both cross-project and online prediction scenarios on a benchmark of 91,330 CI builds from 10 large and long-lived software projects that use the Travis CI build system. The statistical analysis of the obtained results shows that the LSTM-based model outperforms traditional Machine Learning (ML) models with both online and cross-project validations. DL-CIBuild has shown also a less sensitivity to the training set size and an effective robustness to the concept drift. Additionally, by considering several Hyper-Parameter Optimization (HPO) methods as baseline for GA, we demonstrate that the latter performs the best
This content is subject to copyright. Terms and conditions apply.
A preview of the PDF is not available
... To mitigate CI build failures, various CI build prediction techniques [9,23,[35][36][37]52] have been developed to preemptively detect states in the CI process that are likely to cause build failures. This enables developers to take the necessary actions and avoid these failures. ...
... The CI build process is typically time and resourceconsuming, as running failed builds can take hours until discovering the breakage. Prior studies [9,15,16,36] found that CI build failures may cause disruptions in the development process and delays in the product release dates in the context of open-source projects. This RQ seeks to understand the impact of CI build failures at Atlassian from the developers' perspective through a survey. ...
... The explanatory variables in the repository properties dimension account for the largest proportion of the Wald 2 . Previous studies [9,14,36] point out that in open-source projects the To study the relationship between the explanatory variables and the response, we plot the odds produced by our models against each explanatory variable while holding the others at their median values. Figure 2 shows the relationship between the explanatory variables and the response variable with the 95% confidence interval (gray area). ...
Conference Paper
Full-text available
Continuous Integration (CI) build failures could significantly impact the software development process and teams, such as delaying the release of new features and reducing developers' productivity. In this work, we report on an empirical study that investigates CI build failures throughout product development at Atlassian. Our quantitative analysis found that the repository dimension is the key factor influencing CI build failures. In addition, our qualitative survey revealed that Atlassian developers perceive CI build failures as challenging issues in practice. Furthermore, we found that the CI build prediction can not only provide proactive insight into CI build failures but also facilitate the team's decision-making. Our study sheds light on the challenges and expectations involved in integrating CI build prediction tools into the Bitbucket environment, providing valuable insights for enhancing CI processes.
... -Effect size correction: SK-ESD uses Cohen's d (Cohen 2013) to measure the effect size between different clusters and merge clusters having negligible effect size, i.e., having d < 0.2. Figure 4 depicts the experimental design of our study that we adopt to answer the proposed research questions. First, we follow the 10-fold online validation process to divide our dataset into training and testing sets similar to previous studies (Islam et al. 2022;Fan et al. 2018;Saidan et al. 2020;Saidani et al. 2022), since code review data is chronologically ordered. Given a dataset (i.e., a given project data), we first sort it according to the creation date of code reviews, and then we divide the dataset into 11 equal folds. ...
... To evaluate the cost-effectiveness of Cost AwareC R, we used P opt and ACC scores which have been widely adopted in previous studies (Mende and Koschke 2010;Shukla et al. 2018). Moreover, we followed the online validation process since our data is chronologically ordered and to prevent data leaks (Islam et al. 2022;Saidan et al. 2020;Saidani et al. 2022). Furthermore, we summarize the results of each obtained Pareto-front by reporting the median and best metrics for each run following the previous works of ( ) and Arcuri and Briand (Arcuri and Briand 2011). ...
Article
Full-text available
Modern Code Review (MCR) is an essential practice in software engineering. MCR helps with the early detection of defects and preventing poor implementation practices and other benefits such as knowledge sharing, team awareness, and collaboration. However, reviewing code changes is a hard and time-consuming task requiring developers to prioritize code review tasks to optimize their time and effort spent on code review. Previous approaches attempted to prioritize code reviews based on their likelihood to be merged by leveraging Machine learning (ML) models to maximize the prediction performance. However, these approaches did not consider the review effort dimension which results in sub-optimal solutions for code review prioritization. It is thus important to consider the code review effort in code review request prioritization to help developers optimize their code review efforts while maximizing the number of merged code changes. To address this issue, we propose CostAwareCR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${CostAwareCR}$$\end{document}, a multi-objective optimization-based approach to predict and prioritize code review requests based on their likelihood to be merged, and their review effort measured in terms of the size of the reviewed code. CostAwareCR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${CostAwareCR}$$\end{document}uses the RuleFit algorithm to learn relevant features. Then, our approach learns Logistic Regression (LR) model weights using the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to simultaneously maximize (1) the prediction performance and, (2) the cost-effectiveness. To evaluate the performance of CostAwareCR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${CostAwareCR}$$\end{document}, we performed a large empirical study on 146,612 code reviews across 3 large organizations, namely LibreOffice, Eclipse and GerritHub. The obtained results indicate that CostAwareCR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${CostAwareCR}$$\end{document}achieves promising Area Under the Curve (AUC) scores ranging from 0.75 to 0.77. Additionally, CostAwareCR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${CostAwareCR}$$\end{document}outperforms various baseline approaches in terms of effort-awareness performance metrics being able to prioritize the review of 87% of code changes by using only 20% of the effort. Furthermore, our approach achieved 0.92 in terms of the normalized area under the lift chart (Popt\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_{opt}$$\end{document}) indicating that our approach is able to provide near-optimal code review prioritization based on the review effort. Our results indicate that our multi-objective formulation is prominent for learning models that provide a trade-off between good cost-effectiveness while keeping promising prediction performance.
... In this context, another approach that made use of a similar genetic approach was introduced by Saidani et al. [20] where they focused on predicting failing builds in a CI context. They employed Long Short-Term Memory (LSTM) to capture the temporal correlations in historical CI build data and predict the outcome of the next CI build within a given sequence. ...
Preprint
The software industry is experiencing a surge in the adoption of Continuous Integration (CI) practices, both in commercial and open-source environments. CI practices facilitate the seamless integration of code changes by employing automated building and testing processes. Some frameworks, such as Travis CI and GitHub Actions have significantly contributed to simplifying and enhancing the CI process, rendering it more accessible and efficient for development teams. Despite the availability these CI tools , developers continue to encounter difficulties in accurately flagging commits as either suitable for CI execution or as candidates for skipping especially for large projects with many dependencies. Inaccurate flagging of commits can lead to resource-intensive test and build processes, as even minor commits may inadvertently trigger the Continuous Integration process. The problem of detecting CI-skip commits, can be modeled as binary classification task where we decide to either build a commit or to skip it. This study proposes a novel solution that leverages Deep Reinforcement Learning techniques to construct an optimal Decision Tree classifier that addresses the imbalanced nature of the data. We evaluate our solution by running a within and a cross project validation benchmark on diverse range of Open-Source projects hosted on GitHub which showcased superior results when compared with existing state-of-the-art methods.
... Previous research proposed methods to predict build failures using build history and change set features [7,27,31,45]. For instance, Hassan and Wang [27] used past build characteristics for predictions, while Chen et al. [7] introduced an adaptive model that adjusts based on the outcome of the last build. ...
Article
Full-text available
Compute resources that enable Continuous Integration (CI, i.e., the automatic build and test cycle applied to the change sets that development teams produce) are a shared commodity that organizations need to manage. To prevent (erroneous) builds from consuming a large amount of resources, CI service providers often impose a time limit. CI builds that exceed the time limit are automatically terminated. While imposing a time limit helps to prevent abuse of the service, builds that timeout (a) consume the maximum amount of resources that a CI service is willing to provide and (b) leave CI users without an indication of whether the change set will pass or fail the CI process. Therefore, understanding timeout builds and the factors that contribute to them is important for improving the stability and quality of a CI service. In this paper, we investigate the prevalence of timeout builds and the characteristics associated with them. By analyzing a curated dataset of 936 projects that adopt the CircleCI service and report at least one timeout build, we find that the median duration of a timeout build (19.7 minutes) is more than five times that of a build that produces a pass or fail result (3.4 minutes). To better understand the factors contributing to timeout builds, we model timeout builds using characteristics of project build history, build queued time, timeout tendency, size, and author experience based on data collected from 105,663 CI builds. Our model demonstrates a discriminatory power that vastly surpasses that of a random predictor (Area Under the Receiver Operating characteristic Curve, i.e., AUROC = 0.939) and is highly stable in its performance ( AUROC optimism = 0.0001). Moreover, our model reveals that the build history and timeout tendency features are strong indicators of timeout builds, with the timeout status of the most recent build accounting for the largest proportion of the explanatory power. A longitudinal analysis of the incidences of timeout builds (i.e., a study conducted over a period of time) indicates that 64.03% of timeout builds occur consecutively. In such cases, it takes a median of 24 hours before a build that passes or fails occurs. Our results imply that CI providers should exploit build history to anticipate timeout builds.
Article
Full-text available
The substantial volume of user feedback contained in application reviews significantly contributes to the development of human-centred software requirement engineering. The abundance of unstructured text data necessitates an automated analytical framework for decision-making. Language models can automatically extract fine-grained aspect-based sentiment information from application reviews. Existing approaches are constructed based on the general domain corpus, and are challenging to elucidate the internal technique of the recognition process, along with the factors contributing to the analysis results. To fully utilize software engineering domain-specific knowledge and accurately identify aspect-sentiment pairs from application reviews, we design a dependency-enhanced heterogeneous graph neural networks architecture based on the dual-level attention mechanism. The heterogeneous information network with knowledge resources from the software engineering field is embedded into graph convolutional networks to consider the attribute characteristics of different node types. The relationship between aspect terms and sentiment terms in application reviews is determined by adjusting the dual-level attention mechanism. Semantic dependency enhancement is introduced to comprehensively model contextual relationships and analyze sentence structure, thereby distinguishing important contextual information. To our knowledge, this marks initial efforts to leverage software engineering domain knowledge resources to deep neural networks to address fine-grained sentiment analysis issues. The experimental results on multiple public benchmark datasets indicate the effectiveness of the proposed automated framework in aspect-based sentiment analysis tasks for application reviews.
Article
The paper presents an architectural solution to the problem of including navigation services in the already existing information system of the university. The relevance of the study is due to the modern development of indoor navigation technologies and the widespread prevalence of mobile services. The problem statement considers the multi-level architecture of the university IS and its functional logical model in relation to the concept of creating an adaptive educational environment. Based on this, the need to use microservices and APIs for space-time navigation in a university environment is justified. The presented justification of the selected design solutions is based on the proposed generalized scenario of internal navigation in the university, including the setting and construction of routes. Movement at the university is determined by the schedule of events, which implies space-time tracking depending on the role of the user and the infrastructure of buildings. The generalization of our experience in implementing various applications using navigation and location, as well as the results of the modeling of use cases from various points of view, allowed us to build a domain model. It has been proven that such an organization of the data model can be generated from the information structure of the portal and other external systems. Based on the described design solutions, a microservice architecture for a space-time navigation system with a public API has been developed. The key advantage of this approach is not only ample opportunities to support various university activities, but also the creation of infrastructure mechanisms for the modernization and development of the information and educational ecosystem.
Conference Paper
Full-text available
Continuous Integration (CI) aims at supporting developers in integrating code changes quickly through automated building. However , there is a consensus that CI build failure is a major barrier that developers face, which prevents them from proceeding further with development. In this paper, we introduce BF-Detector, an automated tool to detect CI build failure. Based on the adaptation of Non-dominated Sorting Genetic Algorithm (NSGA-II), our tool aims at finding the best prediction rules based on two conflicting objective functions to deal with both minority and majority classes. We evaluated the effectiveness of our tool on a benchmark of 56,019 CI builds. The results reveal that our technique outperforms state-of-the-art approaches by providing a better balance between both failed and passed builds. BF-Detector tool is publicly available, with a demo video, at: https://github.com/stilab-ets/BF-Detector CCS CONCEPTS • Software and its engineering → Software maintenance tools.
Preprint
Full-text available
Context The ultimate goal of Continuous Integration (CI) is to support developers in integrating changes into production constantly and quickly through automated build process. While CI provides developers with prompt feedback on several quality dimensions after each change, such frequent and quick changes may in turn compromise software quality without Refactoring. Indeed, recent work emphasized the potential of CI in changing the way developers perceive and apply refactoring. However, we still lack empirical evidence to confirm or refute this assumption. Objective We aim to explore and understand the evolution of refactoring practices, in terms of frequency, size and involved developers, after the switch to CI in order to emphasize the role of this process in changing the way Refactoring is applied. Method We collect a corpus of 99,545 commits and 89,926 refactoring operations extracted from 39 open-source GitHub projects that adopt Travis CI and analyze the changes using Multiple Regression Analysis (MRA). Results Our study delivers several important findings. We found that the adoption of CI is associated with a drop in the refactoring size as recommended, while refactoring frequency as well as the number (and its related rate) of developers that perform refactoring are estimated to decrease after the shift to CI, indicating that refactoring is less likely to be applied in CI context. Conclusion Our study uncovers insights about CI theory and practice and adds evidence to existing knowledge about CI practices related especially to quality assurance. Software developers need more customized refactoring tool support in the context of CI to better maintain and evolve their software systems.
Article
Full-text available
Context: Continuous Integration (CI) is a common practice in modern software development and it is increasingly adopted in the open-source as well as the software industry markets. CI aims at supporting developers in integrating code changes constantly and quickly through an automated build process. However, in such context, the build process is typically time and resource-consuming which requires a high maintenance effort to avoid build failure. Objective: The goal of this study is to introduce an automated approach to cut the expenses of CI build time and provide support tools to developers by predicting the CI build outcome. Method: In this paper, we address problem of CI build failure by introducing a novel search-based approach based on Multi-Objective Genetic Programming (MOGP) to build a CI build failure prediction model. Our approach aims at finding the best combination of CI built features and their appropriate threshold values, based on two conflicting objective functions to deal with both failed and passed builds. Results: We evaluated our approach on a benchmark of 56,019 builds from 10 large-scale and long-lived software projects that use the Travis CI build system. The statistical results reveal that our approach outperforms the state-of-the-art techniques based on machine learning by providing a better balance between both failed and passed builds. Furthermore, we use the generated prediction rules to investigate which factors impact the CI build results, and found that features related to (1) specific statistics about the project such as team size, (2) last build information in the current build and (3) the types of changed files are the most influential to indicate the potential failure of a given build. Conclusion: This paper proposes a multi-objective search-based approach for the problem of CI build failure prediction. The performances of the models developed using our MOGP approach were statistically better than models developed using machine learning techniques. The experimental results show that our approach can effectively reduce both false negative rate and false positive rate of CI build failures in highly imbalanced datasets.
Article
Full-text available
Machine learning algorithms have been used widely in various applications and areas. To fit a machine learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-parameter configuration for machine learning models has a direct impact on the model’s performance. It often requires deep knowledge of machine learning algorithms and appropriate hyper-parameter optimization techniques. Although several automatic optimization techniques exist, they have different strengths and drawbacks when applied to different types of problems. In this paper, optimizing the hyper-parameters of common machine learning models is studied. We introduce several state-of-the-art optimization techniques and discuss how to apply them to machine learning algorithms. Many available libraries and frameworks developed for hyper-parameter optimization problems are provided, and some open challenges of hyper-parameter optimization research are also discussed in this paper. Moreover, experiments are conducted on benchmark datasets to compare the performance of different optimization methods and provide practical examples of hyper-parameter optimization. This survey paper will help industrial users, data analysts, and researchers to better develop machine learning models by identifying the proper hyper-parameter configurations effectively. Github code: https://github.com/LiYangHart/Hyperparameter-Optimization-of-Machine-Learning-Algorithms
Article
Full-text available
Continuous integration (CI) frameworks, such as Travis CI, are growing in popularity, encouraged by market trends towards speeding up the release cycle and building higher-quality software. A key facilitator of CI is to automatically build and run tests whenever a new commit is submitted/pushed. Despite the many advantages of using CI, it is known that the CI process can take a very long time to complete. One of the core causes for such delays is the fact that some commits (e.g., cosmetic changes) unnecessarily kick off the CI process. Therefore, the main goal of this paper is to automate the process of determining which commits can be CI skipped through the use of machine learning techniques. We first extracted 23 features from historical data of ten software repositories. Second, we conduct a study on the detection of CI skip commits using machine learning where we built a decision tree classifier. We then examine the accuracy of using the decision tree in detecting CI skip commits. Our results show that the decision tree can identify CI skip commits with an average AUC equal to 0.89. Furthermore, the top node analysis shows that the number of developers who changed the modified files, the CI-Skip rules, and commit message are the most important features to detect CI skip commits. Finally, we investigate the generalizability of identifying CI skip commits through applying cross-project validation, and our results show that the general classifier achieves an average 0.74 of AUC values.
Article
This paper searches for the optimal neural architecture by minimizing a proxy of validation loss. Existing neural architecture search (NAS) methods used to discover the optimal neural architecture that best fits the validation examples given the up-to-date network weights. These intermediate validation results are invaluable but have not been fully explored. We propose to approximate the validation loss landscape by learning a mapping from neural architectures to their corresponding validate losses. The optimal neural architecture thus can be easily identified as the minimum of this proxy validation loss landscape. To improve the efficiency, a novel architecture sampling strategy is developed for the approximation of the proxy validation loss landscape. We also propose an operation importance weight (OIW) to balance the randomness and certainty of architecture sampling. The representation of neural architecture is learned through a graph autoencoder (GAE) over both architectures sampled during search and randomly generated architectures. We provide theoretical analyses on the validation loss estimator learned with our sampling strategy. Experimental results demonstrate that the proposed proxy validation loss landscape can be effective in both the differentiable NAS and the evolutionary-algorithm-based (EA-based) NAS.
Article
Continuous Integration (CI) consists of integrating the changes introduced by different developers more frequently through the automation of build process. Nevertheless, the CI build process is seen as a major barrier that causes delays in the product release dates. One of the main reasons for such delays is that some simple changes (i.e., can be skipped) trigger the build, which represents an unnecessary overhead and particularly painful for large projects. In order to cut off the expenses of CI build time, we propose in this paper, SKIPCI, a novel search-based approach to automatically detect CI Skip commits based on the adaptation of Strength-Pareto Evolutionary Algorithm (SPEA-2). Our approach aims to provide the optimal trade-off between two conflicting objectives to deal with both skipped and non-skipped commits. We evaluate our approach and investigate the performance of both within and cross-project validations on a benchmark of 14,294 CI commits from 15 projects that use Travis CI system. The statistical tests revealed that our approach shows a clear advantage over the baseline approaches with average scores of 92% and 84% in terms of AUC for cross-validation and cross-project validations respectively. Furthermore, the features analysis reveals that documentation changes, terms appearing in the commit message and the committer experience are the most prominent features in CI skip detection. When it comes to the cross-project scenario, the results reveal that besides the documentation changes, there is a strong link between current and previous commits results. Moreover, we deployed and evaluated the usefulness of SKIPCI with our industrial partner. Qualitative results demonstrate the effectiveness of SKIPCI in providing relevant CI skip commit recommendations to developers for two large software projects from practitioners point of view.
Article
The lifespan, power density, and transient response make supercapacitors a component of choice for the electric vehicle and renewable energy industry. Supercapacitors’ long lifecycle often makes it difficult for designers to assess the system’s reliability over the complete product cycle. In the existing literature, the remaining useful life (RUL) estimations utilize up to 50% state of health (SOH) degradation data to successfully predict the RUL of the supercapacitors with reasonable accuracy, making them impractical in terms of time and resources required to collect the data. The time to acquire data imposes restrictions on developing a data-driven RUL prediction model for the supercapacitors. The objective of this study is to reliably predict the SOH degradation curve of the supercapacitors with the availability of less than 10% degradation data to avoid time and cost-consuming lifecycle testing. This study presents a novel combination of deep learning algorithm-Deep Belief Network (DBN) with Bayesian Optimization and HyperBand (BOHB) to predict the RUL of the supercapacitors in the early phases of degradation. The proposed method successfully predicts the degradation curve using the data of the initial 15 thousand cycles (less than 6% data for training in most of the cases), which is very promising since the supercapacitor has yet to show much degradation at this stage, thus reducing up to 54% time for the development of the RUL prediction model. The proposed model shows good accuracy with percent error and root mean squared error (RMSE) ranging from 0.05% to 2.2% and 0.8851 to 1.6326, respectively. The robustness of the model is also tested by injecting noise in the training data during training.