Assessing the need – Comparison of search results

Source publication

Orion: A Software Project Search Engine with Integrated Diverse Software Artifacts

Conference Paper

Full-text available

Jul 2013

What projects contain more than 10, 000 lines of code developed by less than 10 people and are still actively maintained with a high bug-fixing rate? To address the challenges for answering such enquiries, we develop an integrated search engine architecture that combines information from different types of software repositories from multiple source...

Desarrollo de un lenguaje de dominio específico para sistemas de gestión de aprendizaje y su herramienta de implementación “KiwiDSM” mediante ingeniería dirigida por modelos

Article

Full-text available

Sep 2010

El artículo presenta la creación de KiwiDSM: herramienta de lenguaje de dominio específico (DSL), que apoyada en ingeniería dirigida por modelos (MDE), permite modelar módulos que conforman un sistema de gestión del aprendizaje (LMS) en el área de comunicaciones; esta herramienta es independiente de la plataforma. La validación de la propuesta se r...

CodeDJ: Reproducible Queries over Large-Scale Software Repositories

Conference Paper

Full-text available

Jul 2021

Analyzing massive code bases is a staple of modern software engineering research – a welcome side-effect of the advent of large-scale software repositories such as GitHub. Selecting which projects one should analyze is a labor-intensive process, and a process that can lead to biased results if the selection is not representative of the population of interest. One issue faced by researchers is that the interface exposed by software repositories only allows the most basic of queries. CodeDJ is an infrastructure for querying repositories composed of a persistent datastore, constantly updated with data acquired from GitHub, and an in-memory database with a Rust query interface. CodeDJ supports reproducibility, historical queries are answered deterministically using past states of the datastore; thus researchers can reproduce published results. To illustrate the benefits of CodeDJ, we identify biases in the data of a published study and, by repeating the analysis with new data, we demonstrate that the study’s conclusions were sensitive to the choice of projects.

Sampling Projects in GitHub for MSR Studies

Preprint

Mar 2021

Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting service. However, researchers have to deal with usage limitations imposed by these APIs and a lack of required information. For example, the GitHub search APIs allow 30 requests per minute and, when searching repositories, only provide limited information (e.g., the number of commits in a repository is not included). To support researchers in sampling projects from GitHub, we present GHS (GitHub Search), a dataset containing 25 characteristics (e.g., number of commits, license, etc.) of 735,669 repositories written in 10 programming languages. The set of characteristics has been derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to (i) always provide fresh data about the existing projects, and (ii) increase the number of indexed projects. The GHS dataset can be queried through a web application we built that allows to set many combinations of selection criteria needed for a study and download the information of matching repositories: https://seart-ghs.si.usi.ch.

Best Practices and Recommendations for Writing Good Software

Article

Full-text available

Nov 2019

Qusay Idrees Sarhan

Writing good software is not an easy task, it requires a lot of coding experience and skills. Therefore, inexperienced software developers or newbies suffer from this critical task. In this paper, we provide guidelines to help in this important context. It presents the most important best practices and recommendations of writing good software from software engineering perspective regardless of the software domain (whether for desktop, mobile, web, or embedded), software size, and software complexity. The best practices provided in this paper are organized in taxonomy of many categories to ease the process of considering them while developing software. Furthermore, many useful, practical, and actionable recommendations are given mostly in each category to be considered by software developers.

Augmenting and structuring user queries to support efficient free-form code search

Article

Full-text available

Oct 2018
EMPIR SOFTW ENG

Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions.

Correlation-based software search by leveraging software term database

Article

Full-text available

May 2018

Internet-scale open source software (OSS) production in various communities generates abundant reusable resources for software developers. However, finding the desired and mature software with keyword queries from a considerable number of candidates, especially for the fresher, is a significant challenge because current search services often fail to understand the semantics of user queries. In this paper, we construct a software term database (STDB) by analyzing tagging data in Stack Overflow and propose a correlation-based software search (CBSS) approach that performs correlation retrieval based on the term relevance obtained from STDB. In addition, we design a novel ranking method to optimize the initial retrieval result. We explore four research questions in four experiments, respectively, to evaluate the effectiveness of the STDB and investigate the performance of the CBSS. The experiment results show that the proposed CBSS can effectively respond to keyword-based software searches and significantly outperforms other existing search services at finding mature software.

Curating GitHub for engineered software projects

Article

Full-text available

Dec 2017
EMPIR SOFTW ENG

Software forges like GitHub host millions of repositories. Software engineering researchers have been able to take advantage of such a large corpora of potential study subjects with the help of tools like GHTorrent and Boa. However, the simplicity in querying comes with a caveat: there are limited means of separating the signal (e.g. repositories containing engineered software projects) from the noise (e.g. repositories containing home work assignments). The proportion of noise in a random sample of repositories could skew the study and may lead to researchers reaching unrealistic, potentially inaccurate, conclusions. We argue that it is imperative to have the ability to sieve out the noise in such large repository forges. We propose a framework, and present a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project. We identify software engineering practices (called dimensions) and propose means for validating their existence in a GitHub repository. We used reaper to measure the dimensions of 1,857,423 GitHub repositories. We then used manually classified data sets of repositories to train classifiers capable of predicting if a given GitHub repository contains an engineered software project. The performance of the classifiers was evaluated using a set of 200 repositories with known ground truth classification. We also compared the performance of the classifiers to other approaches to classification (e.g. number of GitHub Stargazers) and found our classifiers to outperform existing approaches. We found stargazers-based classifier (with 10 as the threshold for number of stargazers) to exhibit high precision (97%) but an inversely proportional recall (32%). On the other hand, our best classifier exhibited a high precision (82%) and a high recall (86%). The stargazer-based criteria offers precision but fails to recall a significant portion of the population.

DéjàVu: a map of code duplicates on GitHub

Article

Oct 2017

Previous studies have shown that there is a non-trivial amount of duplication in source code. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only 6% of the files are distinct. Java, on the other hand, has the least duplication, 60% of files are distinct. Lastly, a project-level analysis shows that between 9% and 31% of the projects contain at least 80% of files that can be found elsewhere. These rates of duplication have implications for systems built on open source software as well as for researchers interested in analyzing large code bases. As a concrete artifact of this study, we have created DéjàVu, a publicly available map of code duplicates in GitHub repositories.

GitcProc: a tool for processing and classifying GitHub commits

Conference Paper

Full-text available

Jul 2017

Sites such as GitHub have created a vast collection of software artifacts that researchers interested in understanding and improving software systems can use. Current tools for processing such GitHub data tend to target project metadata and avoid source code processing, or process source code in a manner that requires significant effort for each language supported. This paper presents GitcProc, a lightweight tool based on regular expressions and source code blocks, which downloads projects and extracts their project history, including fine-grained source code information and development time bug fixes. GitcProc can track changes to both single-line and block source code structures and associate these changes to the surrounding function context with minimal set up required from users. We demonstrate GitcProc's ability to capture changes in multiple languages by evaluating it on C, C++, Java, and Python projects, and show it finds bug fixes and the context of source code changes effectively with few false positives.

Query Reformulation by Leveraging Crowd Wisdom for Scenario-based Software Search

Conference Paper

Full-text available

Sep 2016

The Internet-scale open source software (OSS) production in various communities are generating abundant reusable resources for software developers. However, how to retrieve and reuse the desired and mature software from huge amounts of candidates is a great challenge: there are usually big gaps between the user application contexts (that often used as queries) and the OSS key words (that often used to match the queries). In this paper, we define the scenario-based query problem for OSS retrieval, and then we propose a novel approach to reformulate the raw query by leveraging the crowd wisdom from millions of developers to improve the retrieval results. We build a software-specific domain lexical database based on the knowledge in open source communities , by which we can expand and optimize the input queries. The experiment results show that, our approach can refor-mulate the initial query effectively and outperforms other existing search engines significantly at finding mature software .

Big Code Search: A Bibliography

Article

Aug 2023
ACM COMPUT SURV

Code search is an essential task in software development. Developers often search the internet and other code databases for necessary source code snippets to ease the development efforts. Code search techniques also help learn programming as novice programmers or students can quickly retrieve (hopefully good) examples already used in actual software projects. Given the recurrence of the code search activity in software development, there is an increasing interest in the research community. To improve the code search experience, the research community suggests many code search tools and techniques. These tools and techniques leverage several different ideas and claim a better code search performance. However, it is still challenging to illustrate a comprehensive view of the field considering that existing studies generally explore narrow and limited subsets of used components. This study aims to devise a grounded approach to understanding the procedure for code search and build an operational taxonomy capturing the critical facets of code search techniques. Additionally, we investigate evaluation methods, benchmarks, and datasets used in the field of code search.

Assessing the need – Comparison of search results

Similar publications

Citations