Assessing the need – Comparison of search results  

Assessing the need – Comparison of search results  

Source publication
Conference Paper
Full-text available
What projects contain more than 10, 000 lines of code developed by less than 10 people and are still actively maintained with a high bug-fixing rate? To address the challenges for answering such enquiries, we develop an integrated search engine architecture that combines information from different types of software repositories from multiple source...

Similar publications

Article
Full-text available
El artículo presenta la creación de KiwiDSM: herramienta de lenguaje de dominio específico (DSL), que apoyada en ingeniería dirigida por modelos (MDE), permite modelar módulos que conforman un sistema de gestión del aprendizaje (LMS) en el área de comunicaciones; esta herramienta es independiente de la plataforma. La validación de la propuesta se r...

Citations

... Orion: This system aimed to enable retrieving projects using complex search queries linking different artifacts of software development, such as source code, version control metadata, bug tracker tickets, developer activities and interactions extracted from the hosting platform [3]. The project is no longer maintained, it scaled to about 185K projects. ...
... Finding out whether a project is using a custom 1 let wanted: HashMap<SnapshotId> = db 2 .snapshots() 3 .filter(|snapshot| 4 snapshot.contains( 5 "#include␣<memory_resource>")) 6 .map(|snapshot| ...
Conference Paper
Full-text available
Analyzing massive code bases is a staple of modern software engineering research – a welcome side-effect of the advent of large-scale software repositories such as GitHub. Selecting which projects one should analyze is a labor-intensive process, and a process that can lead to biased results if the selection is not representative of the population of interest. One issue faced by researchers is that the interface exposed by software repositories only allows the most basic of queries. CodeDJ is an infrastructure for querying repositories composed of a persistent datastore, constantly updated with data acquired from GitHub, and an in-memory database with a Rust query interface. CodeDJ supports reproducibility, historical queries are answered deterministically using past states of the datastore; thus researchers can reproduce published results. To illustrate the benefits of CodeDJ, we identify biases in the data of a published study and, by repeating the analysis with new data, we demonstrate that the study’s conclusions were sensitive to the choice of projects.
... Bissyandé et al. [35] presented Orion, a corpus of software projects collected from GitHub, Google Code [36] and Freecode [37]. To query Orion a custom designed DSL language must be used. ...
Preprint
Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting service. However, researchers have to deal with usage limitations imposed by these APIs and a lack of required information. For example, the GitHub search APIs allow 30 requests per minute and, when searching repositories, only provide limited information (e.g., the number of commits in a repository is not included). To support researchers in sampling projects from GitHub, we present GHS (GitHub Search), a dataset containing 25 characteristics (e.g., number of commits, license, etc.) of 735,669 repositories written in 10 programming languages. The set of characteristics has been derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to (i) always provide fresh data about the existing projects, and (ii) increase the number of indexed projects. The GHS dataset can be queried through a web application we built that allows to set many combinations of selection criteria needed for a study and download the information of matching repositories: https://seart-ghs.si.usi.ch.
... Only the search for software code makes them usable and more valuable in our practical life. Therefore, developers should make their code (in case of open-source development) searchable and accessible using suitable metadata to describe their code [49]. Often, such data are related to the code file such as name, size, extension, and attributes. ...
Article
Full-text available
Writing good software is not an easy task, it requires a lot of coding experience and skills. Therefore, inexperienced software developers or newbies suffer from this critical task. In this paper, we provide guidelines to help in this important context. It presents the most important best practices and recommendations of writing good software from software engineering perspective regardless of the software domain (whether for desktop, mobile, web, or embedded), software size, and software complexity. The best practices provided in this paper are organized in taxonomy of many categories to ease the process of considering them while developing software. Furthermore, many useful, practical, and actionable recommendations are given mostly in each category to be considered by software developers.
... The literature contains a large body of approaches that attempt to solve the vocabulary mismatch problem. They either 1) use a controlled vocabulary (Liu et al. 1999) maintained by experts in specific and restricted domains; or 2) automatically derive a thesaurus (Eckert et al. 2007), e.g., word co-occurrence statistics in an exhaustive corpus; or 3) interactively expand user queries (Ruthven 2003), e.g., by recommending other terms from previous query logs; or 4) automatically expand queries (Carpineto et al. 2001) by adding derived words from the terms included in the original query, e.g., add the integer word in a query with int; or 5) completely rewrite the query automatically (Gollapudi et al. 2011). Most of these approaches are not suitable in the settings of a code search engine, since i) the domain is not restricted, ii) the corpus is not finite, iii) query logs are not always available, iv) code terms and query terms may not share any stem words, and v) query terms remain valuable to be matched against identifiers in the code. ...
... There are broadly two ways of reformulating a query: a global approach would use a thesaurus, like WordNet, to enumerated related words and synonyms of the query terms; a more local approach, however, iteratively tries to expand the query by considering extra terms appearing in initial results obtained with the original query and which are marked as relevant by the searcher. Query expansion has been shown to be effective in many natural language processing (NLP) tasks (Carpineto et al. 2001;Xu and Croft 1996). In code search research, query expansion has been intensively used in recent years: Wang et al. (2014) consider human intervention to rank search results. ...
Article
Full-text available
Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions.
... Bissyande et al. [30] proposed an integrated search engine that searches for project entities and provides a uniform interface with a declarative query language. Linstead et al. [31] developed Sourcerer to take advantage of the textual aspect of software, its structural aspects, as well as any relevant metadata. ...
Article
Full-text available
Internet-scale open source software (OSS) production in various communities generates abundant reusable resources for software developers. However, finding the desired and mature software with keyword queries from a considerable number of candidates, especially for the fresher, is a significant challenge because current search services often fail to understand the semantics of user queries. In this paper, we construct a software term database (STDB) by analyzing tagging data in Stack Overflow and propose a correlation-based software search (CBSS) approach that performs correlation retrieval based on the term relevance obtained from STDB. In addition, we design a novel ranking method to optimize the initial retrieval result. We explore four research questions in four experiments, respectively, to evaluate the effectiveness of the STDB and investigate the performance of the CBSS. The experiment results show that the proposed CBSS can effectively respond to keyword-based software searches and significantly outperforms other existing search services at finding mature software.
... Our prediction results highlight the reality that even though 63% of GitHub repositories are used for software development, only a small percentage of those repositories actually contain projects that software engineering researchers may be interested in studying. Dyer et al. (2013) and Bissyandé et al. (2013a) have created domain specific languages-Boa and Orion, respectively-to help researchers mine data about software repositories. Dyer et al. (2013) have used Boa to curate a sizable number of source code repositories from GitHub and SourceForge, however, only Java repositories are currently available. ...
Article
Full-text available
Software forges like GitHub host millions of repositories. Software engineering researchers have been able to take advantage of such a large corpora of potential study subjects with the help of tools like GHTorrent and Boa. However, the simplicity in querying comes with a caveat: there are limited means of separating the signal (e.g. repositories containing engineered software projects) from the noise (e.g. repositories containing home work assignments). The proportion of noise in a random sample of repositories could skew the study and may lead to researchers reaching unrealistic, potentially inaccurate, conclusions. We argue that it is imperative to have the ability to sieve out the noise in such large repository forges. We propose a framework, and present a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project. We identify software engineering practices (called dimensions) and propose means for validating their existence in a GitHub repository. We used reaper to measure the dimensions of 1,857,423 GitHub repositories. We then used manually classified data sets of repositories to train classifiers capable of predicting if a given GitHub repository contains an engineered software project. The performance of the classifiers was evaluated using a set of 200 repositories with known ground truth classification. We also compared the performance of the classifiers to other approaches to classification (e.g. number of GitHub Stargazers) and found our classifiers to outperform existing approaches. We found stargazers-based classifier (with 10 as the threshold for number of stargazers) to exhibit high precision (97%) but an inversely proportional recall (32%). On the other hand, our best classifier exhibited a high precision (82%) and a high recall (86%). The stargazer-based criteria offers precision but fails to recall a significant portion of the population.
... [Dyer et al. 2013] have curated a large number of Java repositories and provide a domain specific language to help researchers mine data about sotware repositories. Similarly [Bissyande et al. 2013] have created Orion, a prototype for enabling unified search to retrieve projects using complex search queries linking diferent artifacts of sotware development, such as source code, version control metadata, bug tracker tickets, developer activities and interactions extracted from hosting platform. Black Duck Open Hub (www.openhub.net) is a public directory of free and open source sotware, ofering analytics and search services for discovering, evaluating, tracking, and comparing open source code and projects. ...
Article
Previous studies have shown that there is a non-trivial amount of duplication in source code. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only 6% of the files are distinct. Java, on the other hand, has the least duplication, 60% of files are distinct. Lastly, a project-level analysis shows that between 9% and 31% of the projects contain at least 80% of files that can be found elsewhere. These rates of duplication have implications for systems built on open source software as well as for researchers interested in analyzing large code bases. As a concrete artifact of this study, we have created DéjàVu, a publicly available map of code duplicates in GitHub repositories.
... GHTorrent [5] and G H Archive [1] are designed for large scale language independent repository mining, primarily focusing on GitHub metadata. Other frameworks for mining software artifacts exist, including tools to scalably build web-crawlers to extract diverse artifacts [11], search engines similar to GHTorrent but not restricted to G H [2], or proposals for fact extraction based approaches [6]. Other tools include Repograms [8] to visualize highlevel commit information, and Chronos [9] to visualize changes over time to speci c regions of code. ...
Conference Paper
Full-text available
Sites such as GitHub have created a vast collection of software artifacts that researchers interested in understanding and improving software systems can use. Current tools for processing such GitHub data tend to target project metadata and avoid source code processing, or process source code in a manner that requires significant effort for each language supported. This paper presents GitcProc, a lightweight tool based on regular expressions and source code blocks, which downloads projects and extracts their project history, including fine-grained source code information and development time bug fixes. GitcProc can track changes to both single-line and block source code structures and associate these changes to the surrounding function context with minimal set up required from users. We demonstrate GitcProc's ability to capture changes in multiple languages by evaluating it on C, C++, Java, and Python projects, and show it finds bug fixes and the context of source code changes effectively with few false positives.
... After reviewing a great deal of literature, we found that most studies in the area of open source software search focus on code search [12,19,20] while few researchers study on search of software project entities. Tegawende F. Bissyande proposed an integrated search engine Orion [3] which focuses on searching for project entities and provides a uniform interface with a declarative query language. They draw a conclusion that their system can help user find relevant projects faster and more accurately than traditional search engine. ...
... We want to assess two aspects of our method: (1)relevance, whether our method helps the user to find relevant software and, in addition, stands out the search service provided by existing project hosting sites (SourForge, OpenHub, GitHub and Oschina) and general search engines (Google, Bing and Baidu) and (2) usability, based on the research [3], we use the usability to measure whether the relevant software returned by our method are mature and more likely to satisfy user's intent. ...
Conference Paper
Full-text available
The Internet-scale open source software (OSS) production in various communities are generating abundant reusable resources for software developers. However, how to retrieve and reuse the desired and mature software from huge amounts of candidates is a great challenge: there are usually big gaps between the user application contexts (that often used as queries) and the OSS key words (that often used to match the queries). In this paper, we define the scenario-based query problem for OSS retrieval, and then we propose a novel approach to reformulate the raw query by leveraging the crowd wisdom from millions of developers to improve the retrieval results. We build a software-specific domain lexical database based on the knowledge in open source communities , by which we can expand and optimize the input queries. The experiment results show that, our approach can refor-mulate the initial query effectively and outperforms other existing search engines significantly at finding mature software .
Article
Code search is an essential task in software development. Developers often search the internet and other code databases for necessary source code snippets to ease the development efforts. Code search techniques also help learn programming as novice programmers or students can quickly retrieve (hopefully good) examples already used in actual software projects. Given the recurrence of the code search activity in software development, there is an increasing interest in the research community. To improve the code search experience, the research community suggests many code search tools and techniques. These tools and techniques leverage several different ideas and claim a better code search performance. However, it is still challenging to illustrate a comprehensive view of the field considering that existing studies generally explore narrow and limited subsets of used components. This study aims to devise a grounded approach to understanding the procedure for code search and build an operational taxonomy capturing the critical facets of code search techniques. Additionally, we investigate evaluation methods, benchmarks, and datasets used in the field of code search.