Conference Paper

Collective Knowledge: Towards R&D Sustainability

January 2016

January 2016

DOI:10.3850/9783981537079_1018

Conference: Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Authors:

Anton Lokhmotov

dividiti

A Collective Knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniques

Article

Jan 2018

Developing efficient software and hardware has never been harder whether it is for a tiny IoT device or an Exascale supercomputer. Apart from the ever growing design and optimization complexity, there exist even more fundamental problems such as lack of interdisciplinary knowledge required for effective software/hardware co-design, and a growing technology transfer gap between academia and industry. We introduce our new educational initiative to tackle these problems by developing Collective Knowledge (CK), a unified experimental framework for computer systems research and development. We use CK to teach the community how to make their research artifacts and experimental workflows portable, reproducible, customizable and reusable while enabling sustainable R&D and facilitating technology transfer. We also demonstrate how to redesign multi-objective autotuning and machine learning as a portable and extensible CK workflow. Such workflows enable researchers to experiment with different applications, data sets and tools; crowdsource experimentation across diverse platforms; share experimental results, models, visualizations; gradually expose more design and optimization choices using a simple JSON API; and ultimately build upon each other's findings. As the first practical step, we have implemented customizable compiler autotuning, crowdsourced optimization of diverse workloads across Raspberry Pi 3 devices, reduced the execution time and code size by up to 40%, and applied machine learning to predict optimizations. We hope such approach will help teach students how to build upon each others' work to enable efficient and self-optimizing software/hardware/model stack for emerging workloads.

Using OpenMP to Detect and Speculate Dynamic DOALL Loops

Chapter

Sep 2020

Production compilers such as GCC, Clang, IBM XL and the Intel C Compiler employ multiple loop parallelization techniques that help in the task of parallel programming. Although very effective, these techniques are only applicable to loops that the compiler can statically determine to have no loop-carried dependences (DOALL). Because of this restriction, a plethora of Dynamic DOALL (D-DOALL) loops are outright ignored, leaving the parallelism potential of many computationally intensive applications unexplored. This paper proposes a new analysis tool based on OpenMP clauses that allow the programmer to generate detailed profiling of any given loop by identifying its loop-carried dependences and producing carefully selected execution time metrics. The paper also proposes a set of heuristics to be used in conjunction with the analysis tool metrics to properly select loops which could be parallelized through speculative execution, even in the presence of loop-carried dependences. A thorough analysis of 180 loops from 45 benchmarks of three different suites (cBench, Parboil, and Rodinia) was realized using the Intel C Compiler and the proposed approach. Experimental results using static analysis from the Intel C Compiler showed that only 7.8% of the loops are DOALL. The proposed analysis tool exposed 39.5% May DOALL (M-DOALL) loops which could be eventually parallelized using speculative execution, as exemplified by loops from the Parboil sad program which produced a speedup of 1.92x.

Using OpenMP to Detect and Speculate Dynamic DOALL Loops

Conference Paper

Jun 2020

Production compilers such as GCC, Clang, IBM XL and the Intel C Compiler employ multiple loop parallelization techniques that help in the task of parallel programming. Although very effective, these techniques are only applicable to loops that the compiler can statically determine to have no loop-carried dependences (DOALL). Because of this restriction, a plethora of Dynamic DOALL (D-DOALL) loops are outright ignored, leaving the parallelism potential of many computationally intensive applications unexplored. This paper proposes a new analysis tool based on OpenMP clauses that allow the programmer to generate detailed profiling of any given loop by identifying its loop-carried dependences and producing carefully selected execution time metrics. The paper also proposes a set of heuristics to be used in conjunction with the analysis tool metrics to properly select loops which could be paral-lelized through speculative execution, even in the presence of loop-carried dependences. A thorough analysis of 180 loops from 45 benchmarks of three different suites (cBench, Parboil, and Rodinia) was realized using the Intel C Compiler and the proposed approach. Experimental results using static analysis from the Intel C Compiler showed that only 7.8% of the loops are DOALL. The proposed analysis tool exposed 39.5% May DOALL (M-DOALL) loops which could be eventually parallelized using speculative execution, as exemplified by loops from the Parboil sad program which produced a speedup of 1.92x.

A model-driven approach for a new generation of adaptive libraries

Preprint

Full-text available

Jun 2018

Efficient high-performance libraries often expose multiple tunable parameters to provide highly optimized routines. These can range from simple loop unroll factors or vector sizes all the way to algorithmic changes, given that some implementations can be more suitable for certain devices by exploiting hardware characteristics such as local memories and vector units. Traditionally, such parameters and algorithmic choices are tuned and then hard-coded for a specific architecture and for certain characteristics of the inputs. However, emerging applications are often data-driven, thus traditional approaches are not effective across the wide range of inputs and architectures used in practice. In this paper, we present a new adaptive framework for data-driven applications which uses a predictive model to select the optimal algorithmic parameters by training with synthetic and real datasets. We demonstrate the effectiveness of a BLAS library and specifically on its matrix multiplication routine. We present experimental results for two GPU architectures and show significant performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an embedded ARM Mali GPU) when compared to a traditionally optimized library.

A Survey on Compiler Autotuning using Machine Learning

Article

Full-text available

Sep 2018

Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.

Same, Same, but Dissimilar: Exploring Measurements for Workload Time-series Similarity

Conference Paper

Full-text available

Apr 2022

Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering

Article

Mar 2022

Pass selection and phase ordering are two critical compiler auto-tuning problems. Traditional heuristic methods cannot effectively address these NP-hard problems especially given the increasing number of compiler passes and diverse hardware architectures. Recent research efforts have attempted to address these problems through machine learning. However, the large search space of candidate pass sequences, the large numbers of redundant and irrelevant features, and the lack of training program instances make it difficult to learn models well. Several methods have tried to use expert knowledge to simplify the problems, such as using only the compiler passes or subsequences in the standard levels (e.g., -O1, -O2, and -O3) provided by compiler designers. However, these methods ignore other useful compiler passes that are not contained in the standard levels. Principal component analysis (PCA) and exploratory factor analysis (EFA) have been utilized to reduce the redundancy of feature data. However, these unsupervised methods retain all the information irrelevant to the performance of compilation optimization, which may mislead the subsequent model learning. To solve these problems, we propose a compiler pass selection and phase ordering approach, called Iterative Compilation based on Metric learning and Collaborative filtering (ICMC) . First, we propose a data-driven method to construct pass subsequences according to the observed collaborative interactions and dependency among passes on a given program set. Therefore, we can make use of all available compiler passes and prune the search space. Then, a supervised metric learning method is utilized to retain useful feature information for compilation optimization while removing both the irrelevant and the redundant information. Based on the learned similarity metric, a neighborhood-based collaborative filtering method is employed to iteratively recommend a few superior compiler passes for each target program. Last, an iterative data enhancement method is designed to alleviate the problem of lacking training program instances and to enhance the performance of iterative pass recommendations. The experimental results using the LLVM compiler on all 32 cBench programs show the following: (1) ICMC significantly outperforms several state-of-the-art compiler phase ordering methods, (2) it performs the same or better than the standard level -O3 on all the test programs, and (3) it can reach an average performance speedup of 1.20 (up to 1.46) compared with the standard level -O3.

Understanding and improving artifact sharing in software engineering research

Article

Full-text available

Jul 2021
EMPIR SOFTW ENG

In recent years, many software engineering researchers have begun to include artifacts alongside their research papers. Ideally, artifacts, including tools, benchmarks, and data, support the dissemination of ideas, provide evidence for research claims, and serve as a starting point for future research. However, in practice, artifacts suffer from a variety of issues that prevent the realization of their full potential. To help the software engineering community realize the full potential of artifacts, we seek to understand the challenges involved in the creation, sharing, and use of artifacts. To that end, we perform a mixed-methods study including a survey of artifacts in software engineering publications, and an online survey of 153 software engineering researchers. By analyzing the perspectives of artifact creators, users, and reviewers, we identify several high-level challenges that affect the quality of artifacts including mismatched expectations between these groups, and a lack of sufficient reward for both creators and reviewers. Using Diffusion of Innovations (DoI) as an analytical framework, we examine how these challenges relate to one another, and build an understanding of the factors that affect the sharing and success of artifacts. Finally, we make recommendations to improve the quality of artifacts based on our results and existing best practices.

Planetary Normal Mode Computation: Parallel Algorithms, Performance, and Reproducibility

Article

Full-text available

Jan 2021

This paper is an extension of work entitled "Computing planetary interior normal modes with a highly parallel polynomial filtering eigensolver." by Shi et al., originally presented at the SC18 conference. A highly parallel polynomial filtered eigensolver was developed and exploited to calculate the planetary normal modes. The proposed method is ideally suited for computing interior eigenpairs for large-scale eigenvalue problems as it greatly enhances memory and computational efficiency. In this work, the second-order finite element method is used to further improve the accuracy as only the first-order finite element method was deployed in the previous work. The parallel algorithm, its parallel performance up to 20k processors, and the great computational accuracy are illustrated. The reproducibility of the previous work was successfully performed on the Student Cluster Competition at the SC19 conference by several participant teams using a completely different Mars-model dataset on different clusters. Both weak and strong scaling performances of the reproducibility by the participant teams were impressive and encouraging. The analysis and reflection of their results are demonstrated and future direction is discussed.

On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond

Article

Full-text available

Jan 2021

Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations, or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation, and we study the performance, accuracy, and generalization ability of the models. Our insights allow us to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree- and MLP-based models.

On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond

Preprint

Full-text available

Nov 2020

Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation and we study the performance, accuracy and generalization ability of the models. Our insights allow to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree-and MLP-based models.

The Hardware Lottery

Preprint

Full-text available

Sep 2020

Sara Hooker

Hardware, systems and algorithms research communities have historically had different incentive structures and fluctuating motivation to engage with each other explicitly. This historical treatment is odd given that hardware and software have frequently determined which research ideas succeed (and fail). This essay introduces the term hardware lottery to describe when a research idea wins because it is suited to the available software and hardware and not because the idea is superior to alternative research directions. Examples from early computer science history illustrate how hardware lotteries can delay research progress by casting successful ideas as failures. These lessons are particularly salient given the advent of domain specialized hardware which makes it increasingly costly to stray off of the beaten path of research ideas.

Understanding and Improving Artifact Sharing in Software Engineering Research

Preprint

Aug 2020

In recent years, many software engineering researchers have begun to include artifacts alongside their research papers. Ideally, artifacts, which include tools, benchmarks, data, and more, support the dissemination of ideas, provide evidence for research claims, and serve as a starting point for future research. This often takes the form of a link in the paper pointing to a website containing these additional materials. However, in practice, artifacts suffer from a variety of issues that prevent them from fully realising that potential. To help the software engineering community realise the potential of artifacts, we seek to understand the challenges involved in the creation, sharing, and use of artifacts. To that end, we perform a mixed-methods study including a publication analysis and online survey of 153 software engineering researchers. We apply the established theory of diffusion of innovation, and draw from the field of implementation science, to make evidence-based recommendations. By analysing the perspectives of artifact creators, users, and reviewers, we identify several high-level challenges that affect the quality of artifacts including mismatched expectations between these groups, and a lack of sufficient reward for both creators and reviewers. Using diffusion of innovation as a framework, we analyse how these challenges relate to one another, and build an understanding of the factors that affect the sharing and success of artifacts. Finally, using principles from implementation science, we make evidence-based recommendations for specific sub-communities (e.g., students and postdocs, artifact evaluation committees, funding bodies, and professional organisations) to improve the quality of artifacts.

Moderating Effect of the Continental Factor on the Business Strategy and M&A Performance in the Pharmaceutical Industry for Sustainable International Business

Article

Full-text available

Jun 2020

This research analyzed the moderating effects of the continental factor on the relation between the business strategies (cost advantage strategy and differentiation strategy) of the pharmaceutical industry and mergers and acquisitions (M&A) performance. A total of 1303 M&A cases were collected from the Bloomberg database between 1995 and 2016 for the sake of empirical analyses. The independent variables were represented by the cost advantage strategy and the differentiation strategy. The dependent variable was for the M&A performance, which was measured for the changes in ROA (return on assets). The results showed that the cost advantage strategy was advantageous when an Asian firm acquired one in either Asia or Europe. In contrast, when a European company received one in either Europe or Asia, M&A performance also was higher, although the cost was higher. On the other hand, the differentiation strategy was valid only when a European firm acquired one in Asia. The moderating effect of the continental factor was beneficial only in the relation between the cost advantage strategy and M&A performance. These results could help companies make decisions that maximize M&A performance based on continental factors from the perspective of the sustainable international business strategy establishment.

The Collective Knowledge project: making ML models more portable and reproducible with open APIs, reusable best practices and MLOps

Preprint

Jun 2020

Grigori Fursin

This article provides an overview of the Collective Knowledge technology (CK or cKnowledge) that attempts to make it easier to reproduce ML&systems research, deploy ML models in production and adapt them to continuously changing data sets, models, research techniques, software and hardware. The CK concept is to decompose complex systems and ad-hoc research projects into reusable sub-components with unified APIs, CLI and JSON meta description. Such components can be connected into portable workflows using DevOps principles combined with reusable automation actions, software detection plugins, meta packages and exposed optimization parameters. CK workflows can automatically plug in different models, data and tools from different vendors while building, running and benchmarking research code in a unified way across diverse platforms and environments, performing whole system optimization, reproducing results and comparing them using public or private scoreboards on the cKnowledge.io platform. The modular CK approach was successfully validated with industrial partners to automatically co-design and optimize software, hardware and machine learning models for reproducible and efficient object detection in terms of speed, accuracy, energy, size and other characteristics. The long-term goal is to simplify and accelerate the development and deployment of ML models and systems by helping researchers and practitioners to share and reuse their knowledge, experience, best practices, artifacts and techniques using open CK APIs.

CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking

Preprint

Full-text available

Jan 2020

We present CodeReef - an open platform to share all the components necessary to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML models across diverse systems in the most efficient way. We also introduce the CodeReef solution - a way to package and share models as non-virtualized, portable, customizable and reproducible archive files. Such ML packages include JSON meta description of models with all dependencies, Python APIs, CLI actions and portable workflows necessary to automatically build, benchmark, test and customize models across diverse platforms, AI frameworks, libraries, compilers and datasets. We demonstrate several CodeReef solutions to automatically build, run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO dataset from the latest MLPerf inference benchmark across a wide range of platforms from Raspberry Pi, Android phones and IoT devices to data centers. Our long-term goal is to help researchers share their new techniques as production-ready packages along with research papers to participate in collaborative and reproducible benchmarking, compare the different ML/software/hardware stacks and select the most efficient ones on a Pareto frontier using online CodeReef dashboards.

MLPerf Inference Benchmark

Preprint

Nov 2019

Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and four orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf implements a set of rules and practices to ensure comparability across systems with wildly differing architectures. In this paper, we present the method and design principles of the initial MLPerf Inference release. The first call for submissions garnered more than 600 inference-performance measurements from 14 organizations, representing over 30 systems that show a range of capabilities.

Methodological Principles for Reproducible Performance Evaluation in Cloud Computing

Article

Full-text available

Jul 2019

The rapid adoption and the diversification of cloud computing technology exacerbate the importance of a sound experimental methodology for this domain. This work investigates how to measure and report performance in the cloud, and how well the cloud research community is already doing it. We propose a set of eight important methodological principles that combine best-practices from nearby fields with concepts applicable only to clouds, and with new ideas about the time-accuracy trade-off. We show how these principles are applicable using a practical use-case experiment. To this end, we analyze the ability of the newly released SPEC Cloud IaaS benchmark to follow the principles, and showcase real-world experimental studies in common cloud environments that meet the principles. Last, we report on a systematic literature review including top conferences and journals in the field, from 2012 to 2017, analyzing if the practice of reporting cloud performance measurements follows the proposed eight principles. Worryingly, this systematic survey and the subsequent two-round human reviews, reveal that few of the published studies follow the eight experimental principles. We conclude that, although these important principles are simple and basic, the cloud community is yet to adopt them broadly to deliver sound measurement of cloud environments.

Implementation science for software engineering: bridging the gap between research and practice (keynote)

Conference Paper

Full-text available

Sep 2018

Lauren Herckis

Software engineering research rarely impacts practitioners in the field. A desire to facilitate transfer alone is not sufficient to bridge the gap between research and practice. Fields from medicine to education have acknowledged a similar challenge over the past 25 years. An empirical approach to the translation of research into evidence-based practice has emerged from the resulting discussion. Implementation science has fundamentally changed the way that biomedical research is conducted, and has revolutionized the daily practice of doctors, social workers, epidemiologists, early childhood educators, and more. In this talk we will explore the methods, frameworks, and practices of implementation science and their application to novel disciplines, including software engineering research. We will close by proposing some directions for future software engineering research to facilitate transfer.

LIBVERSIONINGCOMPILER: An easy-to-use library for dynamic generation and invocation of multiple code versions

Article

Full-text available

Jan 2018

We present libVersioningCompiler, a C++ library designed to support the dynamic generation of multiple versions of the same compute kernel in a HPC scenario. It can be used to provide continuous optimization, code specialization based on the input data or on workload changes, or otherwise to dynamically adjust the application, without the burden of a full dynamic compiler. The library supports multiple underlying compilers but specifically targets the llvm framework. We also provide examples of use, showing the overhead of the library, and providing guidelines for its efficient use.

Introducing ReQuEST: an Open Platform for Reproducible and Quality-Efficient Systems-ML Tournaments

Article

Jan 2018

Co-designing efficient machine learning based systems across the whole hardware/software stack to trade off speed, accuracy, energy and costs is becoming extremely complex and time consuming. Researchers often struggle to evaluate and compare different published works across rapidly evolving software frameworks, heterogeneous hardware platforms, compilers, libraries, algorithms, data sets, models, and environments. We present our community effort to develop an open co-design tournament platform with an online public scoreboard. It will gradually incorporate best research practices while providing a common way for multidisciplinary researchers to optimize and compare the quality vs. efficiency Pareto optimality of various workloads on diverse and complete hardware/software systems. We want to leverage the open-source Collective Knowledge framework and the ACM artifact evaluation methodology to validate and share the complete machine learning system implementations in a standardized, portable, and reproducible fashion. We plan to hold regular multi-objective optimization and co-design tournaments for emerging workloads such as deep learning, starting with ASPLOS'18 (ACM conference on Architectural Support for Programming Languages and Operating Systems - the premier forum for multidisciplinary systems research spanning computer architecture and hardware, programming languages and compilers, operating systems and networking) to build a public repository of the most efficient machine learning algorithms and systems which can be easily customized, reused and built upon.

Program Autotuning As a Service: Opportunities and Challenges

Conference Paper

Full-text available

Dec 2016

Program autotuning is becoming an increasingly valuable tool for improving performance portability across diverse target architectures, exploring trade-offs between several criteria, or meeting quality of service requirements. Recent work on general autotuning frameworks enabled rapid development of domain-specific autotuners reusing common libraries of parameter types and search techniques. In this work we explore the use of such frameworks to develop general-purpose online services for program autotuning using the Software as a Service model. Beyond the common benefits of this model, the proposed approach opens up a number of unique opportunities, such as collecting performance data and utilizing it to improve further runs, or enabling remote online autotuning. However, the proposed autotuning as a service approach also brings in several challenges, such as accessing target systems, dealing with measurement latency, and supporting execution of user-provided code. This paper presents the first step towards implementing the proposed approach and addressing these challenges. We describe an implementation of generic autotuning service that can be used for tuning arbitrary programs on user-provided computing systems. The service is based on OpenTuner autotuning framework and runs on Everest platform that enables rapid development of computational web services. In contrast to OpenTuner, the service doesn't require installation of the framework, allows users to avoid writing code and supports efficient parallel execution of measurement tasks across multiple machines. The performance of the service is evaluated by using it for tuning synthetic and real programs.

Optimizing convolutional neural networks on embedded platforms with OpenCL

Conference Paper

Apr 2016

We invite the community to collaboratively design and optimize convolutional neural networks to meet the performance, accuracy and cost requirements for deployment on a range of form factors -- from sensors to self-driving cars.

Benchmarking, autotuning and crowdtuning OpenCL programs using the Collective Knowledge framework

Conference Paper

Apr 2016

Autotuning is a popular technique to ensure performance portability for important algorithms such as BLAS, FFT and DNN across the ever evolving software and hardware stack. Unfortunately, when performed on a single machine, autotuning can explore only a tiny fraction of the ever growing and non-linear optimization spaces and thus can easily miss optimal solutions. We propose to practically solve this problem with the help of the community using the open-source Collective Knowledge framework (CK). We have customized the universal multi-objective autotuning engine of CK to optimize the local work size and other parameters of OpenCL workloads across diverse inputs and devices. Optimal solutions (with speed increases of up to 20x and energy savings of up to 30% over the default configurations) are preserved in the open repository of optimization knowledge at http://cknowledge.org/repo.

CTFS: A Configurable Tuning Framework on SHENWEI System

Conference Paper

Jun 2021

Combining Static and Dynamic Analysis to Query Characteristics of HPC Applications

Conference Paper

Jun 2021

Compiler Autotuning Based on Hot Function for SHENWEI Processor

Conference Paper

Apr 2021

FireMarshal: Making HW/SW Co-Design Reproducible and Reliable

Conference Paper

Mar 2021

Dynamic Precision Autotuning with TAFFO

Article

May 2020

Many classes of applications, both in the embedded and high performance domains, can trade off the accuracy of the computed results for computation performance. One way to achieve such a trade-off is precision tuning—that is, to modify the data types used for the computation by reducing the bit width, or by changing the representation from floating point to fixed point. We present a methodology for high-accuracy dynamic precision tuning based on the identification of input classes (i.e., classes of input datasets that benefit from similar optimizations). When a new input region is detected, the application kernels are re-compiled on the fly with the appropriate selection of parameters. In this way, we obtain a continuous optimization approach that enables the exploitation of the reduced precision computation while progressively exploring the solution space, thus reducing the time required by compilation overheads. We provide tools to support the automation of the runtime part of the solution, leaving to the user only the task of identifying the input classes. Our approach provides a significant performance boost (up to 320%) on the typical approximate computing benchmarks, without meaningfully affecting the accuracy of the result, since the error remains always below 3%.

Implementing Computational Reproducibility in the Whole Tale Environment

Conference Paper

Jun 2019

We present and define a structured digital object, called a "Tale," for the dissemination and publication of computational scientific findings in the scholarly record. The Tale emerges from the NSF funded Whole Tale project (wholetale.org) which is developing a computational environment designed to capture the entire computational pipeline associated with a scientific experiment and thereby enable computational reproducibility. A Tale allows researchers to create and package code, data and information about the workflow and computational environment necessary to support, review, and recreate the computational results reported in published research. The Tale then captures the artifacts and information needed to facilitate understanding, transparency, and execution of the Tale for review and reproducibility at the time of publication.

CAASCADE: A System for Static Analysis of HPC Software Application Portfolios

Chapter

Apr 2019

With the increasing complexity of upcoming HPC systems, so-called “co-design” efforts to develop the hardware and applications in concert for these systems also become more challenging. It is currently difficult to gather information about the usage of programming model features, libraries, and data structure considerations in a quantitative way across a variety of applications, and this information is needed to prioritize development efforts in systems software and hardware optimizations. In this paper we propose CAASCADE, a system that can harvest this information in an automatic way in production HPC environments, and we show some early results from a prototype of the system based on GNU compilers and a MySQL database.

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Article

Full-text available

Oct 2014

Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPUaccelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives.

Collective Mind: Towards Practical and Collaborative Auto-Tuning

Article

Full-text available

Jul 2014
SCI PROGRAMMING-NETH

Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.

Contention aware execution: Online contention detection and response

Conference Paper

Full-text available

Apr 2010

Cross-core application interference due to contention for shared on-chip and off-chip resources pose a significant challenge to providing application level quality of service (QoS) guarantees on commodity multicore micro-architectures. Unexpected cross-core interference is especially problematic when considering latency-sensitive applications that are present in the web service data center application domains, such as web-search. The commonly used solution is to simply disallow the co-location of latency-sensitive applications and throughput-oriented batch applications on a single chip, leaving much of the processing capabilities of multicore micro-architectures underutilized. In this work we present a Contention Aware Execution Runtime (CAER) environment that provides a lightweight runtime solution that minimizes cross-core interference due to contention, while maximizing utilization. CAER leverages the ubiquitous performance monitoring capabilities present in current multicore processors to infer and respond to contention and requires no added hardware support. We present the design and implementation of the CAER environment, two separate contention detection heuristics, and approaches to respond to contention online. We evaluate our solution using the SPEC2006 benchmark suite. Our experiments show that when allowing co-location with CAER, as opposed to disallowing co-location, we are able to increase the utilization of the multicore CPU by 58% on average. Meanwhile CAER brings the overhead due to allowing co-location from 17% down to just 4% on average.

Histograms of Oriented Gradients for Human Detection

Conference Paper

Jul 2005
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

Histograms of Oriented Gradients for Human Detection

Article

Jun 2005

Collective Mind, Part II: Towards Performance- and Cost-Aware Software Engineering as a Natural Science

Article

Jun 2015

Nowadays, engineers have to develop software often without even knowing which hardware it will eventually run on in numerous mobile phones, tablets, desktops, laptops, data centers, supercomputers and cloud services. Unfortunately, optimizing compilers are not keeping pace with ever increasing complexity of computer systems anymore and may produce severely underperforming executable codes while wasting expensive resources and energy. We present our practical and collaborative solution to this problem via light-weight wrappers around any software piece when more than one implementation or optimization choice available. These wrappers are connected with a public Collective Mind autotuning infrastructure and repository of knowledge (c-mind.org/repo) to continuously monitor various important characteristics of these pieces (computational species) across numerous existing hardware configurations together with randomly selected optimizations. Similar to natural sciences, we can now continuously track winning solutions (optimizations for a given hardware) that minimize all costs of a computation (execution time, energy spent, code size, failures, memory and storage footprint, optimization time, faults, contentions, inaccuracy and so on) of a given species on a Pareto frontier along with any unexpected behavior. The community can then collaboratively classify solutions, prune redundant ones, and correlate them with various features of software, its inputs (data sets) and used hardware either manually or using powerful predictive analytics techniques. Our approach can then help create a large, realistic, diverse, representative, and continuously evolving benchmark with related optimization knowledge while gradually covering all possible software and hardware to be able to predict best optimizations and improve compilers and hardware depending on usage scenarios and requirements.

Experience report: Community-driven reviewing and validation of publications

Article

Jun 2014

In this report, we share our practical experience on crowdsourcing evaluation of research artifacts and reviewing of publications since 2008. We also briefly discuss encountered problems including reproducibility of experimental results and possible solutions.

Docker: lightweight Linux containers for consistent development and deployment

Article

Mar 2014

Dirk Merkel

Docker promises the ability to package applications and their dependencies into lightweight containers that move easily between different distros, start up quickly and are isolated from each other.

The Art of Computer Systems Performance Analysis : Techniques for Experimental Design, Measurement, Simulation and Modeling / R. Jain.

Article

Jan 1991

Raj Jain

The architecture of virtual machine

Article

Jun 2005

A virtual machine can support individual processes or a complete system depending on the abstraction level where virtualization occurs. Some VMs support flexible hardware usage and software isolation, while others translate from one instruction set to another. Virtualizing a system or component -such as a processor, memory, or an I/O device - at a given abstraction level maps its interface and visible resources onto the interface and resources of an underlying, possibly different, real system. Consequently, the real system appears as a different virtual system or even as multiple virtual systems. Interjecting virtualizing software between abstraction layers near the HW/SW interface forms a virtual machine that allows otherwise incompatible subsystems to work together. Further, replication by virtualization enables more flexible and efficient and efficient use of hardware resources.

Realeyes image processing benchmark

Jan 2011

E Hajiyev
R Dávid
L Marák
R Baghdadi

E. Hajiyev, R. Dávid, L. Marák, and R. Baghdadi, "Realeyes image processing benchmark." https://github.com/Realeyes/pencil-benchmarksimageproc, 2011-2015.

Live report with shared artifacts and interactive graphs

779-781

G Fursin
A Lokhmotov

G. Fursin and A. Lokhmotov, "Live report with shared artifacts and interactive graphs." http://cknowledge.org/repo/web.php?wcid=report: b0779e2a64c22907.

Collective Mind: Towards practical and collaborative auto-tuning

G Fursin
R Miceli
A Lokhmotov
M Gerndt
M Baboulin
D Malony
Z Chamski
D Novillo
D D Vento

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

May 2015

L Nardi
B Bodin
M Z Zia
J Mawer
A Nisbet
P H J Kelly
A J Davison
M Luján
M F P O'boyle
G Riley
N Topham
S Furber

L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J. Davison, M. Luján, M. F. P. O'Boyle, G. Riley, N. Topham, and S. Furber, "Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM," in IEEE Intl. Conf. on Robotics and Automation (ICRA), May 2015. arXiv:1410.2167.

Collective Knowledge: Towards R&D Sustainability

No full-text available

Recommended publications

The importance of small knowledge intensive firms in European countries

Corporate Governance, Human Capital, and Firm Value

Challenges of managing knowledge across projects

The Strategic Timing of R&D Agreements