DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to...

Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion

Conference Paper

Feb 2023

AMOS: enabling a utomatic m apping for tensor computations o n s patial accelerators with hardware abstraction

Conference Paper

Jun 2022

Fig. 4: Expression-based representation examples.

NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training

Article

Full-text available

Dec 2021

Deep neural networks (DNNs) are increasingly deployed in various image recognition and natural language processing applications. The continuous demand for accuracy and high performance has led to innovations in DNN design and a proliferation of new operators. However, existing DNN training frameworks such as PyTorch and TensorFlow only support a li...

Fig. 1: The computation flow of fast convolution algorithms.

Fig. 8: Pseudo Code for Dependency Resolving

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

Article

Full-text available

Feb 2020

Modern CNNs require huge amount of convolution operations. To address the overwhelming computation problem, Winograd and FFT fast algorithms have been used as effective approaches to reduce the number of multiplications. Inputs and filters are transformed to special domains then perform element-wise multiplication, which can be transformed into bat...

A coordinated tiling and batching framework for efficient GEMM on GPUs

Conference Paper

Feb 2019

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs....

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

Article

Feb 2019

In recent years, Convolutional Neural Networks (CNNs) have become widely adopted for computer vision tasks. FPGAs have been adequately explored as a promising hardware accelerator for CNNs due to its high performance, energy efficiency, and reconfigurability. However, prior FPGA solutions based on the conventional convolutional algorithm is often b...

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

Conference Paper

Jun 2017

Convolutional neural network (CNN) finds applications in a variety of computer vision applications ranging from object recognition and detection to scene understanding owing to its exceptional accuracy. There exist different algorithms for CNNs computation. In this paper, we explore conventional convolution algorithm with a faster algorithm using W...

Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach

Conference Paper

May 2017

Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach

Article

Apr 2017

Many cluster management systems (CMSs) have been proposed to share a single cluster with multiple distributed computing systems. However, none of the existing approaches can handle distributed machine learning (ML) workloads given the following criteria: high resource utilization, fair resource allocation and low sharing overhead. To solve this pro...

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

Conference Paper

Apr 2017

Timed Dataflow: Reducing Communication Overhead for Distributed Machine Learning Systems

Conference Paper

Dec 2016

A Cross-Platform SpMV Framework on Many-Core Architectures

Article

Full-text available

Oct 2016

Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth remain the critical performance bottlenecks. We present our novel solutions to these problems, for bot...

A fast integral image generation algorithm on GPUS

Article

Apr 2015

Integral image, also known as summed area table is a two-dimensional table generated from an input image. Each entry in the table stores the sum of all pixels which locate on the top-left corner of the entry in the input image. Integral image is a very popular and important algorithm in computer vision and computer graphics applications. Especially...

Deep Image: Scaling up Image Recognition

Article

Jan 2015

We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm using new strategies for data partitioning and communication, larger deep neural network models, novel data augmentation a...

Deep Image: Scaling up Image Recognition

Book

Jan 2015

Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

Conference Paper

Mar 2014

On-chip caches are commonly used in computer systems to hide long off-chip memory access latencies. To manage on-chip caches, either software-managed or hardware-managed schemes can be employed. State-of-art accelerators, such as the NVIDIA Fermi or Kepler GPUs and Intel's forthcoming MIC “Knights Landing” (KNL), support both software-managed cache...

yaSpMV: yet another SpMV framework on GPUs

Conference Paper

Full-text available

Feb 2014

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical perfo...

yaSpMV

Article

Feb 2014

CLSIFT: An Optimization Study of the Scale Invariance Feature Transform on GPUs

Conference Paper

Nov 2013

Scale Invariance Feature Transform (SIFT) is quite suitable for image matching because of its invariance to image scaling, rotation and slight changes in illumination or viewpoint. However, due to high computation complexity it's technically challenging to deploy SIFT in real time application situations. To address this problem, we propose CLSIFT,...

StreamScan

Article

Aug 2013

Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) globa...

StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization

Conference Paper

Feb 2013

An Insightful Program Performance Tuning Chain for GPU Computing

Conference Paper

Sep 2012

It is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. This paper presents an insightful performance tuning chain for GPUs. The g...

GPURoofline: A Model for Guiding Performance Optimizations on GPUs

Conference Paper

Aug 2012

Performance optimization on GPUs requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem. This paper presents GPURoofline, an empirical model for guiding optimizations on GPUs. The goal is to help non-expert programmers wit...

Parallelization and performance optimization on face detection algorithm with OpenCL: A case study

Article

Full-text available

Jun 2012

Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time needs. It is a good idea to parallel the Viola-Jones algorithm with OpenCL to achieve high performance across both AMD and NVidia GPU p...

Summed-area table algorithm optimization based on the OpenCL

Conference Paper

May 2012

Summed-Area table algorithm is also known as image integral algorithm. It is often used for quickly and efficiently generating the sum of values in a rectangular subset of a grid. Our work is based on the OpenCL framework. We have studied various kinds of optimization methods mainly on AMD GPUs. In this paper, we first implemented an efficient pref...

Network

John Owens
University of California, Davis
Leonid Oliker
Lawrence Berkeley National Laboratory
Ion Stoica
University of California, Berkeley
Katherine Yelick
University of California at Berkeley and Lawrence Berkeley National Laboratory
Stanimire Tomov
University of Tennessee