Shengen Yan

Shengen Yan
SenseTime Group Limited. · Research Institute of SenseTime

About

26
Publications
6,047
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,557
Citations

Publications

Publications (26)
Preprint
DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to...
Article
Full-text available
Deep neural networks (DNNs) are increasingly deployed in various image recognition and natural language processing applications. The continuous demand for accuracy and high performance has led to innovations in DNN design and a proliferation of new operators. However, existing DNN training frameworks such as PyTorch and TensorFlow only support a li...
Article
Full-text available
Modern CNNs require huge amount of convolution operations. To address the overwhelming computation problem, Winograd and FFT fast algorithms have been used as effective approaches to reduce the number of multiplications. Inputs and filters are transformed to special domains then perform element-wise multiplication, which can be transformed into bat...
Conference Paper
General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs....
Article
In recent years, Convolutional Neural Networks (CNNs) have become widely adopted for computer vision tasks. FPGAs have been adequately explored as a promising hardware accelerator for CNNs due to its high performance, energy efficiency, and reconfigurability. However, prior FPGA solutions based on the conventional convolutional algorithm is often b...
Conference Paper
Convolutional neural network (CNN) finds applications in a variety of computer vision applications ranging from object recognition and detection to scene understanding owing to its exceptional accuracy. There exist different algorithms for CNNs computation. In this paper, we explore conventional convolution algorithm with a faster algorithm using W...
Article
Many cluster management systems (CMSs) have been proposed to share a single cluster with multiple distributed computing systems. However, none of the existing approaches can handle distributed machine learning (ML) workloads given the following criteria: high resource utilization, fair resource allocation and low sharing overhead. To solve this pro...
Article
Full-text available
Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth remain the critical performance bottlenecks. We present our novel solutions to these problems, for bot...
Article
Integral image, also known as summed area table is a two-dimensional table generated from an input image. Each entry in the table stores the sum of all pixels which locate on the top-left corner of the entry in the input image. Integral image is a very popular and important algorithm in computer vision and computer graphics applications. Especially...
Article
We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm using new strategies for data partitioning and communication, larger deep neural network models, novel data augmentation a...
Book
We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm using new strategies for data partitioning and communication, larger deep neural network models, novel data augmentation a...
Conference Paper
On-chip caches are commonly used in computer systems to hide long off-chip memory access latencies. To manage on-chip caches, either software-managed or hardware-managed schemes can be employed. State-of-art accelerators, such as the NVIDIA Fermi or Kepler GPUs and Intel's forthcoming MIC “Knights Landing” (KNL), support both software-managed cache...
Conference Paper
Full-text available
SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical perfo...
Article
SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical perfo...
Conference Paper
Scale Invariance Feature Transform (SIFT) is quite suitable for image matching because of its invariance to image scaling, rotation and slight changes in illumination or viewpoint. However, due to high computation complexity it's technically challenging to deploy SIFT in real time application situations. To address this problem, we propose CLSIFT,...
Article
Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) globa...
Conference Paper
Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) globa...
Conference Paper
It is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. This paper presents an insightful performance tuning chain for GPUs. The g...
Conference Paper
Performance optimization on GPUs requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem. This paper presents GPURoofline, an empirical model for guiding optimizations on GPUs. The goal is to help non-expert programmers wit...
Article
Full-text available
Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today's bigger and bigger high quality images and videos still bring in the new challenge of real time needs. It is a good idea to parallel the Viola-Jones algorithm with OpenCL to achieve high performance across both AMD and NVidia GPU p...
Conference Paper
Summed-Area table algorithm is also known as image integral algorithm. It is often used for quickly and efficiently generating the sum of values in a rectangular subset of a grid. Our work is based on the OpenCL framework. We have studied various kinds of optimization methods mainly on AMD GPUs. In this paper, we first implemented an efficient pref...

Network

Cited By