Schematic of NVIDIA C1060 Tesla GPU card; Memory and processors organization.

Source publication

CUDA implementation of a block-matching algorithm for Multiple GPU cards

Article

Full-text available

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the CUDA computing engine. Implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block di...

Context 1

... GPUs have specific hardware for floating point arithmetic, 2D and 3D matrix cached access [12]. To a programmer, a CUDA capable card is a collection of multiprocessors (30 for Tesla C1060) where each multiprocessor has a number of processors (8 for Tesla C1060), see Figure 3. Each multiprocessor has its own fast shared memory (16KB for the C1060) that is common to all the processors within it. ...

View in full-text

Efficient Motion Field Interpolation Method for Wyner-Ziv Video Coding

Article

Full-text available

Apr 2011

Wyner-Ziv video coding has the capability to reduce video encoding complexity by shifting motion estimation procedure from encoder to decoder. Amongst many motion estimation methods, expectation maximization algorithm is the most effective one. Unfortunately, the implementation of block-based motion estimation in this algorithm causes motion field...

Temporally Coherent Super-Resolution of Textured Video via Dynamic Texture Synthesis.

Article

Full-text available

Jan 2015

This paper addresses the problem of hallucinating the missing high-resolution (HR) details of a low-resolution (LR) video while maintaining the temporal coherence of the reconstructed HR details by using dynamic texture synthesis (DTS). Most existing multi-frame-based video super-resolution (SR) methods suffer from the problem of limited reconstruc...

Fig. 1. Information from multiple LR images is used in order to...

Table 1 . Image motions of Lenna LR images

Fig. 3. Low resolution pixels mapped into the HR grid

Table 3 . Image motions of Mandrill LR images

A Super-Resolution Image Reconstruction using Natural Neighbor Interpolation

Article

Full-text available

Apr 2015

A super-resolution image reconstruction algorithm using natural neighbor interpolation is proposed and its performance is evaluated. The algorithm is divided into two stages: image registration and the reconstruction of a high-resolution color image. In the first stage, as shifts between images are usually unknown, the algorithm computes an approxi...

Joint Overlapped Block Motion Compensation Using Eight-Neighbor Block Motion Vectors for Frame Rate Up-Conversion

Article

Oct 2013

The traditional block-based motion compensation methods in frame rate up-conversion (FRUC) only use a single uniquely motion vector field. However, there will always be some mistakes in the motion vector field whether the advanced motion estimation (ME) and motion vector analysis (MA) algorithms are performed or not. Once the motion vector field ha...

Spatial correlation-based side information refinement for distributed video coding

Article

Full-text available

Nov 2013

Distributed video coding (DVC) architecture designs, based on distributed source coding principles, have benefitted from significant progresses lately, notably in terms of achievable rate-distortion performances. However, a significant performance gap still remains when compared to prediction-based video coding schemes such as H.264/AVC. This is ma...

Real-time CUDA-based stereo matching using Cyclops2 algorithm

Article

Full-text available

Feb 2018
Int J Image Video Process

This paper presents a novel stereo matching algorithm Cyclops2. The algorithm produces a disparity image, provided two rectified grayscale images. The matching is based on the concept of minimising a weight function calculated using the absolute difference of pixel intensities. We present three simple and easily parallelizable weight functions. Each presented function gives a different trade-off between algorithm processing time and reconstructed depth image accuracy. Detailed description of the algorithm implementation in CUDA is provided. The implementation was specifically optimised for embedded NVIDIA Jetson platform. NVIDIA Jetson TK1 and TX1 boards have been used to evaluate the algorithms. We evaluated seven algorithm variations with different parameter values. Each variation results in a different speed accuracy trade-off, demonstrating that our algorithm can be used in various situations. The presented algorithm achieves up to 70 FPS processing time on lower resolution images (750 × 500 pixels) and up to 23 FPS on high-resolution images (1500 × 1000 pixels). The use of optional post-processing stage (median filter) has also been investigated. We conclude that despite its limitations, our algorithm is relevant in the field of real-time obstacle avoidance.

Multi‐GPU based Event Detection and Localization using High Definition Videos

Conference Paper

Full-text available

Apr 2014

Video processing algorithms are widely used in applications related to computer vision such as motion tracking, human behavior understanding, event detection and localization. Nevertheless, the new video standards, in high definitions (HD: 1280×720, or Full HD: 1920×1080) cause that actual implementations, even running on modern hardware, can't respect the needs of real-time processing. To overcome this constraint, many applications have been developed, that exploit the high power of graphic processing units (GPUs). However, any is able to treat high definition videos efficiently. In this work, we propose an effective exploitation of single and multiple GPUs, in order to achieve real-time detection and localization of abnormal events, using HD and Full HD videos. The proposed approach detects portions of video that corresponds to sudden changes of motion variations of movements. It allows also to provide areas in video frames where motion behavior is surprising compared to the rest of motion in the same frame. Experimental results have been conducted using several videos showing efficient detection and localization of abnormal events in multi-user scenarios. The use of multiple GPUs enabled a real time treatment of high definition videos with a global speedup ranging from 5 to 35, by comparison with CPU implementations.

Evaluation of CUDA GPU architecture as H.264 intra coding acceleration engine

Conference Paper

Nov 2013

Currently the high computational complexity makes it very difficult to produce a whole high definition real-time H.264 encoder solution, for conventional personal computer platform, based only on single-threaded software implementation. Considering that, the current paper analyses the potential of using modern general purpose graphical processing technologies, such as NVIDIA CUDA ® platform, as acceleration engines to improve the overall performance of a computer based H.264 intra video encoder. Performed experiments allowed discriminating the real gains when replacing a CPU based only solution by a GPU identifying some practical bottlenecks related with that solution. The most efficient proposal was finally compared with the original H.264/AVC reference code and the optimized x264 open source library codec, registering significant performance gains (in same cases higher than 7.6x).

Investigating the performance of motion estimation block-matching algorithms on GPU cards

Conference Paper

Full-text available

Sep 2013

In the field of video compression, motion estimation (ME) is a process that leads to high computational complexity. Implementation of ME block-matching (BM) algorithms on general purpose Central Processing Unit (CPU), has resulted in poor performance. In this paper we investigate the performance of two BM ME algorithms: Three Step Search (TSS) and Four Step Search (4SS) on Graphics Processing Unit (GPU) NVIDIA Quadro 400 using the Compute Unified Device Architecture (CUDA) platform. Both algorithms perform motion estimation on a block-by-block basis, which is considered the simplest way in terms of hardware and software implementation. The focus is to achieve parallelization of the algorithms for a real time execution. We consider two well-known test sequences: "football" and "mad900", with different image resolution. The results show that the implementation on a GPU card can improve the performance in terms of execution time, by a factor of 100.

An efficient dynamic multiple-candidate motion vector approach for GPU-based hierarchical motion estimation

Conference Paper

Full-text available

Dec 2012

Hierarchical or pyramid search is a widely used approach in motion estimation, a most expensive function in video encoding, for its low computational complexity and high efficiency. In this approach, multiple down-sampled resolutions from video frames are created. An initial motion estimation is quickly made at a lowest resolution. The final motion estimation result is achieved by propagating the initial estimation towards the original resolution. GPU or General purpose GPU embedded hundreds of number of SIMD-based cores is best suitable for motion estimation, especially with full-search-based approaches as the process can be efficiently parallelized. However, a common fundamental drawback of the hierarchical search is the erroneous estimation from the reduced resolutions may cause the final motion estimation inaccurate. Multiple-candidate motion vector approaches are proposed, however, they lack a mechanism to select the best multiple-candidate schemes considering diverse video encoding characteristics. In this paper we analyse and verify the computational complexity of the hierarchical search using NVIDIA's GPU with realistic workloads. Based on this analysis, we propose an efficient dynamic multiple-candidate motion vector approach to dynamically select best multiple-candidate motion vector schemes at runtime. This approach can achieve highest possible speedups and satisfy a desire motion estimation efficiency. Experiments on realistic workloads show the dynamic scheme selection outperforms the fixed scheme selection based on profiling.

Experimentation of Motion Estimation Algorithms in GPU

Conference Paper

Oct 2015

Video encoder motion estimation algorithms allow a great level of parallelism exploitation, since the same arithmetic operations are repeated over near amounts of pixel data. This paper analyses the use of modern general purpose graphical processing units (GPGPU), such as the NVIDIA CUDA® as an effective acceleration engine to improve motion estimation algorithms overall performance. The results of our analysis include practical evaluations performed on different ME methods using CUDA platform. The evaluations show the impacts of the method, window search size, and ME thread mapping onto the GPGPU in the speed up that can be achieved in such parallel platform.

Exploration of motion estimation algorithm in graphics processing environment

Conference Paper

Oct 2012

Currently, even considering the recent advances in the microprocessor power computing, high definition multimedia applications still require very complex demands to allow real-time video encoding. Particularly, modern video encoders (MPEG/ITU H.26x series) depend of complex and computationally exhaustive motion estimation algorithms to identify and remove temporal redundancy among consecutive (or not) frames inside a video sequence, as strategy to reduce the final compressed bit rate. In fact, the mechanism of block matching can be considered the most critical encoder algorithm, in terms of computational demands, like it is responsible for searching, in distinct reference frames, for similar pixel blocks related with each one of the input image blocks. The number of required block comparisons for high definition videos represents a clear and important restriction for real-time implementations. This paper introduces an improved strategy of block matching method, which was optimized for multiprocessing execution, mainly focusing in implementation over general purpose graphical processing unit technologies, as the NVidia CUDA® GPUs. The improved motion estimation solution was implemented in the JSVM reference code (scalable version of H.264 video encoder), when it was registered a speed up gain of more than 350% in average for 4CIF videos.

Motion perception in medical imaging, in: SPIE Medical Imaging

Conference Paper

Full-text available

Feb 2011
Proceedings of SPIE

A potential drawback of image noise suppression in medical image sequence processing is a possible loss of the apparent motion: making objects appears to move slower or less then they move in reality. For medical imaging application this can be of critical importance, for example myocardium motion in cardiac gated single photon emission computed tomography (SPECT) imaging can differentiate viable muscle from scar tissue. Therefore, in this work we design a set of experiments to measure how human observers perceive apparent motion in the presence of image degradations like noise and blur. In addition we will try to identify relevant image features, based on a visual attention model and a block matching motion estimation method that would allow development of an accurate numerical observer capable of predicting human observer motion perception.

Schematic of NVIDIA C1060 Tesla GPU card; Memory and processors organization.

Context in source publication

Similar publications

Citations