Figure 3 - uploaded by Jovan G Brankov
Content may be subject to copyright.
Schematic of NVIDIA C1060 Tesla GPU card; Memory and processors organization. 

Schematic of NVIDIA C1060 Tesla GPU card; Memory and processors organization. 

Source publication
Article
Full-text available
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the CUDA computing engine. Implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block di...

Context in source publication

Context 1
... GPUs have specific hardware for floating point arithmetic, 2D and 3D matrix cached access [12]. To a programmer, a CUDA capable card is a collection of multiprocessors (30 for Tesla C1060) where each multiprocessor has a number of processors (8 for Tesla C1060), see Figure 3. Each multiprocessor has its own fast shared memory (16KB for the C1060) that is common to all the processors within it. ...

Similar publications

Article
Full-text available
Wyner-Ziv video coding has the capability to reduce video encoding complexity by shifting motion estimation procedure from encoder to decoder. Amongst many motion estimation methods, expectation maximization algorithm is the most effective one. Unfortunately, the implementation of block-based motion estimation in this algorithm causes motion field...
Article
Full-text available
This paper addresses the problem of hallucinating the missing high-resolution (HR) details of a low-resolution (LR) video while maintaining the temporal coherence of the reconstructed HR details by using dynamic texture synthesis (DTS). Most existing multi-frame-based video super-resolution (SR) methods suffer from the problem of limited reconstruc...
Article
Full-text available
A super-resolution image reconstruction algorithm using natural neighbor interpolation is proposed and its performance is evaluated. The algorithm is divided into two stages: image registration and the reconstruction of a high-resolution color image. In the first stage, as shifts between images are usually unknown, the algorithm computes an approxi...
Article
The traditional block-based motion compensation methods in frame rate up-conversion (FRUC) only use a single uniquely motion vector field. However, there will always be some mistakes in the motion vector field whether the advanced motion estimation (ME) and motion vector analysis (MA) algorithms are performed or not. Once the motion vector field ha...
Article
Full-text available
Distributed video coding (DVC) architecture designs, based on distributed source coding principles, have benefitted from significant progresses lately, notably in terms of achievable rate-distortion performances. However, a significant performance gap still remains when compared to prediction-based video coding schemes such as H.264/AVC. This is ma...

Citations

... Even with these measures, results produced by BM are sparse and have high bad pixel percentage, i.e. 25.27%. BM can be easily parallelised due to its simplicity [5], but results of CUDA implementation are not submitted in Middlebury and KITTI datasets. Our algorithm is similar to BM in that we are also performing a full scanline search. ...
Article
Full-text available
This paper presents a novel stereo matching algorithm Cyclops2. The algorithm produces a disparity image, provided two rectified grayscale images. The matching is based on the concept of minimising a weight function calculated using the absolute difference of pixel intensities. We present three simple and easily parallelizable weight functions. Each presented function gives a different trade-off between algorithm processing time and reconstructed depth image accuracy. Detailed description of the algorithm implementation in CUDA is provided. The implementation was specifically optimised for embedded NVIDIA Jetson platform. NVIDIA Jetson TK1 and TX1 boards have been used to evaluate the algorithms. We evaluated seven algorithm variations with different parameter values. Each variation results in a different speed accuracy trade-off, demonstrating that our algorithm can be used in various situations. The presented algorithm achieves up to 70 FPS processing time on lower resolution images (750 × 500 pixels) and up to 23 FPS on high-resolution images (1500 × 1000 pixels). The use of optional post-processing stage (median filter) has also been investigated. We conclude that despite its limitations, our algorithm is relevant in the field of real-time obstacle avoidance.
... Recently, a high interest has been given to new computational architectures, such as GPUs, that turned out to be very efficient in various fields of science, and particularly, for image and video processing [7], [10]. Yet, even though several approaches to the problem of motion detection and tracking have been proposed lately, including those taking advantage of GPUs [13] [14], they are either unable to handle high definition video streams or are limited to a single GPU and thus do not scale up well. Therefore, we propose GPU and Multi-GPU implementations of background extraction, silhouette detection and optical flow estimation algorithms that are exploited, within our proposed approaches of event detection and localization. ...
... This method allowed to extract about 800 features from 640×480 video at 10 Fps which is approximately 10 times faster than the corresponding CPU implementation. There are also some works in [14] proposing a CUDA implementation of block matching motion estimation algorithm using multiple Graphic Processing Units (GPUs). This implementation enabled to achieve a real-time motion estimation for an image sequence of 720×480 pixels, using two NVIDIA C1060 Tesla GPU cards. ...
Conference Paper
Full-text available
Video processing algorithms are widely used in applications related to computer vision such as motion tracking, human behavior understanding, event detection and localization. Nevertheless, the new video standards, in high definitions (HD: 1280×720, or Full HD: 1920×1080) cause that actual implementations, even running on modern hardware, can't respect the needs of real-time processing. To overcome this constraint, many applications have been developed, that exploit the high power of graphic processing units (GPUs). However, any is able to treat high definition videos efficiently. In this work, we propose an effective exploitation of single and multiple GPUs, in order to achieve real-time detection and localization of abnormal events, using HD and Full HD videos. The proposed approach detects portions of video that corresponds to sudden changes of motion variations of movements. It allows also to provide areas in video frames where motion behavior is surprising compared to the rest of motion in the same frame. Experimental results have been conducted using several videos showing efficient detection and localization of abnormal events in multi-user scenarios. The use of multiple GPUs enabled a real time treatment of high definition videos with a global speedup ranging from 5 to 35, by comparison with CPU implementations.
... Nos últimos anos, diversos autores propuseram o uso da tecnologia GPU como uma alternativa de melhorar o desempenho de codificadores de vídeo [4,5,9]. O trabalho [4], por exemplo, propõe uma execução paralela otimizada de IDCT utilizando arquitetura CUDA. ...
... Adicionalmente [9] propõe um algoritmo de estimativa do movimento para um codificador H.264/AVC adotando CUDA. Na solução inicialmente os pixels do bloco atual e do frame da referência são transferidos do processador central à memória de GPU. ...
... A fim de possibilitar a comparação justa em um ambiente real de execução, a presente proposta foi implementada em ANSI C, e a seguir incorporada na versão 12.1 do software de referência. [9]. O cenário para testes foi baseado na seguinte configuração:  Core i7-3770K 3.5GHz, 4C HT, 8MB cache  Memória RAM 16GB DDR3 1600MHz  Geforce PNY GTX 680 2GB GDDR5 6GHz  CUDA Cores 1536 Avaliações iniciais apontaram para um atraso relativo muito significativo relacionado ao processo de comunicação entre CPU e GPU, chegando a ser maior que o tempo de codificação da placa CUDA. ...
Conference Paper
Currently the high computational complexity makes it very difficult to produce a whole high definition real-time H.264 encoder solution, for conventional personal computer platform, based only on single-threaded software implementation. Considering that, the current paper analyses the potential of using modern general purpose graphical processing technologies, such as NVIDIA CUDA ® platform, as acceleration engines to improve the overall performance of a computer based H.264 intra video encoder. Performed experiments allowed discriminating the real gains when replacing a CPU based only solution by a GPU identifying some practical bottlenecks related with that solution. The most efficient proposal was finally compared with the original H.264/AVC reference code and the optimized x264 open source library codec, registering significant performance gains (in same cases higher than 7.6x).
... Motion estimation has been in the center of earlier researches. One of the first algorithms to be used for block based motion estimation, the Full Search (FS) algorithm, is implemented on CUDA [17]. The idea behind FS algorithm is to compare all blocks in a given search window to the current block. ...
... The method defines the block matching algorithm. Since simplest algorithm FS [17] remains only theoretical, researchers have made several attempts to find more effective algorithms. Among the variety of block-matching algorithms that exist, we will study: ...
... There is the possibility for further studies. One example could be the performance investigation of these algorithms on multiple GPU cards [17]. We would expect a linear acceleration with the growing number of GPUs. ...
Conference Paper
Full-text available
In the field of video compression, motion estimation (ME) is a process that leads to high computational complexity. Implementation of ME block-matching (BM) algorithms on general purpose Central Processing Unit (CPU), has resulted in poor performance. In this paper we investigate the performance of two BM ME algorithms: Three Step Search (TSS) and Four Step Search (4SS) on Graphics Processing Unit (GPU) NVIDIA Quadro 400 using the Compute Unified Device Architecture (CUDA) platform. Both algorithms perform motion estimation on a block-by-block basis, which is considered the simplest way in terms of hardware and software implementation. The focus is to achieve parallelization of the algorithms for a real time execution. We consider two well-known test sequences: "football" and "mad900", with different image resolution. The results show that the implementation on a GPU card can improve the performance in terms of execution time, by a factor of 100.
... ℎ. . ℎ. ) [10], where and ℎ are width and height of a frame in MBs, respectively . In our proposed pyramid search, the complexity to obtain ME for each MB at level is = ( ...
Conference Paper
Full-text available
Hierarchical or pyramid search is a widely used approach in motion estimation, a most expensive function in video encoding, for its low computational complexity and high efficiency. In this approach, multiple down-sampled resolutions from video frames are created. An initial motion estimation is quickly made at a lowest resolution. The final motion estimation result is achieved by propagating the initial estimation towards the original resolution. GPU or General purpose GPU embedded hundreds of number of SIMD-based cores is best suitable for motion estimation, especially with full-search-based approaches as the process can be efficiently parallelized. However, a common fundamental drawback of the hierarchical search is the erroneous estimation from the reduced resolutions may cause the final motion estimation inaccurate. Multiple-candidate motion vector approaches are proposed, however, they lack a mechanism to select the best multiple-candidate schemes considering diverse video encoding characteristics. In this paper we analyse and verify the computational complexity of the hierarchical search using NVIDIA's GPU with realistic workloads. Based on this analysis, we propose an efficient dynamic multiple-candidate motion vector approach to dynamically select best multiple-candidate motion vector schemes at runtime. This approach can achieve highest possible speedups and satisfy a desire motion estimation efficiency. Experiments on realistic workloads show the dynamic scheme selection outperforms the fixed scheme selection based on profiling.
Conference Paper
Video encoder motion estimation algorithms allow a great level of parallelism exploitation, since the same arithmetic operations are repeated over near amounts of pixel data. This paper analyses the use of modern general purpose graphical processing units (GPGPU), such as the NVIDIA CUDA® as an effective acceleration engine to improve motion estimation algorithms overall performance. The results of our analysis include practical evaluations performed on different ME methods using CUDA platform. The evaluations show the impacts of the method, window search size, and ME thread mapping onto the GPGPU in the speed up that can be achieved in such parallel platform.
Conference Paper
Currently, even considering the recent advances in the microprocessor power computing, high definition multimedia applications still require very complex demands to allow real-time video encoding. Particularly, modern video encoders (MPEG/ITU H.26x series) depend of complex and computationally exhaustive motion estimation algorithms to identify and remove temporal redundancy among consecutive (or not) frames inside a video sequence, as strategy to reduce the final compressed bit rate. In fact, the mechanism of block matching can be considered the most critical encoder algorithm, in terms of computational demands, like it is responsible for searching, in distinct reference frames, for similar pixel blocks related with each one of the input image blocks. The number of required block comparisons for high definition videos represents a clear and important restriction for real-time implementations. This paper introduces an improved strategy of block matching method, which was optimized for multiprocessing execution, mainly focusing in implementation over general purpose graphical processing unit technologies, as the NVidia CUDA® GPUs. The improved motion estimation solution was implemented in the JSVM reference code (scalable version of H.264 video encoder), when it was registered a speed up gain of more than 350% in average for 4CIF videos.
Conference Paper
Full-text available
A potential drawback of image noise suppression in medical image sequence processing is a possible loss of the apparent motion: making objects appears to move slower or less then they move in reality. For medical imaging application this can be of critical importance, for example myocardium motion in cardiac gated single photon emission computed tomography (SPECT) imaging can differentiate viable muscle from scar tissue. Therefore, in this work we design a set of experiments to measure how human observers perceive apparent motion in the presence of image degradations like noise and blur. In addition we will try to identify relevant image features, based on a visual attention model and a block matching motion estimation method that would allow development of an accurate numerical observer capable of predicting human observer motion perception.