Conference Paper

A coarse grained and hybrid reconfigurable architecture with flexible NOC router for variable block size motion estimation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper proposes a novel application-specific hybrid coarsegrained reconfigurable architecture with a flexible network on chip (NoC) mechanism. Architecture supports variable block size motion estimation (VBSME) with much less resources than ASIC based and coarse grained reconfigurable architectures. The intelligent NoC router supports full search motion estimation algorithm as well as other fast search algorithms like diamond, hexagon, big hexagon and spiral. Our model is a hierarchical hybrid processing element based 2D architecture which supports reuse of reference frame blocks between the processing elements through NoC routers. This reduces the transactions from/to the main memory. Proposed architecture is designed with Verilog-HDL description and synthesized by 90 nm CMOS standard cell library. Results show that our architecture reduces the gate count by 7x compared to its ASIC counterpart that only supports full search method. Moreover, the proposed architecture operates at a frequency comparable to ASIC based implementation to sustain 30 fps. Our approach is based on a simple design which utilizes a high-level of parallelism with an intensive data reuse. Therefore, proposed architecture supports run-time reconfiguration for any block size and for any search pattern depending on the application requirement.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In [118]In [119], the authors Verma et. al. proposed a reconfigurable ME architecture that supports full-search and fast search algorithms that use patterns including diamond, hexagon, big-hexagon and spiral. ...
Thesis
Full-text available
Full text can be found at: http://ria.ua.pt/handle/10773/15442 ------------------- Pdf link: http://ria.ua.pt/bitstream/10773/15442/1/thesis_Purnachand_Nalluri_52295.pdf ------------------------- Abstract: Video coding has been used in applications like video surveillance, video conferencing, video streaming, video broadcasting and video storage. In a typical video coding standard, many algorithms are combined to compress a video. However, one of those algorithms, the motion estimation is the most complex task. Hence, it is necessary to implement this task in real time by using appropriate VLSI architectures. This thesis proposes a new fast motion estimation algorithm and its implementation in real time. The results show that the proposed algorithm and its motion estimation hardware architecture out performs the state of the art. The proposed architecture operates at a maximum operating frequency of 241.6 MHz and is able to process 1080p@60Hz with all possible variables block sizes specified in HEVC standard as well as with motion vector search range of up to ±64 pixels.
... After that, there are a lot of VLSI architectures have been proposed for VBS-IME (e.g. [3][4][5][6]). Some works also have tried to mapping the IME algorithms onto CGRAs [7][8][9], but they are very simple, and mainly aim at validating the proposed architecture. ...
Article
Full-text available
Variable block size integer motion estimation (VBS-IME) is one of several tools which contribute to H.264/AVC's excellent coding efficiency. However, its high computational complexity and huge memory access bandwidth make it difficult to implement. Therefore, a hardware accelerator is indispensable for full-search VBSIME in real-time video encoding applications. To overcome some of the limitations of conventional microprocessors and fine-grained reconfigurable devices in the field of multimedia and communication baseband processing, we have proposed a coarse-grained dynamically reconfigurable computing system, called REMUS. The paper presents architecture and compiling flow proposed for REMUS system, and shows that it is possible to implement a high complexity application as H.264/AVC full-search VBS-IME algorithm with competitive performance on platform of REMUS system. Experimental results have proven that the REMUS system operating at 200 MHz can perform VBSIME at real-time speed for CIF/SDTV@30fps video sequences with two reference frames and maximum search range of [-16,15]/[-8,7]. The implementation, therefore, can apply for H.264/AVC encoder in mobile multimedia applications. REMUS system is designed and synthesized by using TSMC 65nm low power technology. The die size of REMUS is 23.7 mm2. REMUS consumes about 194mW while working at 200MHz.
Article
Nowadays, driven by the consumer demands, the multimedia market is booming and the video coding standards evolve rapidly. A dynamically coarse grain reconfigurable architecture REMUS-II (REconfigurable MUltimedia System 2) is developed as a multi-standards, high resolution, power efficient, and real-time multimedia decoding processor. The hierarchical pipeline is adopted in the REMUS-II for multimedia applications. This paper details the implementation of pipeline optimization techniques for the algorithm and architecture co-design. In each level, the key factors that influence the pipeline performance are analyzed and optimized, including the computational components, the hierarchical memory interfaces, the synchronization mechanisms, and the balanced task assignments. The experimental results show that, compared to original version, the decoding performance of H.264/AVC is improved 2.93 times by the proposed methods. After optimization, the REMUS-II can decode real-time 1080p streams of multi-standards, including H.264/AVC High Profile, MPEG-2 Main Profile, and AVS Jizhun Profile.
Conference Paper
This paper introduces the proposal of an expression grain reconfigurable architecture called BRICK, its functionality and main components. A mapping for three signal processing applications such as a 3�?3 2-D convolution, a 16-tap FIR filter and an 8-point FFT is developed inside the 4�?4 reconfigurable array. A performance simulation analysis study is developed comparing the BRICK reconfigurable array VHDL implementation to a MIPS and a SPARC V8 simulators in order to validate the reconfigurable array proposal. Considerable gains up to an order of magnitude are obtained and important design issues and challenges were discovered when developing this work.
Conference Paper
In this work, we explore a new family of coarse grain reconfigurable architecture called BRICK, which is capable of mapping complete expressions and pipelines into one processing element with Multiple-Input, Multiple-Output characteristics while provided with a centralized control unit to synchronize the operation of each Processing Element (PE). Each PE has heterogeneous ALUs specialized in a particular type of operation. These ALUs can be interconnected to implement complex expressions, either sequential or combinational, increasing computational density and utilization rate of the reconfigurable Array. Preliminary synthesis results and application examples show that efficient mappings can be achieved with BRICK.
Article
Scheduling, placement, and routing are important steps in Very Large Scale Integration (VLSI) design. Researchers have developed numerous techniques to solve placement and routing problems. As the complexity of Application Specific Integrated Circuits (ASICs) increased over the past decades, so did the demand for improved place and route techniques. The primary objective of these place and route approaches has typically been wirelength minimization due to its impact on signal delay and design performance. With the advent of Field Programmable Gate Arrays (FPGAs), the same place and route techniques were applied to FPGA-based design. However, traditional place and route techniques may not work for Coarse-Grained Reconfigurable Architectures (CGRAs), which are reconfigurable devices offering wider path widths than FPGAs and more flexibility than ASICs, due to the differences in architecture and routing network. Further, the routing network of several types of CGRAs, including the Field Programmable Object Array (FPOA), has deterministic timing as compared to the routing fabric of most ASICs and FPGAs reported in the literature. This necessitates a fresh look at alternative approaches to place and route designs. This dissertation presents a finite domain constraint-based, delay-aware placement and routing methodology targeting an FPOA. The proposed methodology takes advantage of the deterministic routing network of CGRAs to perform a delay aware placement.
Conference Paper
Full-text available
A reconfigurable architecture optimized for media processing, and based on 4-bit arithmetic logic unit (ALU) and interconnect is described. Together, these allow the area devoted to configuration bits and routing switches to be about 50% of the area of the basic CHESS array, leaving the rest available for user-visible functional units. CHESS flexibility in application mapping is largely due to the ability to feed ALU with instruction streams generated within the array, generous provision of embedded block random access memory, and the ability to trade routing switches for small memories.
Article
Full-text available
The most radical of the architectures that appear in this issue are Raw processors-highly parallel architectures with hundreds of very simple processors coupled to a small portion of the on-chip memory. Each processor, or tile, also contains a small bank of configurable logic, allowing synthesis of complex operations directly in configurable hardware. Unlike the others, this architecture does not use a traditional instruction set architecture. Instead, programs are compiled directly onto the Raw hardware, with all units told explicitly what to do by the compiler. The compiler even schedules most of the intertile communication. The real limitation to this architecture is the efficacy of the compiler. The authors demonstrate impressive speedups for simple algorithms that lend themselves well to this architectural model, but whether this architecture will be effective for future workloads is an open question
Article
Full-text available
With the advent of new video standards such as MPEG-4 part-10 and H.264/H.26L, demands for advanced video coding, particularly in the area of variable block size video motion estimation (VBSME), are increasing. In this paper, we propose a new one-dimensional (1-D) very large-scale integration architecture for full-search VBSME (FSVBSME). The VBS sum of absolute differences (SAD) computation is performed by re-using the results of smaller sub-block computations. These are distributed and combined by incorporating a shuffling mechanism within each processing element. Whereas a conventional 1-D architecture can process only one motion vector (MV), this new architecture can process up to 41 MV sub-blocks (within a macroblock) in the same number of clock cycles.
Article
Full-text available
In block motion estimation, a search pattern with a different shape or size has a very important impact on search speed and distortion performance. A square-shaped search pattern is adopted in many popular fast algorithms. Recently, a diamond-shaped search pattern was introduced in fast block motion estimation and has exhibited a faster search speed. Based on an in-depth examination of the influence of the search pattern on speed performance, we propose a novel algorithm using a hexagon-based search pattern to achieve further improvement. The hexagon-based search pattern is investigated in comparison with diamond search pattern and demonstrates significant speedup gain over the diamond-based search. Analysis shows that a speed improvement rate of the hexagon-based search (HEXBS) algorithm over the diamond search (DS) algorithm can be over 80% for locating some motion vectors in certain scenarios. In short, the proposed HEXBS algorithm can find the same motion vector with fewer search points than the DS algorithm. Generally speaking, the larger the motion vector, the more search points the. HEXBS algorithm can save, which is further justified by experimental results
Article
Full-text available
The widespread use of block-based interframe motion estimation for video sequence compression in both MPEG and H.263 standards is due to its effectiveness and simplicity of implementation. Nevertheless, the high computational complexity of the full-search algorithm has motivated a host of suboptimal but faster search strategies. A popular example is the three-step search (TSS) algorithm. However, its uniformly spaced search pattern is not well matched to most real-world video sequences in which the motion vector distribution is nonuniformly biased toward the zero vector. Such an observation inspired the new three-step search (NTSS) which has a center-biased search pattern and supports a halfway-stop technique. It is faster on average, and gives better motion estimation as compared to the well-known TSS. Later, the four-step search (4SS) algorithm was introduced to reduce the average case from 21 to 19 search points, while maintaining a performance similar to NTSS in terms of motion compensation errors. We propose a novel unrestricted center-biased diamond search (UCBDS) algorithm which is more efficient, effective, and robust than the previous techniques. It has a best case scenario of only 13 search points and an average of 15.5 block matches. This makes UCBDS consistently faster than the other suboptimal block-matching techniques. This paper also compares the above methods in which both the processing speed and the accuracy of motion compensation are tested over a wide range of test video sequences
Article
Full-text available
This paper proposes a novel flexible VLSI architecture for the implementation of variable block size motion estimation (VBSME). The architecture is able to perform a full motion search on integral multiples of 4×4 blocks sizes. To use the architecture, each 16×16 macroblock of the source frames should be partitioned into sixteen 4×4 non-overlapping subblocks, called primitive subblocks. The architecture contains sixteen modules and one VBSME processor. Each module, realized by cascading ID systolic arrays, is responsible for the block-matching operations of a different primitive subblock. The realization has the advantages of high throughput, high flexibility and 100 % processing element (PE) utilization. The motion estimation of all the primitive subblocks is performed in parallel. Because these primitive subblocks can be used to form the 41 subblocks of different sizes specified by the H.264, the VBSME processor is employed to concurrently compute the sums of absolute differences (SADs) of all the 41 subblocks from the SADs of the primitive subblocks. This new architecture has lower latency and higher throughput over other exiting VBSME architectures for the hardware implementation of H.264 encoders.
Article
this article. This project is funded by US Defense Advanced Research Projects Agency contract DABT63-96-C-0036 and a National Science Foundation Presidential Young Investigator Award. Ikos Systems donated the VirtuaLogic emulation system. References
Article
Two optimized implementations of the emerging ITU-T H.26L video encoder are described. The first, medium-optimized version, is implemented in C and the latter, highly optimized version, utilizes both algorithmic and platform-specific optimizations. Comparisons to a correspondingly optimized H.263/H.263+ implementation are given with the spatial and temporal video quality fixed and the bit rate and complexity varied. On a 733 MHz general-purpose processor, an average encoding speed of 17 frames per second for QCIF sequences is achieved with a 29% reduction in bit rate compared to H.263+. The complexity of H.26L is about 3.4 times more than that of H.263+.
Article
MPEG-4 is a new multimedia standard combining interactivity, object-based natural and synthetic digital video, audio and computer-graphics. For the implementation of the video part of the MPEG-4 standard a high degree of flexibility is required, where the motion estimation requires the highest part of the computational power. Therefore, in this paper fast algorithms for MPEG-4 motion estimation are evaluated in terms of visual quality and computational power requirements for processor based implementations. Due to the object-based nature of MPEG-4 also new VLSI architectures for MPEG-4 motion estimation are required. Therefore known motion estimation architectures are evaluated on their capability of being modified for MPEG-4 support. Based on this evaluation a new dedicated, but flexible MPEG-4 motion estimation architecture targeted for low-power handheld applications is presented, which resulted to be advantageous to processor based implementations by magnitudes of order.
Conference Paper
An optimized implementation of an H.26L video encoder is presented. Compared to H263, H.26L reduces the output bit rate about 25% at the expense of increased (3.8X) complexity. However, optimizations enable real-time operation on a 733 MHz general-purpose processor
Conference Paper
Current implementations of MPEG2 encoders are specially designed in order to perform a huge number of operations, most of which occur during motion estimation. Many fast algorithms have been proposed to reduce the processing power necessary. This paper examines the results achieved by several methods that show promise for reducing computation while sacrificing as little image quality as possible. Methods that achieve these goals are desirable for use in future encoders that will be implemented on generic digital signal processors (DSPs)
Conference Paper
MATRIX is a novel, coarse-grain, reconfigurable computing architecture which supports configurable instruction distribution. Device resources are allocated to controlling and describing the computation on a per task basis. Application-specific regularity allows us to compress the resources allocated to instruction control and distribution, in many situations yielding more resources for datapaths and computations. The adaptability is made possible by a multi-level configuration scheme, a unified configurable network supporting both datapaths and instruction distribution, and a coarse-grained building block which can serve as an instruction store, a memory element, or a computational element. In a 0.5 μ CMOS process, the 8-bit functional unit at the heart of the MATRIX architecture has a footprint of roughly 1.5 mm×1.2 mm, making single dies with over a hundred function units practical today. At this process point, 100 MHz operation is easily achievable, allowing MATRIX components to deliver on the order of 10 Gop/s (8-bit ops)
Article
H.264/AVC is newest video coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The main goals of the H.264/AVC standardization effort have been enhanced compression performance and provision of a "network-friendly" video representation addressing "conversational" (video telephony) and "nonconversational" (storage, broadcast, or streaming) applications. H.264/AVC has achieved a significant improvement in rate-distortion efficiency relative to existing standards. This article provides an overview of the technical features of H.264/AVC, describes profiles and applications for the standard, and outlines the history of the standardization process.
Article
A unified approach to the coder control of video coding standards such as MPEG-2, H.263, MPEG-4, and the draft video coding standard H.264/AVC (advanced video coding) is presented. The performance of the various standards is compared by means of PSNR and subjective testing results. The results indicate that H.264/AVC compliant encoders typically achieve essentially the same reproduction quality as encoders that are compliant with the previous standards while typically requiring 60% or less of the bit rate.
Article
In this paper, a low-power full-search block matching (FSBM) motion-estimation design for the ITU-T recommendation H.263+ standard is proposed. New motion-estimation modes in H.263+ can be fully supported by our architecture. Unlike most previously presented motion-estimation chips, this design can deal with 8×8 and 16×16 block size with different searching ranges. Basically, the proposed architecture is composed of an integer pixel unit with 64 processing elements, and a half-pixel unit with interpolation, a control unit, and data registers. In order to minimize power consumption, gated-clock and dual-supply voltages are used. This design has been realized by TSMC 0.6 μm SPTM CMOS technology. The power consumption is 423.8 mW at 60 MHz and the throughput is 36 fps in CIF format
Article
This paper reports two efficient quadtree-based algorithms for variable-size block matching (VSBM) motion estimation. The schemes allow the dimensions of blocks to adapt to local activity within the image, and the total number of blocks in any frame can be varied while still accurately representing true motion. This permits adaptive bit allocation between the representation of displacement and residual data, and also the variation of the overall bit-rate on a frame-by-frame basis. The first algorithm computes the optimal selection of variable-sized blocks to provide the best-achievable prediction error under the fixed number of blocks for a quadtree-based VSBM technique. The algorithm employs an efficient dynamic programming technique utilizing the special structure of a quadtree. Although this algorithm is computationally intensive, it does provide a yardstick by which the performance of other more practical VSBM techniques can be measured. The second algorithm adopts a heuristic way to select variable-sized square blocks. It relies more on local motion information than on global error optimization. Experiments suggest that the effective use of local information contributes to minimizing the overall error. The result is a more computationally efficient VSBM technique than the optimal algorithm, but with a comparable prediction error
Article
This paper describes a data-interlacing architecture with two-dimensional (2-D) data-reuse for full-search blockmatching algorithm. Based on a one-dimensional processing element (PE) array and two data-interlacing shift-register arrays, the proposed architecture can efficiently reuse data to decrease external memory accesses and save the pin counts. It also achieves 100% hardware utilization and a high throughput rate. In addition, the same chips can be cascaded for different block sizes, search ranges, and pixel rates
Article
A flexible and powerful VLSI architecture for the implementation of a wide spectrum of full search and reduced complexity search block matching algorithms is presented. Optimized efficiency for variable algorithm parameters is obtained by using a quadratic systolic array architecture with global accumulation, combined with a flexible meander-like data flow. Flexibility is further increased by cascadability and/or the possibility of parallel operation. Hardware overhead for particular algorithmic requirements, such as variable pixel resolution, subsampling with offset, and subpixel accuracy, is discussed in detail. A full-custom (CMOS) implementation for the architecture is described
Article
A family of modular VLSI architectures and chip implementations of the motion-compensation full-search block-matching algorithm are described. This set of application-specific integrated circuits is motivated by the intensive computations required to perform motion compensation in real time. The architectures are based on data-flow designs, which allow sequential inputs but perform parallel processing with 100% efficiency. On the basis of these architectures, a programmable chip can be designed for motion vector estimation with different block sizes. The chips can be cascaded for a larger tracking range or for a video source with a higher pixel sampling rate. A chip-pair design is also derived for calculating fractional motion vectors with quarter-pel precision. The chip-pair design has been laid out, and the chip characteristics are given. Test circuitry is also included to increase the testability of the chips
Article
Configurable computers have attracted considerable attention recently because they promise to deliver the performance of application-specific hardware along with the flexibility of general-purpose computers. Unfortunately, configurable computing has had rather limited success to date. We believe that the FPGAs currently used to construct configurable computers are too general to achieve good cost-performance on computationally-intensive applications that demand special-purpose hardware. This paper describes a new architecture called RaPiD (Reconfigurable Pipelined Datapaths), which is optimized for highly repetitive, computationally-intensive tasks. Very deep application-specific computation pipelines can be configured in RaPiD that deliver very high performance for a wide range of applications. RaPiD achieves this using a coarse-grained reconfigurable architecture that mixes the appropriate amount of static configuration with dynamic control. 1 Introduction Special-purpose architectu...
FPGA Co-Processing Architectures for Video Compression
  • Alex Soohoo
Alex Soohoo, "FPGA Co-Processing Architectures for Video Compression," Altera Corporation.
An Efficient VLSI Architecture for H.264 Variable Block Size Motion Estimation
  • Chian-Feng Chien-Min Ou
  • Wen-Jyi Le
  • Hwang
Chien-Min Ou, Chian-Feng Le and Wen-Jyi Hwang, "An Efficient VLSI Architecture for H.264 Variable Block Size Motion Estimation," IEEE Transaction on Consumer Electronics, Volume 51, Issue 4, Nov. 2005 Page(s):1291 -1299