Figure 1 - uploaded by Arvind Arvind
Content may be subject to copyright.
H.264 Decoder Block Diagram

H.264 Decoder Block Diagram

Source publication
Conference Paper
Full-text available
H.264, a state-of-the-art video compression standard, is used across a range of products from cellphones to HDTV. These products have vastly different performance, power and cost requirements, necessitating different hardware-software solutions for H.264 decoding. We show that a design methodology and associated tools which support synthesis from h...

Contexts in source publication

Context 1
... a coded frame, slices, or groups of macroblocks, may be intrapre- dicted, interpredicted from the previous frame, or interpre- dicted from multiple reference frames. Figure 1 shows a block diagram of our H.264 decoder. ...
Context 2
... implementation closely models the block diagram for the CODEC shown in Figure 1. To keep the design as flexible as possible each block was organized to sup- port latency-insensitive communications. ...

Citations

... However, the choice of any HLS tool depends on the target platform and some other criteria, such as source language, available resources, latency, and tool complexity. Hence, some prominent examples that can be mentioned of HLS tools used today in research works include BlueSpec, 3 Altera OpenCL, 4 and Vivado HLS. 5 In this case, there are various studies adapting HLS as a design method, as in Refs. 6 and 7. Other works are focused on testing their methods by implementing them on the FPGA platform in order to improve design productivity in terms of throughput, power consumption, and used resources. ...
... In addition, an ALLOCATION directive is applied on multiplication Journal of Electronic Imaging 033010-6 May∕Jun 2019 • Vol. 28 (3) operations to better improve FPGA resources. The synthesis results reveal a slight reduction in FF and LUT compared to solution 1 while the total number of clock cycles is kept the same. ...
Article
Full-text available
Due to the increasing need for testing solutions to complex hardware designs, several efforts were made in order to improve high-level synthesis (HLS) techniques. These solutions are conceived in such a way that they had to provide reasonable agreements in terms of design time, resources involved, and performance. Generally speaking, two main constraints should be satisfied for HLS applications. The first constraint consists in the ability to process complex systems at a reasonable cost, whereas the second revolves around considering some test constraints in the first tasks of the HLS flow. To fulfill these two constraints, we treated a case study using HLS for intraprediction, dequantization, and inverse transform decoding blocks of an high efficiency video coding (HEVC) decoder. For this experiment, version 10 was used of the HEVC test model (HM) reference software containing >200 functions and over 8000 lines of code. In addition, the suggested algorithm was implemented in an software/hardware (SW/HW) environment using Xilinx ZC 702-based platform. Finally, taking advantage of HLS optimization methods, the hardware design can process 6, 13, 71, and 285 video frames per second for 1600p, 1080p, 480p, and 240p video resolutions, respectively. Yet, the SW/HW designs can only decode 0.5, 1.5, 4, and 15.2 frames per second for the same video resolutions, i.e., with a gain of 3% in frame rate and 60% in power consumption compared to SW implementation.
... The prediction block added to the previous decoded block to create a reconstructed block. The reconstructed reference picture is create from a series of blocks [5][6]. ...
Article
Full-text available
As the world expanded around us and increased the popularity of the Internet by sending receiving, uploading or downloading the high definition videos, it was necessary to use a good technology to reduce the size of the dedicated video and specialized high-quality one. If the videos are send or receive, they need a wide bandwidth to capture this amount of information in the video. Based on the above, the H.264/AVC is a good technology that gives great results for encoding and decoding videos. This technology was developed jointly by (ITU-T) International Telecommunication Union–Telecommunication Standardization, and (ISO) International Organization for Standardization. Our work involves applying the encoding and decoding process of the standard using MATLAB (2013Ra) program. The work is focusing in inter frame prediction using the (IBBB) frame pattern. The video that was subjected to encoding and decoding processing was (Xylophone video name) with (240X320) size and (30f/sec) as a bit rate.
... Previous HLS studies [8,32,33] have mapped H.264 algorithms to hardware. Cadence's C-to-Silicon [33] was used to design individual functional blocks based on a SystemC model but the individual blocks were later integrated using RTL into a system-level decoder. ...
... The study presented in [32] also used a similar methodology albeit different block-level partitioning of the H.264 decoder. Bluespec SystemVerilog (BSV) was used to synthesize a complete H.264 decoder in [8]. Though BSV provides a comparatively higher level of abstraction than traditional RTL, the Figure 1: Structure of H.264 Encoded File designer still focusses on hardware specific details such as data buffering and internal module connection. ...
... A Bluespec implementation [8] for ASIC achieved 30 fps for 720p resolution, but required substantial manual redesign using the Bluespec programming model. Another HLS-based design that quotes performance of full H.264 decoding is a block based design in SystemC [32], but only achieves 33 fps rate in QCIF resolution on a Virtex 4 platform. ...
Conference Paper
Full-text available
High level synthesis (HLS) is gaining wider acceptance for hardware design due to its higher productivity and better design space exploration features. In recent years, HLS techniques and design flows have also advanced significantly, and as a result, many new FPGA designs are developed with HLS. However, despite many studies using HLS, the size and complexity of such applications remain generally small, and it is not well understood how to design and optimize for HLS with large, complex reference code. Typical HLS benchmark applications contain somewhere between 100 to 1400 lines of code and about 20 sub-functions, but typical input applications may contain many times more code and functions. To study such complex applications, we present a case study using HLS for a full H.264 decoder: an application with over 6000 lines of code and over 100 functions. We share our experience on code conversion for synthesizabil-ity, various HLS optimizations, HLS limitations while dealing with complex input code, and general design insights. Through our optimization process, we achieve 34 frames/s at 640x480 resolution (480p). To enable future study and benefit the research community, we open-source our synthe-sizable H.264 implementation.
... If the I/O command requires hardware acceleration, the main controller sends data to the hardware accelerator so that the data are processed by the accelerator. As depicted in Figure 5, hardware acceleration modules are connected to the main controller through FIFOs in a latency-insensitive style [10], [11]. This approach allows various hardware accelerators to be easily inserted into or removed from the main controller. ...
Conference Paper
Full-text available
As the cell size of NAND flash memory is shrinking, its physical characteristics such as performance and lifetime are significantly degraded. As effective solutions of overcoming such poor physical characteristics, more cross-layer system-level approaches (such as compression and deduplication techniques) are expected to be developed. These system-level techniques typically employ intelligent software algorithms supported by specialized hardware accelerators. Using hardware accelerators combined with sophisticated software algorithms greatly increases the design complexity of flash-based storage devices. However, existing storage design environments are not adequate enough to handle this increased design complexity in a timely and efficient manner. To address this new challenge, we propose a novel storage development environment, called FlashBench, that helps developers to build high-complexity storage solutions quickly. FlashBench is designed to provide a generic framework for the rapid development and validation of storage software/hardware algorithms by supporting multi-level design environments, specifically optimized for seamless hardware/software cross-layer integrations. Our case study demonstrates that FlashBench enables developers to implement high-complexity flash devices with specialized optimization functions in a shorter development time over traditional design environments.
... Recently, a number of highly modular research prototypes [15] [6] have been developed using latency-insensitive design [5] . In latencyinsensitive design, the goal is to maintain the functional correctness of the design in response to variations in data availability. ...
Conference Paper
Full-text available
Traditionally, hardware designs partitioned across multiple FPGAs have had low performance due to the inefficiency of maintaining cycle-by-cycle timing among discrete FPGAs. In this paper, we present a mechanism by which complex designs may be efficiently and automatically partitioned among multiple FPGAs using explicitly programmed latency-insensitive links. We describe the automatic synthesis of an area efficient, high performance network for routing these inter-FPGA links. By mapping a diverse set of large research prototypes onto a multiple FPGA platform, we demonstrate that our tool obtains significant gains in design feasibility, compilation time, and even wall-clock performance.
... Another name for H.264 is MPEG-4 Advanced Video Coding (AVC) standard. Since the standard is the result of collaborative effort of the VCEG and MPEG standards Committees, it is informally referred to as Joint Video Team (JVT) standard as well [8]. Applications such as internet multimedia, wireless video, personal video recorders,video-on-demand and videoconferencing have an inexhaustible demand for much higher compression to enable best video quality as possible [27]. ...
... Second, the data flow in the computation of MVmedian is irregular and requires a large amount of on-chip memory to store the required past MVs. As a result of microarchitectural change, the deblocking filter implementation in [8] decreases area dramatically from 2.74 mm 2 to 0.69 mm 2 . Optimized deblocking filter yields 12% increase in throughput of the entire design, and thereby reducing the design critical path by 35%. ...
Conference Paper
Full-text available
The progress of science and technology demands multimedia applications to be realized on embedded systems as it involves transfer of large amounts of data. Compared with standards such as MPEG-2, MPEG-4 Visual, H.264 can deliver better image quality at the same compressed bit rate or at a lower bit rate. The increase in compression efficiency and flexibility come at the expense of increase in complexity, which is a fact that must be overcome. Therefore, an efficient Co-design methodology is required, where the encoder software application is highly optimized and structured in a very modular and efficient manner, so as to allow its most complex and time consuming operations to be offloaded to dedicated hardware accelerators. This paper provides an overview of the features of H.264 and surveys the emerging studies related to new coding features of the standard. © 2012 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering.
... Recently, languages like Bluespec [4], which describe designs not as gates and wires but as a set of guarded atomic actions (or rules) on state elements, have been proposed. Over the last six years, it has been established that Bluespec programs not only can produce no-compromise hardware [1], but that keeping programs at the rule level allows more flexibility in design and refinements [8], [9]. For instance, the addition of a pipeline stage can be implemented in a natural way by splitting the rule corresponding to the appropriate stage into multiple rules, and introducing state to hold the intermediate results. ...
Article
Full-text available
Microarchitectural refinements are often required to meet performance, area, or timing constraints when designing complex digital systems. While refinements are often straightforward to implement, it is difficult to formally specify the conditions of correctness for those which change cycle-level timing. As a result, in the later stages of design only those changes are considered that do not affect timing and whose verification can be automated using tools for checking FSM equivalence. This excludes an essential class of microarchitectural changes, such as the insertion of a register in a long combinational path to meet timing. A design methodology based on guarded atomic actions, or rules, offers an opportunity to raise the notion of correctness to a more abstract level. In rule-based systems, many useful refinements can be expressed simply by breaking a single rule into smaller rules which execute the original operation in multiple steps. Since the smaller rule executions can be interleaved with other rules, the verification task is to determine that no new behaviors have been introduced. We formalize this notion of correctness and present a tool based on SMT solvers that can automatically prove that a refinement is correct, or provide concrete information as to why it is not correct. With this tool, a larger class of refinements at all stages of the design process can be verified easily. We demonstrate the use of our tool in proving the correctness of the refinement of a processor pipeline from four stages to five.
... To get indications for the applicability of using bit level statistics, the model of an application was investigated. The H.264 decoder [18] was simulated at register transfer level to extract signal dumps of the global connections of functional blocks like memories, entropy decoder, prediction unit etc. Those trace dumps were analyzed to extract the bit level signal statistics. ...
Conference Paper
As technology reaches nanoscale order, interconnection systems account for the largest part of power consumption in Systems-on-Chip. Hence, an early and sufficiently accurate power estimation technique is needed for making the right design decisions. In this paper we present a method for system-level power estimation of interconnection fabrics in Systems-on-Chip. Estimations with simple average assumptions regarding the data stream are compared against estimations considering bit level statistics in order to include low level effects like activity factors and crosstalk capacitances. By examining different data patterns and traces of a video decoding system as a realistic example, we found that the data dependent effects are not negligible influences on power consumption in the interconnection system of nanoscale chips. Due to the use of statistical data there is no degradation of simulation speed in our approach.
... For our memory hierarchy explorations, we use an existing H.264 implementation [7]. This codec originally targeted an ASIC implementation, but performance increases in FPGAs permit us to reuse the codec without significant modifications. ...
Article
Full-text available
Developers accelerating applications on FPGAs or other re-configurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Soft-ware developers expect a programming environment to in-clude automatic memory management. Virtual memory pro-vides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instruc-tions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached au-tomatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratch-pads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management. Two uses of FPGA scratchpads are analyzed: buffer man-agement in an H.264 decoder and memory management within a processor microarchitecture timing model.
... We further simplify the interfaces in BlueSSD by adopting a latency insensitive design style, which has been used to facilitate modular refinement in a several large systems [6] [7]. Our modules are not permitted to make timing assumptions about when their inputs will be ready. ...
Article
Full-text available
In this paper we describe BlueSSD, an open platform for exploring hardware and software for NAND flash-based SSD architectures. We introduce the overall architecture of BlueSSD from a hardware and software perspective and briefly explain our design methodology. Preliminary evaluation shows that BlueSSD delivers performance comparable to commercially available SSDs.