Functional block diagram of decoder

Source publication

A fast asynchronous Huffman decoder for compressed-code embedded processors

Conference Paper

Full-text available

Jan 1998

This paper presents the architecture and design of a high-performance asynchronous Huffman decoder for compressed-code embedded processors. In such processors, embedded programs are stored in compressed form in instruction ROM then are decompressed on demand during instruction cache refill. The Huffman decoder is used as a code decompression engine...

Context 1

... cache refill logic counts the 32-bit data words produced by the decoder, and it resets the decoder after 8 words have been produced, which correspond to an entire uncompressed cache line. Figure 3, indi- cating the flow of basic operations. The Input Buffer holds the current input data, which is supplied to the decoder. ...

View in full-text

Context 2

... Align through Sum has 5 stages. The Match/Length Unit of Figure 3 is implemented as two stages: Match and Length. Match produces a 1-hot output, indicat- ing the match class to which the current input symbol belongs. ...

View in full-text

Context 3

... Length ROM then translates this class into the actual token length. The Adder of Figure 3 is divided into a Carry stage followed by a Sum generation stage. Once the sum is computed, it is passed as select bits to the Align stage, indicating the new 0-7 bit shift, and is also passed to the Offset stage. ...

View in full-text

Context 4

... is used only for an initial load, at the start of decoding. While the functional diagram in Figure 3 suggests that shift occurs after add is complete, our actual archi- tecture uses two optimizations. First, in some cases, the upper bit from Length ROM immediately indicates that a 1-byte shift is needed (Pre-Shift). ...

View in full-text

Context 5

... this thread, the current input symbol is decoded and written into the Output Buffer. The Code ROM of Figure 3 is now implemented by 3 stages: Decode, ROM, and Merge. These stages implement a compact lookup table, which uses a 2- way decoding process. ...

View in full-text

SoftCache: A Technique for Power and Area Reduction in Embedded Systems

Article

Full-text available

Apr 2009

Explicitly software managed cache systems are postulated as a solution for power considerations in computing devices. The savings expected in a SoftCache lies in the removal of tag storage, associativity logic, comparators, and other hardware dedicated to memory hierarchies. The penalty lies in high cache-miss cost and additional instructions requi...

Design of High Performance 16-Bit Brent Kung Adder Using Static CMOS Logic Style in 45nm CMOS NCSU Free PDK

Conference Paper

Full-text available

Nov 2013

Nirav Desai

High performance microprocessor units require high performance adders and other arithmetic units. Modern microprocessors are however 32-bits or 64-bits as that is the minimum required for floating point arithmetic as per the IEEE 754 Standard. 8-bit and 16-bit arithmetic processors are normally found in micro-controller applications for embedded sy...

Design Of High Performance 16-Bit Brent Kung Adder Using Static CMOS Logic Style In 45nm CMOS NCSU Free PDK

Conference Paper

Full-text available

Nov 2013

Nirav Desai

ASIC Implementation of Cairo University SPARC “CUSPARC” embedded processor

Conference Paper

Full-text available

Dec 2010

Cairo University SPARC “CUSPARC” processor is an IP embedded processor core conforming to SPARC V8 ISA. CUSPARC is fully developed at Cairo University and is the first Egyptian processor. In this paper, the ASIC Implementation and Verification of the CUSPARC processor is described at 130nm technology node. CUSPARC scores a typical clock frequency o...

Variant of Huffman coding and decoding using words frequencies to improve the compression ratio

Article

Full-text available

Aug 2012

This paper shows that by using a code generated from word frequencies instead of character frequencies it is possible to improve the compression ratio.

Uma Arquitetura para Compressão de Código em Processadores Embarcados

Conference Paper

Full-text available

Oct 2008
Acoust Speech Signal Process Newslett IEEE

O uso eficiente de sistemas móveis e embarcados depende fortemente de estratégias adequadas para redução do consumo de energia. Esses sistemas são caracterizados também por uma grande restrição de recursos, entre eles a quantidade de memória disponível para as aplicações. Este trabalho apresenta um esquema de compressão de código para processadores embarcados compatíveis com o processador ARM, que visa apresentar soluções para essas duas demandas específicas dos sistemas móveis e embarcados. A compressão é feita diretamente no código objeto, após a compilação do código-fonte pelas ferramentas tradicionais, e usa um algoritmo de compressão por freqüência de opcodes: o código de Huffman. Para a execução do código comprimido é necessário que haja um hardware de expansão das instruções associado ao núcleo do processador. O hardware de expansão é composto por estruturas de armazenamento e controle projetados para a realização eficiente das operações de expansão. As medidas de desempenho de compressão e os efeitos da expansão foram feitos a partir da simulação em uma versão modificada do SimpleScalar e com uso da suíte de avaliação MiBench. As simulações realizadas mostram que a taxa média de compressão foi de 76% para o conjunto de benchmarks estudados e com uma perda relativamente pequena no desempenho, quando comparado com a execução das aplicações na sua versão sem compressão.

A High-Throughput, Low-Power Asynchronous Mesh-of-Trees Interconnection Network for the Explicit Multi-Threading (XMT) Parallel Architecture

Article

Aug 2008

Michael Horak

An Efficient algorithm for time separation of events in concurrent systems

Conference Paper

Dec 2007

The time separation of events (TSE) problem is that of finding the maximum and minimum separation between the times of occurrence of two events in a concurrent system. It has applications in the performance analysis, optimization and verification of concurrent digital systems. This paper introduces an efficient polynomial-time algorithm to give exact bounds on TSE's for choice-free concurrent systems, whose operational semantics obey the max-causality rule. A choice-free concurrent system is modeled as a strongly-connected marked graph, where delays on operations are modeled as bounded intervals with unspecified distributions. While previous approaches handle acyclic systems only, or else require graph unfolding until a steady-state behavior is reached, the proposed approach directly identifies and evaluates the asymptotic steady-state behavior of a cyclic system via a graph-theoretical approach. As a result, the method has significantly lower computational complexity than previously-proposed solutions. A prototype CAD tool has been developed to demonstrate the feasibility and efficacy of our method. A set of experiments have been performed on the tool as well as two existing tools, with noticeable improvement on runtime and accuracy for several examples.

The Design of High-Performance Dynamic Asynchronous Pipelines: Lookahead Style

Article

Dec 2007
IEEE T VLSI SYST

A new class of asynchronous pipelines is proposed, called lookahead pipelines (LP), which use dynamic logic and are capable of delivering multi-gigahertz throughputs. Since they are asynchronous, these pipelines avoid problems related to high-speed clock distribution, such as clock power, management of clock skew, and inflexibility in handling varied environments. The designs are based on the well-known PSO style of Williams and Horowitz as a starting point, but achieve significant improvements through novel protocol optimizations: the pipeline communication is structured so that critical events can be detected and exploited earlier. A special focus of this work is to target extremely fine-grain or gate-level pipelines, where the datapath is sectioned into stages, each consisting of logic that is only a single level deep. Both dual-rail and single-rail pipeline implementations are proposed. All the implementations are characterized by low-cost control structures and the avoidance of explicit latches. Post-layout SPICE simulations, in 0.18-mum technology, indicate that the best dual-rail LP design has more than twice the throughput (1.04 giga data items/s) of Williams' PSO design, while the best single-rail LP design achieves even higher throughput (1.55 giga data items/s).

The Design of High-Performance Dynamic Asynchronous Pipelines: High-Capacity Style

Article

Dec 2007
IEEE T VLSI SYST

This paper introduces a high-throughput asynchronous pipeline style, called high-capacity (HC) pipelines, targeted to datapaths that use dynamic logic. This approach includes a novel highly-concurrent handshake protocol, with fewer synchronization points between neighboring pipeline stages than almost all existing asynchronous dynamic pipelining approaches. Furthermore, the dynamic pipelines provide 100% buffering capacity, without explicit latches, by means of separate pullup and pulldown control for each pipeline stage: neighboring stages can store distinct data items, unlike almost all existing latchless dynamic asynchronous pipelines. As a result, very high throughput is obtained. Fabricated first-input-first-output (FIFO) designs, in 0.18-m technology, were fully functional over a wide range of supply voltages (1.2 to over 2.5 V), exhibiting a corresponding range of throughputs from 1.0-2.4 giga items/s. In addition, an experimental finite-impulse response (FIR) filter chip was designed and fabricated with IBM Research, whose speed-critical core used an HC pipeline. The HC pipeline exhibited throughputs up to 1.8 giga items/s, and the overall filter achieved 1.32 giga items/s, thus obtaining 15% higher throughput and 50% lower latency than the fastest previously-reported synchronous FIR filter, also designed at IBM Research.

Pipeline optimization for asynchronous circuits: Complexity analysis and an efficient optimal algorithm

Article

Full-text available

Apr 2006
IEEE T COMPUT AID D

This paper addresses the problem of identifying the minimum pipelining needed in an asynchronous circuit (e.g., number/size of pipeline stages/latches required) to satisfy a given performance constraint, thereby implicitly minimizing area and power for a given performance. The paper first shows that the basic pipeline optimization problem for asynchronous circuits is NP-complete. Then, it presents an efficient branch and bound algorithm that finds the optimal pipeline configuration. The experimental results on a few scalable system models demonstrate that this algorithm is computationally feasible for moderately sized models.

Efficient performance analysis of asynchronous systems based on periodicity

Conference Paper

Full-text available

Sep 2005

This paper presents an efficient method for the performance analysis and optimization of asynchronous systems. An asynchronous system is modeled as a marked graph with probabilistic delay distributions. We show that these systems exhibit inherent periodic behaviors. Based on this property, we derive an algorithm to construct the state space of the system through composition and capture the time evolution of the states into a periodic Markov chain. The system is solved for important performance metrics such as the distribution of input arrival time at a component, which is useful for subsequent system optimization, as well as relative component utilization, system latency and throughput. We also present a tool to demonstrate the feasibility of this method. Initial experimental results are promising, showing over three orders of magnitude improvement in runtime and nearly two orders of magnitude decrease in the size of the state space over previously published results. While the focus of this paper is on asynchronous digital systems, our technique can be applied to other concurrent systems that exhibit global asynchronous behavior, such as GALS and embedded systems.

RL-Huffman encoding for test compression and power reduction in scan applications

Article

Full-text available

Jan 2005
ACM T DES AUTOMAT EL

This paper mixes two encoding techniques to reduce test data volume, test pattern delivery time and power dissipation in scan test applications. This is achieved by using Run-Length encoding followed by Huffman encoding. This combination is especially effective when the percentage of don't cares in a test set is high which is a common case in today's large SoCs. Our analysis and experimental results confirm that achieving up to 89% compression ratio and 93% scan-in power reduction is possible for scan testable circuits such as ISCAS89 benchmarks.

Design and implementation of static Huffman encoding hardware using a parallel shifting algorithm

Article

Nov 2004
IEEE T NUCL SCI

This work discusses the implementation of static Huffman encoding hardware for real-time lossless compression for the electromagnetic calorimeter in the CMS experiment. The construction of the Huffman encoding hardware illustrates the implementation for optimizing the logic size. The number of logic gates in the parallel shift operation required for the hardware was examined. The experiment with a simulated environment and an FPGA shows that the real-time constraint has been fulfilled and the design of the buffer length is appropriate.

Functional block diagram of decoder

Contexts in source publication

Similar publications

Citations