Figure 3 - uploaded by Andrew Wolfe
Content may be subject to copyright.
Functional block diagram of decoder  

Functional block diagram of decoder  

Source publication
Conference Paper
Full-text available
This paper presents the architecture and design of a high-performance asynchronous Huffman decoder for compressed-code embedded processors. In such processors, embedded programs are stored in compressed form in instruction ROM then are decompressed on demand during instruction cache refill. The Huffman decoder is used as a code decompression engine...

Contexts in source publication

Context 1
... cache refill logic counts the 32-bit data words produced by the decoder, and it resets the decoder after 8 words have been produced, which correspond to an entire uncompressed cache line. Figure 3, indi- cating the flow of basic operations. The Input Buffer holds the current input data, which is supplied to the decoder. ...
Context 2
... Align through Sum has 5 stages. The Match/Length Unit of Figure 3 is implemented as two stages: Match and Length. Match produces a 1-hot output, indicat- ing the match class to which the current input symbol belongs. ...
Context 3
... Length ROM then translates this class into the actual token length. The Adder of Figure 3 is divided into a Carry stage followed by a Sum generation stage. Once the sum is computed, it is passed as select bits to the Align stage, indicating the new 0-7 bit shift, and is also passed to the Offset stage. ...
Context 4
... is used only for an initial load, at the start of decoding. While the functional diagram in Figure 3 suggests that shift occurs after add is complete, our actual archi- tecture uses two optimizations. First, in some cases, the upper bit from Length ROM immediately indicates that a 1-byte shift is needed (Pre-Shift). ...
Context 5
... this thread, the current input symbol is decoded and written into the Output Buffer. The Code ROM of Figure 3 is now implemented by 3 stages: Decode, ROM, and Merge. These stages implement a compact lookup table, which uses a 2- way decoding process. ...

Similar publications

Article
Full-text available
Explicitly software managed cache systems are postulated as a solution for power considerations in computing devices. The savings expected in a SoftCache lies in the removal of tag storage, associativity logic, comparators, and other hardware dedicated to memory hierarchies. The penalty lies in high cache-miss cost and additional instructions requi...
Conference Paper
Full-text available
High performance microprocessor units require high performance adders and other arithmetic units. Modern microprocessors are however 32-bits or 64-bits as that is the minimum required for floating point arithmetic as per the IEEE 754 Standard. 8-bit and 16-bit arithmetic processors are normally found in micro-controller applications for embedded sy...
Conference Paper
Full-text available
High performance microprocessor units require high performance adders and other arithmetic units. Modern microprocessors are however 32-bits or 64-bits as that is the minimum required for floating point arithmetic as per the IEEE 754 Standard. 8-bit and 16-bit arithmetic processors are normally found in micro-controller applications for embedded sy...
Conference Paper
Full-text available
Cairo University SPARC “CUSPARC” processor is an IP embedded processor core conforming to SPARC V8 ISA. CUSPARC is fully developed at Cairo University and is the first Egyptian processor. In this paper, the ASIC Implementation and Verification of the CUSPARC processor is described at 130nm technology node. CUSPARC scores a typical clock frequency o...

Citations

... Benes [1] presented a arquitecture and design of a high-performance asynchornous Huffman decoder for compressed-code embedded process. ...
Article
Full-text available
This paper shows that by using a code generated from word frequencies instead of character frequencies it is possible to improve the compression ratio.
... As simulações realizadas mostram que a taxa média de compressão foi de 76% para o conjunto de benchmarks estudados e com uma perda relativamente pequena no desempenho, quando comparado com a execução das aplicações na sua versão sem compressão. 1 ...
... Os sistemas móveis e embarcados têm se popularizado nos dias atuais. Com a possibilidade de se fabricar elementos processadores, memórias e baterias recarregáveis a custos mais baixos e com tamanhos menores, várias aplicações antes baseadas em sistemas mecânicos, elétricos e eletrônicos (dedicados) passaram a utilizar microproces- 1 Este trabalho contou com o apoio financeiro da CAPES sadores. Incluem-se nestes casos os handhelds, telefones celulares, notebooks e MP3 players, entre outros. ...
Conference Paper
Full-text available
O uso eficiente de sistemas móveis e embarcados depende fortemente de estratégias adequadas para redução do consumo de energia. Esses sistemas são caracterizados também por uma grande restrição de recursos, entre eles a quantidade de memória disponível para as aplicações. Este trabalho apresenta um esquema de compressão de código para processadores embarcados compatíveis com o processador ARM, que visa apresentar soluções para essas duas demandas específicas dos sistemas móveis e embarcados. A compressão é feita diretamente no código objeto, após a compilação do código-fonte pelas ferramentas tradicionais, e usa um algoritmo de compressão por freqüência de opcodes: o código de Huffman. Para a execução do código comprimido é necessário que haja um hardware de expansão das instruções associado ao núcleo do processador. O hardware de expansão é composto por estruturas de armazenamento e controle projetados para a realização eficiente das operações de expansão. As medidas de desempenho de compressão e os efeitos da expansão foram feitos a partir da simulação em uma versão modificada do SimpleScalar e com uso da suíte de avaliação MiBench. As simulações realizadas mostram que a taxa média de compressão foi de 76% para o conjunto de benchmarks estudados e com uma perda relativamente pequena no desempenho, quando comparado com a execução das aplicações na sua versão sem compressão.
... However, asynchronous designs can have high-performance, even out-performing synchronous designs [33], [30], [8], [31]. Synchronous designs have the limitation of performance at the worst-case timing of all clocked paths. ...
... 8 shows the LP design with added datapath. The second output of the mutex (ME) is chosen for the multiplexer select input (mux sel). ...
... to enable d to fire at time t = 2. To make the second firing of d as late as possible, we use maximum delay values on all other edges to make d fire at time t = 9. This results in a maximum separation of ∆dd = 7.Figure 8 shows how the simulation can be thought of as two separate runs of a fixed-delay system, each with a different delay assignment. [2] 3. maximize delay from b to d (b) Run 2 Suppose now we want to solve the opposite problem of finding the minimum separation between two consecutive firings of d. Using similar reasoning, we would want the first firing of d to happen as late as possible with respect to b, and the second firing to happen as early as possible after. The si ...
... The experimental results are shown in Table IV. Four sets of test cases were used: (i) different variations of asynchronous FIFO designs; (ii) an asynchronous Huffman decoder design from [2]; (iii) designs from the published literature [1], [8] (iv) variants of the examples shown in Section III of this paper. In many cases, since the examples are small and simple, the runtimes of the three tools do not show significant discrepancies, and the results match. ...
Conference Paper
The time separation of events (TSE) problem is that of finding the maximum and minimum separation between the times of occurrence of two events in a concurrent system. It has applications in the performance analysis, optimization and verification of concurrent digital systems. This paper introduces an efficient polynomial-time algorithm to give exact bounds on TSE's for choice-free concurrent systems, whose operational semantics obey the max-causality rule. A choice-free concurrent system is modeled as a strongly-connected marked graph, where delays on operations are modeled as bounded intervals with unspecified distributions. While previous approaches handle acyclic systems only, or else require graph unfolding until a steady-state behavior is reached, the proposed approach directly identifies and evaluates the asymptotic steady-state behavior of a cyclic system via a graph-theoretical approach. As a result, the method has significantly lower computational complexity than previously-proposed solutions. A prototype CAD tool has been developed to demonstrate the feasibility and efficacy of our method. A set of experiments have been performed on the tool as well as two existing tools, with noticeable improvement on runtime and accuracy for several examples.
... Fulcrum Microsystems, an asynchronous startup company, has developed a commercial largely-asynchronous crossbar switch for high-speed networking [5]. In addition, a number of substantial experimental chips have also been produced, both in industry and in academia, including several asynchronous micropro- cessors [10], [21]; a fast asynchronous Pentium instruction length decoder developed by Intel, exhibiting three times the throughput and only half the power consumption of a corresponding clocked implementation [27]; a fast asynchronous Huffman decoder for compressed-code embedded processors [2]; and a hybrid synchronous-asynchronous finite-impulse response (FIR) filter for disk-drive read channels developed in collaboration with IBM Research [33] that exhibited 50% lower latency and 15% higher throughput than a comparable clocked implementation. The focus of this paper is on a key enabling technology to make high-speed asynchronous systems practical: the design of high-throughput asynchronous pipelines. ...
Article
A new class of asynchronous pipelines is proposed, called lookahead pipelines (LP), which use dynamic logic and are capable of delivering multi-gigahertz throughputs. Since they are asynchronous, these pipelines avoid problems related to high-speed clock distribution, such as clock power, management of clock skew, and inflexibility in handling varied environments. The designs are based on the well-known PSO style of Williams and Horowitz as a starting point, but achieve significant improvements through novel protocol optimizations: the pipeline communication is structured so that critical events can be detected and exploited earlier. A special focus of this work is to target extremely fine-grain or gate-level pipelines, where the datapath is sectioned into stages, each consisting of logic that is only a single level deep. Both dual-rail and single-rail pipeline implementations are proposed. All the implementations are characterized by low-cost control structures and the avoidance of explicit latches. Post-layout SPICE simulations, in 0.18-mum technology, indicate that the best dual-rail LP design has more than twice the throughput (1.04 giga data items/s) of Williams' PSO design, while the best single-rail LP design achieves even higher throughput (1.55 giga data items/s).
... Fulcrum Microsystems, an asynchronous startup company, has developed a commercial largely-asynchronous crossbar switch for high-speed networking [5]. In addition, a number of substantial experimental chips have also been produced, both in industry and in academia, including several asynchronous micropro- cessors [10], [21]; a fast asynchronous Pentium instruction length decoder developed by Intel, exhibiting three times the throughput and only half the power consumption of a corresponding clocked implementation [27]; a fast asynchronous Huffman decoder for compressed-code embedded processors [2]; and a hybrid synchronous-asynchronous finite-impulse response (FIR) filter for disk-drive read channels developed in collaboration with IBM Research [33] that exhibited 50% lower latency and 15% higher throughput than a comparable clocked implementation. The focus of this paper is on a key enabling technology to make high-speed asynchronous systems practical: the design of high-throughput asynchronous pipelines. ...
Article
This paper introduces a high-throughput asynchronous pipeline style, called high-capacity (HC) pipelines, targeted to datapaths that use dynamic logic. This approach includes a novel highly-concurrent handshake protocol, with fewer synchronization points between neighboring pipeline stages than almost all existing asynchronous dynamic pipelining approaches. Furthermore, the dynamic pipelines provide 100% buffering capacity, without explicit latches, by means of separate pullup and pulldown control for each pipeline stage: neighboring stages can store distinct data items, unlike almost all existing latchless dynamic asynchronous pipelines. As a result, very high throughput is obtained. Fabricated first-input-first-output (FIFO) designs, in 0.18-m technology, were fully functional over a wide range of supply voltages (1.2 to over 2.5 V), exhibiting a corresponding range of throughputs from 1.0-2.4 giga items/s. In addition, an experimental finite-impulse response (FIR) filter chip was designed and fabricated with IBM Research, whose speed-critical core used an HC pipeline. The HC pipeline exhibited throughputs up to 1.8 giga items/s, and the overall filter achieved 1.32 giga items/s, thus obtaining 15% higher throughput and 50% lower latency than the fastest previously-reported synchronous FIR filter, also designed at IBM Research.
... Thus, the min-abstract-latch pipeline optimization problem is to find a minimum cardinality abstract latch assignment that yields a cycle time that is well-defined and less than or equal to a given constraint δ. Example To make this model more concrete, consider the pipeline optimization model for a Huffman decoder [4] depicted in Figure 3 using the PS0 pipeline scheme. The model decomposes the Huffman circuit into 11 units separated by 9 slots and includes the estimated delays for each unit. ...
... Each time a new abstract latch is added to a partial solution we compute the subset of associated VSSs that are properly decomposed. We do not search the subtree routed at a slot-assignedchild when 1) the number of abstract latches assigned up to that child node plus the derived lower bound for that subtree is larger or equal to the current best solution or 2) the child node represents a solution better than the current best, in which case the current best solution is updated, or 3) the cycle metrics associated with any loop dependency exceeds δ. 4 We do not search the subtree routed at a slot-excluded-04 when we determine there exist no feasible solution for a VSS associated with the slot. ...
Article
Full-text available
This paper addresses the problem of identifying the minimum pipelining needed in an asynchronous circuit (e.g., number/size of pipeline stages/latches required) to satisfy a given performance constraint, thereby implicitly minimizing area and power for a given performance. The paper first shows that the basic pipeline optimization problem for asynchronous circuits is NP-complete. Then, it presents an efficient branch and bound algorithm that finds the optimal pipeline configuration. The experimental results on a few scalable system models demonstrate that this algorithm is computationally feasible for moderately sized models.
... One potential advantage of asynchronous systems over their clocked counterparts, in certain applications, is better average-case performance [1,19,25]. However, the performance analysis of asynchronous systems remains a challenge. ...
... In our approach, an asynchronous system is modeled as a marked graph, a subclass of Petri nets that captures concurrency and data-dependent relationships between interacting components in decision-free systems [7]. 1 The variations in input arrival time and component delays are captured in a probabilistic delay model. The probability distribution of input arrival at a component can provide indication of system bottlenecks. ...
... The first set of test cases are different variants of Sutherland's micropipeline design [20] shown in Figure 1. The second test case is an asynchronous Huffman decoder design proposed in [1]. The [1] 160 0.036 DiffEq [25] 175 0.039 DCT [21] 169 0.031 Table 1: Experimental Results third test case is a low-control-overhead asynchronous differential equation solver proposed in [25], and the fourth is an asynchronous DCT matrix-vector multiplier proposed in [21]. ...
Conference Paper
Full-text available
This paper presents an efficient method for the performance analysis and optimization of asynchronous systems. An asynchronous system is modeled as a marked graph with probabilistic delay distributions. We show that these systems exhibit inherent periodic behaviors. Based on this property, we derive an algorithm to construct the state space of the system through composition and capture the time evolution of the states into a periodic Markov chain. The system is solved for important performance metrics such as the distribution of input arrival time at a component, which is useful for subsequent system optimization, as well as relative component utilization, system latency and throughput. We also present a tool to demonstrate the feasibility of this method. Initial experimental results are promising, showing over three orders of magnitude improvement in runtime and nearly two orders of magnitude decrease in the size of the state space over previously published results. While the focus of this paper is on asynchronous digital systems, our technique can be applied to other concurrent systems that exhibit global asynchronous behavior, such as GALS and embedded systems.
... Several Huffman decoders have been proposed in recent years which are independent of the circuit and data set [Chia-Hsing and Chein-Wei 1998;Benes et al. 1998;Rudberg and Wanhammar 1997]. The drawback, however, is that these decoders are expensive. ...
Article
Full-text available
This paper mixes two encoding techniques to reduce test data volume, test pattern delivery time and power dissipation in scan test applications. This is achieved by using Run-Length encoding followed by Huffman encoding. This combination is especially effective when the percentage of don't cares in a test set is high which is a common case in today's large SoCs. Our analysis and experimental results confirm that achieving up to 89% compression ratio and 93% scan-in power reduction is possible for scan testable circuits such as ISCAS89 benchmarks.
... It assigns shorter codes to the more frequent symbols and the longer codes to the rarer ones, which reduces the overall size of the original information. Previous studies on the implementation of Huffman encoding focused on offline [4], fast decoding [5]- [11], adaptive Huffman coding [12]- [14], or the size of the storing memory in the area of VLSI aspect [13], [14]. ...
Article
This work discusses the implementation of static Huffman encoding hardware for real-time lossless compression for the electromagnetic calorimeter in the CMS experiment. The construction of the Huffman encoding hardware illustrates the implementation for optimizing the logic size. The number of logic gates in the parallel shift operation required for the hardware was examined. The experiment with a simulated environment and an FPGA shows that the real-time constraint has been fulfilled and the design of the buffer length is appropriate.