Fig 3 - uploaded by Bertil Schmidt
Content may be subject to copyright.
Similar publications
In this paper we provide new examples of hyperbolic but nonsystolic groups by showing that the triangle groups (2, 4, 5) and (2, 5, 5) are not systolic. Along the way we prove some results about subsets of systolic complexes stable under involutions.
Citations
... Modern VLSI technology allows to integrate massively parallel systems on a single chip. The area limitations for the processors of such systems require small but efficient computational units [11,14,20]. In particular, it may motivate the choice of a bit-serial data organization of the individual processors. ...
... 3 Motivated by the design of an Instruction Systolic Array (ISA) with 1024 processors on one chip [14], an FPU has been designed that meets these requirements. Its bit-serial structure allows for a fine-grained pipelined implementation on a minimal area of silicon. ...
... The processor architecture is part of a chip design containing an ISA of size 32 × 32 [14]. The main components of the processor architecture are a set of 64 data registers, the communication register, a unit for integer addition, logical operations and conditional instructions, a 16-bit multiplier, and a shifter/adder for floating-point arithmetic (see Figure 2). ...
This paper presents the design of a new bit-serial floating-point unit (FPU). It has been developed for the processors of the Instruction Systolic Array parallel computer model. In contrast to conventional ,bit-parallel FPUs the bit-serial approach ,requires a different data format. Our FPU uses an IEEE compliant internal floating-point format that allows,a fast ,least significant bit (LSB)-first arithmetic and ,can ,be efficiently implemented,in hardware. Key Words: bit-serial Floating Point Units, massively parallel processors, systolic
... Thus, the integration of efficient floating-point arithmetic is essential to these processors. Motivated by the design of an Instruction Systolic Array (ISA) with 1024 processors on one chip [6], an FPU has been designed that meets these requirements. Its bit-serial structure allows for a fine grained pipelined implementation on a minimal area of silicon. ...
... The processor architecture is part of a chip design containing an ISA of size 32 × 32 [6]. The main components of the processor architecture are a set of 64 data registers, the communication register, a unit for integer addition, logical operations and conditional instructions, a 16-bit multiplier, and a shifter/adder for floating point arithmetic (see Fig. 2). ...
This paper presents the design of a new bit-serial floating-point unit (FPU). It has been developed for the processors of the Instruction Systolic Array parallel computer model. In contrast to conventional bit-parallel FPUs the bit-serial approach requires different data formats. Our FPU uses an IEEE compliant internal floating point format that allows a fast least significant bit (LSB)-first arithmetic and can be efficiently implemented in hardware.
In this paper, we present a modular and pipeline architecture for lifting-based multilevel 2-D DWT, without using line-buffer and frame-buffer. Overall area-delay product is reduced in the proposed design by appropriate partitioning and scheduling of the computation of individual decomposition-levels. The processing for different levels is performed by a cascaded pipeline structure to maximize the hardware utilization efficiency (HUE). Moreover, the proposed structure is scalable for high-throughput and area-constrained implementation. We have removed all the redundancies resulting from decimated wavelet filtering to maximize the HUE. The proposed design involves L pyramid algorithm (PA) units and one recursive pyramid algorithm (RPA) unit, where R = N / P , L =⌈log<sub>4</sub> P̅ ⌉ and P is the input block size, M and N , respectively, being the height and width of the image. The entire multilevel DWT is computed by the proposed structure in MR cycles. The proposed structure has O (8 R ×2 L ) cycles of output latency, which is very small compared to the latency of the existing structures. Interestingly, the proposed structure does not require any line-buffer or frame-buffer, unlike the existing folded structures which otherwise require a line-buffer of size O ( N ) and frame-buffer of size O ( M /2× N /2) for multilevel 2-D computation. Instead of those buffers, the proposed structure involves only local registers and RAM of size O ( N ). The saving of line-buffer and frame-buffer achieved by the proposed design is an important advantage, since the image size could very often be as large as 512 × 512. From the simulation results we find that, the proposed scalable structure offers better slice-delay-product (SDP) for higher throughput of implementation since the on-chip memory of this structure remains almost unchanged with input block size. It has 17% less SDP than the best of the corresponding existing structures on average, for different input-block sizes and image sizes. It involves 1.92 times more transistors, but offers 12.2 times higher throughput and consumes 52% less power per output (PPO) compared to the other, on average for different input sizes.