Fig 3 - uploaded by Bertil Schmidt
Content may be subject to copyright.
2. Bit-level pipelined execution of an instruction 

2. Bit-level pipelined execution of an instruction 

Citations

... Modern VLSI technology allows to integrate massively parallel systems on a single chip. The area limitations for the processors of such systems require small but efficient computational units [11,14,20]. In particular, it may motivate the choice of a bit-serial data organization of the individual processors. ...
... 3 Motivated by the design of an Instruction Systolic Array (ISA) with 1024 processors on one chip [14], an FPU has been designed that meets these requirements. Its bit-serial structure allows for a fine-grained pipelined implementation on a minimal area of silicon. ...
... The processor architecture is part of a chip design containing an ISA of size 32 × 32 [14]. The main components of the processor architecture are a set of 64 data registers, the communication register, a unit for integer addition, logical operations and conditional instructions, a 16-bit multiplier, and a shifter/adder for floating-point arithmetic (see Figure 2). ...
Article
Full-text available
This paper presents the design of a new bit-serial floating-point unit (FPU). It has been developed for the processors of the Instruction Systolic Array parallel computer model. In contrast to conventional ,bit-parallel FPUs the bit-serial approach ,requires a different data format. Our FPU uses an IEEE compliant internal floating-point format that allows,a fast ,least significant bit (LSB)-first arithmetic and ,can ,be efficiently implemented,in hardware. Key Words: bit-serial Floating Point Units, massively parallel processors, systolic
... Thus, the integration of efficient floating-point arithmetic is essential to these processors. Motivated by the design of an Instruction Systolic Array (ISA) with 1024 processors on one chip [6], an FPU has been designed that meets these requirements. Its bit-serial structure allows for a fine grained pipelined implementation on a minimal area of silicon. ...
... The processor architecture is part of a chip design containing an ISA of size 32 × 32 [6]. The main components of the processor architecture are a set of 64 data registers, the communication register, a unit for integer addition, logical operations and conditional instructions, a 16-bit multiplier, and a shifter/adder for floating point arithmetic (see Fig. 2). ...
Conference Paper
Full-text available
This paper presents the design of a new bit-serial floating-point unit (FPU). It has been developed for the processors of the Instruction Systolic Array parallel computer model. In contrast to conventional bit-parallel FPUs the bit-serial approach requires different data formats. Our FPU uses an IEEE compliant internal floating point format that allows a fast least significant bit (LSB)-first arithmetic and can be efficiently implemented in hardware.
Article
Full-text available
In this paper, we present a modular and pipeline architecture for lifting-based multilevel 2-D DWT, without using line-buffer and frame-buffer. Overall area-delay product is reduced in the proposed design by appropriate partitioning and scheduling of the computation of individual decomposition-levels. The processing for different levels is performed by a cascaded pipeline structure to maximize the hardware utilization efficiency (HUE). Moreover, the proposed structure is scalable for high-throughput and area-constrained implementation. We have removed all the redundancies resulting from decimated wavelet filtering to maximize the HUE. The proposed design involves L pyramid algorithm (PA) units and one recursive pyramid algorithm (RPA) unit, where R = N / P , L =⌈log<sub>4</sub> P̅ ⌉ and P is the input block size, M and N , respectively, being the height and width of the image. The entire multilevel DWT is computed by the proposed structure in MR cycles. The proposed structure has O (8 R ×2 L ) cycles of output latency, which is very small compared to the latency of the existing structures. Interestingly, the proposed structure does not require any line-buffer or frame-buffer, unlike the existing folded structures which otherwise require a line-buffer of size O ( N ) and frame-buffer of size O ( M /2× N /2) for multilevel 2-D computation. Instead of those buffers, the proposed structure involves only local registers and RAM of size O ( N ). The saving of line-buffer and frame-buffer achieved by the proposed design is an important advantage, since the image size could very often be as large as 512 × 512. From the simulation results we find that, the proposed scalable structure offers better slice-delay-product (SDP) for higher throughput of implementation since the on-chip memory of this structure remains almost unchanged with input block size. It has 17% less SDP than the best of the corresponding existing structures on average, for different input-block sizes and image sizes. It involves 1.92 times more transistors, but offers 12.2 times higher throughput and consumes 52% less power per output (PPO) compared to the other, on average for different input sizes.