2. Bit-level pipelined execution of an instruction

Source publication

KPROC - An Instruction Systolic Architecture for Parallel Prefix Applications.

Article

Full-text available

Jan 2000

Bole SysTol 2010

Data

Full-text available

Oct 2014

Isolated systolic hypertension does not result from a smaller proximal aortic diameter

Article

Full-text available

Dec 2010

The effect of septal basal segments in the assessment of systolic dyssynchrony index

Article

Full-text available

Jan 2013

Comparison of conventional echocardiographic parameters of RV systolic function with cardiac magnetic resonance imaging

Article

Full-text available

Feb 2015

Fig. 4 This surface is a special case appearing in the proof of...

Fig. 6 The surface S with corners x, y and z and sides c s , c t and c r

Fig. 7 X 0 as described in the proof of Lemma 4.2. The complex spanned...

Fig. 10 This figure illustrates the proof of Lemma 6.4

Fig. 16 Cases a-e shown from left to right

The Triangle Groups (2, 4, 5) and (2, 5, 5) are not Systolic

Article

Full-text available

Nov 2020

In this paper we provide new examples of hyperbolic but nonsystolic groups by showing that the triangle groups (2, 4, 5) and (2, 5, 5) are not systolic. Along the way we prove some results about subsets of systolic complexes stable under involutions.

A bit-serial floating-point unit for a massively parallel system on a chip

Article

Full-text available

Jun 2004
Parallel Algorithm Appl

This paper presents the design of a new bit-serial floating-point unit (FPU). It has been developed for the processors of the Instruction Systolic Array parallel computer model. In contrast to conventional ,bit-parallel FPUs the bit-serial approach ,requires a different data format. Our FPU uses an IEEE compliant internal floating-point format that allows,a fast ,least significant bit (LSB)-first arithmetic and ,can ,be efficiently implemented,in hardware. Key Words: bit-serial Floating Point Units, massively parallel processors, systolic

Design of a Bit-Serial Floating Point Unit for a Fine Grained Parallel Processor Array.

Conference Paper

Full-text available

Jan 2003

This paper presents the design of a new bit-serial floating-point unit (FPU). It has been developed for the processors of the Instruction Systolic Array parallel computer model. In contrast to conventional bit-parallel FPUs the bit-serial approach requires different data formats. Our FPU uses an IEEE compliant internal floating point format that allows a fast least significant bit (LSB)-first arithmetic and can be efficiently implemented in hardware.

Memory Efficient Modular VLSI Architecture for Highthroughput and Low-Latency Implementation of Multilevel Lifting 2-D DWT

Article

Full-text available

Jun 2011
IEEE T SIGNAL PROCES

In this paper, we present a modular and pipeline architecture for lifting-based multilevel 2-D DWT, without using line-buffer and frame-buffer. Overall area-delay product is reduced in the proposed design by appropriate partitioning and scheduling of the computation of individual decomposition-levels. The processing for different levels is performed by a cascaded pipeline structure to maximize the hardware utilization efficiency (HUE). Moreover, the proposed structure is scalable for high-throughput and area-constrained implementation. We have removed all the redundancies resulting from decimated wavelet filtering to maximize the HUE. The proposed design involves L pyramid algorithm (PA) units and one recursive pyramid algorithm (RPA) unit, where R = N / P , L =⌈log<sub>4</sub> P̅ ⌉ and P is the input block size, M and N , respectively, being the height and width of the image. The entire multilevel DWT is computed by the proposed structure in MR cycles. The proposed structure has O (8 R ×2 L ) cycles of output latency, which is very small compared to the latency of the existing structures. Interestingly, the proposed structure does not require any line-buffer or frame-buffer, unlike the existing folded structures which otherwise require a line-buffer of size O ( N ) and frame-buffer of size O ( M /2× N /2) for multilevel 2-D computation. Instead of those buffers, the proposed structure involves only local registers and RAM of size O ( N ). The saving of line-buffer and frame-buffer achieved by the proposed design is an important advantage, since the image size could very often be as large as 512 × 512. From the simulation results we find that, the proposed scalable structure offers better slice-delay-product (SDP) for higher throughput of implementation since the on-chip memory of this structure remains almost unchanged with input block size. It has 17% less SDP than the best of the corresponding existing structures on average, for different input-block sizes and image sizes. It involves 1.92 times more transistors, but offers 12.2 times higher throughput and consumes 52% less power per output (PPO) compared to the other, on average for different input sizes.

2. Bit-level pipelined execution of an instruction

Similar publications

Citations