2. Two 256x256 8 bit gray scale images used as input for the...

A System for Evaluating Performance and Cost of SIMD Array Designs

Article

Jan 2000
J PARALLEL DISTR COM

: SIMD arrays are likely to become increasingly important as coprocessors in domain specic systems as architects continue to leverage RAM technology in their design. The problem this work addresses is the ecient evaluation of SIMD arrays with respect to complex applications while accounting for operating frequency and chip area. The underlying issues include the size of the architecture space, the lack of portability of the test programs, and the inherent complexity of simulating up to hundreds of thousands of processing elements. The overall method we use is to combine architecture level and Electronic Design Automation (EDA) level modeling by using an EDA based tool to calibrate architectural simulations. The resulting system retains much of the high throughput of the architecture level simulator but also has accuracy similar to that of an early pass EDA synthesis and circuit simulation. The particular problem of computational cost of the architectural level simulation is addressed ...

PAVLOV: A programmable architecture for volume processing

Article

Dec 1999

We present a parallel 2D mesh connected architecture with SIMD processing elements. The design allows for real-time volume rendering as well as interactive 3D segmentation and 3D feature extraction. This is possible because the SIMD processing elements are programmable,a feature which also allows the use of many different rendering algorithms. We present an algorithm which, with the addition of hardware resources, provides conflict free access to volume slices along any of the three major axes. The volume access conflict has been the main reason why previous similar architectures could not perform real-time volume rendering. We present the performance of preliminary algorithms on a software simulator of the architecture design. CR Categories: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors)---Single-instructionstream, multiple-data-stream processors (SIMD) ; I.3.1 [Computer Graphics]: Hardware Architecture---Graphics processors, Parallel processing...

Processor/Memory/Array Size Tradeoffs in the Design of SIMD Arrays for a Spatially Mapped Workload

Article

Oct 1997

: Though massively parallel SIMD arrays continue to be promising for many computer vision applications, they have undergone few systematic empirical studies. The problems include the size of the architecture space, the lack of portability of the test programs, and the inherent complexity of simulating up to hundreds of thousands of processing elements. The latter two issues have been addressed previously, here we describe how spreadsheets and tk/tcl are used to endow our simulator with the flexibility to model a large variety of designs. The utility of this approach is shown in the second half of the paper where results are presented as to the performance of a large number of array size, datapath, register file, and application code combinations. The conclusions derived include the utility of multiplier and floating point support, the cost of virtual PE emulation, likely datapath /memory combinations, and overall designs with the most promising performance/chip area ratios. 1 Introduc...

Abacus: A Reconfigurable Bit-Parallel Architecture for Early Vision

Article

Full-text available

Mar 1997

Michael Bolotski

Many important computational problems, including those of computer vision, are characterized by data-parallel, low-precision integer operations on large volumes of data. For such highly structured problems, this thesis develops Abacus, a high-speed reconfigurable SIMD (single-instruction, multiple-data) architecture that outperforms conventional microprocessors by over an order of magnitude using the same silicon resources. Earlier SIMD systems computed at relatively slow clock rates compared to their uniprocessor counterparts. The thesis discusses the problems involved in operating a large SIMD system at high clock rates, including instruction distribution and chip-to-chip communication, presents the solutions adopted by the Abacus design. Although the chip was implemented in a 1989-era VLSI technology, it was designed to contain 1024 processing elements (PEs), operate at 125 MHz, and deliver 2 billion 16-bit arithmetic operations per second (GOPS). The PE and chip architecture are ...

The impact of grain size on the efficiency of embedded SIMD image processing architectures

Article

Nov 2004
J PARALLEL DISTR COM

Abstract, Pixel-per-processing element (PPE) ratio-the amount of image data directly mapped to each processing element, has a significant impact on the area and energy efficiency of embedded SIMD architectures for image processing applications. This paper quantitatively evaluates the impact of PPE ratio on system performance and efficiency for focal-plane SIMD image processing architectures by comparing throughput, area efficiency, and energy efficiency for a range of common application kernels using architectural and workload simulation. While

Array control for high-performance SIMD systems

Article

Mar 2004
J PARALLEL DISTR COM

Although arrays of SIMD PEs can be built with very high operating frequencies, problems exist in keeping the array busy. The inherent mismatch between host and array makes it di-cult to maintain high array utilization: either the rate of instruction issue is very low or PE data locality is compromised, having the same efiect. Oursolutionisbasedonanarraycontrolunit(ACU)designthatexpandsmacroinstructionsintwostages, flrst by data tile and then into microinstructions. The expansion itself solves the issue problem; decoupling the expansionmodalitiesmaintainsdatalocality. Severalissuesinvolvinghost/ACUinteractionneedtoberesolvedto efiectthissolution. Wepresentexperimentalresultsshowingthatourapproachdeliverssubstantialimprovementin memoryhierarchyperformance: acacheofonlyonefourththesizeissu-cienttoachievethesameperformanceas previous approaches. We also describe our implementations which demonstrate that achieving gigaherz operating frequencies with current technologies is plausible.

Enpassant: An Environment for Evaluating Massively Parallel Array Architectures for Spatially Mapped Applications.

Article

Apr 1995
INT J PATTERN RECOGN

Although massively parallel arrays for spatially mapped applications have been proposed since the 1950s42 and built since the 1960s,12 there have been very few systematic empirical studies that cover more than a small fraction of the design space. The problems have included the lack of a test suite of non-trivial application codes; inadequate language support; the difficulties of balancing evaluation performance with flexibility; and balancing test suite portability with accuracy of evaluation. We describe an environment that addresses these problems. A realistic workload including a series of applications currently being used as building blocks in vision research has been constructed. Both flexibility in architectural parameter selection and simulation efficiency are maintained with a novel new technique that combines virtual machine emulation with trace-driven simulation. The trade-off between fairness to diverse target architectures and programmability of the test suite is addressed through the use of operator and application libraries for a small set of critical functions. We also present examples of the type of results we are obtaining, including the effects of changing ALU designs and datapath widths, finding critical points in register set and cache sizes, the benefits of various types of router networks, and the performance cost of processor virtualization.

Memory Considerations for High Performance SIMD Systems with On-Chip Control

Conference Paper

Jun 2003

Although arrays of SIMD PEs can be built with very high operating frequencies, problems exist in keeping the array busy. The inherent mismatch between host and array makes it difficult to maintain high array utilization: either the rate of instruction issue is very low or PE data locality is compromised, having the same effect. Our solution is based on an array control unit (ACU) design that expands macroinstructions in two stages, first by data tile and then into microinstructions. The expansion itself solves the issue problem; decoupling the expansion modalities maintains data locality. Several issues involving host/ACU interaction need to be resolved to effect this solution. We present experimental results showing that our approach delivers substantial improvement in memory hierarchy performance: a cache of only one fourth the size is sufficient to achieve the same performance as previous approaches.

Impact of pixel per processor ratio on embedded SIMD architectures

Conference Paper

Oct 2001

A key design parameter for embedded SIMD architectures is the amount of image data directly mapped to each processing element defined as the pixel-per-processing-element (PPE) ratio. This paper presents a study to determine the effect of different PPE mapping on the performance and efficiency figures of an embedded SIMD architecture. The correlation between problem size, PPE ratio, and processing element architecture are illustrated for a target implementation in 100 nm technology. A case study is illustrated to derive quantitative measures of performance, energy, and area efficiency. For fixed image size, power consumption, and silicon area, a constrained optimization is performed that indicates that a PPE value of 4 yields to the most efficient system configuration. Results indicate that this system is capable of delivering performance in excess of 1 Tops/s at 2.4 W, operating at 200 MHz, with 16384 PE integrated in about 850 mm<sup>2</sup>

Making a dataparallel language portable for massively parallelarray computers

Conference Paper

Nov 1997

A key goal in language design is to simultaneously achieve portability and efficiency. Achieving a general solution to this problem is quite difficult: virtually all attempts have emphasized one or the other requirement by restricting either the architecture domain, the application domain, or both. In this study we present (i) a framework that explains why meeting these requirements simultaneously is so difficult, and (ii) our approach, which, though it may not be the final word on this subject, implements a new set of trade-offs that may come closer to a balanced solution than has been previously achieved. Our solution includes an easy to use language based on the dataparallel programmer's model, a compiler that hides as many machine variations as possible, a library with emulations of constructs that map directly to hardware on some but not all machines, and a library with different versions of those critical application functions for which a single algorithm is not optimal across all hardware configurations. We have found the programmer cost for the application and architecture domains considered here to be quite reasonable

2. Two 256x256 8 bit gray scale images used as input for the correspondence matcher.

Citations