Figure 4 - uploaded by Allen Hanson
Content may be subject to copyright.
2. Two 256x256 8 bit gray scale images used as input for the correspondence matcher.

2. Two 256x256 8 bit gray scale images used as input for the correspondence matcher.

Source publication
Article
Full-text available
THE EVALUATION OF MASSIVELY PARALLEL ARRAY ARCHITECTURES September, 1994 Martin C. Herbordt, B.A., University of Pennsylvania M.S., University of Massachusetts Amherst Ph.D., University of Massachusetts Amherst Directed by: Professor Charles C. Weems Although massively parallel arrays have been proposed since the 1950's and built since the 1960's,...

Citations

... A sample of these results is shown in Table 6. The applications are described in [14]. The time needed to generate a complete set of Table 6: Sample timings of various image understanding applications run on the CAAPP instruction-level simulator and the generice SIMD array virtual machine emulator. ...
Article
: SIMD arrays are likely to become increasingly important as coprocessors in domain specic systems as architects continue to leverage RAM technology in their design. The problem this work addresses is the ecient evaluation of SIMD arrays with respect to complex applications while accounting for operating frequency and chip area. The underlying issues include the size of the architecture space, the lack of portability of the test programs, and the inherent complexity of simulating up to hundreds of thousands of processing elements. The overall method we use is to combine architecture level and Electronic Design Automation (EDA) level modeling by using an EDA based tool to calibrate architectural simulations. The resulting system retains much of the high throughput of the architecture level simulator but also has accuracy similar to that of an early pass EDA synthesis and circuit simulation. The particular problem of computational cost of the architectural level simulation is addressed ...
... Herbordt provides a very good analysis of design tradeoffs for designing 2D SIMD mesh arrays in [14]. He analyzed meshes for performing image processing tasks. ...
Article
We present a parallel 2D mesh connected architecture with SIMD processing elements. The design allows for real-time volume rendering as well as interactive 3D segmentation and 3D feature extraction. This is possible because the SIMD processing elements are programmable,a feature which also allows the use of many different rendering algorithms. We present an algorithm which, with the addition of hardware resources, provides conflict free access to volume slices along any of the three major axes. The volume access conflict has been the main reason why previous similar architectures could not perform real-time volume rendering. We present the performance of preliminary algorithms on a software simulator of the architecture design. CR Categories: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors)---Single-instructionstream, multiple-data-stream processors (SIMD) ; I.3.1 [Computer Graphics]: Hardware Architecture---Graphics processors, Parallel processing...
... As an example of the last point, some are dominated by gray-scale computation (which is mostly 8 bit integer), others by floating point. They are described in detail in [3]. ...
... The first three comprise the input constructor. 3 Note that trace compilation is not related to the work by Ellis and Fisher involving trace-scheduling compilers. Figure 3: Shown are major components of ENPASSANT. ...
Article
: Though massively parallel SIMD arrays continue to be promising for many computer vision applications, they have undergone few systematic empirical studies. The problems include the size of the architecture space, the lack of portability of the test programs, and the inherent complexity of simulating up to hundreds of thousands of processing elements. The latter two issues have been addressed previously, here we describe how spreadsheets and tk/tcl are used to endow our simulator with the flexibility to model a large variety of designs. The utility of this approach is shown in the second half of the paper where results are presented as to the performance of a large number of array size, datapath, register file, and application code combinations. The conclusions derived include the utility of multiplier and floating point support, the cost of virtual PE emulation, likely datapath /memory combinations, and overall designs with the most promising performance/chip area ratios. 1 Introduc...
... A simulation approach to SIMD performance evaluation was taken by Herbordt (Herbordt 1994). His trace compilation technique compiled traces generated on an abstract virtual machine for a speci c target architecture. ...
... It was supported by the results of two related research works. Holman (Holman & Snyder 1989) explicitly listed f A for a variety of algorithms, as in Herbordt's (Herbordt 1994) evaluation of parallel architectures included a set of simulation results relating datapath width to execution time. The value of f A was extracted by tting Amdahl's function to the published performance curves. ...
Article
Full-text available
Many important computational problems, including those of computer vision, are characterized by data-parallel, low-precision integer operations on large volumes of data. For such highly structured problems, this thesis develops Abacus, a high-speed reconfigurable SIMD (single-instruction, multiple-data) architecture that outperforms conventional microprocessors by over an order of magnitude using the same silicon resources. Earlier SIMD systems computed at relatively slow clock rates compared to their uniprocessor counterparts. The thesis discusses the problems involved in operating a large SIMD system at high clock rates, including instruction distribution and chip-to-chip communication, presents the solutions adopted by the Abacus design. Although the chip was implemented in a 1989-era VLSI technology, it was designed to contain 1024 processing elements (PEs), operate at 125 MHz, and deliver 2 billion 16-bit arithmetic operations per second (GOPS). The PE and chip architecture are ...
Article
Abstract, Pixel-per-processing element (PPE) ratio-the amount of image data directly mapped to each processing element, has a significant impact on the area and energy efficiency of embedded SIMD architectures for image processing applications. This paper quantitatively evaluates the impact of PPE ratio on system performance and efficiency for focal-plane SIMD image processing architectures by comparing throughput, area efficiency, and energy efficiency for a range of common application kernels using architectural and workload simulation. While
Article
Although arrays of SIMD PEs can be built with very high operating frequencies, problems exist in keeping the array busy. The inherent mismatch between host and array makes it di-cult to maintain high array utilization: either the rate of instruction issue is very low or PE data locality is compromised, having the same efiect. Oursolutionisbasedonanarraycontrolunit(ACU)designthatexpandsmacroinstructionsintwostages, flrst by data tile and then into microinstructions. The expansion itself solves the issue problem; decoupling the expansionmodalitiesmaintainsdatalocality. Severalissuesinvolvinghost/ACUinteractionneedtoberesolvedto efiectthissolution. Wepresentexperimentalresultsshowingthatourapproachdeliverssubstantialimprovementin memoryhierarchyperformance: acacheofonlyonefourththesizeissu-cienttoachievethesameperformanceas previous approaches. We also describe our implementations which demonstrate that achieving gigaherz operating frequencies with current technologies is plausible.
Article
Although massively parallel arrays for spatially mapped applications have been proposed since the 1950s42 and built since the 1960s,12 there have been very few systematic empirical studies that cover more than a small fraction of the design space. The problems have included the lack of a test suite of non-trivial application codes; inadequate language support; the difficulties of balancing evaluation performance with flexibility; and balancing test suite portability with accuracy of evaluation. We describe an environment that addresses these problems. A realistic workload including a series of applications currently being used as building blocks in vision research has been constructed. Both flexibility in architectural parameter selection and simulation efficiency are maintained with a novel new technique that combines virtual machine emulation with trace-driven simulation. The trade-off between fairness to diverse target architectures and programmability of the test suite is addressed through the use of operator and application libraries for a small set of critical functions. We also present examples of the type of results we are obtaining, including the effects of changing ALU designs and datapath widths, finding critical points in register set and cache sizes, the benefits of various types of router networks, and the performance cost of processor virtualization.
Conference Paper
Although arrays of SIMD PEs can be built with very high operating frequencies, problems exist in keeping the array busy. The inherent mismatch between host and array makes it difficult to maintain high array utilization: either the rate of instruction issue is very low or PE data locality is compromised, having the same effect. Our solution is based on an array control unit (ACU) design that expands macroinstructions in two stages, first by data tile and then into microinstructions. The expansion itself solves the issue problem; decoupling the expansion modalities maintains data locality. Several issues involving host/ACU interaction need to be resolved to effect this solution. We present experimental results showing that our approach delivers substantial improvement in memory hierarchy performance: a cache of only one fourth the size is sufficient to achieve the same performance as previous approaches.
Conference Paper
A key design parameter for embedded SIMD architectures is the amount of image data directly mapped to each processing element defined as the pixel-per-processing-element (PPE) ratio. This paper presents a study to determine the effect of different PPE mapping on the performance and efficiency figures of an embedded SIMD architecture. The correlation between problem size, PPE ratio, and processing element architecture are illustrated for a target implementation in 100 nm technology. A case study is illustrated to derive quantitative measures of performance, energy, and area efficiency. For fixed image size, power consumption, and silicon area, a constrained optimization is performed that indicates that a PPE value of 4 yields to the most efficient system configuration. Results indicate that this system is capable of delivering performance in excess of 1 Tops/s at 2.4 W, operating at 200 MHz, with 16384 PE integrated in about 850 mm<sup>2</sup>
Conference Paper
A key goal in language design is to simultaneously achieve portability and efficiency. Achieving a general solution to this problem is quite difficult: virtually all attempts have emphasized one or the other requirement by restricting either the architecture domain, the application domain, or both. In this study we present (i) a framework that explains why meeting these requirements simultaneously is so difficult, and (ii) our approach, which, though it may not be the final word on this subject, implements a new set of trade-offs that may come closer to a balanced solution than has been previously achieved. Our solution includes an easy to use language based on the dataparallel programmer's model, a compiler that hides as many machine variations as possible, a library with emulations of constructs that map directly to hardware on some but not all machines, and a library with different versions of those critical application functions for which a single algorithm is not optimal across all hardware configurations. We have found the programmer cost for the application and architecture domains considered here to be quite reasonable