A StarT-Voyager node. The application processor (aP), its level two cache controller, DRAM, and memory controller are all standard. In place of the second application processor is our custom network interface unit (NIU). The NIU contains 3 FPGAs (aBIU, sBIU, TxURxU), 1 ASIC (CTRL), two dual-ported banks of SRAM (aSRAM, sSRAM), a single-ported SRAM (clsSRAM), and an embedded processor (sP). The NIU connects the memory bus of the SMP to a high performance interconnection network (Arctic).

Source publication

StarT-Voyager: A Flexible Platform for Exploring Scalable SMP Issues

Article

Full-text available

Feb 1970

: This paper describes StarT-Voyager, a machine designed as an experimental platform for research in cluster system communication. The heart of StarT-Voyager is a network interface unit (NIU) that connects the memory bus of a PowerPC-based SMP to the MIT Arctic network. The NIU is highly flexible, with its set of functions easily modified by firmwa...

Context 1

... StarT-Voyager system consists of an interconnection network and a set of nodes, with one NIU card per node. Each node consists of an unmodified IBM PowerPC 604e-based two-processor card slot SMP. Each node contains a 166MHz 604e processor and 512KB in-line L2 cache card in one processor slot and a StarT-Voyager network interface unit (NIU) in a second processor slot, Figure 2. The 604e is referred to the application processor (aP). The NIU consists of custom hardware, SRAMs, and a 604 microprocessor that is used as an embedded service processor (sP) to execute firmware. The NIU connects to an MIT Arctic network[1], a 160MB/sec/direction/link fat tree network designed and implemented within our research group. It is important to note that the aP uses all of the original SMP's infrastructure, including the memory controller, DRAM, PCI bridge, and so ...

View in full-text

Programming Models for FPGA Application Accelerators

Conference Paper

Full-text available

Sep 2009

Algorithms can be accelerated by offloading compute-intensive operations to application accelerators comprising reconfigurable hardware devices known as Field Programmable Gate Arrays (FPGAs). We examine three types of accelerator programming model – master-worker, message passing and shared memory – and a typical FPGA system configuration that uti...

Programming Models for Reconfigurable Application Accelerators

Article

Full-text available

Jan 2009

Algorithms can be accelerated by offloading compute-intensive operations to application accelerators comprising reconfigurable hardware devices known as Field Pro-grammable Gate Arrays (FPGAs). We examine three types of accelerator programming model – master-worker, message passing and shared memory – and a typical FPGA system configuration that ut...

Remote Store Programming: Mechanisms and Performance

Article

May 2009

This paper proposes the remote store programming paradigm for parallel computing. This novel model avoids locks and combines the communication mechanism of shared memory programming with the synchronization mech-anism of message passing programming. The result is a new paradigm which is particularly suited for multicore architectures with non-uniform memory access (NUMA) shared memory. In this paper, we describe the remote store programming model, present several detailed examples of remote store programs, and present results of im-plementing remote store programs on a multicore. Our results show that remote store programs can achieve close to linear speedup for large numbers of cores and can provide a sizable performance advantage over shared memory programs.

Advances in the dataflow computational model

Article

May 2000
PARALLEL COMPUT

The dataflow program graph execution model, or dataflow for short, is an alternative to the stored-program (von Neumann) execution model. Because it relies on a graph representation of programs, the strengths of the dataflow model are very much the complements of those of the stored-program one. In the last thirty or so years since it was proposed, the dataflow model of computation has been used and developed in very many areas of computing research: from programming languages to processor design, and from signal processing to reconfigurable computing. This paper is a review of the current state-of-the-art in the applications of the dataflow model of computation. It focuses on three areas: multithreaded computing, signal processing and reconfigurable computing.

Micro-Architectures of High Performance, Multi-User System Area Network Interface Cards.

Conference Paper

Full-text available

Jan 2000

This paper examines two Network Interface Card micro-architectures that support low latency, high bandwidth user level message passing in multi-user environments. The two are at different ends of a design spectrum-the Resident queues design relies completely on hardware, while the Non-resident queues design is heavily firmware driven. Through actual implementation of these designs and simulation-based micro-benchmark studies, we identify issues critical to the performance and functionality of the firmware-based approach. The firmware-based approach offers much flexibility at a moderate performance penalty, while the Resident design has superior performance for the functions it implements. This leads us to conclude that a hybrid design combining complete hardware support for common operations and a firmware implementation of less common functions achieves both high performance and flexibility

Message passing support on StarT-Voyager

Conference Paper

Full-text available

Jan 1999

No single message passing mechanism can efficiently support all types of communication that commonly occur in most parallel or distributed programs. MIT's StarT-Voyager, a hybrid message passing/shared memory parallel machine, provides four message passing mechanisms to achieve high performance over a wide spectrum of communication types and sizes. Hardware and address translation enforced protection allows direct user-level access to message passing facilities in a multiuser environment. StarT-Voyager's protection scheme improves upon past designs by not requiring strictly synchronized gang-scheduling, and by supporting non-monolithic protection domains. To minimize the development effort and cost, the machine is designed to use unmodified commercial PowerPC 604-based SMP systems as the building block. A Network End-point Subsystem (NES) card which plugs into one of each SMP's processor card slots provides the interface to Arctic, a low-latency, high-bandwidth network developed at MIT. This paper describes StarT-Voyager's message passing mechanisms and their predicted performance

A Case for Chip Multiprocessors Based on the Data-Driven Multithreading Model

Article

Jun 2006
INT J PARALLEL PROG

Current high-end microprocessors achieve high performance as a result of adding more features and therefore increasing complexity. This paper makes the case for a Chip-Multiprocessor based on the Data-Driven Multithreading (DDM-CMP) execution model in order to overcome the limitations of current design trends. Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronization overheads. DDM-CMP avoids the complexity of other designs by combining simple commodity microprocessors with a small hardware overhead for thread scheduling and an interconnection network. Preliminary experimental results show that a DDM-CMP chip of the same hardware budget as a high-end commercial microprocessor, clocked at the same frequency, achieves a speedup of up to 18.5 with a 78–81% power consumption of the commercial chip. Overall, the estimated results for the proposed DDM-CMP architecture show a significant benefit in terms of both speedup and power consumption making it an attractive architecture for future processors.

A Personal Supercomputer for Climate Research

Conference Paper

Full-text available

Dec 1999

We describe and analyze the performance of a cluster of personal computers dedicated to coupled climate simulations. This climate modeling system performs comparably to state-of-the-art supercomputers and yet is affordable by individual research groups, thus enabling more spontaneous application of high-end numerical models to climate science. The cluster's novelty centers around the Arctic Switch Fabric and the StarT-X network interface, a system-area interconnect substrate developed at MIT. A significant fraction of the interconnect's hardware performance is made available to our climate model through an application-specific communication library. In addition to reporting the overall application performance of our cluster, we develop an analytical performance model of our application. Based on this model, we define a metric, Potential Floating-Pointing Performance, which we use to quantify the role of high-speed interconnects in determining application performance. Our results show that a high-performance interconnect, in conjunction with a light-weight application-specific library, provides efficient support for our fine-grain parallel application on an otherwise general-purpose commodity system.

Cache-Coherent Distributed Shared Memory: Perspectives on Its Development and Future Challenges

Article

Full-text available

Jun 1999
P IEEE

Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming the dominant approach for building multiprocessors with moderate to large numbers of processors. Cache coherence allows such architectures to use caching to take advantage of locality in applications without changing the programmer's model of memory. We review the key developments that led to the creation of cache-coherent distributed shared memory and describe the Stanford DASH Multiprocessor, the first working implementation of hardware-supported scalable cache coherence. We then provide a perspective on such architectures and discuss important remaining technical challenges. 1. Motivations In the 1980s, multiprocessors were designed with two major architectural approaches. For small numbers of processors (typically less than 16 or 32), the domina...

Context in source publication

Similar publications

Citations