Figure 2 - uploaded by Larry Rudolph
Content may be subject to copyright.
A StarT-Voyager node. The application processor (aP), its level two cache controller, DRAM, and memory controller are all standard. In place of the second application processor is our custom network interface unit (NIU). The NIU contains 3 FPGAs (aBIU, sBIU, TxURxU), 1 ASIC (CTRL), two dual-ported banks of SRAM (aSRAM, sSRAM), a single-ported SRAM (clsSRAM), and an embedded processor (sP). The NIU connects the memory bus of the SMP to a high performance interconnection network (Arctic). 

A StarT-Voyager node. The application processor (aP), its level two cache controller, DRAM, and memory controller are all standard. In place of the second application processor is our custom network interface unit (NIU). The NIU contains 3 FPGAs (aBIU, sBIU, TxURxU), 1 ASIC (CTRL), two dual-ported banks of SRAM (aSRAM, sSRAM), a single-ported SRAM (clsSRAM), and an embedded processor (sP). The NIU connects the memory bus of the SMP to a high performance interconnection network (Arctic). 

Source publication
Article
Full-text available
: This paper describes StarT-Voyager, a machine designed as an experimental platform for research in cluster system communication. The heart of StarT-Voyager is a network interface unit (NIU) that connects the memory bus of a PowerPC-based SMP to the MIT Arctic network. The NIU is highly flexible, with its set of functions easily modified by firmwa...

Context in source publication

Context 1
... StarT-Voyager system consists of an interconnection network and a set of nodes, with one NIU card per node. Each node consists of an unmodified IBM PowerPC 604e-based two-processor card slot SMP. Each node contains a 166MHz 604e processor and 512KB in-line L2 cache card in one processor slot and a StarT-Voyager network interface unit (NIU) in a second processor slot, Figure 2. The 604e is referred to the application processor (aP). The NIU consists of custom hardware, SRAMs, and a 604 microprocessor that is used as an embedded service processor (sP) to execute firmware. The NIU connects to an MIT Arctic network[1], a 160MB/sec/direction/link fat tree network designed and implemented within our research group. It is important to note that the aP uses all of the original SMP's infrastructure, including the memory controller, DRAM, PCI bridge, and so ...

Similar publications

Conference Paper
Full-text available
Algorithms can be accelerated by offloading compute-intensive operations to application accelerators comprising reconfigurable hardware devices known as Field Programmable Gate Arrays (FPGAs). We examine three types of accelerator programming model – master-worker, message passing and shared memory – and a typical FPGA system configuration that uti...
Article
Full-text available
Algorithms can be accelerated by offloading compute-intensive operations to application accelerators comprising reconfigurable hardware devices known as Field Pro-grammable Gate Arrays (FPGAs). We examine three types of accelerator programming model – master-worker, message passing and shared memory – and a typical FPGA system configuration that ut...

Citations

... Other projects have also realized the potential benefits of combining features of message passing and shared memory programming. These include hardware projects with support for shared address spaces and high speed networks [26, 27, 2, 3, 31, 28, 35, 5], as well as software projects that provide run-time support for shared memory on hardware built for message passing [32, 16, 10, 25] and those which explicitly combine shared memory and message passing interfaces at the application level [24, 41, 8]. Other approaches have integrated these two mechanisms more tightly and have developed a distinct programming paradigm fo distributed shared memory (DSM) which is implemented in the SHMEM library [38] and the Unified Parallel C language [12]. ...
... As the number of processors in a parallel computer grew, researchers discovered that one way to provide a scalable hardware architecture was to provide the abstraction of a single global shared memory on hardware with physically distributed shared memory. This distributed shared memory approach includes software-based systems such as IVY [32], Mirage [16], Munin [10], and Treadmarks [25] , as well as hardware systems such as the Alewife processor [26], the DASH processor [31], the FLASH processor [28], the J-Machine [35], StarT-Voyager [5], the IBM SP [2], and the Cray T3E [3], which support shared memory through architectural mechanisms. These hardware approaches combine support for shared address spaces with high speed networks and are thus capable of supporting both shared memory and message passing at the same time. ...
Article
This paper proposes the remote store programming paradigm for parallel computing. This novel model avoids locks and combines the communication mechanism of shared memory programming with the synchronization mech-anism of message passing programming. The result is a new paradigm which is particularly suited for multicore architectures with non-uniform memory access (NUMA) shared memory. In this paper, we describe the remote store programming model, present several detailed examples of remote store programs, and present results of im-plementing remote store programs on a multicore. Our results show that remote store programs can achieve close to linear speedup for large numbers of cores and can provide a sizable performance advantage over shared memory programs.
... The architecture is intended to retain the latency-hiding feature of the Monsoon split-phase global memory operations. Later development of the Start-T project can be found in [3,25]. Several other interesting projects have been carried out elsewhere in the US [64±66,95,96,109] and in the world such as in Japan [75,100] . ...
Article
The dataflow program graph execution model, or dataflow for short, is an alternative to the stored-program (von Neumann) execution model. Because it relies on a graph representation of programs, the strengths of the dataflow model are very much the complements of those of the stored-program one. In the last thirty or so years since it was proposed, the dataflow model of computation has been used and developed in very many areas of computing research: from programming languages to processor design, and from signal processing to reconfigurable computing. This paper is a review of the current state-of-the-art in the applications of the dataflow model of computation. It focuses on three areas: multithreaded computing, signal processing and reconfigurable computing.
... How should the embedded processor interface with the host, and with its message passing hardware? This paper sheds light on these questions by studying two specific designs , both of which are implemented in the StarT-Voyager research machine [5, 4]. This paper makes several contributions. ...
Conference Paper
Full-text available
This paper examines two Network Interface Card micro-architectures that support low latency, high bandwidth user level message passing in multi-user environments. The two are at different ends of a design spectrum-the Resident queues design relies completely on hardware, while the Non-resident queues design is heavily firmware driven. Through actual implementation of these designs and simulation-based micro-benchmark studies, we identify issues critical to the performance and functionality of the firmware-based approach. The firmware-based approach offers much flexibility at a moderate performance penalty, while the Resident design has superior performance for the functions it implements. This leads us to conclude that a hybrid design combining complete hardware support for common operations and a firmware implementation of less common functions achieves both high performance and flexibility
... Examples of extensions include coherent shared memory and non-resident message queues implementation (see Section 4). Interested readers are referred to Ang et al. [2] for discussions of this flexibility and Ang et al. [3] for NES micro-architecture details. ...
Conference Paper
Full-text available
No single message passing mechanism can efficiently support all types of communication that commonly occur in most parallel or distributed programs. MIT's StarT-Voyager, a hybrid message passing/shared memory parallel machine, provides four message passing mechanisms to achieve high performance over a wide spectrum of communication types and sizes. Hardware and address translation enforced protection allows direct user-level access to message passing facilities in a multiuser environment. StarT-Voyager's protection scheme improves upon past designs by not requiring strictly synchronized gang-scheduling, and by supporting non-monolithic protection domains. To minimize the development effort and cost, the machine is designed to use unmodified commercial PowerPC 604-based SMP systems as the building block. A Network End-point Subsystem (NES) card which plugs into one of each SMP's processor card slots provides the interface to Arctic, a low-latency, high-bandwidth network developed at MIT. This paper describes StarT-Voyager's message passing mechanisms and their predicted performance
Article
Current high-end microprocessors achieve high performance as a result of adding more features and therefore increasing complexity. This paper makes the case for a Chip-Multiprocessor based on the Data-Driven Multithreading (DDM-CMP) execution model in order to overcome the limitations of current design trends. Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronization overheads. DDM-CMP avoids the complexity of other designs by combining simple commodity microprocessors with a small hardware overhead for thread scheduling and an interconnection network. Preliminary experimental results show that a DDM-CMP chip of the same hardware budget as a high-end commercial microprocessor, clocked at the same frequency, achieves a speedup of up to 18.5 with a 78–81% power consumption of the commercial chip. Overall, the estimated results for the proposed DDM-CMP architecture show a significant benefit in terms of both speedup and power consumption making it an attractive architecture for future processors.
Conference Paper
Full-text available
We describe and analyze the performance of a cluster of personal computers dedicated to coupled climate simulations. This climate modeling system performs comparably to state-of-the-art supercomputers and yet is affordable by individual research groups, thus enabling more spontaneous application of high-end numerical models to climate science. The cluster's novelty centers around the Arctic Switch Fabric and the StarT-X network interface, a system-area interconnect substrate developed at MIT. A significant fraction of the interconnect's hardware performance is made available to our climate model through an application-specific communication library. In addition to reporting the overall application performance of our cluster, we develop an analytical performance model of our application. Based on this model, we define a metric, Potential Floating-Pointing Performance, which we use to quantify the role of high-speed interconnects in determining application performance. Our results show that a high-performance interconnect, in conjunction with a light-weight application-specific library, provides efficient support for our fine-grain parallel application on an otherwise general-purpose commodity system.
Article
Full-text available
Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming the dominant approach for building multiprocessors with moderate to large numbers of processors. Cache coherence allows such architectures to use caching to take advantage of locality in applications without changing the programmer's model of memory. We review the key developments that led to the creation of cache-coherent distributed shared memory and describe the Stanford DASH Multiprocessor, the first working implementation of hardware-supported scalable cache coherence. We then provide a perspective on such architectures and discuss important remaining technical challenges. 1. Motivations In the 1980s, multiprocessors were designed with two major architectural approaches. For small numbers of processors (typically less than 16 or 32), the domina...