Figure 1 - uploaded by Olivier Gluck
Content may be subject to copyright.
The MPC computer architecture. 

The MPC computer architecture. 

Source publication
Article
Full-text available
This paper presents an efficient MPI implementation on a cluster of PCs using a remote DMA communication primitive. For experimental purposes, the MultiPC (MPC) parallel computer was used. It consists of standard PCs interconnected through a gigabit High Speed Link (HSL) network. This paper focuses on communication software layers over the HSL netw...

Context in source publication

Context 1
... MultiPC (MPC) parallel computer is a low cost, high performance cluster of PCs. The differentiating and origi- nal element of this cluster resides in its High-Speed Link (HSL) network designed at the University P. & M. Curie, Paris. This packet switched network uses one Gbits/s, point to point, serial links. From the application software point of view, the MPC computer provides an optimized Message Passing Interface (MPI) library [3]. Efficient software layers and a specific high-performance implementation of MPICH [11] has been developed on top of the HSL network. The MPC parallel computer is a high performance cluster [8] with the following components: - standard PC main-boards, - a standard Unix-based operating system (LINUX or FreeBSD), - an Ethernet control network, - a high speed communication network: the gigabit HSL network, - a FastHSL network controller on each node implementing the communication protocol, - efficient software layers for using the HSL network, - an efficient implementation of MPICH over the HSL network. Communication performance is a critical and very important aspect of cluster computing. From the hardware point of view, gigabit high-speed networks (like Myrinet [7], SCI [12] or HSL [6]) are now very common. Nevertheless, an efficient hardware is not sufficient to reach good performance. The communication software overhead represents the main part of communication time [13]. The goal of projects like Active Messages [17] or Fast Messages [14] is to reduce this software overhead. The critical factors are well identified. We can quote intermediate data copies, the crossing of several communication layers, phys- ical/virtual address translations and the use of system calls and interrupt signaling during communications. This paper deals with an efficient implementation of MPICH over a high-speed network providing a remote DMA communication primitive, and the impact of using system calls and hardware interrupts during communications. Two implementations of MPICH over a basic ”remote write” primitive have been realized: MPI-MPC 1 and MPI-MPC 2. Section 2 gives a brief description of the MPC parallel computer and its remote write primitive. Section 3 describes MPI-MPC 1, the first implementation of MPICH over MPC using system calls and interrupts during communication phases. Section 4 explains how user-level communication is achieved by the MPI-MPC 2 implementation. Section 5 is a comparison between these two implementations. Section 6 presents the computation time on the Gauss elimination method using a parallel implementation of a library that takes into account the floating point round-off error propagation: Control of Accuracy and Debugging for Numerical Applications (CADNA) [2]. Finally, Section 7 concludes and deals with future work aspects. The MPC computer (MultiPC) is the result of a 6-year research project at LIP6. The goal was to design a cluster of PCs using a truly scalable and high speed interconnec- tion network, providing an efficient remote DMA primitive to the software. This section gives a brief description of hardware and low-level software in the MPC computer. It is clearly a machine that falls within the class of low-cost Beowulf-like [4] high performance computers. A recent overview of High Performance Cluster Computing has been done in [8]. From the hardware point of view, MPC consists of several processing nodes interconnected by a custom high- speed communication network (HSL) which complies with the IEEE-1355 standard [6]. Processing nodes are standard PCs. Figure 1 presents the architecture of the MPC computer. Three nodes are represented. They are connected via an Ethernet control network to a front-end computer. Each node contains an HSL network controller board designed by the LIP6 laboratory. The FastHSL board carries 2 ASICs. The PCI-DDC chip [18] is a PCI controller that implements a ”remote write” protocol. The RCUBE router [5] is a single chip 8*8 dynamic crossbar that offers 8 bi-directional HSL ports. Thanks to the highly integrated RCUBE router, there is no centralized switch in this architecture, as each node contains a routing capability. The HSL links are 1 Gbits/s bi-directional, full duplex, serial links. HSL connections between CPU1, CPU2 and CPU3 are not represented on figure 1. More information concerning the MPC hardware can be found in [10]. Each processor node is connected to an RCUBE router using a dedicated PCI to HSL network controller named PCI-DDC. This chip implements the Direct Deposit State Less Receiver Protocol (DDSLRP), developed at LIP6 to reduce the processor overhead. Classical data transfer protocols usually require several copies of data in intermediate buffers before and after transmission through the network. PCI-DDC can access directly the host memory. In order to enhance performance, PCI-DDC implements the ”remote write” primitive described in figure 2. This can be seen as a DMA request where the local PCI-DDC directly fetches data from the local host memory and the remote PCI-DDC writes data directly into the remote memory. The descriptors of a message are pushed by the software into the LME, which is the ”List of Messages to Send”, located in host memory. This list contains the descriptors of buffers to be transmitted (local and remote physical ad- dresses, length, destination node, etc.). A remote write communication proceeds as follows: (1) NIC A reads the message descriptor through DMA access on PCI-bus; (2) it starts data transmission, using again DMA accesses to the host memory relieving thus the sender processor from data transmission; (4) on the receiver side, as soon as PCI-DDC receives a packet, it starts writing incoming data at the corresponding memory location; (5) when the last packet is written, NIC B writes in the LMR, which is the ”List of Re- ceived Messages” of the destination node; (3) (6) in emis- sion/reception, the notification is achieved by a hardware interrupt signal to the host processor. Our goal is to provide efficient software layers to access the HSL network from the application level. The lowest software layer, called PUT, offers basic kernel communication services using the remote write primitive described in section 2.2. This layer provides a kernel API that writes page descriptors into LME and handles event signaling using hardware interrupts. A zero-copy strategy is implemented to take advantage of the performance offered by the HSL network. To let multiple users call PUT simultane- ously and to handle interrupts, this layer is designed inside the OS kernel. The communication primitive supplied by the PUT API can only transfer a contiguous buffer located in the physical memory. It needs the following parameters: the receiver node identifier, the physical address of the local buffer, the physical address of the remote buffer, the data length and a set of flags. The latency of this layer is about 5 s and the maximum bandwidth is 494 Mbits/s on a Pentium II 350MHz without using interrupts. For more details about the MPC low-level software layers, please consult [1]. The MPC parallel computer runs MPI [3] applications. We have designed an efficient MPICH [11] implementation for PC clusters using a remote DMA communication primitive. This implementation is built over the PUT layer. The specification of PUT creates two main problems. Firstly, data transmitted by PUT must be physically located in a contiguous memory space. Secondly, the sender has to know where to write data in the remote physical memory. There are two types of messages in MPICH: control messages and data messages. In our MPI implementation, con- trol messages are used to transfer rapidly on the HSL network some control information or limited size user-data. The maximum size of a control message is set to 16 Kbytes in the current MPI-MPC implementation. There are four sub-types of control messages: - SHORT: user-data (encapsulated in a control message), - REQ: request of transmission of a large data message, - RSP: reply to a request, - CRDT: credits, used for flow control. The MPC software allocates at boot time, on each node, an array of contiguous physical memory that is used to im- plement the destination buffers for the control messages. Each node gets the physical address of all remote slots that are allocated to him, through the control network (all nodes are connected by an Ethernet network for configuration). The transmission of control messages is done thanks to an intermediate copy in pre-allocated buffers ...

Similar publications

Article
Full-text available
The neuroscience research is gaining momentum and it explores the science of brain and performance of an individual. Whole Brain Teaching (WBT) is a blend of direct instruction and cooperative pedagogy that allow teachers to effectively deliver the lesson and to create an engaged and enjoyable learning experience for the children. This study aimed...
Article
Full-text available
Snow dynamics influence seasonal behaviors of wildlife, such as denning patterns and habitat selection related to the availability of food resources. Under a changing climate, characteristics of the temporal and spatial patterns of snow are predicted to change, and as a result, there is a need to better understand how species interact with snow dyn...
Technical Report
Full-text available
WE DEVELOP, within the framework of aggregate yearly road safety models, a new explanation of the extraordinary simultaneous occurrence in OECD countries of 18 national road fatality maxima during the 5 year period 1970-1974 (including 11 in 1972-1973), a puzzle referred to as “The Matterhorn” by Stipdonk (2007) in his comments on the earlier diagn...
Research
Full-text available
Twenty mice (fifteen days old) were divided into 2 groups, first group include 10 mice were inoculated intracerebrally with 100 µl of isolated bovine parainfluenza type 3 virus and the second group, include 10 mice were inoculated intracerebrally with 100 µl embryonic bovine kidney (EBK) cell suspension media. After (3-7) days post infection, mice...
Article
Full-text available
This study was aimed at ascertaining knowledge of preventive measures and management of HIV/AIDS victims among parents in Umuna Community of Orlu Local Government of Imo State. In line with the eight objectives of the study eight research questions and four null hypotheses were formulated. Related literature were reviewed and summarized, Descriptiv...

Citations

Conference Paper
Full-text available
Distributed systems must provide some kind of inter process communication (IPC) mechanisms to enable communication between local and especially geographically dispersed and physically distributed processes. These mechanisms may be implemented at different levels of distributed systems namely at application level, library level, operating system interface level, or kernel level. Upper level implementations are intuitively simpler to develop but are less efficient. This paper provides hard evidence on this intuition. It considers two renowned IPC mechanisms, one implemented at library level, called MPI, and the other implemented at kernel level, called DIPC. It shows that the time taken to calculate the Pi number by a distributed system that uses MPI to program and run the calculation of Pi number in parallel is on average 35% slower than by the same distributed system that uses DIPC to program and run the calculation of Pi number in parallel. It is concluded that if distributed systems are to become an appropriate platform for high performance scientific computing of all kinds, it is necessary to try harder and implement IPC mechanisms at kernel level, even ignoring so many other factors in favor of kernel level implementations like safety, privilege, reliability, and primitiveness.