Figure 5 - uploaded by Atsushi Hori
Content may be subject to copyright.
Xeon Phi 64kB page table entry format. 

Xeon Phi 64kB page table entry format. 

Source publication
Article
Full-text available
The increasing prevalence of co-processors such as the Intel Xeon Phi, has been reshaping the high performance computing (HPC) landscape. The Xeon Phi comes with a large number of power efficient CPU cores, but at the same time, it's a highly memory constraint environment leaving the task of memory management entirely up to application developers....

Contexts in source publication

Context 1
... 64kB page extension was originally added to create an intermediate step between 4kB and 2MB pages, so that reducing TLB misses and preserving high granularity can be attained at the same time. Figure 5 illustrates the page table format. Support can be enabled for 64kB pages via a hint bit addition in page table entries for which the hardware mechanism relies upon the operating system manipulating the page tables and address maps correctly. ...
Context 2
... can be enabled for 64kB pages via a hint bit addition in page table entries for which the hardware mechanism relies upon the operating system manipulating the page tables and address maps correctly. As indicated by P ageF rame k in Figure 5, to set a 64kB mapping, the operating system must initialize 16 regular 4kB page table entries (PTE), which are a series of subsequent 4kB pages of a contiguous 64kB memory region. Furthermore, the first entry of the 16 entries must correspond to a 64kB aligned virtual address, which in turn must map to a 64kB aligned physical frame. ...
Context 3
... the first entry of the 16 entries must correspond to a 64kB aligned virtual address, which in turn must map to a 64kB aligned physical frame. The OS then sets a special bit of the PTE to indicate that CPU cores should cache the PTE using a 64kB entry rather than a series of separate 4kB entries, denoted by the flag 64 on the header of the PTEs in Figure 5. On a TLB miss, the hardware performs the page table walk as usual, and the INVLPG instruction also works as expected. ...

Similar publications

Conference Paper
Full-text available
Power-aware parallel job scheduling has been rec-ognized as a demanding issue in the high-performance computing (HPC) community. The goal is to efficiently allocate and utilize power and energy in machine rooms. In practice the power for machine rooms is well over-provisioned, specified by high energy LINPACK runs or nameplate power estimates. This...
Article
Full-text available
Although high-performance computing has always been about efficient application execution, both energy and power consumption have become critical concerns owing to their effect on operating costs and failure rates of large-scale computing platforms. Modern processors provide techniques, such as dynamic voltage and frequency scaling (DVFS) and CPU c...
Article
Full-text available
Objective: The advent of High-Performance Computing (HPC) in recent years has led to its increasing use in brain study through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast div...
Article
Full-text available
In this paper we take a look at what the Intel Xeon Processor 7500 family, code namedNehalem-EX, brings to high performance computing. We compare two families of Intel Xeonbased systems (Intel Xeon 7500 and Intel Xeon 5600) and present a performance evolutionof 16 node clusters based on these CPUs. We compare CPU generations utilizing dual socketpl...

Citations

Chapter
IHK/McKernel is a lightweight multi-kernel operating system that is designed for extreme-scale HPC systems. The basic idea of IHK/McKernel is to run Linux and a lightweight kernel (LWK) side by side on each compute node to provide both LWK scalability and full Linux compatibility. IHK/McKernel is one of the first multi-kernels that has been evaluated at large scale and that has demonstrated the advantages of the multi-kernel approach. This chapter describes the architecture of IHK/McKernel, provides insights into some of its unique features, and describes its ability to outperform Linux through experiments. We also discuss our experiences and lessons learned so far.
Article
The critical sections with the lock protection greatly limit the concurrency of multi-threaded applications. The prior lock elision based technique is presented to exploit the parallelism between critical sections accessing the disjoint shared data, but still fails to notice and expose a high degree of concurrency between critical sections that contend for the same shared data, i.e., conflicting critical sections (CCS). This paper focuses on exploiting the CCS parallelism. The key insight of this work is that, for each running CCS, a large proportion ( $> 73.4\%$ ) of parallelism between CCSs can be exploited as fully as possible by allowing the parallel execution for their respective code fragments ranging from the beginning to the first conflict point at runtime. We develop this insight into a new microarchitecture, called BSOptimizer, which can perform the partial reversion integrated with a series of sophisticated hardware and software strategies for the CCS parallelization. We complement the off-the-shelf cache coherency protocol to perceive the conflict location of CCS, present a predictive checkpoint mechanism to register and predict the concerned conflict point in a lightweight and accurate fashion, and redefine the traditional mutual exclusive semantics with a binary relationship, on the basis of which, each CCS can be scheduled in parallel as expected. Our experimental results on a wide variety of real programs and PARSEC benchmarks show that, compared to the native execution and two state-of-the-art strategies (including SLE and SLR), BSOptmizer can dramatically improves the performance of programs with a slight ( $< 0.8\%$ ) energy consumption and ( $< 3.9\%$ ) extra runtime overhead. Our verified evaluation on a micro-benchmark with software based optimization also demonstrates the performance accuracy of BSOptimizer in exploiting the CCS parallelism.
Conference Paper
Lightweight multi-kernel architectures, where HPC specialized lightweight kernels (LWKs) run side-by-side with Linux on compute nodes, have received a great deal of attention recently due to their potential for addressing many of the challenges system software faces as we move towards exascale and beyond. LWKs in multi-kernels implement only a limited set of kernel functionality and the rest is supported by Linux, for example, device drivers for high-performance interconnects. While most of the operations of modern high-performance interconnects are driven entirely by user-space, memory registration for remote direct memory access (RDMA) usually involves interaction with the Linux device driver and thus comes at the price of service offloading. In this paper we introduce various optimizations for multi-kernel LWKs to eliminate the memory registration cost. In particular, we propose a safe RDMA pre-registration mechanism combined with lazy memory unmapping in the LWK. We demonstrate up to two orders of magnitude improvement in RDMA registration latency and up to 15% improvement on MPI_Allreduce() for large message sizes.
Conference Paper
In HPC, two trends have led to the emergence and popularity of an operating-system approach in which multiple kernels are run simultaneously on each compute node. The first trend has been the increase in complexity of the HPC software environment, which has placed the traditional HPC kernel approaches under stress. Meanwhile, microprocessors with more and more cores are being produced, allowing specialization within a node. As is typical in an emerging field, different groups are considering many different approaches to deploying multi-kernels. In this paper we identify and describe a number of ongoing HPC multi-kernel efforts. Given the increasing number of choices for implementing and providing compute node kernel functionality, users and system designers will find value in understanding the differences among the kernels (and among the perspectives) of the different multi-kernel efforts. To that end, we provide a survey of approaches and qualitatively compare and contrast the alternatives. We identify a series of criteria that characterize the salient differences among the approaches, providing users and system designers with a common language for discussing the features of a design that are relevant for them. In addition to the set of criteria for characterizing multi-kernel architectures, the paper contributes a classification of current multi-kernel projects according to those criteria.
Conference Paper
Lightweight kernels (LWK) have been in use on the compute nodes of supercomputers for decades. Although many high-end systems now run Linux, interest in options and alternatives has increased in the last couple of years. Future extreme-scale systems require rethinking of the operating system, and modern LWKs may well play a role in the final solution. In the course of our research, it has become clear that no single definition for a lightweight kernel exists. This paper describes what we mean by the term and what makes LWKs different from other operating system kernels.
Conference Paper
As systems sizes increase to exascale and beyond, there is a need to enhance the system software to meet the needs and challenges of applications. The evolutionary versus revolutionary debate can be set aside by providing system software that simultaneously supports existing and new programming models. The seemingly contradictory requirements of scalable performance and traditional rich programming APIs (POSIX, and Linux in particular) suggest that approach, and has lead to a new class of research. Traditionally, operating systems for extreme-scale computing have followed two approaches: they have either started with a full-weight kernel (FWK), typically Linux, and removed features which were impeding performance and scalability, or they started with a light-weight kernel (LWK), and added capability to provide Linux compatibility. Neither of these approaches, succeed in retaining full Linux compatibility and achieving high scalability. To overcome this problem, we have been exploring the design space of providing LWK performance while retaining the Linux APIs and Linux environment. Our hybrid solution is to run Linux and an LWK side-by-side on the same node. HPC applications execute on top of the LWK, but the system selectively provides OS features by leveraging the Linux kernel. In this paper, we discuss two possible methods of achieving the symbiosis between the two kernels and the trade-offs between them. Specifically, we detail and contrast two particular approaches, Intel's mOS project and IHK/McKernel, an effort lead by RIKEN Advanced Institute for Computational Science.
Conference Paper
Bloom filters are widely used in databases and network areas. These filters facilitate efficient membership checking with a low false positive ratio. It is a way to improve the throughput of bloom filter by parallel processing. Common many-core processors such as Xeon Phi can provide high parallelism. Thus, we build an iterative model to analyze memory access performance. This performance suggests that the bottleneck in the traditional design is mainly caused by synchronization cost and memory latency on many-core platforms. Therefore, we propose a parallel bloom filter (PBF), which is a lockless method involving input data preprocessing. This method reduces synchronization overhead and improves cache locality. We also implement and evaluate PBF on a Xeon Phi processor. Results show that the memory access performance is three times better than that of the counting bloom filter. PBF provides improved scalability, and the speedup ratio can reach a maximum of 80.7x.
Conference Paper
Modern CPU architectures provide a large number of processing cores and application programmers are increasingly looking at hybrid programming models, where multiple threads of a single process interact with the MPI library simultaneously. Moreover, recent high-speed interconnection networks are being designed with capabilities targeting communication explicitly from multiple processor cores. As a result, scalability of the MPI library so that multithreaded applications can efficiently drive independent network communication has become a major concern. In this work, we propose a novel operating system level concept called the thread private shared library (TPSL), which enables threads of a multithreaded application to see specific shared libraries in a private fashion. Contrary to address spaces in traditional operating systems, where threads of a single process refer to the exact same set of virtual to physical mappings, our technique relies on per-thread separate page tables. Mapping the MPI library in a thread private fashion results in per-thread MPI ranks eliminating resource contention in the MPI library without the need for redesigning it. To demonstrate the benefits of our mechanism, we provide preliminary evaluation for various aspects of multithreaded MPI processing through micro-benchmarks on two widely used MPI implementations, MPICH and MVAPICH, with only minor modifications to the libraries.
Article
Historically, GPU-based HPC applications have had a substantial memory bandwidth advantage over CPU-based workloads due to using GDDR rather than DDR memory. However, past GPUs required a restricted programming model where application data was allocated up front and explicitly copied into GPU memory before launching a GPU kernel by the programmer. Recently, GPUs have eased this requirement and now can employ on-demand software page migration between CPU and GPU memory to obviate explicit copying. In the near future, CC-NUMA GPU-CPU systems will appear where software page migration is an optional choice and hardware cache-coherence can also support the GPU accessing CPU memory directly. In this work, we describe the trade-offs and considerations in relying on hardware cache-coherence mechanisms versus using software page migration to optimize the performance of memory-intensive GPU workloads. We show that page migration decisions based on page access frequency alone are a poor solution and that a broader solution using virtual address-based program locality to enable aggressive memory prefetching combined with bandwidth balancing is required to maximize performance. We present a software runtime system requiring minimal hardware support that, on average, outperforms CC-NUMA-based accesses by 1.95 x, performs 6% better than the legacy CPU to GPU memcpy regime by intelligently using both CPU and GPU memory bandwidth, and comes within 28% of oracular page placement, all while maintaining the relaxed memory semantics of modern GPUs.