Home
Université Paris-Sud 11
Laboratoire de Recherche en Informatique
Khaled Hamidouche

Khaled Hamidouche
Université Paris-Sud 11 | Paris 11 · Laboratoire de Recherche en Informatique

Post-Doc, The Ohio State Unive

About

Publications

9,043

Reads

904

Citations

Publications

GPU initiated OpenSHMEM: correct and efficient intra-kernel networking for dGPUs

Conference Paper

Feb 2020

ComP-net: command processor networking for efficient intra-kernel communications on GPUs

Conference Paper

Nov 2018

Current state-of-the-art in GPU networking advocates a host-centric model that reduces performance and increases code complexity. Recently, researchers have explored several techniques for networking within a GPU kernel itself. These approaches, however, suffer from high latency, waste energy on the host, and are not scalable with larger/more GPUs...

Kernel-Assisted Communication Engine for MPI on Emerging Manycore Processors

Conference Paper

Dec 2017

GPU triggered networking for intra-kernel communications

Conference Paper

Nov 2017

GPUs are widespread across clusters of compute nodes due to their attractive performance for data parallel codes. However, communicating between GPUs across the cluster is cumbersome when compared to CPU networking implementations. A number of recent works have enabled GPUs to more naturally access the network, but suffer from performance problems,...

MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling

Conference Paper

Aug 2017

S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters

Conference Paper

Jan 2017

Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node....

Enabling Performance Efficient Runtime Support for Hybrid MPI+UPC++ Programming Models

Conference Paper

Dec 2016

Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters

Conference Paper

Dec 2016

Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA

Conference Paper

Dec 2016

CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC

Conference Paper

Dec 2016

Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications

Conference Paper

Nov 2016

OpenSHMEM Non-blocking Data Movement Operations with MVAPICH2-X: Early Experiences

Conference Paper

Nov 2016

Designing MPI Library with On-Demand Paging (ODP) of InfiniBand: Challenges and Benefits

Conference Paper

Nov 2016

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Conference Paper

Oct 2016

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Conference Paper

Full-text available

Sep 2016

Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages across processes. And second, the communication buffers used by HPDA applications and DL frameworks ge...

INAM2: InfiniBand Network Analysis and Monitoring with MPI

Conference Paper

Jun 2016

Modern high-end computing is being driven by the tight integration of several hardware and software components. On the hardware front, there are the multi-/many-core architectures (including accelerators and co-processors) and high-end interconnects like InfiniBand that are continually pushing the envelope of raw performance. On the software side,...

CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters

Article

May 2016

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs. It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications...

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Conference Paper

May 2016

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems

Conference Paper

May 2016

Designing high performance communication runtime for GPU managed memory: early experiences

Conference Paper

Mar 2016

Graphics Processing Units (GPUs) have gained the position of a main stream accelerator due to its low power footprint and massive parallelism. CUDA 6.0 onward, NVIDIA has introduced the Managed Memory capability which unifies the host and device memory allocations into a single allocation and removes the requirement for explicit memory transfers be...

Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM

Conference Paper

Full-text available

Dec 2015

Machine learning algorithms are benefiting from the continuous improvement of programming models, including MPI, MapReduce and PGAS. k-Nearest Neighbors (k-NN) algorithm is a widely used machine learning algorithm, applied to supervised learning tasks such as classification. Several parallel implementations of k-NN have been proposed in the literat...

Scalable Out-of-core OpenSHMEM Library for HPC

Conference Paper

Full-text available

Dec 2015

Many HPC applications have memory requirements that exceed the typical memory available on the compute nodes. While many HPC installations have resources with very large memory installed, a more portable solution for those applications is to implement an out-of-core method. This out-of-core mechanism offloads part of the data typically onto disk wh...

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X

Conference Paper

Dec 2015

An ever increased push for performance in the HPC arena has led to a multitude of hybrid architectures in both software and hardware for HPC systems. Partitioned Global Address Space (PGAS) programming model has gained a lot of attention over the last couple of years. The main advantage of PGAS model is the ease of programming provided by the abstr...

High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR

Conference Paper

Full-text available

Dec 2015

Offloaded GPU Collectives Using CORE-Direct and CUDA Capabilities on InfiniBand Clusters

Conference Paper

Dec 2015

A case for application-oblivious energy-efficient MPI runtime

Conference Paper

Nov 2015

Power has become a major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the de facto communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call---provides a potential for energy an...

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks

Conference Paper

Sep 2015

As we move towards efficient exascale systems, heterogeneous accelerators like NVIDIA GPUs are becoming a significant compute component of modern HPC clusters. It has become important to utilize every single cycle of every compute device available in the system. From NICs to GPUs to Co-processors, heterogeneous compute resources are the way to move...

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Conference Paper

Sep 2015

High Performance MPI Datatype Support with User-Mode Memory Registration: Challenges, Designs, and Benefits

Conference Paper

Sep 2015

High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters

Conference Paper

Full-text available

Aug 2015

Intel Many Integrated Core (MIC) architectures have been playing a key role in modern supercomputing systems due to the features of high performance and low power consumption. This makes them become an attractive choice to accelerate HPC applications. MPI-3 RMA is an important part of the MPI-3 standard. It provides one-sided semantics that reduce...

Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-All Collective Algorithms

Conference Paper

Aug 2015

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters

Conference Paper

Jul 2015

Several techniques have been proposed in the past for designing non-blocking collective operations on high-performance clusters. While some of them required a dedicated process/thread or periodic probing to progress the collective others needed specialized hardware solutions. The former technique, while applicable to any generic HPC cluster, had th...

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters

Article

Jul 2015

Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective....

A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters

Article

Jun 2015

Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applica...

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

Article

Jun 2015

Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory be...

High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation

Conference Paper

Full-text available

May 2015

Scalable Graph500 design with MPI-3 RMA

Article

Nov 2014

The MPI two-sided programming model has been widely used for scientific applications. However, the benefits of MPI one-sided communication are still not well exploited. Recently, MPI-3 Remote Memory Access (RMA) was introduced with several advanced features which provide better performance, programmability, and flexibility over MPI-2 RMA. However,...

High performance OpenSHMEM for Xeon Phi clusters: Extensions, runtime designs and application co-design

Article

Nov 2014

Intel Many Integrated Core (MIC) architectures are becoming an integral part of modern supercomputer architectures due to their high compute density and performance per watt. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communicati...

Scalable MiniMD design with hybrid MPI and OpenSHMEM

Article

Full-text available

Oct 2014

The MPI programming model has been widely used for scientific applications. The emergence of Partitioned Global Address Space (PGAS) programming models presents an alternative approach to improve programmability. With the global data view and lightweight communication operations, PGAS has the potential to increase the performance of scientific appl...

Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models

Article

Oct 2014

While Hadoop holds the current Sort Benchmark record, previous research has shown that MPI-based solutions can deliver similar performance. However, most existing MPI-based designs rely on two-sided communication semantics. The emerging Partitioned Global Address Space (PGAS) programming model presents a flexible way to express parallelism for data...

Understanding the Memory-Utilization of MPI Libraries

Conference Paper

Sep 2014

The MPI Tools information interface (MPI_T), introduced as part of MPI 3.0 standard, has been gaining momentum in both the MPI and performance tools communities. In this paper, we investigate the challenges involved in profiling the memory utilization characteristics of MPI libraries that can be exposed to tools and libraries leveraging the MPI_T i...

HAND: A hybrid approach to accelerate non-contiguous data movement using MPI datatypes on GPU clusters

Conference Paper

Sep 2014

An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance...

MIC-Check

Conference Paper

Jun 2014

The advent of many-core architectures like Intel MIC is enabling the design of increasingly capable supercomputers within reasonable power budgets. Fault-tolerance is becoming more important with the increased number of components and the complexity in these heterogeneous clusters. Checkpoint-restart mechanisms have been traditionally used to enhan...

Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand: Early Experiences

Article

Jun 2014

The Dynamic Connected (DC) InfiniBand transport protocol has recently been introduced by Mellanox to address several shortcomings of the older Reliable Connection (RC), eXtended Reliable Connection (XRC), and Unreliable Datagram (UD) transport protocols. DC aims to support all of the features provided by RC - such as RDMA, atomics, and hardware rel...

Optimizing Collective Communication in UPC

Conference Paper

May 2014

Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. PGAS langua...

High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters

Conference Paper

May 2014

Intel's Many-Integrated-Core (MIC) architecture aims to provide Teraflop throughput (through high degrees of parallelism) with a high FLOP/Watt ratio and x86 compatibility. However, this two-fold approach to solving power and programmability challenges for Exascale computing is constrained by certain architectural idiosyncrasies. MIC coprocessors h...

Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems

Conference Paper

Feb 2014

State-of-the-art MPI libraries rely on locks to guarantee thread-safety. This discourages application developers from using multiple threads to perform MPI operations. In this paper, we propose a high performance, lock-free multi-endpoint MPI runtime, which can achieve up to 40\% improvement for point-to-point operation and one representative colle...

MVAPICH-PRISM

Conference Paper

Nov 2013

Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86__64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand...

Efficient inter-Node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs

Conference Paper

Oct 2013

GPUs and accelerators have become ubiquitous in modern supercomputing systems. Scientific applications from a wide range of fields are being modified to take advantage of their compute power. However, data movement continues to be a critical bottleneck in harnessing the full potential of a GPU. Data in the GPU memory has to be moved into the host m...

Efficient and truly passive MPI-3 RMA using InfiniBand atomics

Conference Paper

Sep 2013

Multi/many-core architectures offer high compute density on modern supercomputing clusters. It is critical for applications to minimize communication and synchronization overheads to achieve peak performance. MPI offers one-sided communication semantics that are aimed at enabling this. In this paper, we propose a novel design for implementing truly...

A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters

Conference Paper

Sep 2013

Accelerating High-Performance Linkpack (HPL) on heterogeneous clusters with multi-core CPUs and GPUs has attracted a lot of attention from the High Performance Computing community. It is becoming common for large scale clusters to have GPUs on only a subset of nodes in order to limit system costs. The major challenge for HPL in this case is to effi...

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Conference Paper

Aug 2013

The emergence of co-processors such as Intel Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Furthermore, the communication characteristics between MIC processes are also different compared to communication between host proce...

MVAPICH2-MIC: A high performance MPI library for Xeon Phi clusters with infiniband

Conference Paper

Aug 2013

Intel's Xeon Phi coprocessor, based on Many Integrated Core architecture, packs more than 1TFLOP of performance on a single chip and offers x86 compatibility. While MPI libraries can run out-of-the-box on the Xeon Phi coprocessors, it is critical to tune them for the new architecture and to redesign them using any new system level features offered...

MIC-RO: Enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand

Conference Paper

Jun 2013

Xeon Phi, the latestMany Integrated Core (MIC) co-processor from Intel, packs up to 1 TFLOP of double precision performance in a single chip while providing x86 compatibility and supporting popular programming models like MPI and OpenMP. One of the easiest way to take advantage of the MIC is to use compiler directives to offoad appropriate compute...

Parallel Smith-Waterman Comparison on Multicore and Manycore Computing Platforms with BSP++

Article

Feb 2013

Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm that compares two sequences in quadratic time and space. Due to high computing power and memory requirements, SW is usually executed on High Performance Computing (HPC) platforms such as m...

Parallel Biological Sequence Comparison on Heterogeneous High Performance Computing Platforms with BSP++

Article

Full-text available

Oct 2011

Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm (SW) that compares two sequences in quadratic time and space. Due to high computing and memory requirements, SW is usually executed on HPC platforms such as multicore clusters and CellBEs....

A framework for an automatic hybrid MPI+OpenMP code generation.

Conference Paper

Full-text available

Jan 2011

Three High Performance Architectures in the Parallel APMC Boat

Conference Paper

Full-text available

Nov 2010

Approximate probabilistic model checking, and more generally sampling based model checking methods, proceed by drawing independent executions of a given model and by checking a temporal formula on these executions. In theory, these methods can be easily massively parallelized, but in practice one has to consider, for this purpose, important aspects...

Hybrid bulk synchronous parallelism library for clustered SMP architectures

Article

Full-text available

Sep 2010

This paper presents the design and implementation of BSP++, a C++ parallel programming library based on the Bulk Synchronous Parallelism model to perform high performance computing on both SMP and SPMD architectures using OpenMPI and MPI. We show how C++ support for genericity provides a functional and intuitive user interface which still delivers...

The Harris algorithm revisited on the CELL processor

Article

Full-text available

Jan 2010

Highly responsive implementations of corner detection is re-ally expected as it is a key ingredient for other image pro-cessing kernels like the motion detection. Indeed, motion detection requires the analysis of a continuous flow of im-ages, thus a real-time processing implies the use of highly optimized subroutines. We consider a tiled implementa...

Comparaison de MPI, OpenMP et MPI+OpenMP sur un noeud multiprocesseur multicoeurs AMD à mémoire partagée

Article

Full-text available

Résumé La majorité des architectures parallèles sont organisées comme un cluster de noeuds de multi-coeurs à mémoire partagée (mcSMN). Les statistiques montrent que majorité des tâches exécutées sur ces pla-teformes utilisent un seul noeud. Si on restreint l'investigation aux environnements et modèles de pro-grammation parallèles les plus connus, i...