Data exchange between CPU and GPU

Source publication

Parallel branch and bound on a CPU-GPU system

Article

Full-text available

Feb 2012

Hybrid implementation via CUDA of a branch and bound method for knapsack problems is proposed. Branch and bound computations can be carried out either on the CPU or on the GPU according to the size of the branch and bound list, i.e. the number of nodes. Tests are carried out on a Tesla C2050 GPU. A first series of computational results showing a su...

Context 1

... a separate memory transaction is issued for each thread. This degrades significantly the overall processing time. Reference is made to [17] for further details on the NVIDIA GPUs and computing systems architecture and how to optimize the code. In this paper, we study the parallel implementation of the branch and bound methods on a Tesla C2050 GPU. Bound computation is particularly time consuming in branch and bound algorithm. Nevertheless, this task can be efficiently parallelized on GPU. The main tasks of our parallel algorithm are presented below (see also Figure ...

View in full-text

Context 2

... best lower bound L is obtained in the GPU via a reduction method making use of the atomic instruction atomicM ax applied to the table of lower bounds (see [8]). If the size of the list is small, then it is not efficient to launch the branch and bound computation kernels on GPU since the GPU occupancy would be very small and computations on GPU would not cover communications between CPU and GPU. This is the reason why the branch and bound algorithm is implemented on CPU in this particular context. We note that for a given problem, the branch and bound computation phases can be carried out several times on the CPU according to the result of the pruning procedure. In this study, GPU kernels are launched only when the size of the list, denoted by q, is greater than 5000 nodes (see Figure 3). We have noticed that this condition ensures generally a 100 % occupancy for the GPU. This procedure starts after the list of nodes and the best lower bound L ̄ have been transfered from the GPU to the CPU. The size q of the list is updated by taking a value of q which is twice as much as its original value, then the list is treated from e = 1 to q . A node e is considered to be non promising (NP) if U e ≤ L ̄ otherwise it is promising (P) (see Figure 4). Non promising states are replaced via an iterative procedure that starts from the beginning of the list. The different steps of the procedure that permits one to replace a non promising node with index l by a promising node are presented ...

View in full-text

Multiple-Precision Summation on Hybrid CPU-GPU Platforms Using RNS-based Floating-Point Representation

Preprint

Full-text available

Jan 2019

We consider the summation of large sets of floating-point numbers on hybrid CPU-GPU platforms using MPRES, a new software library for multiple-precision computations on CPUs and CUDA compatible GPUs. This library uses an RNS-based floating-point representation, in accordance with which the multiple-precision significands are represented in a residu...

Multiple-Precision Summation on Hybrid CPU-GPU Platforms Using RNS-based Floating-Point Representation

Conference Paper

Full-text available

Nov 2018

Design and Optimization of Hybrid MD5-Blowfish Encryption on GPUs

Article

Full-text available

Apr 2012

Nowadays, data has been playing an indispens-able role in almost all industrial areas. Data integrity and security over Internet, other types of media and applications have become the major concerns in computer world. If con-fidential or sensitive data is forged, juggled or wiretapped by an attacker, capital losses might occur. Encryption is one of...

Pixel “p” different neighborhood configuration

White skeletons superposed on the original duck produced by the eight...

White skeletons superposed on the original turtle produced by the...

White skeletons superposed on the original camel produced by the eight...

Black fingerprint (1) skeletons given by the eight considered algorithms

A modified ZS thinning algorithm by a hybrid approach

Article

Full-text available

May 2018

Thinning is one of the most important techniques in the field of image processing. It is applied to erode the image of an object layer-by-layer until a skeleton is left. Several thinning algorithms allowing to get a skeleton of a binary image are already proposed in the literature. This paper investigates several well-known parallel thinning algori...

Hybrid domain multipactor prediction algorithm and its CUDA parallel implementation

Article

Full-text available

Dec 2020

Based on the finite element method (FEM) in the frequency domain and particle-in-cell approach in the time domain, a hybrid domain multipactor threshold prediction algorithm is proposed in this paper. The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accu...

Optimization of multi-class 0/1 knapsack problem on GPUs by improving memory access efficiency

Article

Full-text available

Jul 2022
J SUPERCOMPUT

This work aims to improve the GPU performance for solving the 0/1 knapsack problem, which is a well-known combinatorial optimization problem found in many practical applications, including cryptography, financial decision, electronic design automation, computing resource management, etc. The knapsack problem is NP-hard, but it can be solved efficiently by dynamic programming (DP) algorithms in pseudo-polynomial runtime. The DP knapsack algorithm on GPUs has been presented. However, as the modern GPU architecture provides much higher computing throughput than its memory bandwidth, previous work is bounded by the data access time on GPU memory because its CGMA (Compute to Global Memory Access) ratio is 1, which means every computing operation involves one memory access on average. To address the problem, an innovative approach called Multi-Class 0/1 Knapsack Problem (MCKP), whose items can be classified into groups with equal values or weights is proposed in this paper. By reconstructing the DP equations for solving MCKP, it is able to explore data parallelism and reusability across threads. This made it possible to optimize the computation across iterations (i.e., items), and significantly improve the CGMA ratio by 5-fold after exploring the use of GPU shared memory and registers for reused data. We extensively analyze the performance of our approach on two modern GPU models, NVIDIA Tesla V100 and RTX 3070. Compared to the runtime of previous work, our approach achieves up to 8x and 18x speedup on V100 and RTX 3070 respectively, the latter one being a GPU with lower memory bandwidth. In addition, by comparing the two speedups, we found that we are able to achieve more efficient computing usage when the memory bandwidth is limited such as RTX 3070.

Uso de GPUs na resolução do Problema da Mochila Multidimensional

Conference Paper

Full-text available

Oct 2021
Acoust Speech Signal Process Newslett IEEE

O problema da mochila multidimensional (MKP) é um problema clássico da área de otimização combinatória. Embora tenha muitas aplicações práticas, não se conhece qualquer algoritmo de complexidade polinomial para a sua resolução, ou seja, pertence à classe NP-difícil. Essa situação tem levado à busca por técnicas mais eficientes para a sua resolução. Contudo, mesmo as abordagens mais promissoras não conseguem resolver instâncias de maior porte em um tempo computacional aceitável. Isso tem motivado o uso de paralelismo em sua resolução e, particularmente, a adoção de GPUs devido à possibilidade de processar em paralelo grandes volumes de dados. Nesse contexto, o presente trabalho tem o objetivo de identificar, por meio de uma revisão sistemática da literatura, o estado da arte das técnicas que utilizam processos de GPUs para resolver o MKP.

Parallel computational optimization in operations research: A new integrative framework, literature review and research directions

Article

Nov 2020
EUR J OPER RES

Guido Schryen

Solving optimization problems with parallel algorithms has a long tradition in OR. Its future relevance for solving hard optimization problems in many fields, including finance, logistics, production and design, is leveraged through the increasing availability of powerful computing capabilities. Acknowledging the existence of several literature reviews on parallel optimization, we did not find reviews that cover the most recent literature on the parallelization of both exact and (meta)heuristic methods. However, in the past decade substantial advancements in parallel computing capabilities have been achieved and used by OR scholars so that an overview of modern parallel optimization in OR that accounts for these advancements is beneficial. Another issue from previous reviews results from their adoption of different foci so that concepts used to describe and structure prior literature differ. This heterogeneity is accompanied by a lack of unifying frameworks for parallel optimization across methodologies, application fields and problems, and it has finally led to an overall fragmented picture of what has been achieved and still needs to be done in parallel optimization in OR. This review addresses the aforementioned issues with three contributions: First, we suggest a new integrative framework of parallel computational optimization across optimization problems, algorithms and application $ Invited review domains. The framework integrates the perspectives of algorithmic design and computational implementation of parallel optimization. Second, we apply the framework to synthesize prior research on parallel optimization in OR, focusing on computational studies published in the period 2008-2017. Finally, we suggest research directions for parallel optimization in OR.

Optimal Batch Plants Design on Parallel Systems: A Comparative Study

Conference Paper

Full-text available

May 2019

We report a comparative study on the development of high-performance software for solving the problem of optimal design of multiproduct batch plants on modern parallel systems. We analyze two main algorithmic approaches to optimization - branch-and-bound and metaheuristic-based - and we develop and compare their parallel implementations on a variety of parallel architectures: multi-core CPU, GPU, and clusters. Our experiments on a real-world case study - optimization of chemical-engineering systems - demonstrate the trade-offs between the run time performance and the quality of solutions achieved by different algorithms on various parallel architectures.

A parallel graph edit distance algorithm

Article

Oct 2017
EXPERT SYST APPL

Graph edit distance (GED) has emerged as a powerful and flexible graph matching paradigm that can be used to address different tasks in pattern recognition, machine learning, and data mining. GED is an error-tolerant graph matching problem which consists in minimizing the cost of the sequence that transforms a graph into another by means of edit operations. Edit operations are deletion, insertion and substitution of vertices and edges. Each vertex/edge operation has its associated cost defined in the vertex/edge cost function. Unfortunately, Unfortunately, the GED problem is NP-hard. The question of elaborating fast and precise algorithms is of first interest. In this paper, a parallel algorithm for exact GED computation is proposed. Our proposal is based on a branch-and-bound algorithm coupled with a load balancing strategy. Parallel threads run a branch-and-bound algorithm to explore the solution space and to discard misleading partial solutions. In the mean time, the load balancing scheme ensures that no thread remains idle. Experiments on 4 publicly available datasets empirically demonstrated that under time constraints our proposal can drastically improve a sequential approach and a naive parallel approach. Our proposal was compared to 6 other methods and provided more precise solutions while requiring a low memory usage.

An Out-of-Core Branch and Bound Method for Solving the 0-1 Knapsack Problem on a GPU

Conference Paper

Full-text available

Aug 2017

In this paper, we propose an out-of-core branch and bound (B&B) method for solving the 0–1 knapsack problem on a graphics processing unit (GPU). Given a large problem that produces many subproblems, the proposed method dynamically swaps subproblems out to CPU memory. We adopt two strategies to realize this swapping-out procedure with minimum amount of CPU-GPU data transfer. The first strategy is a GPU-based stream compaction strategy that reduces the sparseness of arrays. The second strategy is a double buffering strategy that hides the data transfer overhead by overlapping data transfer with GPU-based B&B operations. Experimental results show that the proposed method can store 33.7 times more subproblems than the previous method, solving twice more instances on the GPU. As for the stream compaction strategy, an input-output separated scheme runs $13.1\%$ faster than an input-output unified scheme.

A GPU parallelization of branch-and-bound for multiproduct batch plants optimization

Article

Full-text available

Feb 2017
J SUPERCOMPUT

Branch-and-bound (B&B) is a popular approach to accelerate the solution of the optimization problems, but its parallelization on graphics processing units (GPUs) is challenging because of B&B’s irregular data structures and poor computation/communication ratio. The contributions of this paper are as follows: (1) we develop two CUDA-based implementations (iterative and recursive) of B&B on systems with GPUs for a practical application scenario—optimal design of multi-product batch plants, with a particular example of a chemical-engineering system (CES); (2) we propose and implement several optimizations of our CUDA code by reducing branch divergence and by exploiting the properties of the GPU memory hierarchy; and(3) we evaluate our implementations and their optimizations on a modern GPU-based system and we report our experimental results.

Solving Hard Combinatorial Optimization Problems using Cooperative Parallel Metaheuristics

Thesis

Full-text available

Sep 2016

Danny Munera Ramirez

Combinatorial Optimization Problems (COP) are widely used to model and solve real-life problems in many different application domains. These problems represent a real challenge for the research community due to their inherent difficulty, as many of them are NP-hard. COPs are difficult to solve with exact methods due to the exponential growth of the problem’s search space with respect to the size of the problem. Metaheuristics are often the most efficient methods to make the hardest problems tractable. However, some hard and large real-life problems are still out of the scope of even the best metaheuristic algorithms. Parallelism is a straightforward way to improve metaheuristics performance. The basic idea is to perform concurrent explorations of the search space in order to speedup the search process. Currently, the most advanced techniques implement some communication mechanism to exchange information between metaheuristic instances in order to try and increase the probability to find a solution. However, designing an efficient cooperative parallel method is a very complex task, and many issues about communication must be solved. Furthermore, it is known that no unique coop- erative configuration may efficiently tackle all problems. This is why there are currently efficient cooperative solutions dedicated to some specific problems or more general cooperative methods but with limited performances in practice. In this thesis we propose a general framework for Cooperative Parallel Metaheuristics (CPMH). This framework includes several parameters to control the cooperation. CPMH organizes the explorers into teams; each team aims at intensifying the search in a particular region of the search space and uses intra-team communication. In addition, inter-team communication is used to ensure search diversification. CPMH allows the user to tune the trade-off between intensification and diversification. However, our framework supports different metaheuristics and metaheuristics hybridization. We also provide X10CPMH, an implementation of our CPMH framework developed in the X10 parallel language. To assess the soundness of our approach we tackle two hard real-life COP: hard variants of the Stable Matching Problem (SMP) and the Quadratic Assignment Problem (QAP). For all problems we propose new sequential and parallel metaheuristics, including a new Extremal Optimization-based method and a new hybrid cooperative parallel algorithm for QAP. All algorithms are implemented thanks to X10CPMH. A complete experimental evaluation shows that the cooperative parallel versions of our methods scale very well, providing high-quality solutions within a limited timeout. On hard and large variants of SMP, our cooperative parallel method reaches super-linear speedups. Regarding QAP, the cooperative parallel hybrid algorithm performs very well on the hardest instances, and improves the best known solutions of several instances.

GPU Computing Applied to Linear and Mixed Integer Programming

Technical Report

Full-text available

Sep 2016

Thanks to CUDA and OpenCL, Graphics Processing Units (GPUs) have recently gained considerable attention in science and engineering as accelerators for High Performance Computing (HPC). In this chapter, we show how the Operations Research (OR) community can take great benefit of GPUs. In particular, we present a survey of the main contributions to the field of GPU computing applied to linear and mixed-integer programming. The OR field is rich in complex problems and sophisticated algorithms that can take advantage of parallelization. However, all algorithms in the literature do not fit to the SIMT paradigm. Therefore, we highlight the main issues tackled by different authors to overcome the difficulties of implementation and the results obtained with their optimization algorithms via GPU computing.

Parallel global optimization on GPU

Article

Full-text available

Sep 2016
J GLOBAL OPTIM

This work considers a parallel algorithm for solving multidimensional multiextremal optimization problems. This algorithm uses Peano-type space filling curves for dimension reduction. Conditions of non-redundant parallelization of the algorithm are considered. Efficiency of the algorithm on modern computing systems with the use of graphics processing units (GPUs) is investigated. Speedup of the algorithm using GPU as compared with the same algorithm implemented on CPU only is demonstrated experimentally. Computational experiments are carried out on a series of several hundred multidimensional multiextremal problems.

Data exchange between CPU and GPU

Contexts in source publication

Similar publications

Citations