Figure 3 - uploaded by Didier El Baz
Content may be subject to copyright.
Data exchange between CPU and GPU 

Data exchange between CPU and GPU 

Source publication
Article
Full-text available
Hybrid implementation via CUDA of a branch and bound method for knapsack problems is proposed. Branch and bound computations can be carried out either on the CPU or on the GPU according to the size of the branch and bound list, i.e. the number of nodes. Tests are carried out on a Tesla C2050 GPU. A first series of computational results showing a su...

Contexts in source publication

Context 1
... a separate memory transaction is issued for each thread. This degrades significantly the overall processing time. Reference is made to [17] for further details on the NVIDIA GPUs and computing systems architecture and how to optimize the code. In this paper, we study the parallel implementation of the branch and bound methods on a Tesla C2050 GPU. Bound computation is particularly time consuming in branch and bound algorithm. Nevertheless, this task can be efficiently parallelized on GPU. The main tasks of our parallel algorithm are presented below (see also Figure ...
Context 2
... best lower bound L is obtained in the GPU via a reduction method making use of the atomic instruction atomicM ax applied to the table of lower bounds (see [8]). If the size of the list is small, then it is not efficient to launch the branch and bound computation kernels on GPU since the GPU occupancy would be very small and computations on GPU would not cover communications between CPU and GPU. This is the reason why the branch and bound algorithm is implemented on CPU in this particular context. We note that for a given problem, the branch and bound computation phases can be carried out several times on the CPU according to the result of the pruning procedure. In this study, GPU kernels are launched only when the size of the list, denoted by q, is greater than 5000 nodes (see Figure 3). We have noticed that this condition ensures generally a 100 % occupancy for the GPU. This procedure starts after the list of nodes and the best lower bound L ̄ have been transfered from the GPU to the CPU. The size q of the list is updated by taking a value of q which is twice as much as its original value, then the list is treated from e = 1 to q . A node e is considered to be non promising (NP) if U e ≤ L ̄ otherwise it is promising (P) (see Figure 4). Non promising states are replaced via an iterative procedure that starts from the beginning of the list. The different steps of the procedure that permits one to replace a non promising node with index l by a promising node are presented ...

Similar publications

Preprint
Full-text available
We consider the summation of large sets of floating-point numbers on hybrid CPU-GPU platforms using MPRES, a new software library for multiple-precision computations on CPUs and CUDA compatible GPUs. This library uses an RNS-based floating-point representation, in accordance with which the multiple-precision significands are represented in a residu...
Conference Paper
Full-text available
We consider the summation of large sets of floating-point numbers on hybrid CPU-GPU platforms using MPRES, a new software library for multiple-precision computations on CPUs and CUDA compatible GPUs. This library uses an RNS-based floating-point representation, in accordance with which the multiple-precision significands are represented in a residu...
Article
Full-text available
Nowadays, data has been playing an indispens-able role in almost all industrial areas. Data integrity and security over Internet, other types of media and applications have become the major concerns in computer world. If con-fidential or sensitive data is forged, juggled or wiretapped by an attacker, capital losses might occur. Encryption is one of...
Article
Full-text available
Thinning is one of the most important techniques in the field of image processing. It is applied to erode the image of an object layer-by-layer until a skeleton is left. Several thinning algorithms allowing to get a skeleton of a binary image are already proposed in the literature. This paper investigates several well-known parallel thinning algori...
Article
Full-text available
Based on the finite element method (FEM) in the frequency domain and particle-in-cell approach in the time domain, a hybrid domain multipactor threshold prediction algorithm is proposed in this paper. The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accu...

Citations

... Many heuristic algorithms based on search optimization have been proposed to find near-optimal solutions for the knapsack problem in polynomial time, such as firefly algorithm [7] and genetic algorithm [25]. In particular, many attempts have been made to implement and optimize the branch-and-bound (BB) algorithms [2,4,13,27,28] on GPUs. The BB algorithms are suitable for GPUs because the traversing paths can be broken into independent subproblems and processed in parallel. ...
Article
Full-text available
This work aims to improve the GPU performance for solving the 0/1 knapsack problem, which is a well-known combinatorial optimization problem found in many practical applications, including cryptography, financial decision, electronic design automation, computing resource management, etc. The knapsack problem is NP-hard, but it can be solved efficiently by dynamic programming (DP) algorithms in pseudo-polynomial runtime. The DP knapsack algorithm on GPUs has been presented. However, as the modern GPU architecture provides much higher computing throughput than its memory bandwidth, previous work is bounded by the data access time on GPU memory because its CGMA (Compute to Global Memory Access) ratio is 1, which means every computing operation involves one memory access on average. To address the problem, an innovative approach called Multi-Class 0/1 Knapsack Problem (MCKP), whose items can be classified into groups with equal values or weights is proposed in this paper. By reconstructing the DP equations for solving MCKP, it is able to explore data parallelism and reusability across threads. This made it possible to optimize the computation across iterations (i.e., items), and significantly improve the CGMA ratio by 5-fold after exploring the use of GPU shared memory and registers for reused data. We extensively analyze the performance of our approach on two modern GPU models, NVIDIA Tesla V100 and RTX 3070. Compared to the runtime of previous work, our approach achieves up to 8x and 18x speedup on V100 and RTX 3070 respectively, the latter one being a GPU with lower memory bandwidth. In addition, by comparing the two speedups, we found that we are able to achieve more efficient computing usage when the memory bandwidth is limited such as RTX 3070.
... O primeiro também inclui técnicas de compressão de dados e o segundo aborda a aplicação de um sistema multi GPU. Em El-Baz 2012, Boukedjar et al. 2012]é descrito o método Branch and Bound sob o suporte de uma GPU. Em [Hajarian et al. 2016]é exposto o uso do algoritmo Firefly, com paralelização em placas de processamento gráfico. ...
Conference Paper
Full-text available
O problema da mochila multidimensional (MKP) é um problema clássico da área de otimização combinatória. Embora tenha muitas aplicações práticas, não se conhece qualquer algoritmo de complexidade polinomial para a sua resolução, ou seja, pertence à classe NP-difícil. Essa situação tem levado à busca por técnicas mais eficientes para a sua resolução. Contudo, mesmo as abordagens mais promissoras não conseguem resolver instâncias de maior porte em um tempo computacional aceitável. Isso tem motivado o uso de paralelismo em sua resolução e, particularmente, a adoção de GPUs devido à possibilidade de processar em paralelo grandes volumes de dados. Nesse contexto, o presente trabalho tem o objetivo de identificar, por meio de uma revisão sistemática da literatura, o estado da arte das técnicas que utilizam processos de GPUs para resolver o MKP.
... Chakroun et al., 2013b;Herrera et al., 2017; Taoka et al., 2008; Ponz-Tienda et al., 2017;Ismail et al., 2014; Paulavicius et al., 2011;Christou and Vassilaras, 2013;McCreesh and Prosser, 2015; Eckstein et al., 2015;Carvajal et al., 2014; Borisenko et al., 2017;Gmys et al., 2017; Liu and Kao, 2013; Bak et al., 2011;Gmys et al., 2016; Silva et al., 2015; Barreto and Bauer, 2010; Vu and Derbel, 2016;Chakroun and Melab, 2015; Paulavičius andŽilinskas, 2009; Posypkin and Sigal, 2008;Chakroun et al., 2013a; Aitzai and Boudhar, 2013; Ozden et al., 2017;Cauley et al., 2011;Xu et al., 2009; Aldasoro et al., 2017; Pages-Bernaus et al., 2015;Lubin et al., 2013; Adel et al., 2016;Borisenko et al., 2011;Boukedjar et al., 2012;Carneiro et al., 2011; Galea and Le Cun, 2011;Herrera et al., 2013;Sanjuan-Estrada et al., 2011) Dynamic programming (Dias et al., 2013 Aldasoro et al., 2015;Maleki et al., 2016; Tan et al., 2009; Stivala et al., 2010;Boyer et al., 2012;Boschetti et al., 2016; Kumar et al., 2011;Rashid et al., 2010; Tran, 2010) Interior point method(Huebner et al., 2017;Hong et al., 2010;Lubin et al., 2012;Lucka et al., 2008) Problem-specific exact algorithms(Li et al., 2015;Rossbory and Reisner, 2013; Kollias et al., 2014;Bozdag et al., 2008)) ...
Article
Solving optimization problems with parallel algorithms has a long tradition in OR. Its future relevance for solving hard optimization problems in many fields, including finance, logistics, production and design, is leveraged through the increasing availability of powerful computing capabilities. Acknowledging the existence of several literature reviews on parallel optimization, we did not find reviews that cover the most recent literature on the parallelization of both exact and (meta)heuristic methods. However, in the past decade substantial advancements in parallel computing capabilities have been achieved and used by OR scholars so that an overview of modern parallel optimization in OR that accounts for these advancements is beneficial. Another issue from previous reviews results from their adoption of different foci so that concepts used to describe and structure prior literature differ. This heterogeneity is accompanied by a lack of unifying frameworks for parallel optimization across methodologies, application fields and problems, and it has finally led to an overall fragmented picture of what has been achieved and still needs to be done in parallel optimization in OR. This review addresses the aforementioned issues with three contributions: First, we suggest a new integrative framework of parallel computational optimization across optimization problems, algorithms and application $ Invited review domains. The framework integrates the perspectives of algorithmic design and computational implementation of parallel optimization. Second, we apply the framework to synthesize prior research on parallel optimization in OR, focusing on computational studies published in the period 2008-2017. Finally, we suggest research directions for parallel optimization in OR.
... The main challenges are B&B's irregular data structures that are not well suited for GPU computing and the fact that the computation/communication ratio is low. In [24], a hybrid implementation of B&B for the knapsack problem demonstrates that for small problem sizes it is not efficient to launch the B&B computation kernels on GPU. A parallel CUDA implementation in [25] makes use of data compression. ...
Conference Paper
Full-text available
We report a comparative study on the development of high-performance software for solving the problem of optimal design of multiproduct batch plants on modern parallel systems. We analyze two main algorithmic approaches to optimization - branch-and-bound and metaheuristic-based - and we develop and compare their parallel implementations on a variety of parallel architectures: multi-core CPU, GPU, and clusters. Our experiments on a real-world case study - optimization of chemical-engineering systems - demonstrate the trade-offs between the run time performance and the quality of solutions achieved by different algorithms on various parallel architectures.
... Moreover, each thread has its own register and private local memory. (2013) and Boukedjar et al. (2012) solved the irregularity of BnB. However, the explorations in both approaches take longer time. ...
... We believe that proposing a parallel branchand-bound algorithm dedicated to solving the GED problem is of great interest since the computational time will be improved. The search tree of GED is irregular (i.e., the number of tree nodes varies depending on the ability of the lower and upper bounds in pruning the search tree) and thus the regular parallel approaches (e.g., Boukedjar et al., 2012;Chakroun & Melab, 2013;Dorta et al., 2003 ) are not suitable for such a problem. ...
Article
Graph edit distance (GED) has emerged as a powerful and flexible graph matching paradigm that can be used to address different tasks in pattern recognition, machine learning, and data mining. GED is an error-tolerant graph matching problem which consists in minimizing the cost of the sequence that transforms a graph into another by means of edit operations. Edit operations are deletion, insertion and substitution of vertices and edges. Each vertex/edge operation has its associated cost defined in the vertex/edge cost function. Unfortunately, Unfortunately, the GED problem is NP-hard. The question of elaborating fast and precise algorithms is of first interest. In this paper, a parallel algorithm for exact GED computation is proposed. Our proposal is based on a branch-and-bound algorithm coupled with a load balancing strategy. Parallel threads run a branch-and-bound algorithm to explore the solution space and to discard misleading partial solutions. In the mean time, the load balancing scheme ensures that no thread remains idle. Experiments on 4 publicly available datasets empirically demonstrated that under time constraints our proposal can drastically improve a sequential approach and a naive parallel approach. Our proposal was compared to 6 other methods and provided more precise solutions while requiring a low memory usage.
... The outline of B&B approach is shown in Fig. 1. Many B&B approaches successfully solved knapsack problems by exploiting data parallelism using various parallel machines such as single-instruction multiple-data (SIMD) machines [3], cluster systems [4], computational grids [5,6], and graphics processing units (GPUs) [7][8][9][10]. Among these parallel machines, the GPU [11] is a powerful accelerator device for not only graphics applications but also compute-and memory-intensive applications [12,13]. ...
... Boukedjar et al. [8] presented a B&B approach that solves the 0-1 knapsack problem on a GPU. Their method managed subproblems in the GPU memory with arrays standing for subproblems. ...
... Therefore, CPU-GPU data transfer remains as a performance bottleneck. To reduce the amount of data transfer between the CPU and GPU, Lalami et al. [9] presented an extension of [8] by restricting transferred data to a single attribute: label data (i.e., label array), where a label shows whether an element is active (label = 1) or passive (label = 0). They focused on the data access pattern required for compaction. ...
Conference Paper
Full-text available
In this paper, we propose an out-of-core branch and bound (B&B) method for solving the 0–1 knapsack problem on a graphics processing unit (GPU). Given a large problem that produces many subproblems, the proposed method dynamically swaps subproblems out to CPU memory. We adopt two strategies to realize this swapping-out procedure with minimum amount of CPU-GPU data transfer. The first strategy is a GPU-based stream compaction strategy that reduces the sparseness of arrays. The second strategy is a double buffering strategy that hides the data transfer overhead by overlapping data transfer with GPU-based B&B operations. Experimental results show that the proposed method can store 33.7 times more subproblems than the previous method, solving twice more instances on the GPU. As for the stream compaction strategy, an input-output separated scheme runs \(13.1\%\) faster than an input-output unified scheme.
... The main difficulties in B&B are irregular data structures that are not well suited for GPU computing and the fact that the computation/communication ratio is low. In [3], a hybrid implementation of B&B for the knapsack problem demonstrates that for small problem sizes it is not efficient to launch the B&B computation kernels on GPU. A parallel CUDA implementation in [4] makes use of data compression. ...
Article
Full-text available
Branch-and-bound (B&B) is a popular approach to accelerate the solution of the optimization problems, but its parallelization on graphics processing units (GPUs) is challenging because of B&B’s irregular data structures and poor computation/communication ratio. The contributions of this paper are as follows: (1) we develop two CUDA-based implementations (iterative and recursive) of B&B on systems with GPUs for a practical application scenario—optimal design of multi-product batch plants, with a particular example of a chemical-engineering system (CES); (2) we propose and implement several optimizations of our CUDA code by reducing branch divergence and by exploiting the properties of the GPU memory hierarchy; and(3) we evaluate our implementations and their optimizations on a modern GPU-based system and we report our experimental results.
... Le parallélisme a déjàété appliqué avec succès aux méthodes complètes et approchées. La parallélisation des algorithmes complets aété largementétudiée [121,33,24]. Dans cette thèse nous nous intéressons aux méthodes méta-heuristiques parallèles qui mettentà profit les ressources de calcul pour accélérer le processus de recherche de solution. ...
... Parallelism has been successfully applied to complete and approximation methods and parallelization of complete algorithms has been widely studied [121,33,24]. In this thesis we are interested in parallel metaheuristics methods that take advantage of parallel resources to speedup the search process. ...
Thesis
Full-text available
Combinatorial Optimization Problems (COP) are widely used to model and solve real-life problems in many different application domains. These problems represent a real challenge for the research community due to their inherent difficulty, as many of them are NP-hard. COPs are difficult to solve with exact methods due to the exponential growth of the problem’s search space with respect to the size of the problem. Metaheuristics are often the most efficient methods to make the hardest problems tractable. However, some hard and large real-life problems are still out of the scope of even the best metaheuristic algorithms. Parallelism is a straightforward way to improve metaheuristics performance. The basic idea is to perform concurrent explorations of the search space in order to speedup the search process. Currently, the most advanced techniques implement some communication mechanism to exchange information between metaheuristic instances in order to try and increase the probability to find a solution. However, designing an efficient cooperative parallel method is a very complex task, and many issues about communication must be solved. Furthermore, it is known that no unique coop- erative configuration may efficiently tackle all problems. This is why there are currently efficient cooperative solutions dedicated to some specific problems or more general cooperative methods but with limited performances in practice. In this thesis we propose a general framework for Cooperative Parallel Metaheuristics (CPMH). This framework includes several parameters to control the cooperation. CPMH organizes the explorers into teams; each team aims at intensifying the search in a particular region of the search space and uses intra-team communication. In addition, inter-team communication is used to ensure search diversification. CPMH allows the user to tune the trade-off between intensification and diversification. However, our framework supports different metaheuristics and metaheuristics hybridization. We also provide X10CPMH, an implementation of our CPMH framework developed in the X10 parallel language. To assess the soundness of our approach we tackle two hard real-life COP: hard variants of the Stable Matching Problem (SMP) and the Quadratic Assignment Problem (QAP). For all problems we propose new sequential and parallel metaheuristics, including a new Extremal Optimization-based method and a new hybrid cooperative parallel algorithm for QAP. All algorithms are implemented thanks to X10CPMH. A complete experimental evaluation shows that the cooperative parallel versions of our methods scale very well, providing high-quality solutions within a limited timeout. On hard and large variants of SMP, our cooperative parallel method reaches super-linear speedups. Regarding QAP, the cooperative parallel hybrid algorithm performs very well on the hardest instances, and improves the best known solutions of several instances.
... – or GPUs are used to accelerate only the most time consuming activities or parts of codes. Chakroun et al. ( , 2013) Melab et al. (2012) Branch-And- Bound Traveling Salesman Problem Carneiro et al. (2011) Table 4Branch and Bound on GPU Boukedjar et al. (2012), Lalami and El , and Lalami (2012) studied the GPU implementation of the B&B algorithm for KPs. The nodes are first generated in sequential on the host. ...
Technical Report
Full-text available
Thanks to CUDA and OpenCL, Graphics Processing Units (GPUs) have recently gained considerable attention in science and engineering as accelerators for High Performance Computing (HPC). In this chapter, we show how the Operations Research (OR) community can take great benefit of GPUs. In particular, we present a survey of the main contributions to the field of GPU computing applied to linear and mixed-integer programming. The OR field is rich in complex problems and sophisticated algorithms that can take advantage of parallelization. However, all algorithms in the literature do not fit to the SIMT paradigm. Therefore, we highlight the main issues tackled by different authors to overcome the difficulties of implementation and the results obtained with their optimization algorithms via GPU computing.
... It is noteworthy that the branch and bound method is applied in other areas of optimization; in particular, combinatorial optimization and integer programming. Using GPU shows good results in these areas [13,14]. It is explained by the following factor. ...
... Problems of this class (such as the Travelling Salesman Problem or the Knapsack Problem) are characterized by short time of objective function value calculation at one point and, at the same time, relatively long time required to process the results. Therefore, the most of labor-intensive operations here consist in branching and bound computations, which are implemented on GPU (see [13,14]). This paper considers the problems of Lipschitzian global optimization that are frequently encountered in applications. ...
... Let us make a similar experiment using a GPU version of the algorithm. Also let us fix the average number of iterations and average time required to solve one problem (see Tables 9,10,11,12,13). Let us measure the speedup and redundancy of GPU algorithm in relation to the CPU algorithm started with 32 cores (see column p = 32 in Tables 4, 5). The number of the used threads p on GPU varied, all other parameters of the method did not change. ...
Article
Full-text available
This work considers a parallel algorithm for solving multidimensional multiextremal optimization problems. This algorithm uses Peano-type space filling curves for dimension reduction. Conditions of non-redundant parallelization of the algorithm are considered. Efficiency of the algorithm on modern computing systems with the use of graphics processing units (GPUs) is investigated. Speedup of the algorithm using GPU as compared with the same algorithm implemented on CPU only is demonstrated experimentally. Computational experiments are carried out on a series of several hundred multidimensional multiextremal problems.