Old motherboard architecture.

Source publication

Parallel arithmetic encryption for high-bandwidth communications on multicore/GPGPU platforms

Conference Paper

Full-text available

Jul 2010

In this work we study the feasibility of high-bandwidth, secure communications on generic machines equipped with the latest CPUs and General-Purpose Graphical Processing Units (GPGPU). We first analyze the suitability of current Nehalem CPU architectures. We show in particular that high performance CPUs are not sufficient by themselves to reach our...

Context 1

... ( L3 ) that is shared by the four cores, and two physical buses ( RAM and QuickPath) ( figure 1). To each CPU chip is associated a privileged memory bank (connected to the CPU chip RAM bus) that is accessed directly by the associated CPU, and that can still be accessed by remote CPU chips through a CPU-to-CPU interconnection (figure 2). This new memory hier- archy enhances parallelism since each CPU can now access its own memory bank independently, without creating any contention with respect to other CPUs. As far as I/O communications are concerned, Intel also removed the classic bottleneck by replacing the old shared-bus architecture (e.g. the Front Side Bus, FSB, of figure 3) by point-to-point connections between each CPU and a I/O hub (figure 2). Network Interface Card (NIC) In traditional NIC architectures, the NIC driver is the only entity capable of accessing the NIC (figure 4). On the opposite, recent NIC also include parallelism mechanisms at the hardware level, and more precisely new NIC define multiple reception and transmission queues (figure 5). Thanks to this change, the traditional bottleneck at the hardware/software edge is removed. Even though it was primarily designed for virtualized server [5], the new NIC architecture is useful in our use-case because they can help spreading the network charge over different cores, over different OS ...

View in full-text

Figure 6. Block cipher counter modes of operation in GPU platform.

Figure 10. inline PTX code of LEA round function.

Throughput (Gbps) table with existing implementation.

Highly Efficient Implementation of Block Ciphers on Graphic Processing Units for Massively Large Data

Article

Full-text available

May 2020

With the advent of IoT and Cloud computing service technology, the size of user data to be managed and file data to be transmitted has been significantly increased. To protect users’ personal information, it is necessary to encrypt it in secure and efficient way. Since servers handling a number of clients or IoT devices have to encrypt a large amou...

OpenCL performance portability for general-purpose computation on graphics processor units: An exploration on cryptographic primitives

Article

Full-text available

Aug 2014
CONCURR COMP-PRACT E

The modern trend toward heterogeneous many-core architectures has led to high architectural diversity in both high performance and high-end embedded systems. To effectively exploit the computational resources of such a wide range of architectures, programming languages and APIs such as OpenCL have become increasingly popular. Although OpenCL provides functional code portability and the ability to fine tune the application to the target hardware, providing performance portability is still an open problem. Thus, many research works have investigated the optimization of specific combinations of application and target platform. In this paper, we aim at leveraging the experience obtained in the implementation of algorithms from the cryptography domain to provide a set of guidelines for modern many-core heterogeneous architecture performance portability and to establish a base on which domain-specific languages and compiler transformations could be built in the near future. We study algorithmic choices and the effect of compiler transformations on three representative applications in the chosen domain on a set of seven target platforms. To estimate how well the application fits the architecture, we define a metric of computational intensity both for the architecture and the application implementation. Besides being useful to compare either different implementation or algorithmic choices and their fitness to a specific architecture, it can also be useful to the compiler to guide the code optimization process. Copyright © 2014 John Wiley & Sons, Ltd.

Reduction of computation time of Seismic Migration using FPGAs and GPGPUs: A review article

Article

Full-text available

Jun 2013

This article makes a review around the efforts that are currently being carried out in order to reduce the computation time of the MS. We introduce the methods used to make the migration process as well as the two computer architectures that are offering better processing times. We review the most representative implementations of this process on these two technologies and summarize the contributions of each of these investigations. The article ends with our analisys about the direction that future research should take in this area.

Reducción de los tiempos de cómputo de la Migración Sísmica usando FPGAs y GPGPUs: Un artículo de revisión

Article

Full-text available

Mar 2013

This article makes a review around the eﬀorts that are currently being carried out in order to reduce the computation time of the MS. We introduce the methods used to make the migration process as well as the two computer architectures that are oﬀering better processing times. We review the most representative implementations of this process on these two technologies and summarize the contributions of each of these investigations. The article ends with our analisys about the direction that future research should take in this area. PACS: 93.85.Rt MSC: 68M20

Solving Bivariate Polynomial Systems on a GPU

Article

Feb 2012
J Phys Conf

We present a CUDA implementation of dense multivariate polynomial arithmetic based on Fast Fourier Transforms over finite fields. Our core routine computes on the device (GPU) the subresultant chain of two polynomials with respect to a given variable. This subresultant chain is encoded by values on a FFT grid and is manipulated from the host (CPU) in higher-level procedures. We have realized a bivariate polynomial system solver supported by our GPU code. Our experimental results (including detailed profiling information and benchmarks against a serial polynomial system solver implementing the same algorithm) demonstrate that our strategy is well suited for GPU implementation and provides large speedup factors with respect to pure CPU code.

Solving Bivariate Polynomial Systems on a GPU

Conference Paper

Nov 2011

We present a CUDA implementation of dense multivariate polynomial arithmetic based on Fast Fourier Transforms (FFT) over finite fields. Our core routine computes on the device (GPU) the subresultant chain of two polynomials with respect to a given variable. This subresultant chain is encoded by values on a FFT grid and is manipulated from the host (CPU) in higher‐level procedures, for instance for polynomial GCDs modulo regular chains. We have realized a bivariate polynomial system solver supported by our GPU code. Our experimental results demonstrate that our strategy is well suited for GPU implementation and provides large speedup factors with respect to pure CPU code.

Accelerating Solution Proposal of AES Using a Graphic Processor

Article

Full-text available

Nov 2011
ADV ELECTR COMPUT EN

The main goal of this work is to analyze the possibility of using a graphic processing unit in non graphical calculations. Graphic Processing Units are being used nowadays not only for game engines and movie encoding/decoding, but also for a vast area of applications, like Cryptography. We used the graphic processing unit as a cryptographic coprocessor in order accelerate AES algorithm. Our implementation of AES is on a GPU using CUDA architecture. The performances obtained show that the CUDA implementation can offer speedups of 11.95Gbps. The tests are conducted in two directions: running the tests on small data sizes that are located in memory and large data that are stored in files on hard drives.

Compromis performance/sécurité des passerelles très haut débit pour Internet

Thesis

Nov 2013

Ludovic Jacquin

Dans cette thèse nous abordons le problème de la conception de passerelle IPsec très haut débit pour la sécurisation des communications entre réseaux locaux. Pour cela, nous proposons deux architectures : une passerelle purement logicielle sur un unique serveur, dite intégrée, et une passerelle utilisant plusieurs serveurs et un module matériel cryptographique, dite en rupture. La première partie de nos travaux étudie l'aspect performance des deux architectures proposées. Nous commençons par montrer qu'un serveur sur étagère est limité par sa puissance de calcul pour atteindre l'objectif de chiffrement et communication à 10 Gb/s. De plus, les nouvelles cartes graphiques, bien que prometteuses en terme de puissance, ne sont pas adaptées au problème du chiffrement de paquets réseau (à cause de leur faible taille). Nous mettons alors en place une pile réseau répartie sur plusieurs machines et procédons à sa parallélisation dans le cadre de l'architecture en rupture. Dans un second temps, nous analysons l'intégration d'une passerelle dans un réseau, notamment l'interaction du protocole de contrôle ICMP avec IPsec. ICMP est particulièrement important pour atteindre de haut débit par son implication dans le mécanisme d'optimisation de la taille des paquets. Pour cela, nous avons développé IBTrack, un logiciel d'étude du comportement des routeurs, par rapport à ICMP, le long d'un chemin. Nous montrons ensuite qu'ICMP est un vecteur d'attaque contre IPsec en exploitant un défaut fondamental des normes IP et IPsec : le surcoût des paquets IP créé par le mode tunnel entre en conflit avec le minimum de la taille maximale prévue par IP.

Portable parallelized blowfish via RenderScript

Conference Paper

Jun 2015

Tweakable parallel OFB mode of operation with delayed thread synchronization

Article

Full-text available

Dec 2015

Introduction of various cryptographic modes of operation is induced with noted imperfections of symmetric block algorithms. Design of some cryptographic modes of operation has already been exploited as an idea for parallelization of certain algorithms execution. To the best of our knowledge, there is no evidence in the available literature that output feedback (OFB) mode, which is used in satellite communications, has ever been parallelized. In this paper, we consider the performance of a convenient mode of operation, which performs tweakable parallel encryption using xor encrypt xor (XEX) and xor encrypt (XE) constructions in OFB like mode. We make use of an idea similar to the XTS-AES in order to create two parallel tweakable block ciphers. The first of them is designed using XEX construction, while the second is based on XE construction. Each cipher uses two threads to produce corresponding keystreams. Keystreams are first merged with each other and then used in modified tweakable parallel OFB mode of operation. As a proof of the concept, we have implemented a Java application in which these parallel solutions are applied to collect empirical data. The results obtained show that under certain conditions tweakable parallel OFB modes using XEX and XE constructions can achieve performance accelerations up to 10% and to 20%, respectively.

Towards Transparently Tackling Functionality and Performance Issues across Different OpenCL Platforms

Article

Feb 2015

OpenCL applications may present tight constraints on work-group size due to algorithm design or chosen implementation strategy. This may hamper functional or performance portability across different platforms, due to lack of resources. The current solution is to re-design the implementation, optimizing it for the new platform. However, this can become a showstopper for new platforms, for which a large manual optimization effort is needed to port benchmark suites and applications. In this work, we aim at tackling such issues by applying work-item coalescing techniques to optimize the mapping of the work-items to the processing elements. However, this is generally not sufficient to achieve good performance as different design patterns may be applied to exploit the specific features of the target architecture. We show how additional target specific transformations can improve the performance with respect to the work-items coalescing baseline. We employ a Matrix Multiply case study to show how the work-item coalescing transformations can impact functional portability, together with providing an opportunity of automatically inserting the use of asynchronous copies on embedded many-core platforms endowed with such a feature.

Old motherboard architecture.

Context in source publication

Similar publications

Citations