Figure 3 - uploaded by Jean-Louis Roch
Content may be subject to copyright.
Old motherboard architecture. 

Old motherboard architecture. 

Source publication
Conference Paper
Full-text available
In this work we study the feasibility of high-bandwidth, secure communications on generic machines equipped with the latest CPUs and General-Purpose Graphical Processing Units (GPGPU). We first analyze the suitability of current Nehalem CPU architectures. We show in particular that high performance CPUs are not sufficient by themselves to reach our...

Context in source publication

Context 1
... ( L3 ) that is shared by the four cores, and two physical buses ( RAM and QuickPath) ( figure 1). To each CPU chip is associated a privileged memory bank (connected to the CPU chip RAM bus) that is accessed directly by the associated CPU, and that can still be accessed by remote CPU chips through a CPU-to-CPU interconnection (figure 2). This new memory hier- archy enhances parallelism since each CPU can now access its own memory bank independently, without creating any contention with respect to other CPUs. As far as I/O communications are concerned, Intel also removed the classic bottleneck by replacing the old shared-bus architecture (e.g. the Front Side Bus, FSB, of figure 3) by point-to-point connections between each CPU and a I/O hub (figure 2). Network Interface Card (NIC) In traditional NIC architectures, the NIC driver is the only entity capable of accessing the NIC (figure 4). On the opposite, recent NIC also include parallelism mechanisms at the hardware level, and more precisely new NIC define multiple reception and transmission queues (figure 5). Thanks to this change, the traditional bottleneck at the hardware/software edge is removed. Even though it was primarily designed for virtualized server [5], the new NIC architecture is useful in our use-case because they can help spreading the network charge over different cores, over different OS ...

Similar publications

Article
Full-text available
With the advent of IoT and Cloud computing service technology, the size of user data to be managed and file data to be transmitted has been significantly increased. To protect users’ personal information, it is necessary to encrypt it in secure and efficient way. Since servers handling a number of clients or IoT devices have to encrypt a large amou...

Citations

... Such analyses deal primarily with the definition of the optimal granularity of parallelism, block size, and management of the memory hierarchy (what to allocate to shared memory and how to avoid bank conflicts) to improve the performance of AES. Other works deal with implementations of AES for specialized applications [42,43] or with automated CUDA kernel generation from C-code [39]. However, the introduction of the AES-NI [44] instruction set extension in the Intel Nehalem processor family provides a fast way to implement AES. ...
Article
Full-text available
The modern trend toward heterogeneous many-core architectures has led to high architectural diversity in both high performance and high-end embedded systems. To effectively exploit the computational resources of such a wide range of architectures, programming languages and APIs such as OpenCL have become increasingly popular. Although OpenCL provides functional code portability and the ability to fine tune the application to the target hardware, providing performance portability is still an open problem. Thus, many research works have investigated the optimization of specific combinations of application and target platform. In this paper, we aim at leveraging the experience obtained in the implementation of algorithms from the cryptography domain to provide a set of guidelines for modern many-core heterogeneous architecture performance portability and to establish a base on which domain-specific languages and compiler transformations could be built in the near future. We study algorithmic choices and the effect of compiler transformations on three representative applications in the chosen domain on a set of seven target platforms. To estimate how well the application fits the architecture, we define a metric of computational intensity both for the architecture and the application implementation. Besides being useful to compare either different implementation or algorithmic choices and their fitness to a specific architecture, it can also be useful to the compiler to guide the code optimization process. Copyright © 2014 John Wiley & Sons, Ltd.
... Así mismo, las GPGPUs ofrecen una mejor relación rendimiento/potencia y rendimiento/costo [2],[48],[49] Carlos Fajardo, Javier Castillo Villar y César Pedraza estas empiezan a aparecer en la literatura. Dentro de las más importantes se encuentran: solución de ecuaciones simultáneas [51], diferentes tipos de transformadas [52],[53],[54], soluciones de la ecuación de onda 2D y 3D [12], [55], comunicaciones [56], simulaciones de dinámica molecular [57], entre otras. ...
Article
Full-text available
This article makes a review around the efforts that are currently being carried out in order to reduce the computation time of the MS. We introduce the methods used to make the migration process as well as the two computer architectures that are offering better processing times. We review the most representative implementations of this process on these two technologies and summarize the contributions of each of these investigations. The article ends with our analisys about the direction that future research should take in this area.
... Existe una gran número de investigaciones relacionadas con el desarrollo de soluciones computacionales basadas en GPGPU y muchas de |276 Ingeniería y Ciencia estas empiezan a aparecer en la literatura. Dentro de las más importantes se encuentran: solución de ecuaciones simultáneas [51], diferentes tipos de transformadas [52], [53], [54], soluciones de la ecuación de onda 2D y 3D [12], [55], comunicaciones [56], simulaciones de dinámica molecular [57], entre otras. ...
Article
Full-text available
This article makes a review around the efforts that are currently being carried out in order to reduce the computation time of the MS. We introduce the methods used to make the migration process as well as the two computer architectures that are offering better processing times. We review the most representative implementations of this process on these two technologies and summarize the contributions of each of these investigations. The article ends with our analisys about the direction that future research should take in this area. PACS: 93.85.Rt MSC: 68M20
... [2] or the application section in the Wikipedia page for GPGPU [1], where many references and links are provided. Symbolic computation is also entering the area of many-core computing, but few reports have been published so far [4] [8] [10] [17]. ...
Article
We present a CUDA implementation of dense multivariate polynomial arithmetic based on Fast Fourier Transforms over finite fields. Our core routine computes on the device (GPU) the subresultant chain of two polynomials with respect to a given variable. This subresultant chain is encoded by values on a FFT grid and is manipulated from the host (CPU) in higher-level procedures. We have realized a bivariate polynomial system solver supported by our GPU code. Our experimental results (including detailed profiling information and benchmarks against a serial polynomial system solver implementing the same algorithm) demonstrate that our strategy is well suited for GPU implementation and provides large speedup factors with respect to pure CPU code.
... [1] or the application section in the Wikipedia page for GPGPU [2], where many references and links are provided. Symbolic computation is also entering the area of many-core computing, but few reports have been published so far [3,4,5,6]. ...
Conference Paper
We present a CUDA implementation of dense multivariate polynomial arithmetic based on Fast Fourier Transforms (FFT) over finite fields. Our core routine computes on the device (GPU) the subresultant chain of two polynomials with respect to a given variable. This subresultant chain is encoded by values on a FFT grid and is manipulated from the host (CPU) in higher‐level procedures, for instance for polynomial GCDs modulo regular chains. We have realized a bivariate polynomial system solver supported by our GPU code. Our experimental results demonstrate that our strategy is well suited for GPU implementation and provides large speedup factors with respect to pure CPU code.
... [3] realizes an AES implementation on CUDA and speaks about performances that go over 10Gbps. [15] does a comparison similar to the one proposed in this paper and compares the CPU and the GPGPU. The conclusion drawn is that legacy processors are not enough to reach and compare with the objectives proposed and finding a solution to accelerate AES on a GPU proves to be extremely difficult. ...
Article
Full-text available
The main goal of this work is to analyze the possibility of using a graphic processing unit in non graphical calculations. Graphic Processing Units are being used nowadays not only for game engines and movie encoding/decoding, but also for a vast area of applications, like Cryptography. We used the graphic processing unit as a cryptographic coprocessor in order accelerate AES algorithm. Our implementation of AES is on a GPU using CUDA architecture. The performances obtained show that the CUDA implementation can offer speedups of 11.95Gbps. The tests are conducted in two directions: running the tests on small data sizes that are located in memory and large data that are stored in files on hard drives.
Thesis
Dans cette thèse nous abordons le problème de la conception de passerelle IPsec très haut débit pour la sécurisation des communications entre réseaux locaux. Pour cela, nous proposons deux architectures : une passerelle purement logicielle sur un unique serveur, dite intégrée, et une passerelle utilisant plusieurs serveurs et un module matériel cryptographique, dite en rupture. La première partie de nos travaux étudie l'aspect performance des deux architectures proposées. Nous commençons par montrer qu'un serveur sur étagère est limité par sa puissance de calcul pour atteindre l'objectif de chiffrement et communication à 10 Gb/s. De plus, les nouvelles cartes graphiques, bien que prometteuses en terme de puissance, ne sont pas adaptées au problème du chiffrement de paquets réseau (à cause de leur faible taille). Nous mettons alors en place une pile réseau répartie sur plusieurs machines et procédons à sa parallélisation dans le cadre de l'architecture en rupture. Dans un second temps, nous analysons l'intégration d'une passerelle dans un réseau, notamment l'interaction du protocole de contrôle ICMP avec IPsec. ICMP est particulièrement important pour atteindre de haut débit par son implication dans le mécanisme d'optimisation de la taille des paquets. Pour cela, nous avons développé IBTrack, un logiciel d'étude du comportement des routeurs, par rapport à ICMP, le long d'un chemin. Nous montrons ensuite qu'ICMP est un vecteur d'attaque contre IPsec en exploitant un défaut fondamental des normes IP et IPsec : le surcoût des paquets IP créé par le mode tunnel entre en conflit avec le minimum de la taille maximale prévue par IP.
Article
Full-text available
Introduction of various cryptographic modes of operation is induced with noted imperfections of symmetric block algorithms. Design of some cryptographic modes of operation has already been exploited as an idea for parallelization of certain algorithms execution. To the best of our knowledge, there is no evidence in the available literature that output feedback (OFB) mode, which is used in satellite communications, has ever been parallelized. In this paper, we consider the performance of a convenient mode of operation, which performs tweakable parallel encryption using xor encrypt xor (XEX) and xor encrypt (XE) constructions in OFB like mode. We make use of an idea similar to the XTS-AES in order to create two parallel tweakable block ciphers. The first of them is designed using XEX construction, while the second is based on XE construction. Each cipher uses two threads to produce corresponding keystreams. Keystreams are first merged with each other and then used in modified tweakable parallel OFB mode of operation. As a proof of the concept, we have implemented a Java application in which these parallel solutions are applied to collect empirical data. The results obtained show that under certain conditions tweakable parallel OFB modes using XEX and XE constructions can achieve performance accelerations up to 10% and to 20%, respectively.
Article
OpenCL applications may present tight constraints on work-group size due to algorithm design or chosen implementation strategy. This may hamper functional or performance portability across different platforms, due to lack of resources. The current solution is to re-design the implementation, optimizing it for the new platform. However, this can become a showstopper for new platforms, for which a large manual optimization effort is needed to port benchmark suites and applications. In this work, we aim at tackling such issues by applying work-item coalescing techniques to optimize the mapping of the work-items to the processing elements. However, this is generally not sufficient to achieve good performance as different design patterns may be applied to exploit the specific features of the target architecture. We show how additional target specific transformations can improve the performance with respect to the work-items coalescing baseline. We employ a Matrix Multiply case study to show how the work-item coalescing transformations can impact functional portability, together with providing an opportunity of automatically inserting the use of asynchronous copies on embedded many-core platforms endowed with such a feature.