Technical ReportPDF Available

BOOK OF ABSTRACTS of PARNUM 2017 - 11th International Workshop on Parallel Numerics

July 2017

July 2017

DOI:10.13140/RG.2.2.21647.48806

Affiliation: Friedrich-Alexander-University of Erlangen-Nürnberg

Authors:

Ulrich Rüde

Friedrich-Alexander-University of Erlangen-Nürnberg

Marian Vajtersic

University of Salzburg

This is the book of abstracts of talks presented at PARNUM 2017, the 11th workshop on Parallel Numerics held in Waischenfeld, Germany, April 19-21, 2017. The objective of this workshop was the exchange of research results in the area of parallel scientific computing, parallel algorithms, and high performance computing. For further details we refer to our workshop website www.parnum2017.fau.de .

No caption available

…

Content may be subject to copyright.

Content uploaded by Dominik Bartuschat

Content may be subject to copyright.

PARNUM 2017

International Workshop on Parallel Numerics

April 19 – 21, 2017

Fraunhofer Research Campus

Waischenfeld, Germany

BOOK OF ABSTRACTS

Program Committee

Hans-Joachim Bungartz, Technical University of Munich, Germany

Dietmar Fey, University of Erlangen-N¨urnberg, Germany

Gundolf Haase, Karl-Franzens-University, Graz, Austria

Axel Klawonn, University of Cologne, Germany

G¨unter Leugering, University of Erlangen-N¨urnberg, Germany

Miriam Mehl, University of Stuttgart, Germany

Gabriel Okˇsa, Slovak Academy of Sciences, Bratislava, Slovakia

Roman Trobec, Joˇzef Stefan Institute, Ljubljana, Slovenia

Pavel Tvrdik, Czech Technical University, Prague, Czech Republic

Roman Wyrzykowski, Czestochowa University of Technology, Poland

Program Chairmen

Ulrich R¨ude, University of Erlangen-N¨urnberg, Germany

Mari´an Vajterˇsic, University of Salzburg, Austria &

Slovak Academy of Sciences, Bratislava, Slovakia

Organizing Committee

Dominik Bartuschat, University of Erlangen-N¨urnberg, Germany

Julia Deserno, University of Erlangen-N¨urnberg, Germany

Editors: Dominik Bartuschat, Ulrich R¨ude, and Mari´an Vajterˇsic

April 2017, Erlangen, Germany

Speaker Index ParNum 17

Speaker Index

Name Title of talk Page

Iain Duﬀ Direct solution of sparse linear equations on parallel

computers

Jean-Matthieu Gallard Code generation for a high order ADER-DG solver in

a hyperbolic PDE engine

Selime G¨urol Parallelization in the time dimension of geophysical

data assimilation problems

Kamil Halbiniak Exploring programming models for accelerating scien-

tiﬁc applications on hybrid CPU-MIC platforms

Thomas Heller Asynchronous integration of CUDA/OpenCL within

HPX for utilizing full cluster capabilities

Michael Hofmann Transparent execution of numerical libraries on dis-

tributed HPC platforms

Thomas Huckle Parallel solution of tridiagonal matrices 7

Imad Kissami HPC as a service for computational ﬂuid dynamics

problems

Ivan Kotenkov Design of cache-eﬃcient multithreaded sparse matrix

format for modern era

Rade Kutil, Markus Flatz Convergence and parallelization of nonnegative matrix

factorization with Newton iteration

Martin Lanser A framework for nonlinear FETI-DP and BDDC

methods

Manfred Liebmann Explicit vectorization as a design tool for parallel al-

gorithms on modern hardware architectures

M´aria Luck´a Parallel multi-density based clustering 14

Alban Lumi Energy aware computations on manycore systems 15

Lois Curfman McInnes Community software ecosystems for high-performance

computational science: Opportunities and challenges

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 i

ParNum 17 Speaker Index

Name Title of talk Page

Michael Obersteiner A highly scalable MPI parallelization of the Fast Mul-

tipole Method

Gabriel Okˇsa Convergence of the parallel Block-Jacobi EVD algo-

rithm for Hermitian matrices

Michael Rippl Eﬃcient transformation of the general eigenproblem

with symmetric banded matrices to a banded standard

eigenproblem

Stefan Rosenberger OpenACC parallelization for the solution of the Bido-

main equations

Miroslav Rozloˇzn´ık The factors in the SR decomposition and their condi-

tioning

Louise Spellacy Partial inverses of block tridiagonal non-Hermitian

matrices

Robert Spir Workﬂow for parallel processing of biomedical images 23

Linda Stals Use of domain decomposition for the solution of the

Thin Plate Spline saddle point problem

Zdenˇek Strakoˇs On the numerical stability analysis of pipelined Krylov

subspace methods

Jonas Thies Employing HPC for analyzing nonlinear PDE systems

beyond simulation

Roman Trobec Impact of interconnection network topology on paral-

lel performance – a survey

Miroslav T˚uma Mixed sparse-dense linear least squares and precondi-

tioned iterative methods

Tim Werner Eﬃcient GPU-based Smoothed Particle Hydrodynam-

ics

Elias Wimmer Is Gossip-inspired reduction competitive in high per-

formance computing?

Yusaku Yamamoto Roundoﬀ error analysis of the CholeskyQR2 and re-

lated algorithms

ii Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Duﬀ ParNum 17

DIRECT SOLUTION OF SPARSE LINEAR EQUATIONS

ON PARALLEL COMPUTERS

Iain S. Duﬀ

STFC Rutherford Appleton Laboratory,

Harwell Campus, Didcot, Oxfordshire, UK

and

CERFACS, M´et´eo-France, Toulouse, France

iain.duff@stfc.ac.uk

As part of the H2020 FET-HPC Project NLAFET, we are studying the scalability of

algorithms and software for using direct methods for solving large sparse equations.

We brieﬂy discuss the structure of NLAFET and the scope of the Project. We then discuss

the algorithmic approaches for solving sparse systems: positive deﬁnite, symmetric indeﬁnite,

and unsymmetric. An important aspect of most of our algorithms is that although we are

solving sparse equations most of the kernels are for dense linear algebra. We show why this

is the case with a simple example before illustrating the various levels of parallelism available

in the sparse case. We examine the beneﬁts of using standard run time systems to assist us

in developing codes for extreme scale computers.

For sparse matrices that are very unsymmetric inasmuch as their structure is quite diﬀerent

from the structure of |A|+|AT|, we use sparse data structures. We discuss the design, coding,

and performance of software for this case, including the development of a parallel threshold-

Markowitz algorithm.

We illustrate our talk with runs of prototype codes that will be developed for inclusion in

a library being developed in the context of the NLAFET Project.

The work described in this talk has been conducted by the STFC NLAFET Team who

comprise: Florent Lopez, Stojce Nakov, and Vedran Novakovic.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 1

ParNum 17 Gallard

CODE GENERATION FOR A HIGH ORDER ADER-DG

SOLVER IN A HYPERBOLIC PDE ENGINE

Jean-Matthieu Gallard

Department of Informatics, Technical University of Munich,

Garching, Germany

jean-matthieu.gallard@tum.de

In this talk the use of code generation to improve the performance and the energy eﬃciency

of the solver engine ExaHyPE is discussed. ExaHyPE is an Horizon 2020 EU project to

develop a high-performance engine to solve hyperbolic systems of PDEs using the high-order

discontinuous Galerkin ﬁnite element method [1]. The engine will be ﬂexible to support

various applications and will be tailored towards expected exascale architectures. One of

the main goals of the project is therefore to provide to the end-user an abstraction of the

complicated algorithms to implement the ADER-DG numerical scheme and of the issues

related to scalability and parallel adaptive mesh reﬁnement, which are handled internally by

the Peano framework [2].

Code generation within the engine produces optimized code that is tailored to the speciﬁc

PDE problem, to the chosen polynomial order for the ADER-DG scheme, and especially

to the compute architecture used. Compute kernels for the ADER-DG scheme exploit the

high performance LIBXSMM library [3] for small matrix multiplications occurring in the

element-local kernels, use tailored data layouts and support compiler auto-vectorization. The

generated optimized kernels oﬀer a speedup of a factor 2.5 when compared to a generic C++

implementation, and are currently benchmarked and improved in regards to performance and

energy consumption. First results will be presented for benchmark scenarios in seismology

and astrophysics.

References

[1] O. Zanotti, F. Fambri, M. Dumbser, and A. Hidalgo. Space–time adaptive ADER discon-

tinuous Galerkin ﬁnite element schemes with a posteriori sub-cell ﬁnite volume limiting.

Comput. Fluids, 118:204–224, 2015. doi:10.1016/j.compfluid.2015.06.020.

[2] T. Weinzierl. The Peano software - parallel, automaton-based, dynamically adaptive grid

traversals. CoRR, abs/1506.04496, 2015. url:http://arxiv.org/abs/1506.04496.

[3] A. Heinecke, G. Henry, M. Hutchinson, and H. Pabst. LIBXSMM: Accelerating Small

Matrix Multiplications by Runtime Code Generation. In Proceedings of the International

Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16.

IEEE Press, 2016, pages 84:1–84:11. url:http: / /dl . acm. org /citation . cfm? id =

3014904.3015017.

2 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Fisher, Gratton, G¨urol ParNum 17

PARALLELIZATION IN THE TIME DIMENSION OF

GEOPHYSICAL DATA ASSIMILATION PROBLEMS

Mike Fisher

European Centre for Medium-Range Weather Forecasts, Reading, UK

Serge Gratton

INP-ENSEEIHT, Toulouse, France

serge.gratton@enseeiht.fr

Selime G¨urol

CERFACS, M´et´eo-France, Toulouse, France

gurol@cerfacs.fr

In this talk we will address the numerical solution of the saddle point system arising from

four dimensional variational (4D-Var) data assimilation, including a study of preconditioning.

This new saddle point formulation [1] of 4D-Var allows parallelization in time dimension.

Therefore, it represents a crucial step towards higher computational eﬃciency, since 4D-Var

approaches otherwise require many sequential computations.

In recent years, there has been increasing interest in saddle point problems which arise

in many other applications such as constrained optimization, computational ﬂuid dynamics,

optimal control and so forth. The key issue of solving saddle point systems with Krylov

subspace methods is to ﬁnd eﬃcient preconditioners. This talk will focus on the new low-

rank limited memory preconditioners [2] exploiting the particular structure of the problem.

Numerical experiments performed within the Object Oriented Prediction System (OOPS) are

presented.

References

[1] M. Fisher and S. G¨urol. Parallelization in the time dimension of four-dimensional vari-

ational data assimilation. Q. J. R. Meteorol. Soc., 143(703):1136–1147, 2017. doi:10.

1002/qj.2997.

[2] M. Fisher, S. Gratton, S. G¨urol, Y. Tr´emolet, and X. Vasseur. Low rank updates in

preconditioning the saddle point systems arising from data assimilation problems. Op-

tim. Meth. Softw:1–25, 2016. doi:10.1080/10556788.2016.1264398.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 3

ParNum 17 Wyrzykowski, Szustak, Halbiniak

EXPLORING PROGRAMMING MODELS FOR

ACCELERATING SCIENTIFIC APPLICATIONS ON

HYBRID CPU-MIC PLATFORMS

Roman Wyrzykowski, Lukasz Szustak, Kamil Halbiniak

Faculty of Mechanical Engineering and Computer Science,

Czestochowa University of Technology, Poland

roman@icis.pcz.pl

Modern heterogeneous computing platforms have become powerful HPC solutions, which

could be applied to a wide range of real-life applications. In particular, the hybrid platforms

equipped with Intel Xeon Phi coprocessors oﬀer the advantages of massively parallel com-

puting, while supporting practically the same (or similar) parallel programming model as

conventional homogeneous solutions. However, there is still an open issue as to how scientiﬁc

applications can eﬃciently utilize hybrid platforms with Intel MIC coprocessors.

In paper [1], we proposed a method for porting a real-life scientiﬁc application to com-

puting platforms with Intel MICs. We focused on the parallel implementation of a numerical

model of alloy solidiﬁcation, which is based on the generalized ﬁnite diﬀerence method. We

developed a sequence of steps that are necessary for porting this application to platforms with

accelerators, assuming no signiﬁcant modiﬁcations of the code. The proposed method consid-

ers not only overlapping computations with data movements, but also takes into account an

adequate utilization of cores/threads and vector units. Using parallel resources of one Intel

Xeon Phi coprocessor (KNC architecture), the developed approach allowed us to execute the

whole application 3.45 times faster than the original parallel version running on two CPUs.

In this work, we focus on studying various heterogeneous programming models for acceler-

ating the solidiﬁcation application on hybrid CPU-MIC platforms. We focus on two models:

OpenMP 4.0 Accelerator Model and Hetero Streams Library (hStreams in short) [2]. Now the

main challenge for achieving a desired high performance of computations is to take advantage

of CPUs and coprocessors to work together, when all the available threads of CPUs and Intel

MICs are utilized coherently to solve the modelling problem.

In the paper, we present the performance comparison of the above-mentioned models for

various conﬁgurations of computing resources. In particular, using the hStreams library, our

approach allows us parallelize eﬃciently the solidiﬁcation application on hybrid platforms

with two CPUs and two MICs, and accelerate computations about 10.5 times in comparison

with the basic version for two CPUs. We also conclude that while OpenMP provides an

uniﬁed directive-based programming model, the current stable version of this standard is not

eﬃcient in multi-device heterogeneous platforms. That is why, we plan to investigate new

features available in version 4.5 of OpenMP, such as asynchronous oﬄoad mechanism.

References

[1] L. Szustak et al. Toward parallel modeling of solidiﬁcation based on the generalized ﬁnite

diﬀerence method using Intel Xeon Phi. PPAM 2015, Part I. LNCS, 9573:411–412, 2016.

[2] Ch. J. Newburn et al. Heterogeneous streaming. IPDPSW, AsHES, 2016.

4 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Diehl, Heller, Troska, Kaiser, Fey, Schweitzer ParNum 17

ASYNCHRONOUS INTEGRATION OF CUDA/OPENCL

WITHIN HPX FOR UTILIZING FULL CLUSTER

CAPABILITIES

P. Diehl1,5, T. Heller2,5, L. Troska1, H. Kaiser3,5 , D. Fey2, M. A.

Schweitzer1,4

1Institute for Numerical Simulation, University of Bonn, Germany

2Department of Computer Science, University of Erlangen-N¨urnberg, Germany

3Center for Computation and Technology, Louisiana State University, USA

4Meshfree Multiscale Methods, Fraunhofer SCAI, Schloss Birlinghoven, Germany

5The STE||AR Group (http: // stellar-group. org )

Experience shows that on today’s high performance systems the utilization of diﬀerent

acceleration cards in conjunction with a high utilization of all other parts of the system is

diﬃcult. Future architectures, like exascale style clusters, are expected to aggravate this is-

sue as the number of cores are expected to increase and memory hierarchies are expected to

become deeper. One big aspect for distributed applications is to guarantee high utilization

of all available resources, including local or remote acceleration cards on a cluster while fully

using all the available CPU resources and the integration of the GPU work into the overall

programming model. For the integration of CUDA and OpenCL code we extended HPX [1, 2],

a general purpose C++ run time system for parallel and distributed applications of any scale,

and enabled asynchronous data transfers from and to the GPU device and the asynchronous

invocation of CUDA- and OpenCL kernels on this data. Both operations are well integrated

into the general programming model of HPX which allows to seamlessly overlap any GPU

operation with work on the main cores. Any user deﬁned CUDA or OpenCL kernel can be

launched on any (local or remote) GPU device available to the distributed application. We

present asynchronous implementations for the data transfers and kernel launches for CUDA

and OpenCL code as part of a HPX asynchronous execution graph. Using this approach we

can combine all remotely and locally available acceleration cards on a cluster to utilize its full

performance capabilities. Benchmarks show, that the integration of the asynchronous opera-

tions (data transfer + launches of the kernels) as part of the HPX execution graph imposes

no additional computational overhead and signiﬁcantly eases orchestrating coordinated and

concurrent work on the main cores and the used GPU devices.

References

[1] H. Kaiser, T. Heller, D. Bourgeois, and D. Fey. Higher-level Parallelization for Local

and Distributed Asynchronous Task-based Programming. In Proceedings of the First

International Workshop on Extreme Scale Programming Models and Middleware, ESPM

’15. ACM, 2015, pages 29–37. doi:10.1145/2832241.2832244.

[2] H. Kaiser, B. Adelstein-Lelbach, T. Heller, A. Berg´e, and J. Biddiscombe. hpx: HPX

V0.9.99: A general purpose C++ runtime system for parallel and distributed applications

of any scale. 2016. doi:10.5281/zenodo.58027.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 5

ParNum 17 Hofmann, Polster, Streller, Walther

TRANSPARENT EXECUTION OF NUMERICAL

LIBRARIES ON DISTRIBUTED HPC PLATFORMS

Michael Hofmann, Florian Polster, Riko Streller, Daniel Walther

Department of Computer Science, Chemnitz University of Technology, Germany

mhofma@cs.tu-chemnitz.de

The usage of numerical software libraries is well established in scientiﬁc computing as they

can provide advanced methods and eﬃcient implementations for solving common problems

of scientiﬁc applications. One major goal of using these libraries is to improve the appli-

cation performance by fully exploiting the available computational hardware. For example,

BLAS libraries, such as OpenBLAS or cuBLAS provide eﬃcient implementations of linear

algebra operations that exploit modern multicore processors or graphics processing units. In

this contribution, we propose a method for redirecting the execution of an existing numerical

software library to a distributed HPC platform. The redirection is transparent in the sense

that the application does not have to distinguish whether the utilized library functions are

executed locally or distributed. Thus, the method allows to exploit the computational power

of HPC platforms even in non-parallel application codes. Our proposed solution provides

replacements of the utilized library functions, that can be used without additional program-

ming eﬀorts for adapting the application code. Furthermore, by providing the replacement

functions as a shared library, the redirection can also be applied to applications that are only

available as a binary executable. We demonstrate the approach for several numerical soft-

ware libraries, such as BLAS and LAPACK libraries for linear algebra operations, the FFTW

library for fast Fourier transforms, and the ScaFaCoS library for fast Coulomb interactions

in particle systems. This includes sequential libraries as well as parallel libraries based on

multi-threading or MPI. The implementation utilizes the Simulation Component and Data

Coupling library [1] for performing the program interactions and the data transfers between

the locally executed application and the numerical software library executed on a distributed

HPC platform. Experimental results are presented to investigate the overhead of the required

data transfers and the achieved performance improvements.

Acknowledgments This work was performed within the Federal Cluster of Excellence EXC

1075 “MERGE Technologies for Multifunctional Lightweight Structures” and supported by

the German Research Foundation (DFG). Financial support is gratefully acknowledged.

References

[1] M. Hofmann and G. R¨unger. Sustainability through ﬂexibility: Building complex simula-

tion programs for distributed computing systems. Simul. Model. Pract. Theory, 58(1):65–

78, 2015. Special Issue on Techniques And Applications For Sustainable Ultrascale Com-

puting Systems. doi:10.1016/j.simpat.2015.05.007.

6 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Huckle ParNum 17

PARALLEL SOLUTION OF TRIDIAGONAL MATRICES

Thomas Huckle

Department of Informatics, Technical University of Munich,

Garching, Germany

huckle@in.tum.de

For solving sparse linear systems of equations iteratively, an eﬃcient preconditioner is

necessary, that also allows fast solution in parallel. So ILU is a preconditioner that can be

derived eﬃciently in parallel [1]. For eigenvalue algorithms like MRRR also twisted factor-

izations of tridiagonal matrices are used. But here the factorizations have to be computed to

high accuracy. In these applications the convergence of the ﬁxed point iteration of Chow for

computing ILU takes too many iterations. Therefore, in this talk we will discuss fast iterative

methods for computing (twisted) factorizations and for eﬃciently solving linear systems with

(twisted) bidiagonal matrices in parallel.

References

[1] E. Chow and A. Patel. Fine-grained parallel incomplete LU factorization. SIAM J. Sci.

Comput., 37(2):C169–C193, 2015. doi:10.1137/140968896.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 7

ParNum 17 Benkhaldoun, C´erin, Kissami, Saad

HPC AS A SERVICE FOR COMPUTATIONAL FLUID

DYNAMICS PROBLEMS

Fayssal Benkhaldoun

LAGA - Institut Galil´ee, Universit´e Paris 13, Villetaneuse, France

fayssal@math.univ-paris13.fr

Christophe C´erin

LIPN - Institut Galil´ee, Universit´e Paris 13, Villetaneuse, France

christophe.cerin@lipn.univ-paris13.fr

Imad Kissami

LAGA & LIPN - Institut Galil´ee, Universit´e Paris 13, Villetaneuse, France

imad@lipn.univ-paris13.fr

Walid Saad

LIPN & University of Tunis, LATICE, ENSIT, Tunis, Tunisia

walid.saad@lipn.univ-paris13.fr

Abstract

In this paper we decline the full cycle for transforming a parallel code for a Computa-

tional Fluid Dynamics (CFD) problem into a parallel version for the RedisDG workﬂow

engine. This system is able to capture heterogeneous and highly dynamic environments,

thanks to opportunistic scheduling strategies. It also captures multi-criteria approaches

to decide the allocation of tasks to machines. We show how to move to the ﬁeld of ’HPC

as a Service’ in order to use heterogeneous platforms and to also investigate other per-

formance metrics than the makespan (the minimum completion time). We also provide

an experimental evaluation of the implemented solution where we discuss of the accuracy

of the multi-criteria approach. This paper states that new models for High Performance

Computing are possible, under the condition we revisit our mind in the direction of the

potential of new paradigms such as cloud, edge computing. . . New challenges are to ag-

gregate resources, from anywhere, at any time under Service Level Agreements (SLAs)

constraints.

Introduction

Our research focuses on the design of Systems for heterogeneous and highly dynamic environ-

ments, notably clouds, desktop grids and volunteer computing projects. The overall objective

is to execute computational codes in such environments.. . and progressively moving from a

traditional view for High Performance Computing (HPC) to service oriented and workﬂow

oriented views. The context we consider is also of particular interest for the development

of extreme edge and edge cloud. A hard question, that the dynamicity causes here, is that

given a workﬂow to schedule, we do not have any a-priori knowledge on the resources that

are available. To address it, we propose to implement a Publish-Subscribe based mechanism

for resource discovery and allocation. The mechanism is implemented in a prior system we

developed for mimic desktop grid environments: the RedisDG system.

8 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Benkhaldoun, C´erin, Kissami, Saad ParNum 17

Recall that the Publish-Subscribe paradigm is an asynchronous mode for communicating

between entities. Some users, namely the subscribers or clients or consumers, express and

record their interests under the form of subscriptions, and are notiﬁed later by another event

produced by other users, namely the producers.

This communication mode is multi-point, anonymous and implicit. Thus, it allows spa-

tial decoupling (the interacting entities do not know each other), and time decoupling (the

interacting entities do not need to participate at the same time). This total decoupling be-

tween the production and the consumption of services increases the scalability by eliminating

many sorts of explicit dependencies between participating entities. Eliminating dependen-

cies reduces the coordination needs and consequently the synchronizations between entities.

These advantages make the communicating infrastructure well suited to the management of

distributed systems and simplify the development of a middleware for the coordination of

components in our workﬂow engine context or for applications running in diﬀerent domains

and communicating through middleware solutions deployed on clouds.

Indeed, we support the thesis that for building systems for heterogeneous and highly

dynamic environments we need to be compliant with:

1. a publish-subscribe layer for the orchestration of the components of the system;

2. a set of opportunistic strategies for allocating work/tasks that are also based on the

publish-subscribe layer;

3. a small number of software dependencies for the system and the ability to deploy the

system and its applications on demand. This point is of particular interest in this paper

and we promote the ’easy to use’, and systems that can be deployed without a system

administrator.

In this paper we propose a solution for item 2 that is implemented into the RedisDG

system. We consider a CFD problem and we execute our CFD solution, obtained from

transforming a parallel code into a workﬂow, on top of the RedisDG system. We conduct

experiments to validate our approach.

The organisation of the presentation for the workshop is as follows. First we introduce

the numerical problem we are faced to. We also summarize some related works. Second

we introduce a parallel solution of the problem in the spirit of MPI programming. Third,

we explain how to provide with a workﬂow oriented view for solving the numerical problem.

Then, we explain our new strategy for allocating tasks of the workﬂow and we show experi-

mental results to demonstrate the potential of this strategy. Experiments are conducted on

six geographically distributed clusters in the Grid’5000 testbed. Finally, we conclude the

presentation.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 9

ParNum 17 Kotenkov, ˇ

Simeˇcek, Langr

DESIGN OF CACHE-EFFICIENT MULTITHREADED

SPARSE MATRIX FORMAT FOR MODERN ERA

Ivan Kotenkov, Ivan ˇ

Simeˇcek, Daniel Langr

Faculty of Information Technology, Czech Technical University in Prague,

Czech Republic

koteniv1@fit.cvut.cz, xsimecek@fit.cvut.cz, daniel.langr@fit.cvut.cz

The most common routines in numerical linear algebra are sparse matrix-vector multi-

plication and transposed sparse matrix-vector multiplication. In the further text, we denote

these operations as sparse multiplication. These operations are crucial (and the most time-

consuming) in iterative solvers of sparse systems of linear equations. In these applications,

large number of sparse multiplications with the same matrix Ais executed.

Matrices emerging in HPC applications running on these systems need to be mapped to

nodes such that each node contains in its memory some portion of matrix nonzero elements.

The overall performance and scalability depend strongly on matrix partitioning, matrix-to-

node mapping, and the used matrix storage format. Many storage formats for sparse matrices

have been developed. Since the commonly used storage formats (like COO or CSR) are

not suﬃcient for high-performance computations (they were introduced more than 45 years

ago), extensive research has been conducted about maximal computational eﬃciency of these

routines [1].

We present a new storage format and related algorithms for sparse multiplication that is

designed for eﬃcient parallel (multithreaded) execution and has the following features:

•It is a hierarchical format (inspired by ELL idea) that has a good spatial and temporal

locality.

•For highly parallel architectures, some of the challenges of the sparse multiplication are

thread divergence and load imbalance when operating on matrices with high standard

deviation in the number of non-zero elements per row [2]. We address these problems

by grouping rows in bins in a modiﬁcation of the ELL format.

•Using target-speciﬁc versions of algorithms we achieve a very good overall performance.

References

[1] I. ˇ

Simeˇcek and D. Langr. Space and execution eﬃcient formats for modern processor ar-

chitectures. In 2015 17th International Symposium on Symbolic and Numeric Algorithms

for Scientiﬁc Computing (SYNASC), 2015, pages 98–105. doi:10.1109/SYNASC.2015.

24.

[2] A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarath, and P. Sadayappan. Fast Sparse

Matrix-Vector Multiplication on GPUs for Graph Applications. SC14: International Con-

ference for High Performance Computing, Networking, Storage and Analysis, 2014. doi:

10.1109/sc.2014.69.

10 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Kutil, Flatz, Vajterˇsic ParNum 17

CONVERGENCE AND PARALLELIZATION OF

NONNEGATIVE MATRIX FACTORIZATION WITH

NEWTON ITERATION

Rade Kutil1, Markus Flatz1, Mari´an Vajterˇsic1,2

Department of Computer Sciences, University of Salzburg, Austria1

Department of Informatics, Slovak Academy of Sciences, Bratislava2

rkutil@cosy.sbg.ac.at

The goal of Nonnegative Matrix Factorization (NMF) is to represent a large nonnegative

matrix Ain an approximate way as a product W H of two signiﬁcantly smaller nonnegative

matrices. Applications of the NMF include text mining, document classiﬁcation, clustering,

spectral data analysis, face recognition, and computational biology.

Among several algorithms to calculate the NMF, such as the multiplicative update algo-

rithm (MU) [1], there are Newton-type methods [2]. In [3], we proved that Newton methods

can be parallelized very well because Newton iterations can be performed in parallel without

exchanging data between processes. Therefore, they have an advantage on parallel archi-

tectures over other methods. However, these methods can show problematic convergence

behavior, limiting their eﬃciency.

We present a modiﬁed algorithm that has stable convergence. Like all algorithms, it

minimizes the approximation error by alternatingly improving Wwhile holding Hﬁxed,

and vice versa. Newton iteration is applied in each step. There are several diﬀerences to

existing algorithms. While [2] uses unconstrained optimization and active set methods, our

method uses Karush-Kuhn-Tucker (KKT) conditions and a reﬂective technique. While [3]

uses backtracking line search in order not to violate KKT conditions, we use this technique

only to guarantee global convergence, i.e. that the approximation error decreases in each

iteration. The KKT conditions are enforced by using a modiﬁed target function with the

same zeros.

Our method allows for an inexact approach, where only few Newton iterations are per-

formed per outer iteration. Experiments show that this leads to faster convergence. Although,

on the other hand, this increases the communication overhead in the parallel implementation,

a single Newton iteration is still the best choice and parallel eﬃciency is satisfactory.

References

[1] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factor-

ization. Nature, 401:788–791, 1999.

[2] D. Kim, S. Sra, and I. S. Dhillon. Fast Newton-type methods for the least squares non-

negative matrix approximation problem. In Proc. of the 2007 SIAM Int. Conf. on Data

Min. (SDM07), 2007, pages 343–354.

[3] M. Flatz and M. Vajterˇsic. Parallel Nonnegative Matrix Factorization via Newton Iter-

ation. Parallel Process. Lett., 26(03):1650014, 2016. doi:10.1142/S0129626416500146.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 11

ParNum 17 Klawonn, Lanser, Rheinbach, Uran

A FRAMEWORK FOR NONLINEAR FETI-DP AND

BDDC METHODS

Axel Klawonn, Martin Lanser, Matthias Uran

Mathematisches Institut,

Universit¨at zu K¨oln, Germany

{axel.klawonn,martin.lanser,m.uran}@uni-koeln.de

Oliver Rheinbach

Institut f¨ur Numerische Mathematik und Optimierung,

Technische Universit¨at Bergakademie Freiberg, Germany

oliver.rheinbach@math.tu-freiberg.de

Highly scalable and robust Newton-Krylov domain decomposition approaches are widely

used for the solution of nonlinear implicit problems, e.g., in structural mechanics. In gen-

eral, in those methods, the nonlinear problem is ﬁrst linearized and afterwards decomposed

into subdomains. By changing this order, i.e., by ﬁrst decomposing the nonlinear problem,

new parallel and nonlinear domain decomposition methods can be designed which can reduce

communication by increasing local work. These methods show often a higher robustness than

classical Newton-Krylov variants and can be interpreted as nonlinear globalization strate-

gies. In this talk, we discuss diﬀerent Nonlinear-FETI-DP and BDDC approaches [1, 2, 3],

which can be formulated in a common framework and also be interpreted as nonlinear right-

preconditioners [4]. We also present weak scaling results to mare than 200K BlueGene/Q

cores on JUQUEEN at FZ J¨ulich.

References

[1] A. Klawonn, M. Lanser, and O. Rheinbach. Nonlinear FETI-DP and BDDC Methods.

SIAM J. Sci. Comput., 36(2):A737–A765, 2014. doi:10.1137/130920563.

[2] A. Klawonn, M. Lanser, and O. Rheinbach. Toward Extremely Scalable Nonlinear Do-

main Decomposition Methods for Elliptic Partial Diﬀerential Equations. SIAM J. Sci.

Comput., 37(6):C667–C696, 2015. doi:10.1137/140997907.

[3] A. Klawonn, M. Lanser, O. Rheinbach, and M. Uran. New Nonlinear FETI-DP Methods

Based on a Partial Nonlinear Elimination of Variables. In Proceedings of the 23rd Inter-

national Conference on Domain Decomposition. Accepted for Publication. Lect. Notes

Comput. Sci. Eng., 2016.

[4] A. Klawonn, M. Lanser, O. Rheinbach, and M. Uran. Nonlinear FETI-DP and BDDC

Methods: A Uniﬁed Framework and Parallel Results. SIAM J. Sci. Comput. Submitted

for publication, 2016.

12 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Liebmann ParNum 17

EXPLICIT VECTORIZATION AS A DESIGN TOOL FOR

PARALLEL ALGORITHMS ON MODERN HARDWARE

ARCHITECTURES

Manfred Liebmann

Institute for Mathematics and Scientiﬁc Computing, University of Graz, Austria

manfred.liebmann@uni-graz.at

Modern hardware architectures provide a formidable challenge to the design of algorithms

with portable performance across diﬀerent ﬂavors of multicore CPUs, manycore accelerators,

and graphics processors. Three diﬀerent case studies: small eigenvalue problems in magnetic

resonance imaging [1], algorithms for semiclassical quantum dynamics [2, 3], and algebraic

multigrid methods for uncertainty quantiﬁcation [4], show the applicability of explicit vector-

ization techniques a as general design tool for massively parallel software.

References

[1] M. Presenhuber. Numerische Methoden zur Nullstellenbestimmung f¨ur Anwendungen in

der quantitativen Magnetresonanztomographie. Diplomarbeit. Karl-Franzens-Universit¨at

Graz, 2015. url:http://resolver.obvsg.at/urn:nbn:at:at-ubg:1-86584.

[2] D. Sattlegger M. Liebmann. Parallel algorithms for semiclassical quantum dynamics.

Preprint No. IGDK-2015-25. 2015. url:http: // igdk .eu / foswiki/ pub /IGDK1754 /

Preprints/LiebmannSattlegger_2015.pdf.

[3] D. Sattlegger. Eﬃcient Algorithms for Semiclassical Quantum Dynamics. Dissertation.

M¨unchen: Technische Universit¨at M¨unchen, 2015. url:http: / /nbn - resolving .de /

urn/resolver.pl?urn:nbn:de:bvb:91-diss-20151221-1277913-1-7.

[4] D. Schaden. Eﬃcient Parallel PDE-Solvers for Uncertainty Quantiﬁcation. Master Thesis.

Technische Universit¨at M¨unchen, 2016.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 13

ParNum 17 Cs´oka, Laurinec, Luck´a

PARALLEL MULTI-DENSITY BASED CLUSTERING

Luk´aˇs Cs´oka, Peter Laurinec, M´aria Luck´a

Faculty of Informatics and Information Technologies,

Slovak University of Technology in Bratislava, Slovakia

xcsokal@stuba.sk, peter.laurinec@stuba.sk, maria.lucka@stuba.sk

Data clustering is a process of joining similar objects into groups. Although many cluster-

ing algorithms are known, it is still a challenging research area because of increasing amount

of data, requesting thus parallel instead of a sequential approach. In this work, we modify the

density based algorithm DBSCAN [1] that achieves excellent clustering results for datasets

with equal density, but in general very bad results when applied to data with various densities.

Another problem of DBSCAN is parallelization because of its strongly sequential character.

The modiﬁcation DBSCAN-DLP [2] solves the problem of various densities, but it is still

strongly sequential and unusable for larger dataset clustering. A successful parallel version of

DBSCAN based on the disjoint-set data structure is presented in [3].

In our work, we combine and modify these two approaches for clustering large data sets

with various densities in parallel. Proposed algorithm aims to ﬁnd multiple Density Level Sets

(DLS) based on a statistical analysis of data. To each DLS few data points from the dataset

are assigned, but not all points belong to any DLS in general. The standard parameter of

DBSCAN , characterizing the distance of neighborhood points, is then computed for every

DLS. The algorithm continues with DBSCAN clustering on the data points that were already

assigned to any DLS. After, an expansion of clusters to not assigned points is performed.

The method is compared with the well-known K-Means, standard DBSCAN and DBSCAN-

DLP methods on artiﬁcial datasets with various densities, where it achieves better results

concerning quality of classiﬁcation and performance. On datasets with equal densities, it

behaves similar to DBSCAN, but with better identiﬁcation of outliers. We have used both

OpenMP and Message Passing Interface (MPI) approaches and showed that solving this big

data problem is without proper parallelization approach almost impossible.

References

[1] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discover-

ing clusters in large spatial databases with noise. In Second Int. Conf. on Knowledge

Discovery and Data Mining, 1996, pages 226–231.

[2] Z. Xiong, R. Chen, Y. Zhang, and X. Zhang. Multi-density dbscan algorithm based on

density levels partitioning. Int. J. Comput. Inf. Sci., 9(10):2739–2749, 2012.

[3] M. M. A. Patwary, D. Palsetia, A. Agrawal, W. K. Liao, F. Manne, and A. Choudhary.

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In High

Perform. Computing, Networking, Storage and Analysis (SC), 2012 Int. Conf. for, 2012,

pages 1–11.

14 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Lumi, Haase ParNum 17

ENERGY AWARE COMPUTATIONS ON MANYCORE

SYSTEMS

Alban Lumi and Gundolf Haase

Institute for Mathematics and Scientiﬁc Computing, University of Graz, Austria

alban.lumi@uni-graz.at, gundolf.haase@uni-graz.at

Besides accuracy of the results, the overall solution time is the main quantity programmers

focussing on. On the other hand the compute nodes transform electrical energy to heat which

has to be dissipated afterwards. Extrapolating the recent hardware developments by ARM

and Intel as well a by NVIDIA and AMD we have to scope with many cores on one chip that

all have to transfer data to/from the main memory transpassing a hierarchy of caches and/or

faster memory.

We will present available tools [1] to determine the power consumption when executing

various application codes on diﬀerent hardware as a conventional CPU-Cluster, the Intel’s

Knights Landing and the ThunderX by ARM. The application codes range from the simple

Jacobi iteration to fully coupled cardiovascular simulations.

Choosing the Eikonal solver [2] as one special application we will demonstrate how al-

gorithmic changes and diﬀerent of memory access patterns will reduce the overall energy

consumption although the overall runtime might not be reduced.

Supported by the Horizon 2020 project MontBlanc3.

References

[1] V. M. Weaver, M. Johnson, K. Kasichayanula, J. Ralph, P. Luszczek, D. Terpstra, and

S. Moore. Measuring Energy and Power with PAPI. In. PASA Workshop, 2012.

[2] D. Ganellari and G. Haase. Fast many-core solvers for the Eikonal equations in cardio-

vascular simulations. In 2016 International Conference on High Performance Computing

Simulation (HPCS). peer-reviewed. IEEE, 2016, pages 278–285. doi:10.1109/HPCSim.

2016.7568347.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 15

ParNum 17 McInnes

COMMUNITY SOFTWARE ECOSYSTEMS FOR

HIGH-PERFORMANCE COMPUTATIONAL SCIENCE:

OPPORTUNITIES AND CHALLENGES

Lois Curfman McInnes

Mathematics and Computer Science Division, Argonne National Laboratory, USA

mcinnes@mcs.anl.gov

Numerical libraries have proven eﬀective in providing widely reusable software that is

robust, eﬃcient, and scalable—delivering advanced algorithms and data structures that en-

able scientiﬁc discovery for a broad range of applications. However, as we exploit emerging

extreme-scale architectures to address more advanced modeling, simulation, and analysis,

daunting challenges arise in software productivity and sustainability. Diﬃculties include in-

creasing complexity of algorithms and computer science techniques required in multiscale and

multiphysics applications, the imperative of portable performance in the midst of dramatic

and disruptive architectural changes, the realities of large legacy code bases, and human

factors arising in distributed multidisciplinary research teams. New architectures require fun-

damental algorithm and software refactoring, while at the same time the demand is increasing

for greater reproducibility of simulation and analysis results for predictive science. This situ-

ation brings with it the unique opportunity to fundamentally change how scientiﬁc software

is designed, developed, and sustained.

This presentation will introduce the Extreme-scale Scientiﬁc Software Development Kit

(xSDK) [1], which deﬁnes community policies (https://xsdk.info/policies) to improve

code quality and compatibility across independently developed packages. The vision of the

xSDK is to provide infrastructure for and interoperability of a collection of related and

complementary software elements—developed by diverse, independent teams throughout the

community—that provide the building blocks, tools, models, processes, and related artifacts

for rapid and eﬃcient development of high-quality extreme-scale applications. The xSDK

currently includes four major open-source numerical libraries (hypre, SuperLU, PETSc, and

Trilinos) and two domain components (Alquimia and PFLOTRAN). The xSDK approach

provides turnkey installation of member software packages and seamless combination of ag-

gregate capabilities. We will discuss experiences in creating xSDK foundations—ﬁrst steps

toward realizing an extreme-scale scientiﬁc software ecosystem. We welcome contributions

to the xSDK, feedback on draft xSDK community policies, and dialogue about work toward

broader community ecosystems for scientiﬁc software.

References

[1] R. Bartlett, I. Demeshko, T. Gamblin, G. Hammond, M. Heroux, J. Johnson, A. Klinvex,

X. Li, L. Curfman McInnes, J. D. Moulton, D. Osei-Kuﬀuor, J. Sarich, B. Smith, J.

Willenbring, and U. Meier Yang. xSDK Foundations: Toward an Extreme-scale Scientiﬁc

Software Development Kit. available via https: / / arxiv . org / abs / 1702 . 08425, to

appear in Supercomputing Frontiers and Innovations. 2017.

16 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Obersteiner, Tchipev, Neumann, Bungartz ParNum 17

A HIGHLY SCALABLE MPI PARALLELIZATION OF THE

FAST MULTIPOLE METHOD

Michael Obersteiner, Nikola Tchipev, Philipp Neumann, Hans-Joachim

Bungartz

Chair of Scientiﬁc Computing, Technical University of Munich,

Garching, Germany

oberstei@in.tum.de, tchipev@in.tum.de, neumanph@in.tum.de bungartz@in.tum.de

In this talk an MPI Parallelization strategy of the Fast Multipole Method (FMM) is

discussed that scales up to 32k processors [1]. The implementation is based on a well-known

parallelization scheme of the list-based FMM [2], which splits the octree in a local and a global

part. This scheme uses local operations of type reduce to obtain the results in the global

tree part and therefore avoids reduce operations involving all processors which is shown to

be critical for large scale simulations. By utilizing an adaptation of the Neutral Territory

Method [3] only 6 processors in the local tree and 31 in the global tree in comparison to,

respectively, 26 and 189 for a full-shell approach, are involved in the send as well as receive

operations for each level. Furthermore, import loads are reduced signiﬁcantly for the global

tree part and for up to three levels in the local tree part. Additional optimizations are an

auto-tuning scheme and a method for reducing the communication by fusing neighboring

domains in the global tree which can in some cases improve the performance. In this way,

relative speedups of 2.3 for a small scenario with 64 local cells on the ﬁnest level and 5.6 for

a larger scenario with 512 local cells on the ﬁnest level were obtained in the range of 4096

to 32768 processors on the Shaheen cluster of the King Abdullah University of Science and

Technology [4].

References

[1] M. Obersteiner. Parallel Implementation of the Fast Multipole Method. Master’s thesis.

Technical University of Munich, 2016.

[2] R. Yokota, G. Turkiyyah, and D. Keyes. Communication Complexity of the Fast Mul-

tipole Method and its Algebraic Variants. ArXiv e-prints, 2014. arXiv: 1406 . 1974

[cs.DC].

[3] D. Shaw. A fast, scalable method for the parallel evaluation of distance-limited pairwise

particle interactions. J. Comput. Chem., 26(13):1318–1328, 2005.

[4] Shaheen II. url:https://www.hpc.kaust.edu.sa/content/shaheen-ii.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 17

ParNum 17 Okˇsa, Yamamoto, Beˇcka, Va jterˇsic

CONVERGENCE OF THE PARALLEL BLOCK-JACOBI

EVD ALGORITHM FOR HERMITIAN MATRICES

Gabriel Okˇsa

Institute of Mathematics, Slovak Academy of Sciences, Bratislava, Slovakia

Gabriel.Oksa@savba.sk

Yusaku Yamamoto

Department of Communication Engineering and Informatics,

The University of Electro-Communications, Tokyo, Japan

yusaku.yamamoto@uec.ac.jp

Martin Beˇcka

Institute of Mathematics, Slovak Academy of Sciences, Bratislava, Slovakia

Martin.Becka@savba.sk

Mari´an Vajterˇsic

Department of Computer Sciences, University of Salzburg, Austria

and

Institute of Mathematics, Slovak Academy of Sciences, Bratislava, Slovakia

marian.vajtersic@sbg.ac.at

Let a Hermitian matrix Aof order nbe divided into a w×wblock structure with w=

2p, where pis the number of processors (cores). The aim is to compute the eigenvalue

decomposition (EVD) of Ain parallel using the two-sided block-Jacobi method with the

dynamic ordering deﬁned as follows. At parallel iteration step k, 2poﬀ-diagonal blocks of

A(k)with block indices (Xk1, Yk1), (Yk1, Xk1), . . ., (Xkp, Ykp),(Ykp, Xkp), Xki< Ykifor all i,

are eliminated using the greedy implementation of parallel dynamic ordering (GIPDO):

1. At iteration step k, all pairs of the oﬀ-diagonal blocks are ordered decreasingly with

respect to their weights

w(k)

IJ =kA(k)

IJ k2

F+kA(k)

JI k2

F, I 6=J.

2. After choosing the ﬁrst pair, kA(k)

Xk1Yk1k2

F=kA(k)

Yk1Xk1k2

F= maxI6=JkA(k)

IJ k2

F, additional

p−1 pairs are chosen for annihilation with a decreasing weight and each new pair must

have its block-row and block-column indices diﬀerent from all already chosen blocks.

Processor i, 1≤i≤p, solves the 2 ×2-block EVD subproblem:





P(k)

XkiXkiP(k)

XkiYki

P(k)

YkiXkiP(k)

YkiYki





H



A(k)

XkiXkiA(k)

XkiYki

A(k)

YkiXkiA(k)

YkiYki







P(k)

XkiXkiP(k)

XkiYki

P(k)

YkiXkiP(k)

YkiYki



=



A(k+1)

XkiXki0

0ˆ

A(k+1)

YkiYki



,

where the diagonal blocks ˆ

A(k+1)

XkiXkiand ˆ

A(k+1)

YkiYkiare square, diagonal matrices of order `=n/w.

Subsequently, the orthogonal matrix of local eigenvectors is used in the update of block

columns and rows (Xki, Yki) in parallel.

For such an algorithm and under reasonable assumptions, we prove its asymptotic quadratic

convergence (AQC) to a diagonal matrix for all possible distributions of eigenvalues (simple,

multiple, clusters). Numerical examples conﬁrm the developed theory.

18 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Rippl, Huckle, Lang ParNum 17

EFFICIENT TRANSFORMATION OF THE GENERAL

EIGENPROBLEM WITH SYMMETRIC BANDED

MATRICES TO A BANDED STANDARD

EIGENPROBLEM

Michael Rippl, Thomas Huckle

Chair of Scientiﬁc Computing, Technical University of Munich,

Garching, Germany

ripplm@in.tum.de

Bruno Lang

Applied Computer Science Group, Bergische Universit¨at Wuppertal, Germany

The solution of symmetric eigenproblems plays a key role in many computational simula-

tions. Generalized eigenproblems are transformed to a standard problem and solved with a

common approach for this problem. This transformation has the drawback that for banded

matrices in the generalized eigenproblem the banded structure is not preserved. The matrix

of the standard eigenproblem will generally be a full matrix.

We followed the ideas of the group of Lang (University of Wuppertal) who modiﬁed Craw-

ford’s algorithm [1]. Crawford’s algorithm proposes a way to immediatelly remove the ﬁll-in

when applying the factorization of B to A. This algorithm requires for both matrices a com-

mon bandwith. The new approach only requires that the bandwith of matrix A is not bigger

than the bandwith of matrix B.

We implemented this procedure to the ELPA project [2]. ELPA oﬀers a two step approach for

solving the standard eigenvalue problem. The ﬁrst step transfers the matrix of the standard

problem to a banded matrix and the second step transfers the banded matrix to tridiagonal-

ized form where the eigenvalues can be determined easily.

By using Lang’s Twisted-Crawford algorithm the transformation to banded form and also

the corresponding back transformation of the eigenvectors can be skipped. Furthermore it

provides some interesting blocking and parallelization posibilities, which allow to achieve a

good speedup compared to the Crawford’s method or Cholesky factorization.

References

[1] C. R. Crawford. Reduction of a Band-Symmetric Generalized Eigenvalue Problem Comm.

Comm. ACM, (16):41–44, 1973.

[2] T. Auckenthaler. Highly scalable eigensolvers for petaﬂop applications. PhD thesis. Uni-

versit¨at M¨unchen, 2012.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 19

ParNum 17 Rosenberger

OPENACC PARALLELIZATION FOR THE SOLUTION OF

THE BIDOMAIN EQUATIONS

Stefan Rosenberger

Institute for Mathematics and Scientiﬁc Computing, University of Graz, Austria

and

SFB Research Center Mathematical Optimization and Applications in Biomedical

Sciences

stefan.rosenberger@uni-graz.at

Cardiovascular simulations include coupled PDEs (partial diﬀerential equations) for elec-

trical potentials, non-linear deformations and systems of ODEs (ordinary diﬀerential equa-

tions) all of them are contained in the simulation software CARP (Cardiac Arrhythmia Re-

search Package). We focus in this talk on the solvers for the elliptical part of the bidomain

equations describing the electric stimulation of the heart for an anisotropic tissue. The exist-

ing conjugate gradient solver with an algebraic multigrid preconditioner is already parallelized

by MPI+OpenMP/CUDA.

We investigate the OpenACC parallelization of this solver on one GPU especially its

competitiveness with respect to the highly optimized CUDA implementation on recent GPUs.

The OpenACC performance seems to be quite close to the CUDA performance when typical

traps are avoided. We will present additionally ﬁrst results of these solver parts on Intel’s

KNL (Knights Landing).

20 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Rozloˇzn´ık, Faßbender ParNum 17

THE FACTORS IN THE SR DECOMPOSITION AND

THEIR CONDITIONING

Miroslav Rozloˇzn´ık

Institute of Mathematics, Czech Academy of Sciences, Prague, Czech Republic

miro@cs.cas.cz

Heike Faßbender

Institut Computational Mathematics, AG Numerik, Technische Universit¨at

Braunschweig, Germany

h.fassbender@tu-braunschweig.de

Almost every nonsingular matrix A∈R2m,2mcan be decomposed into the product of

a symplectic matrix Sand an upper J-triangular matrix R. This decomposition is not

unique. In this contribution we analyze the freedom of choice in the symplectic and the

upper J-triangular factors and review several existing suggestions on how to choose the free

parameters in the SR decomposition. In particular we consider two choices leading to the

minimization of the condition number of the diagonal blocks in the upper J-triangular factor

and to the minimization of the conditioning of the corresponding blocks in the symplectic

factor. We develop bounds for the extremal singular values of the whole upper J-triangular

factor and the whole symplectic factor in terms of the spectral properties of even-dimensioned

principal submatrices of the skew-symmetric matrix associated with the SR decomposition.

The theoretical results are illustrated on two small examples.

This research is supported by the Grant Agency of the Czech Republic under the project

GA17-12925S.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 21

ParNum 17 Spellacy, Golden

PARTIAL INVERSES OF BLOCK TRIDIAGONAL

NON-HERMITIAN MATRICES

Louise Spellacy

School of Mathematics, Trinity College, Dublin, Ireland

spellal@tcd.ie

Darach Golden

Research I.T., Trinity College, Dublin, Ireland

dgolden@tcd.ie

The SMEAGOL electronic code uses a combination of density function theory (DFT) and

Non-Equilibrium Green’s Functions (NEGF) to study nanoscale electronic transport under

the eﬀect of an applied bias potential [1]. Inversion of a block tridiagonal non-Hermitian

matrix is required to obtain the Green’s function used by the SMEAGOL code. In many

cases, only the block tridiagonal part of the inverse is needed. Currently the SMEAGOL code

is limited by single node, multicore matrix inverses. The addition of parallel sparse matrix

inverse functionality will allow signiﬁcantly larger systems to be addressed.

The algorithm presented here is an extension of a previous work in [2] and [3], where a

method for parallel inversion of Hermitian block tridiagonal matrices is detailed. This method

extends [2] and [3] to the non-Hermitian case and allows for the case of varying block sizes.

The tridiagonal blocks of the matrix are evenly distributed across pprocesses. The local blocks

are used to form a “super matrix” on each process. These matrices are inverted locally and

the local inverses are combined in a pairwise manner. There are log(p) combination steps. At

each combination step, the updates to the global inverse are represented by updating “matrix

maps” on each process. The matrix maps are ﬁnally applied to the original local blocks to

retrieve the block tridiagonal elements of the inverse. This extended algorithm requires the

computation and communication of a greater number of matrix maps than the algorithm

detailed in [3]. This “pairwise” algorithm has been implemented as a standalone program,

written in Fortran and MPI. It has been tested on local clusters in the Trinity Centre for High

Performance Computing. The algorithm is discussed in detail in the presentation. Inverses

calculated using the “pairwise” implementation are compared with inverses calculated using

well known parallel matrix libraries such as ScaLAPACK and MUMPS. Results are obtained

for random test matrices and for matrices arising from DFT calculations.

References

[1] SMEAGOL: Non-equilibrium Electronic Transport. www.smeagol.tcd.ie.

[2] S. Cauley et al. A Scalable Distributed Method for Quantum-Scale Device Simulation.

J. Appl. Phys., 101(12):123715, 2007. doi:10.1063/1.2748621.

[3] S. Cauley et al. Distributed Non-Equilibrium Green’s Function Algorithms for the Sim-

ulation of Nanoelectronic Devices with Scattering. J. Appl. Phys., 110(4):043713, 2011.

doi:10.1063/1.3624612.

22 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Spir, Mikula ParNum 17

WORKFLOW FOR PARALLEL PROCESSING OF

BIOMEDICAL IMAGES

Robert Spir, Karol Mikula

Faculty of Civil Engineering, Slovak University of Technology in Bratislava, Slovakia

spir@math.sk, mikula@math.sk

We present an integrated workﬂow for processing of the biomedical images of early stages

of embryo development of various organisms obtained from the two-photon microscopy. We

ﬁrst start with the geodesic mean curvature ﬂow ﬁltering of the raw data to remove the noise

and to improve the image quality [1], then we continue using the level-set center detection to

obtain the cell identiﬁers [2], after that we proceed with the segmentation of cells, membranes

or the whole embryo using the generalized subjective surface method [3] and ﬁnally we can

do an automated cell tracking and cell lineage tree reconstruction by the extraction of the

cell trajectories forming the lineage tree from the potential ﬁeld calculated from the combi-

nation of distance functions computed inside the 4D segmentations of the processed data [4].

Each step of our processing workﬂow is parallelized using various techniques such as MPI for

distributed computing, OpenMP for local parallelization, GNU Parallel script for launching

parallel tasks and Task Parallel Library for parallelization of .net applications. In addition to

the parallelization, our workﬂow is optimized to run on the computer clusters with NUMA

architecture.

References

[1] Z. Kriva, K. Mikula, N. Peyrieras, B. Rizzi, A. Sarti, and O. Stasova. 3D Early Embryo-

genesis Image Filtering by Nonlinear Partial Diﬀerential Equations. Med. Image Anal.,

14(4):510–526, 2010. doi:10.1016/j.media.2010.03.003.

[2] P. Frolkovic, K. Mikula, N. Peyrieras, and A. Sarti. A counting number of cells and cell

segmentation using advection-diﬀusion equations. Kybernetika, 43(6):817–829, 2007.

[3] K. Mikula, N. Peyrieras, M. Remesikova, and A.Sarti. 3D embryogenesis image segmen-

tation by the generalized subjective surface method using the ﬁnite volume technique.

Finite Volumes for Complex Applications V, Problems & Perspectives:585–592, 2008.

[4] K. Mikula, R. Spir, M. Smisek, E. Faure, and N. Peyrieras. Nonlinear PDE based numer-

ical methods for cell tracking in zebraﬁsh embryogenesis. Appl. Numer. Math., 95:250–

266, 2015. doi:10.1016/j.apnum.2014.09.002.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 23

ParNum 17 Stals

DOMAIN DECOMPOSITION APPLIED TO THE

THIN-PLATE SPLINE SADDLE POINT PROBLEM

Linda Stals

Department of Mathematics, Australian National University, Canberra, Australia

linda.stals@anu.edu.au

Data ﬁtting is an integral part of a number of applications including data mining, 3D

reconstruction of geometric models, image warping and medical image analysis. A commonly

used method for ﬁtting functions to data is the thin-plate spline method. This method is

popular because it is not sensitive to noise in the data.

We have developed a discrete thin-plate spline approximation technique that uses local

basis functions [1]. With this approach the system of equations is sparse and its size depends

only on the number of points in the discrete grid, not the number of data points. Nevertheless

the resulting system is a saddle point problem that can be ill-conditioned for certain choices

of parameters. In this talk I will present a domain decomposition based preconditioner that

results in a well conditioned system for a wider choice of parameters.

References

[1] L. Stals. Eﬃcient Solution Techniques for a Finite Element Thin Plate Spline Formula-

tion. J. Sci. Comput., 63(2):374–409, 2015. doi:10.1007/s10915-014-9898-x.

24 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Carson, Rozloˇzn´ık, Strakoˇs, Tich´y, T˚uma ParNum 17

ON THE NUMERICAL STABILITY ANALYSIS OF

PIPELINED KRYLOV SUBSPACE METHODS

Erin C. Carson

Courant Institute of Mathematical Sciences, New York University, USA

rinc@cims.nyu.edu

Miroslav Rozloˇzn´ık

Institute of Computer Science and Institute of Mathematics,

Czech Academy of Sciences, Prague, Czech Republic

miro@cs.cas.cz

Zdenˇek Strakoˇs, Petr Tich´y, Miroslav T˚uma

Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic

strakos@karlin.mff.cuni.cz, ptichy@karlin.mff.cuni.cz, mirektuma@karlin.mff.cuni.cz

Inexact computations in Krylov subspace methods, either due to ﬂoating point roundoﬀ

error or intentional action motivated by savings in computing time or energy consumption,

have two basic eﬀects, namely, slowing down convergence and limiting attainable accuracy.

Although the methodologies for their investigation are diﬀerent, these phenomena are closely

related and cannot be separated from one another.

As the name suggests, Krylov subspace methods can be viewed as a sequence of pro-

jections onto nested subspaces of increasing dimension. They are therefore by their nature

implemented as synchronized recurrences. This is the fundamental obstacle to eﬃcient parallel

implementation. Standard approaches to overcoming this obstacle described in the literature

involve reducing the number of global synchronization points and increasing parallelism in

performing arithmetic operations within individual iterations. One such approach, employed

by the so-called pipelined Krylov subspace methods, involves overlapping the global commu-

nication needed for computing inner products with local arithmetic computations.

Recently, the issues of attainable accuracy and delayed convergence caused by inexact com-

putations became of interest in relation to pipelined Krylov subspace methods. In this con-

tribution based on [1] we recall the related early results and developments in synchronization-

reducing Krylov subspace methods, identify the main factors determining possible numerical

instabilities, and outline approaches needed for the analysis and understanding of pipelined

Krylov subspace methods. We demonstrate the discussed issues numerically using several

algorithmic variants of the conjugate gradient method. We conclude with a brief perspective

on Krylov subspace methods in the forthcoming exascale era.

References

[1] E. Carson, M. Rozloˇzn´ık, Z. Strakoˇs, P. Tich´y, and M. T˚uma. On the numerical stability

analysis of pipelined Krylov subspace methods. Submitted, 2016.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 25

ParNum 17 Thies, Song, Wubs, Baars

EMPLOYING HPC FOR ANALYZING NONLINEAR PDE

SYSTEMS BEYOND SIMULATION

Jonas Thies, Weiyan Song

German Aerospace Center (DLR),

Simulation and Software Technology,

Cologne, Germany

Jonas.Thies@DLR.de, Weiyan.Song@DLR.de

Fred W. Wubs, Sven Baars

Institute for Mathematics and Computer Science,

University of Groningen, Netherlands

F.W.Wubs@RuG.nl, S.Baars@RuG.nl

We review techniques of numerical bifurcation and stability analysis with examples from

computational ﬂuid dynamics and biology. The methodology allows insight into the complete

dynamics of nonlinear PDE systems, where standard simulation tool chains leave the question

of existence, proximity and stability of multiple solutions open.

The main bottleneck in the method are large and sparse linear systems of equations and

eigenvalue problems arising from the discretized steady-state PDE. The use of HPC is there-

fore attractive to increase the achievable resolution, but remains challenging because nonsym-

metric and indeﬁnite systems need to be solved. The ‘hybrid multi-level solver’ HYMLS [1, 2]

is a robust multi-level incomplete factorization technique that was designed for this particular

class of problems. HYMLS has an intuitive geometric interpretation and good parallelization

properties. We present some performance results of a prototypical implementation based on

MPI and the Trilinos software. The eigenvalue problems that arise are solved using the Jacobi-

Davidson method as implemented in the SPPEXA ESSEX [3] project’s phist library [4].

References

[1] F. W. Wubs and J. Thies. A Robust Two-Level Incomplete Factorization for (Navier-)Stokes

Saddle Point Matrices. SIAM J. Matrix Anal. Appl., 32(4):1475–1499, 2011. doi:10.

1137/100789439.

[2] HYMLS, a HYbrid Multi-Level Solver. url:https://bitbucket.org/hymls/hymls/.

[3] The ESSEX project website. url:http://blogs.fau.de/essex/.

[4] PHIST, a Pipelined Hybrid-parallel Iterative Solver Toolkit. url:https://bitbucket.

org/essex/phist/.

26 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Trobec, Ugovˇsek ParNum 17

IMPACT OF INTERCONNECTION NETWORK

TOPOLOGY ON PARALLEL PERFORMANCE - A

SURVEY

Roman Trobec and Janez Ugovˇsek

Department of Communication Systems, Joˇzef Stefan Institue, Ljubljana, Slovenia

roman.trobec@ijs.si, janez@ugovsek.info

Interconnection networks (ICNs) have an important role on the execution time of coop-

erating computers. In parallel systems with a high number of cooperating computers the

performance of data communication systems is becoming more important than the perfor-

mance of the processors. The technological barrier posed on the further increasing of processor

speed is evident in the contemporary high performance computers (HPC) as an ever increas-

ing number of cooperating processors [1]. However, the exchange of temporary data between

processors can disturb the balance between calculation and communication time. The ICN

importantly determines the eﬃciency and scalability of a HPC on most real-world parallel

applications. It can shorten the execution by more eﬃciently exploited computers, even if

their number grows.

The performance of an ICN depends on technological and topological factors, e.g., net-

work topology, message routing, and ﬂow-control algorithms. The routing and ﬂow-control

algorithms have advanced to a state where eﬃcient techniques are known and used. However,

further sophistication is possible in the development of network topologies [2], which is the

main focus of our work. We present the state-of-the-art technology and topology of several

common ICNs used in petascale computers. Their analysis indicates that ICNs with higher

performance are needed for future exascale computers [3]. They should be based on high-

radix topologies with optical connections [4] for longer links. It is could be also expected that

future ICNs will be able to adapt dynamically to the current application in some optimal

way.

References

[1] R. Trobec, R. Vasiljevi´c, M. Tomaˇsevi´c, V. Milutinovi´c, and M. Beivide R. Valero. Inter-

connection Networks in Petascale Computer Systems: A Survey. ACM Comput. Surv.,

49(3):44:1–44:24, 2016. doi:10.1145/2983387.

[2] W. J. Dally and B. P. Towles. Principles and Practices of Interconnection Networks.

Morgan Kaufmann Publishers Inc., 2004.

[3] Coteus, P. W. and Knickerbocker, J. U. and Lam, C. H. and Vlasov, Y. A. Technologies

for Exascale Systems. IBM J. Res. Dev., 55(5):581–592, 2011. doi:10.1147/JRD.2011.

2163967.

[4] A. F. Benner, M. Ignatowski, J. A. Kash, D. M. Kuchta, and M. B. Ritter. Exploitation

of Optical Interconnects in Future Server Architectures. IBM J. Res. Dev., 49(4/5):755–

775, 2005. url:http://dl.acm.org/citation.cfm?id=1148882.1148902.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 27

ParNum 17 Scott, T˚uma

MIXED SPARSE-DENSE LINEAR LEAST SQUARES AND

PRECONDITIONED ITERATIVE METHODS

Jennifer Scott

STFC Rutherford Appleton Laboratory, Harwell Campus, Didcot, Oxfordshire, UK

jennifer.scott@stfc.ac.uk.

Miroslav T˚uma

Department of Numerical Mathematics, Faculty of Mathematics and Physics,

Charles University, Prague, Czech Republic

mirektuma@karlin.mff.cuni.cz

The eﬃcient solution of large linear least squares problems in which the system matrix

Acontains rows with very diﬀerent densities is challenging. There have been many classical

contributions to solving this problem that focus on direct methods; they can be found in the

monograph [1]. Such solvers typically perform a splitting of the rows of Ainto two row blocks,

Asand Ad. The block Asis such that the sparse factorization of the normal matrix AT

sAs

is feasible, while the rows in the block Adhave a relatively large number of nonzero entries.

These dense rows are initially ignored, a factorization of the sparse part is computed using a

sparse direct solver and then the solution updated to take account of the omitted dense rows.

There are two potential weaknesses of this approach. First, in practical applications the

number of rows that contain a signiﬁcant number of entries may not be small. Processing some

of the denser rows separately may improve performance. Furthermore, large-scale problems

require the use of preconditioned iterative solvers. A straightforward proposal to precondition

the iterative solver using only an incomplete factorization of the sparse block while discarding

the dense block may not lead to any success. In this presentation, we propose processing As

separately within a conjugate gradient method using an incomplete factorization precondi-

tioner combined with the factorization of a dense matrix of size equal to the number of rows

in Ad. Problems arising from practical applications are used to demonstrate the potential of

the new approach; see also [2].

References

[1] A. Bj¨orck. Numerical methods for least squares problems. Society for Industrial and Ap-

plied Mathematics (SIAM), 1996, pages xviii+408. doi:10.1137/1.9781611971484.

[2] J. A. Scott and M. T˚uma. Solving mixed sparse-dense linear least squares by precondi-

tioned iterative methods. Technical Report RAL-P-2017-001. RAL, 2017.

28 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Korch, Werner ParNum 17

EFFICIENT GPU-BASED SMOOTHED PARTICLE

HYDRODYNAMICS

Matthias Korch

Department of Computer Science,

University of Bayreuth, Germany

korch@uni-bayreuth.de

Tim Werner

Department of Computer Science,

University of Bayreuth, Germany

werner@uni-bayreuth.de

Smoothed particle hydrodynamics (SPH) is a numerical method, which simulates a ﬂuid

by dividing it into particles interacting with each other. For reducing the computational

complexity, SPH simulations typically limit the interactions between particles to a short

range. Still, computing those short-ranged interactions is the most computationally intensive

task in SPH simulations, typically requiring more than 90 % of the total runtime. Because

these interactions can also be computed in parallel, SPH is well suited for parallel processors

such as GPUs. However, the performance can be enhanced further by using GPU speciﬁc

optimization techniques.

In this paper, we investigate how to eﬃciently compute those short-ranged interactions of

particles in SPH. For this purpose, starting from a basic linked cell approach, we iteratively

evaluate several diﬀerent optimization techniques for the kernels, namely removal of the x-

loop, decreasing the cell size, simpliﬁcation of the cell-sphere interaction test, fast forwarding

of particles, and temporary Verlet lists. Our main goals are the improvement of the data

parallelism and the reduction of the overhead of the grid traversal. The ﬁnal implementation

achieves both goals and yields a signiﬁcant speedup compared to the basic approach.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 29

ParNum 17 Wimmer, Casas, Gansterer

IS GOSSIP-INSPIRED REDUCTION COMPETITIVE IN

HIGH PERFORMANCE COMPUTING?

Elias Wimmer

Faculty of Informatics, Research Group Parallel Computing, TU Wien, Austria

ewimmer@par.tuwien.ac.at

Marc Casas

Barcelona Supercomputing Center (BSC), Spain

marc.casas@bsc.es

Wilfried N. Gansterer

Faculty of Computer Science, University of Vienna, Austria

wilfried.gansterer@univie.ac.at

The utility of gossip-based reduction algorithms in a High Performance Computing (HPC)

context is investigated. They are compared to state-of-the-art deterministic parallel reduction

algorithms in terms of fault tolerance and resilience against silent data corruption (SDC)

as well as in terms of runtime performance and scalability. New gossip-based reduction

algorithms are proposed which signiﬁcantly improve the state-of-the-art in terms of resilience

against SDC. A new gossip-inspired reduction algorithm is proposed which promises a more

competitive runtime performance for low accuracy in an HPC context than gossip-based

algorithms. It is shown that for very large systems the new gossip-inspired reduction algorithm

has the potential to outperform classical reduction algorithm for low accuracy problems.

30 Parallel Numerics 2017/Waischenfeld, Germany/April 19-21

Yamamoto ParNum 17

ROUNDOFF ERROR ANALYSIS OF THE CHOLESKYQR2

AND RELATED ALGORITHMS

Yusaku Yamamoto

Department of Communication Engineering and Informatics,

The University of Electro-Communications, Tokyo, Japan

yusaku.yamamoto@uxec.ac.jp

Cholesky QR is an ideal QR factorization algorithm from the viewpoint of high perfor-

mance computing [1], but it has rarely been used in practice due to numerical instability.

Recently, we showed that by repeating Cholesky QR twice, we can greatly improve the sta-

bility [2]. In this talk, we present a detailed error analysis of the algorithm, which we call

CholeskyQR2. Numerical stability of related algorithms, such as the CholeskyQR2 algorithm

in an oblique inner product [3], is also discussed.

References

[1] T. Fukaya, Y. Nakatsukasa, Y Yanagisawa, and Y. Yamamoto. CholeskyQR2: A Simple

and Communication-avoiding Algorithm for Computing a Tall-skinny QR Factorization

on a Large-scale Parallel System. In ScalA’14 Proceedings of the 5th Workshop on Latest

Advances in Scalable Algorithms for Large-Scale Systems. IEEE, 2014, pages 31–38. doi:

10.1109/ScalA.2014.11.

[2] Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa, and T. Fukaya. Roundoﬀ Error Analysis

of the CholeskyQR2 Algorithm. Electron. Trans. Numer. Anal., 44:306–326, 2015.

[3] Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa, and T. Fukaya. Roundoﬀ Error Analysis

of the CholeskyQR2 Algorithm in an Oblique Inner Product. JSIAM Lett., 8:5–8, 2015.

Parallel Numerics 2017/Waischenfeld, Germany/April 19-21 31

ResearchGate has not been able to resolve any citations for this publication.

Solving Mixed Sparse-Dense Linear Least-Squares Problems by Preconditioned Iterative Methods

Article

Full-text available

Jan 2017

The efficient solution of large linear least-squares problems in which the system matrix A contains rows with very different densities is challenging. Previous work has focused on direct methods for problems in which A has a few relatively dense rows. These rows are initially ignored, a factorization of the sparse part is computed using a sparse direct solver, and then the solution is updated to take account of the omitted dense rows. In some practical applications the number of dense rows can be significant, and for very large problems, using a direct solver may not be feasible. We propose processing rows that are identified as dense separately within a conjugate gradient method using an incomplete factorization preconditioner combined with the factorization of a dense matrix of size equal to the number of dense rows. Numerical experiments on large-scale problems from real applications are used to illustrate the effectiveness of our approach. The results demonstrate that we can efficiently solve problems that could not be solved by a preconditioned conjugate gradient method without exploiting the dense rows.

Low rank updates in preconditioning the saddle point systems arising from data assimilation problems

Article

Full-text available

Dec 2016

The numerical solution of saddle point systems has received a lot of attention over the past few years in a wide variety of applications such as constrained optimization, computational fluid dynamics and optimal control, to name a few. In this paper, we focus on the saddle point formulation of a large-scale variational data assimilation problem, where the computations involving the constraint blocks are supposed to be much more expensive than those related to the (1, 1) block of the saddle point matrix. New low-rank limited memory preconditioners exploiting the particular structure of the problem are proposed and analysed theoretically. Numerical experiments performed within the Object-Oriented Prediction System are presented to highlight the relevance of the proposed preconditioners.

Nonlinear FETI-DP and BDDC methods: A unified framework and parallel results

Article

Nov 2017

Parallel Newton--Krylov FETI-DP (Finite Element Tearing and Interconnecting---Dual-Primal) domain decomposition methods are fast and robust solvers, e.g., for nonlinear implicit problems in structural mechanics. In these methods, the nonlinear problem is first linearized and then decomposed into loosely coupled (linear) problems, which can be solved in parallel. By changing the order of the operations, new parallel communication can be constructed, where the loosely coupled local problems are nonlinear. We discuss different nonlinear FETI-DP methods which are equivalent when applied to linear problems but which show a different performance for nonlinear problems. Moreover, a new unified framework is introduced which casts all nonlinear FETI-DP domain decomposition approaches discussed in the literature into a single algorithm. Furthermore, the equivalence of nonlinear FETI-DP methods to specific nonlinearly right-preconditioned Newton--Krylov methods is shown. For the methods using nested Newton iterations, a strategy is presented to stop the inner Newton iteration early, resulting in an approximate local nonlinear elimination. Additionally, the nonlinear BDDC (Balancing Domain Decomposition by Constraint) method is presented as a right-preconditioned Newton approach. Finally, for the first time, parallel weak scaling results for four different nonlinear FETI-DP approaches are compared to standard Newton--Krylov FETI-DP in two and three dimensions, using both exact as well as highly scalable inexact linear FETI-DP preconditioners and up to 131,072 message passing interface (MPI) ranks on the JUQUEEN supercomputer at Forschungszentrum Jülich. For a model problem with nonlocal nonlinearities, nonlinear FETI-DP methods are shown to be up to five times faster than the standard Newton--Krylov FETI-DP approach.

LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation

Conference Paper

Nov 2016

xSDK Foundations: Toward an Extreme-scale Scientific Software Development Kit

Article

Feb 2017

Extreme-scale computational science increasingly demands multiscale and multiphysics formulations. Combining software developed by independent groups is imperative: no single team has resources for all predictive science and decision support capabilities. Scientific libraries provide high-quality, reusable software components for constructing applications with improved robustness and portability. However, without coordination, many libraries cannot be easily composed. Namespace collisions, inconsistent arguments, lack of third-party software versioning, and additional difficulties make composition costly. The Extreme-scale Scientific Software Development Kit (xSDK) defines community policies to improve code quality and compatibility across independently developed packages (hypre, PETSc, SuperLU, Trilinos, and Alquimia) and provides a foundation for addressing broader issues in software interoperability, performance portability, and sustainability. The xSDK provides turnkey installation of member software and seamless combination of aggregate capabilities, and it marks first steps toward extreme-scale scientific software ecosystems from which future applications can be composed rapidly with assured quality and scalability.

Toward Parallel Modeling of Solidification Based on the Generalized Finite Difference Method Using Intel Xeon Phi

Conference Paper

Apr 2016
Lect Notes Comput Sci

Modern heterogeneous computing platforms have become powerful HPC solutions, which could be applied for a wide range of applications. In particular, the hybrid platforms equipped with Intel Xeon Phi coprocessors offers performance advantages over conventional homogeneous solutions based on CPUs, while supporting practically the same parallel programming model. However, there is still an open issue how scientific applications can utilize efficiently the hybrid platforms equipped with Intel coprocessors. In this paper we propose a method for porting a real-life scientific application to computing platforms with Intel Xeon Phi. We focus on the parallel implementation of a numerical model of solidification, which is based on the generalized finite difference method. We develop a sequence of steps that are necessary for porting this application to platforms with accelerators, assuming no significant modifications of the code. The proposed method considers not only efficient data transfers that allow for overlapping computations with data movements, but also takes into account an adequate utilization of cores/threads and vector units. The developed approach allows us to execute the whole application 3.45 times faster than the original parallel version running on two CPUs.

Interconnection Networks in Petascale Computer Systems: A Survey

Article

Sep 2016

This article provides background information about interconnection networks, an analysis of previous developments, and an overview of the state of the art. The main contribution of this article is to highlight the importance of the interpolation and extrapolation of technological changes and physical constraints in order to predict the optimum future interconnection network. The technological changes are related to three of the most important attributes of interconnection networks: topology, routing, and flow-control algorithms. On the other hand, the physical constraints, that is, port counts, number of communication nodes, and communication speed, determine the realistic properties of the network. We present the state-of-the-art technology for the most commonly used interconnection networks and some background related to often-used network topologies. The interconnection networks of the best-performing petascale parallel computers from past and present Top500 lists are analyzed. The lessons learned from this analysis indicate that computer networks need better performance in future exascale computers. Such an approach leads to the conclusion that a high-radix topology with optical connections for longer links is set to become the optimum interconnect for a number of relevant application domains.

Parallel Nonnegative Matrix Factorization via Newton Iteration

Article

Sep 2016

The goal of Nonnegative Matrix Factorization (NMF) is to represent a large nonnegative matrix in an approximate way as a product of two significantly smaller nonnegative matrices. This paper shows in detail how an NMF algorithm based on Newton iteration can be derived using the general Karush-Kuhn-Tucker (KKT) conditions for first-order optimality. This algorithm is suited for parallel execution on systems with shared memory and also with message passing. Both versions were implemented and tested, delivering satisfactory speedup results.

Fast many-core solvers for the Eikonal equations in cardiovascular simulations

Conference Paper

Jul 2016

Space and Execution Efficient Formats for Modern Processor Architectures

Conference Paper

Sep 2015

BOOK OF ABSTRACTS of PARNUM 2017 - 11th International Workshop on Parallel Numerics

Abstract and Figures

Recommended publications

ATMOS 2008 Abstracts Collection -- 8th Workshop on Algorithmic Approaches for Transportation Modelin...

Book of Abstracts "State of the Art Workshop" of the 28th EURO Conference Operational Research, Pozn...

Proceedings of the Second International Workshop on Numerical and Symbolic Abstract Domains (NSAD 20...

III Workshop Latino Americano sobre Biobed ANAIS