Conference PaperPDF Available

Porting a neuro-imaging application to a CPU-GPU cluster

July 2014

July 2014

DOI:10.1109/HPCSim.2014.6903679

Conference: High Performance Computing & Simulation (HPCS), 2014 International Conference on
At: Bologna

Authors:

Reza Nakhjavani

University of Toronto

Show all 6 authorsHide

The ever increasing complexity of scientific applications has led to utilization of new HPC paradigms such as Graphical Processing Units (GPUs). However, modifying existing applications to enable them to be executed on GPU could be challenging. Furthermore, the considerable speedup achieved by execution of linear algebra operations on GPUs has added a huge heterogeneity to HPC clusters. In this work, we enabled NPAIRS, a neuro-imaging application, to be executed on GPU with slight modifications to its original code. This important feature of our implementation enables current users of NPAIRS, i.e. non-expert bio-medical scientists, to get benefit from GPU without having to apply fundamental changes to their existing application. As the second part of our research we investigated the efficiency of several scheduling algorithms for a heterogeneous cluster that contains GPU nodes. Experimental results show we achieved 7X speedup for NPAIRS. Moreover, although scheduling does not play an important role when there is no GPU node in the cluster, it can highly improve the makespan for a CPU-GPU cluster. We compared our scheduling results with Torque and MCT, two of the most commonly used schedulers in current HPC platforms. Our results show that the Sufferage scheduling can improve the makespan of Torque and MCT by 47% and 4% respectively.

An example of detected activation areas when the subjects were performing a simple reaction time task [7].

…

Highlevel workflow of the NPAIRS application.

…

Execution profile of NPAIRS for different number of principal components on a dataset of 25 subjects. The computations of PCA in the resampling loop accounts for 70% of total computation time on average.

…

Figures - uploaded by Stephen C Strother

Content may be subject to copyright.

Content uploaded by Stephen C Strother

Content may be subject to copyright.

Porting a Neuro-Imaging Application to a

CPU-GPU cluster

Reza Sina Nakhjavani∗, Sahel Sharify∗, Ali B. Hashemi∗, Alan W. Lu∗, Cristiana Amza∗, and Stephen Strother†

∗Electrical and Computer Engineering Department, University of Toronto

†Rotman Research Institute, Baycrest, Toronto, Ontario

†Department of Medical Biophysics, University of Toronto

Email: ∗{sina,sahel,hashemi,luwen,amza}@eecg.toronto.edu, †sstrother@research.baycrest.org

Abstract—The ever increasing complexity of scientiﬁc appli-

cations has led to utilization of new HPC paradigms such as

Graphical Processing Units (GPUs). However, modifying existing

applications to enable them to be executed on GPU could be

challenging. Furthermore, the considerable speedup achieved by

execution of linear algebra operations on GPUs has added a huge

heterogeneity to HPC clusters. In this work, we enabled NPAIRS,

a neuro-imaging application, to be executed on GPU with slight

modiﬁcations to its original code. This important feature of our

implementation enables current users of NPAIRS, i.e. non-expert

bio-medical scientists, to get beneﬁt from GPU without having

to apply fundamental changes to their existing application. As

the second part of our research we investigated the efﬁciency of

several scheduling algorithms for a heterogeneous cluster that

contains GPU nodes. Experimental results show we achieved 7X

speedup for NPAIRS. Moreover, although scheduling does not

play an important role when there is no GPU node in the cluster,

it can highly improve the makespan for a CPU-GPU cluster. We

compared our scheduling results with Torque and MCT, two of

the most commonly used schedulers in current HPC platforms.

Our results show that the Sufferage scheduling can improve the

makespan of Torque and MCT by 47% and 4% respectively.

I. INTRODUCTION

The demand for high performance computing is increasing

everyday. Processing medical images is an example of a CPU-

intensive application. This new class of applications is running

too slow even on today’s multi-core architectures. Although

there is no need for most of such applications to be strictly

real-time, being able to execute them on the order of a few

minutes, instead of hours or days, helps researchers to test and

evaluate their new ideas and algorithms quickly. Moreover,

some degree of interactivity with the application is important

for biomedical researchers using the neuroscience workloads

that we are working with in this paper.

This drastic demand for higher performance has led the

computer industry to incorporate multi-core and many-core

processors in today’s HPC platforms. NVIDIA’s [1] GPUs

and Intel’s Xeon Phi are today’s most common many-core

architectures that are used as co-processors in computationally

intensive applications. On the other hand, the emergence of

General Purpose computing on GPUs (GPGPU) and their

programming languages such as CUDA [2] and OpenCL

[3], along with the integration of GPUs into existing multi-

core machines has made them a viable solution to accelerate

embarrassingly parallel applications. This paradigm has also

added heterogeneity in today’s desktops and laptops as well

as cloud environments targeting HPC workloads.

Scheduling jobs for super computers has been extensively

studied. However, the heterogeneous nature of recent modern

super computers, as well as CPU and GPU clusters, demands

that we revisit the portability and scheduling problem for such

systems. GPUs have shown the ability to provide higher peak

throughput for a wide range of massively parallel applications.

Consequently, compute-intensive applications would prefer to

be scheduled on a GPU-enabled server when running on a

heterogeneous cluster. This may however lead to GPU device

contention and decrease the overall throughput since there

is usually a smaller number of GPUs than CPUs in typical

clusters used in biomedical settings. Therefore, tasks may

ﬁnish faster if run on a CPU rather than waiting for a GPU

resource to become available. In such cases, scheduling some

jobs on GPUs while running others on CPUs results in a better

utilization of resources as well as higher peak throughput.

In this paper, we make the following contributions: i) we

proposed a technique for porting and scheduling biomedical

applications to CPU-GPU clusters with minimal application

changes, and ii) we implemented and evaluated scheduling

techniques for heterogeneous clusters to shorten execution

time and optimize resource utilization.

Our design for portability is particularly important for

biomedical applications. Because it allows for separation of

concerns between any source code modiﬁcations or extensions

normally performed by biomedical researchers, and any li-

braries used for the purpose of providing platform-dependent

support. This will isolate the biomedical researchers from such

lower-level concerns.

The scheduling algorithms we investigated in this paper

range from relatively simple ones, such as shortest (estimated)

job ﬁrst, to more sophisticated ones, such as Sufferage schedul-

ing algorithm [4], which tries to optimize the penalty that a

task suffers if it cannot be scheduled on its preferred resource.

As our case study we selected NPAIRS, a biomedical

application for processing functional Magnetic Resonance

Imaging (fMRI) brain images. NPAIRS is used to determine

the correlation between brain images of several patients (sub-

jects) while doing a speciﬁc task. It is a good example for

applications which are both data and CPU intensive. Indeed,

NPAIRS performs quite complicated operations (Eigen Value

Fig. 1. An example of detected activation areas when the subjects were

performing a simple reaction time task [7].

Decomposition) on a large set of input data. These operations

are extremely parallelizable and they execute considerably

faster on GPU. In this paper, we provide a method to schedule

the tasks of our NPAIRS application on a heterogeneous CPU-

GPU cluster. However, our scheduling methods are generic and

could be used for other applications as well.

II. BR AI N IMAG E PROCESSING

Functional Magnetic Resonance Imaging (fMRI) is a non-

invasive neuroimaging technique commonly used to study

function of human brain by measuring the blood-oxygenation-

level-dependent (BOLD) signal [5], [6]. Simply put, activation

of a region of the brain causes oxygenated blood to ﬂow to that

region. Oxygenated and de-oxygenated blood have different

magnetic properties, which can be captured in fMRI images

[5], [6]. The imaging data gathered by fMRI are analysed to

detect correlations among brain activations in response to a

stimulus, e.g. a particular motor task.

Typically, neuroscientists and physicians start by designing

fMRI experiments to answer different questions they have in

mind, such as trying to determine which part of the brain is

responsible for a speciﬁc task. The actual experiment varies

depending on the question being investigated. But, in general,

it involves choosing subjects, a group of patients or healthy

individuals, and choosing the type of stimuli or a task to

be performed by the subjects, e.g. a ﬁnger tapping task.

During the experiment, an MRI scanner collects fMRI data

of subjects’ brains while the subjects are given the stimulus

or perform the requested task.

Generally, it is common to collect data from multiple

individuals (subjects) to draw more general conclusion on

the brain’s function with more statistical power and less

noise [6]. After performing the experiment, collected fMRI

data needs to be cleansed by applying different preprocessing

steps such as motion correction, spatial smoothing, detrending

and whitening, and registration to template brains. Finally,

statistical analysis is performed on the preprocessed fMRI data

to detect correlation between regions of the brain and the task

subjects performed during the experiment [6]. Output of the

analysis can be presented as a color coded image of brain.

For example, Figure 1 show the active region of brain when

performing a simple reaction time task [7].

A. NPAIRS

The NPAIRS (Nonparametric, Prediction, Activation, Inﬂu-

ence, Reproducibility, re-Sampling) is a neuroimaging soft-

ware package for analyzing fMRI data [8] [9].

Figure 2 shows the workﬂow of the NPAIRS application.

The NPAIRS program is based on a split-half resampling

framework that randomly splits the data into two halves. Then,

each half of the data is analyzed individually using a statistical

analysis method. Current implementation of the NPAIRS uses

principal component analysis (PCA) and Canonical Variate

Analysis (CVA) algorithms to do the statistical analysis.

Results of the analysis on the two split datasets are used

to generate prediction accuracy (p) and reproducibility (r)

metrics. Prediction accuracy determines how accurately the

values of experimental design parameters, e.g. performance

measures, can be predicted in an independent test dataset.

Reproducibility determines how reliably the parameters in

the same test dataset can be reproduced [8]. This resampling

loop, which contains process of splitting the data into halves,

analysing each half, and computing the evaluation metrics, is

repeated by default 100 times or until all the possible disjoint

pairs have been tested.

In NPAIRS, in order to control model complexity and re-

duce data dimensionality, principal component analysis (PCA)

is used. PCA determines principal components of the input

dataset. Then, only ﬁrst Q principal components are used to

produce linear, multivariate discriminant functions for analysis

of the images. The value of Q, i.e. number of selected

principal components, signiﬁcantly affects prediction accuracy

and reproducibility of the analysis. To study the effect of

number of principal components on the quality of the analysis,

an exhaustive search is performed on this hyperparameter.

On each iteration of this exhaustive search, NPAIRS exe-

cutes the same algorithm with a different value for Q. But,

since NPAIRS is a computationally expensive application,

each iteration of this search, could take hours, depending on

the size of dataset and platform on which NPAIRS is running.

Although NPAIRS is written to be executed on a single node,

it can either run the algorithm for a speciﬁc number of Q

or execute it for a set of different values of Q sequentially.

Obviously, this exhaustive search is embarrassingly parallel

because the evaluation of different values of Q are totally

independent. Therefore, it is possible to run several instances

of NPAIRS as a separate process (either on a single node or

multiple nodes). Each instance will then be sequential and

completely independent of all other instances. In this paper,

we ﬁrst improve the performance of NPAIRS on a single node

with a GPU with minimal change in the source code. Then,

we provide a scheduling framework for parallel execution of

NPAIRS on a heterogeneous cluster.

III. ACCELERATING NPAIRS

The ﬁrst step towards accelerating NPAIRS is understanding

its execution proﬁle. We instrumented the NPAIRS source

Setup parameters and load dataset

PCA

Initial PCA

CVA

PCA

CVA

Split 1 Split 2

Compute evaluation metrics (rand p)

Aggregate results

k

(default 100 times)

Resampling loop

Fig. 2. Highlevel workﬂow of the NPAIRS application.

code1[10].

Based on our instrumentation, we created execution proﬁle

of NPAIRS on a dataset of 25 subjects for different number

of principal components. Figure 3 illustrates proportion of

different parts of NPAIRS (as depicted in Figure 2) on a

machine with two Intel Xeon E5-2650 CPUs.

Execution proﬁle of NPAIRS reveals the fact that on av-

erage, computation of the PCA algorithm in the resampling

loop accounts for 70% of the total execution time. This

is expected since PCA is the most complex part of the

application. Also, during the execution of NPAIRS, PCA is

executed in the resampling loop 200 times (100 times for each

split). Therefore, we expect that accelerating PCA signiﬁcantly

reduces NPAIRS execution time.

In NPAIRS, the number of principal components (#PCs)

can be deﬁned as an input to the application. After PCA

function computes eigenvalues and eigenvectors of its input

matrix, the ﬁrst #PCs eigenvectors with the highest eigenvalues

are selected to be passed to the CVA step (see Figure 2).

Most computationally expensive operations in PCA algorithm

can be translated into basic linear algebra operations, which

are intrinsically parallel. This characteristic makes PCA a

suitable candidate to be executed on GPUs, which consist of a

large number of weak cores that can efﬁciently execute small

operations in parallel. Thus, to improve the performance of

NPAIRS with minimal change in the source code, we move

the computation of principal component analysis from CPU

to GPU.

A. Implementing PCA on GPU

NPAIRS uses the covariance method to compute principal

components of prepossessed fMRI images [11]. In order to be

1NPAIRS is an open source software package under GNU GPL v2 License.

20%

40%

60%

80%

100%

210 20 40 100 300 400 500

Percentage of Execution Time

Number of Principal Components

Aggregating results

Computing evaluation metrics

CVA (resampling loop)

PCA (resampling loop)

Initial PCA

Fig. 3. Execution proﬁle of NPAIRS for different number of principal

components on a dataset of 25 subjects. The computations of PCA in the

resampling loop accounts for 70% of total computation time on average.

consistent with the existing implementation, we implemented

PCA algorithm on GPU using the same method. The covari-

ance method for computing principal components works as

follows.

First, the input matrix (M), which contains the prepossessed

fMRI images, is normalized by subtracting average of each

column from the elements of that column, as eq. 1.

∀i, j 0≤i<m, 0≤j<n ˆ

Mij =Mij −1

i=0

Mij (1)

Then, sum of squares and products (SSP) is computed by

multiplying the transpose of normalized matrix with itself (eq.

2).

SS P =ˆ

MT×ˆ

M(2)

Next, eigenvalues and eigenvectors of the symmetric SSP

matrix are computed. Finally, the normalized input matrix

is multiplied by a matrix whose columns are eigenvectors

computed in the previous step.

P Cscore =ˆ

M×EigenV ectors(S SP )(3)

The outputs of PCA which are used in NPAIRS are PCscore ,

eigenvalues, and eigenvectors.

There are two main framework for programming on

GPUs, Compute Uniﬁed Device Architecture (CUDA) [2] and

OpenCL [3]. The former has been introduced by NVIDIA,

one of the pioneer manufactures of GPUs. The latter is imple-

mented by Khronos group, an industry consortium for creating

open standards for the authoring and acceleration of parallel

computing, graphics, etc. Some studies show that CUDA

shows better performance than OpenCL in many applications

[12]. Moreover, our target GPU is NVIDIA Titan and CUDA

is well suited for NVIDIA GPUs. Additionally CUDA is

more mature in terms of available matrix operation libraries,

e.g. CUBLAS, CUDA). Considering all these advantages we

decided to use CUDA as our framework for programming

on GPUs. Moreover, we used CUDA Basic Linear Algebra

Subroutines (CUBLAS) library [13] for matrix computations

and CULA library [14] for eigenvalue decomposition.

B. Invoking CUDA from Java

Since NPAIRS is implemented in Java and we used CUDA

to implement PCA on GPU, integration of these two pieces

of code could be challenging. Moreover, because developers

and users of NPAIRS application are biomedical researchers,

it is necessary to do the integration with minimal changes

to the NPAIRS original source code. Generally, there are

two methods for interfacing a non-Java code with a Java

application: in-process and inter-process. The former in which

the interfacing is done in a single process has less performance

overhead. However, the latter in which the interfacing is

done in multiple processes is more portable. Java Native

Interface (JNI) and sharing data for example by writing on

disk are the best candidates in terms of minimal changes

for in-process and inter-process communication, respectively.

However, transferring data through writing on disk or ramdisk

is inefﬁcient due to its high I/O overhead. Since the goal is to

improve performance of NPAIRS, in-process communication

is a better choice for integrating PCA code on GPU with the

NPAIRS source code. Java Native Interface (JNI) [15] is the

most commonly used method for in-process communication

in Java. JNI enables a Java application, running in a Java

Virtual Machine (JVM), to call an external library function

implemented in other languages such as C and CUDA. We

used JNI to connect NPAIRS with the PCA implementation

on GPU.

We created a C library from our CUDA implementation

of PCA algorithm. This library contains a function named

PCA_on_GPU to perform PCA on GPU. When an instance

of NPIARS calls PCA_on_GPU function, ﬁrst, CPU initializes

the GPU. Then, our library transfers the CUDA implementa-

tion of PCA and the input matrix to GPU. Finally, after GPU

compute PCA function, our library retrieves the results from

GPU and passes them to the NPAIRS instance.

In order to execute PCA on GPU, we need to only add a

few lines in the NPAIRS original source code to: 1) check

availability of GPU 2)setup temporary variables to receive

the results from GPU and store them in the corresponding

variables in NPAIRS 3) call PCA function on GPU . A

snapshot of the NPAIRS source code which supports execution

of PCA on GPU is depicted in ﬁgure 4. In this ﬁgure,

line 4 to 16 reﬂects the only necessary modiﬁcations in the

original NPAIRS source code. Note that NPAIRS application

consists of more than 100,000 lines of code. Therefore, the

modiﬁcation of the source code is negligible.

Transferring data between the JVM and the native library

can affect performance of the application. This issues is

escalated in NPAIRS because PCA_on_GPU function in the

native library is executed multiple times (by default 200 times

in the resampling loop). To alleviate the overhead of data

transfer between JVM and native library, we keep the data

in memory space of JVM and only send pointers of those

memory locations to the native library.

1class PCA {

2native void PCA_on_GPU(double[] M, long nRows,

long nColc, double[] PC_score, double[]

evec_tmp, double[] eigenvalues);

...

3computePCA(Matrix M, boolean normalizeBySD) {

...

4if(isGPUAvailable()) {

5System.loadLibrary("PCA_GPU_lib");

6double[] pca_tmp = new double[nRows*nCols];

7double[] evec_tmp= new double[nCols*nCols];

8PCA_on_GPU(M,nRows,nCols,pca_tmp,evec_tmp,

eigenvalues);

9for (int i = 0; i < nRows; i++)

10 for (int j = 0; j < nCols; j++)

11 PCscore.set(i,j, pca_tmp[j*nCols+i]);

12 for (int i = 0; i < nCols; i++)

13 for (int j = 0; j < nCols; j++)

14 eigenvectors.set(i,j,evec_tmp[j*nCols+i]);

15 }

16 else{

... /*Original CPU implemention of PCA */

17 }

18 }

...

19 }

Fig. 4. To enable NPAIRS to compute PCA on a GPU, we need to add

just a few lines to PCA class in NPAIRS source code, which has more than

100,000 lines of code.

C. Experimental Results

In order to evaluate the efﬁciency of our proposed GPU

implementation, we executed the GPU-assisted NPAIRS on

three different resources, i.e. two CPU nodes Fat (32 cores) ,

Light (16 cores), and one GPU node, described in Section V.

In this experiment we varied number of principal components

from 2 to 500 and evaluated NPAIRS on two datasets with

25 and 31 subjects. Results of this experiments is depicted

in Figure 5. Obviously, the larger data set needs more time

to be processed. In addition, the execution time increases as

the number of principal components increases. This is due

to the nature of NPAIRS application, in which the number

of principal components directly affects size of input for the

CVA, thus affects the execution time of NPAIRS. Also, the

results show that the difference between execution time of

the original NPAIRS running on CPU and the GPU assisted

implementation is larger for the dataset with 31 subjects

compared to the dataset with 25 subjects. This conﬁrms the

suitability of PCA to be executed on GPU. Although larger

data size imposes data movement overhead for GPU, the result

show that the achieved speedup mitigates it.

Execution proﬁle of our proposed GPU implementation of

NPAIRS is depicted in Figure 6. This ﬁgures shows that

in our proposed the GPU assisted implementation, the PCA

computation accounts for about 20% of the execution time

of NPAIRS. Where as, when running NPAIRS on a CPU

node, this proportion is close to 70% on average (Figure 3).

It should be noted that execution of PCA as a stand-alone

200 300 400 500

25 Subjects (Fat

Node)

31 Subjects

(GPU Node)

31 Subjects

(Light Node)

31 Subjects (Fat

Node)

1,000

2,000

3,000

4,000

5,000

6,000

7,000

0 100 200 300 400 500

Execution Time (s)

Number of Principal Components

25 Subjects (Light Node) 31 Subjects (Light Node)

25 Subjects (Fat Node) 31 Subjects (Fat Node)

25 Subjects (GPU Node) 31 Subjects (GPU Node)

Fig. 5. Execution proﬁle of NPAIRS for different numbers of principal

components on two datasets with 25 and 31 subjects for three different nodes:

a GPU node, a Light node, and Fat node.

20%

40%

60%

80%

100%

100

200

300

400

500

Percentage of Execution Ti me

Number of Principal Components

Aggregating results

Computing evaluation metrics

CVA (resampling loop)

PCA (resampling loop)

Initial PCA

Fig. 6. Execution proﬁle for different numbers of principal components on a

dataset of 25 subjects running on a GPU.

application on a GPU can be up to 12 times faster than

the CPU implementation of PCA. However, since the PCA

computation accounts for about 70% of the execution time of

NPAIRS, using our proposed the GPU assisted implementation

can speed up NPAIRS 3 to 7 times.

IV. SCHEDULING ON A HETEROGENEOUS CLUSTER

We show that GPU-assisted implementation of NPAIRS can

speed up execution of NPAIRS 3 to 7 times. However, in a

real-world research cluster, not all nodes are equipped with

a GPU. But, those CPU-only nodes, though are much slower

than GPU nodes, can still contribute in an exhaustive search

on the number of principal components for NPAIRS. This

way, instances of NPAIRS, each of which evaluates a different

value for the number of principal components, can be executed

on a CPU node or a GPU node in parallel. However, since

the performance of NPAIRS on GPU and CPU nodes are

signiﬁcantly different, to execute NPAIRS on a heterogeneous

cluster, we need to carefully schedule instances of NPAIRS,

each evaluating a different value for the number of principal

components. In this section, we study performance of well-

known scheduling algorithms, which can be used regardless

of the cluster management platform. We show that execution

time of the exhaustive search on the number of principal

components for NPAIRS on a cluster signiﬁcantly varies for

different scheduling algorithms.

There are several performance metrics to evaluate schedul-

ing algorithms such as latency, throughput, and makespan.

In this application, the goal is to minimize execution time

of multiple instances of NPAIRS, each evaluating a different

number of principal components, running in parallel on a

cluster. This task is achieved when all instances of NPAIRS

successfully ﬁnish their execution. It should be noted that

instances of NPAIRS are independent, have no deadline, and

can be executed in parallel on any resource in the cluster.

Considering all these characteristics, we use makespan of the

NPAIRS jobs, i.e. the time taken to execute all instances

of NPAIRS, to evaluate performance of different scheduling

algorithms.

A. Overview of the Scheduling Algorithms

Although ﬁnding an optimal schedule with the minimum

makespan is a NP-hard problem, there are two different

approaches to ﬁnd a near-optimal schedule.

The ﬁrst approach is using machine learning techniques to

explore the whole solution space, i.e. all possible schedules.

In this approach, the exploration stops when some predeﬁned

constraints about the solution quality are satisﬁed or the

algorithm execution time exceeds some threshold. Genetic

Algorithms [16], [17] and Bee Colony Optimization [18], [19]

are examples of this approach. Although this approach may

result in a better solution, it is computing intensive and suffers

from poor scalability.

The second option is using greedy algorithms to optimize

a partial solution, iteratively aiming to ﬁnd a near-optimal

ﬁnal solution. In this approach, scheduling algorithms have

polynomial execution time and are more scalable than the

machine learning based scheduling algorithms. Moreover, if

implemented efﬁciently, its results can be competitive with

the results of machine learning techniques. We implemented

and evaluated the following well-known algorithms.

1) Shortest Job First (SJF) and Longest Job First (LJF):

In SJF [20], ﬁrst, submitted jobs are sorted in ascending order

of their estimated execution time on GPU. Then, the shortest

unscheduled job will be assigned to the fastest available node.

If all nodes of the cluster are busy, the shortest unscheduled

job will wait until a node will be available. LJF [20] is similar

to SJF except that jobs are sorted descendingly based on their

execution time and then will be scheduled with the same order.

2) Min-Min and Max-Min: The Min-Min algorithm [21],

[22] schedules the submitted jobs iteratively. At each iteration,

ﬁnish time of all unscheduled jobs on each resource, i.e. node,

are computed using eq. 4.

fjr =ej r +avlr(4)

where fjr and ejr are the ﬁnish time and estimated execution

time of the job jon resource r, respectively, and avlris the

earliest time that resource rbecomes available. Then, for each

job the resource with the earliest ﬁnish time is chosen as

the selected resource (also known as ﬁrst min). Finally, the

job with earliest ﬁnish time will be scheduled on its selected

resource (also known as second min). The algorithm iterates

until all jobs are scheduled.

The Max-Min algorithm [21], [22] is similar to Min-Min in

its ﬁrst min step. But, it selects the job with maximum ﬁnish

time in its second step.

3) Sufferage: Sufferage algorithm [4], [22] is an extension

of Min-Min. It deﬁnes Sufferage of a job as the difference

between the Earliest Finish Time (EFT) of the job and its

Second Earliest Finish Time (SEFT). The goal of this algo-

rithm is to minimize the Sufferage value of all submitted jobs.

This is achieved by prioritizing jobs that are competing on the

same resource. Similar to Min-Min, Sufferage is an iterative

algorithm. But, on the contrary to Min-Min, in which at each

iteration only one job is assigned to one resource, Sufferage

may assign multiple jobs to their preferred resources. The

Sufferage algorithm works as follows.

At the beginning of each iteration, ﬁnish time of unsched-

uled jobs on all resources are computed. Then, the unscheduled

jobs are sorted in ascending order of their EFTs on all

resources. Starting with an unscheduled job with the EFT,

job iselects its preferred resource r∗

i, i.e. the resource which

ﬁnishes the job earlier than any other resource in the cluster. If

the preferred resource of job ihas not been assigned to any job

in the current iteration, job iwill be scheduled on its preferred

resource. But, if in the current iteration, the preferred resource

of the job ihas already been assigned to another job j, then job

ishould compete with job jfor this resource. Among job iand

job j, the job with greater Sufferage value will be assigned to

resource r∗

i. The other job will be put back in the unscheduled

job queue to be scheduled in the next iterations. An iteration

of Sufferage algorithm ends with traversing all unscheduled

jobs as described above. Detail of the Sufferage algorithm is

presented in Algorithm 1.

V. E VALUATION OF SCHEDULING ALGORITHMS

To evaluate the scheduling algorithms introduced in the

previous section, ﬁrst we create execution proﬁles of NPAIRS

with different values for the number of principal components.

Then, we use our in-house simulator to evaluate the scheduling

algorithms. Inputs for the simulator are as follows: i) Execu-

tion proﬁles of NPAIRS application with different values for

the number of principal components on all available resources.

ii) A list of available resources in the cluster. iii) A list of

submitted jobs to the cluster with their input parameters, the

number of principal components for NPAIRS.

The execution proﬁles for the NPAIRS jobs are obtained by

running individual execution of each job on each node of our

in-house heterogeneous cluster which consists of three types

of resources:

•Fat node: Four Intel Xeon E5-4620 CPUs with 512GB

of memory.

•Light node: Two Intel Xeon E5-2650 CPUs with 32GB

of memory.

•GPU node: One Intel Core i7-3770K CPU with 16GB of

memory, equipped with a Nvidia GeForce GTX TITAN

GPU.

Algorithm 1: Sufferage Algorithm

while there is an unscheduled job do

foreach unscheduled job jdo

foreach resource rin cluster do

j←ﬁnish time of job jon resource r;

if fr

j<EFT of job jthen

Assign j’s EFT to its SEFT;

Assign fr

jto EFT of job j;

r∗

j←r;

else if fr

j<SEFT of job jthen

Assign fr

jto SEFT of job j

suff eragej←SEFT of job j-EFT of job j;

Sort jobs ascendingly based on EFT;

Mark all resources as unassigned;

foreach unscheduled job jdo

if r∗

jis unassigned then

Schedule job jon its preferred resource r∗

Set status of r∗

jto assigned;

else if sufferage of job jis greater than sufferage of

currently scheduled job on r∗

jthen

Return the job currently scheduled on r∗

jto the

unscheduled job queue;

Schedule job jon r∗

In the following experiments, unless otherwise mentioned,

we use a cluster of 3 Fat nodes, 3 Light nodes, and 2 GPU

nodes. We use the same fMRI dataset of 25 subjects for all

experiments. Also, for all experiments, submitted workload to

the cluster is a batch of 99 independent NPAIRS jobs each

with a unique value for the number of principal components

ranging from 2 to 100. We assume all NPAIRS job are part

of an exhaustive search process on the number of principal

components, as described in section II, and are submitted to

the cluster at the same time. All the scheduling algorithms

that we evaluate, schedule the batch of NPAIRS jobs on

their submission. We use makespan of the submitted batch of

NPAIRS jobs to evaluate efﬁciency of different clusters and

scheduling algorithms.

As a baseline for the scheduling algorithms, we evaluate

a basic First Come First Serve (FCFS) scheduling algorithm.

Since makespan of a schedule produced by FCFS depends

on the arrival order of the jobs and we assume that all jobs

are submitted in a batch at the same time, we repeat FCFS

algorithm 200 times on the same batch of jobs, each time with

a different order. and report average, the best, and the worst

makespan.

In addition, we evaluate two basic scheduling algorithms,

Torque [23] and Minimum Compilation Time (MCT) [24].

Torque schedules all jobs on their fastest resource(s) in the

cluster. MCT algorithm [24], follows a greedy strategy and

assigns each job to a node that can ﬁnish the job sooner,

30,600

30,700

30,800

30,900

31,000

31,100

31,200

31,300

31,400

31,500

Torque FCFS MCT SJF LJF Min-Min Max-Min Sufferage

Execution Time (s)

45,577

45,500



Fig. 7. Execution time of the batch of NPAIRS jobs with different scheduling

algorithms on a CPU cluster of 3 Fat nodes and 5 Light nodes.

17,500

18,000

18,500

19,000

19,500

20,000

20,500

21,000

Torque FCFS MCT SJF LJF Min-Min Max-Min Sufferage

Execution Time (s)

32,750

30,000



Fig. 8. Execution time of the batch of NPAIRS jobs with different scheduling

algorithms on a heterogeneous cluster of 3 Fat nodes, 3 Light nodes, and 2

GPU nodes .

considering jobs that have been already scheduled on the

nodes.

In order to ensure a fair comparison between our CPU-GPU

framework for NPAIRS and its original CPU implementation,

we evaluate different scheduling algorithms on a CPU cluster

with 5 Light nodes and 3 Fat nodes. We use the best results

obtained from this cluster to compare with the results of het-

erogeneous CPU-GPU cluster. Figure 7 depicts the makespan

of the NPAIRS batch for all tested scheduling algorithms on

the CPU cluster. The results indicate that in a CPU cluster,

where execution of NPAIRS tasks do not differ signiﬁcantly

on different nodes, the difference between makespan of tested

scheduling algorithms is less than 0.25%. The only exception

is the Torque scheduling algorithm [23], in which all jobs are

scheduled only on their fastest resources, which are the Light

nodes in this cluster (Figure 5).

As we demonstrated in Section III, NPAIRS application can

get a signiﬁcant performance boost by utilizing a GPU. To

show the effect of a few GPU nodes on the performance of

the NPAIRS workload, we build a heterogeneous cluster by

replacing two Light nodes in the above-mentioned CPU cluster

with two GPU nodes. In the heterogeneous cluster, since the

execution time of NPAIRS on different nodes signiﬁcantly

varies, on the contrary to the homogeneous CPU cluster,

makespan of a batch of NPAIRS job depends on the scheduling

algorithm.

The makespan of the batch of NPAIRS jobs in the heteroge-

neous cluster with different scheduling algorithms is illustrated

in Figure 8. The results show that all algorithms, except

Torque, perform better than the average of FCFS algorithm.

This conﬁrms hypothesis about the importance of having a

86%

88%

90%

92%

94%

96%

98%

100%

Torque FCFS MCT SJF LJF Min-Min Max-Min Sufferage

Average Utilization

GPU Nodes Fat Nodes Light Nodes

Fig. 9. Resource utilization of different scheduling algorithms on a hetero-

geneous cluster of 3 Fat nodes, 3 Light nodes, and 2 GPU nodes.

scheduling algorithm for a homogeneous cluster.

Torque schedules jobs on the fastest resource in the cluster,

the two GPU nodes in this cluster, and does not utilize other

nodes. Therefore, the makespan of Torque is higher than the

other scheduling algorithms.

SJF and LJF have slightly higher makespan (less than 0.5%)

compared to the best FCFS case. Min-Min and Max-min,

which can be considered more intelligent versions of SJF and

LJF, outperform them by 3.5% and 2.5%, respectively. The

difference between performance of Min-Min and Max-Min

is due to their different approaches for incrementally ﬁnding

the ﬁnal scheduling solution. Min-Min, at each iteration,

tries to keep the load of the resource balanced by assigning

a job to a recourse which leads to the minimum possible

increase of current overall ﬁnish time. This strategy postpones

allocation of longer jobs and fails to create a schedule of

good quality when few number of very long jobs remain to be

scheduled at the end of algorithm. In this case, all resources,

except those which are running the long remaining jobs, have

approximately same ﬁnish time and remain idle while few

resources are busy executing the long jobs. On the other hand,

Max-Min gives higher priority to longer jobs. It fails when a

long job that is scheduled on a powerful resource steals the

opportunity from many shorter jobs. The NPAIRS application

does not contain any job with extremely longer execution time.

For this reason, Min-Min outperforms Max-Min by 1.5%.

Sufferage is an extension on Min-Min algorithm and re-

duces makespan of Min-Min by 3%. This is because all

the NPAIRS jobs have the same preferred resources, GPU

nodes, and Sufferage algorithm assigns GPU nodes to jobs

that get the most advantages of running on them, which are

the jobs that suffer the most from running on other nodes.

Overall, Sufferage results in the minimum makespan among all

evaluated scheduling algorithms. Compared to Torque, FCFS,

and MCT algorithms, Sufferage reduces makespan of the

NPAIRS jobs by 47%, 9%, and 4%, respectively. This is a

considerable improvement which has been achieved with a

very low scheduling overhead.

To have a more complete analysis of the evaluated schedul-

ing algorithms, resource utilization for each types of nodes

in the heterogeneous cluster is presented in Figure 9. As ex-

pected, Sufferage has the most balanced utilization among all

10,000

15,000

20,000

25,000

30,000

35,000

CPU Cluster GPU Cluster CPU-GPU Cluster

Execution Time (s)

Fig. 10. Execution time of the batch of NPAIRS jobs on three different

clusters: a CPU cluster of 3 Fat nodes and 5 Light nodes, a GPU cluster of 2

GPU nodes, and a heterogeneous cluster of 3 Fat nodes, 3 Light nodes, and

2 GPU nodes.

algorithms. The least utilized resources when using Sufferage

algorithm are the Fat nodes. The low utilization of Fat nodes in

Sufferage is a key factor in the success of Sufferage algorithm

because efﬁciency of Fat nodes on executing NPAIRS jobs is

less than the GPU nodes and Light nodes (see Figure 5).

On the contrary to Sufferage, FCFS, SFJ, and LJF result in

the most utilization of FAT nodes, which explains their high

makespan (Figure 8). Although Min-Min and Max-Min try

to increase utilization of the fastest nodes, i.e. GPU nodes,

and decrease the slowest nodes, i.e. Fat nodes, they are not as

successful as Sufferage in utilizing Light nodes more than Fat

nodes.

To show that in the presence of powerful GPU nodes, we

can still beneﬁt from available CPU nodes, we compared the

performance of a CPU-only cluster of 8 nodes, a GPU-only

cluster of 2 nodes, and a heterogeneous cluster of 6 CPU

nodes and 2 GPU nodes (Figure 10). The shortest makespan

for the CPU cluster belongs to one single execution of FCFS

algorithm, which is less than 1% shorter than the makespan

produced by Sufferage algorithm. The shortest makespan

for the heterogeneous cluster is produced by the sufferage

algorithm. Also, to show that in presence of powerful GPU

nodes, we can still beneﬁt from available CPU nodes, we

provide make span of a cluster of 2 GPU nodes (Figure 10).

The results support our approach, which is utilizing a few

powerful GPU nodes with already available CPU nodes in a

heterogeneous cluster.

VI. RE LATE D WO RK

Since the advent of GPGPU, several researches have been

performed to efﬁciently use the computational power of GPUs

for scientiﬁc applications. Using GPU’s computational power

in medical image processing has been widely investigated

[25]–[27] such as BROCCOLI [25] which is an OpenCL im-

plementation of an fMRI analysis software package. However,

the method we presented in this paper beneﬁts from GPUs

while applying minimal changes to the existing CPU code of

the fMRI application.

GPUs have also been widely used in a wide range of other

scientiﬁc applications [28]–[30]. Authors in [28] have imple-

mented a numerical weather prediction algorithm on a GPU

and integrated it into a weather forecasting application. They

achieved 7x speedup for the GPU version of the algorithm,

Whereas they got only 2x speedup after integrating it into

the whole application. However, we got 14x speedup for the

PCA and 7x speedup when integrating it into NPAIRS. This

implies that our implementation is as efﬁcient that even with

the presence of data transfer cost, we still have a reasonable

speedup.

In [31], a chemistry application has been scheduled on a

HPC platform. However, the beneﬁt of using GPU is not

investigated in this work since porting to GPU requires a

fundamental upgrade to that application. On the other hand,

our suggested method applies minimal changes to NPAIRS

with the beneﬁt of achieving almost 7x speedup.

There are multiple works on studying job scheduling for

heterogeneous clusters. StarPU [32] is a scheduling frame-

work for heterogeneous multi-core architectures. Several basic

strategies has been implemented in StarPU. However, its main

target is to provide the load balancing among all available

resources. Authors in [33], have extended Hadoop to perform

the task scheduling for GPU-based heterogeneous clusters.

Their main goal is to minimize the execution time. However,

Hadoop framework is not appropriate for small jobs with

execution times of less than a minute. Therefore, in cases

where jobs are being executed too fast on a GPU, it is not

reasonable to use Hadoop due to its signiﬁcant overhead. Ravi

et al. [24] schedule a set of well-known applications on a

cluster of CPU-GPU nodes. They have developed a set of

simple scheduling schemes. They test their method both for

single-node and multi-node applications. Likewise, the whole

scheduling schema is for independent jobs. Torque [23] is

a resource manager which is being widely used to manage

heterogeneous clusters. The scheduling strategy of Torque is

based on OpenPBS and it is fairly simple. The idea is that

the user will specify the job that needs to be executed. Once

a resource is selected for a job, Torque does not consider

the possibility of running that job on another resource type.

Considering the fact that user will always ask for the fastest

resource to execute the job, this schema will impose a high

load imbalance to the system.

VII. CONCLUSION

The computational power of GPU has recently been used to

accelerate scientiﬁc applications. Neuroimaging applications

mostly consist of complex linear algebra operations which are

intrinsically parallel. Thus, they are one of the best candidates

to be accelerated by GPUs. In this paper, we efﬁciently ported

NPAIRS, a neuroimaging application, to GPU. This is done

by adding only a few lines of code to the original NPAIRS

code. This minimal change of the code is so important from

the view point of NPAIRS’ non-expert users and developers,

i.e. bio-medical scientists. Because they do not have to suffer

from any fundamental change in the application.

As the second part of our research, we investigated the efﬁ-

ciency of different scheduling algorithms for running NPAIRS

on a heterogeneous cluster. Experimental results show that

when running NPAIRS original code on a homogeneous clus-

ter of CPU nodes, the scheduler does not have a considerable

impact on overall execution time. However, by replacing a

quarter of CPU nodes with GPU nodes and utilizing the

Sufferage scheduling algorithm, we improved the application

performance by 44%. We also compared our results with

Torque’s basic scheduler and MCT, two of the commonly used

schedulers in current HPC platforms. The Sufferage algorithm

improves Torque and MCT by 47% and 4% respectively.

Our results show that by applying minimal changes to

the original code and adding a few GPUs, each costs only

25% of a CPU node, signiﬁcant performance improvement is

achieved.

REFERENCES

[1] Nvidia corporation. [Online]. Available: http://www.nvidia.com/

[2] (2007) Compute uniﬁed device architecture programming guide.

[Online]. Available: docs.nvidia.com/cuda

[3] Khronos group. [Online]. Available: https://www.khronos.org/opencl/

[4] M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund,

“Dynamic mapping of a class of independent tasks onto heterogeneous

computing systems,” Journal of Parallel and Distributed Computing,

vol. 59, no. 2, pp. 107–131, 1999.

[5] N. K. Logothetis, “What we can do and what we cannot do with fMRI,”

Nature, vol. 453, no. 7197, pp. 869–878, 2008.

[6] A. Eklund, “Computational medical image analysis: With a focus on

real-time fMRI and non-parametric statistics,” 2012.

[7] A. H. Andersen, D. M. Gash, and M. J. Avison, “Principal component

analysis of the dynamic response measured by fMRI: a generalized linear

systems framework,” Magnetic Resonance Imaging, vol. 17, no. 6, pp.

795–815, 1999.

[8] S. C. Strother, J. Anderson, L. K. Hansen, U. Kjems, R. Kustra, J. Sidtis,

S. Frutiger, S. Muley, S. LaConte, and D. Rottenberg, “The quantitative

evaluation of functional neuroimaging experiments: The NPAIRS data

analysis framework,” NeuroImage, vol. 15, no. 4, pp. 747–771, 2002.

[9] S. Strother, A. Oder, R. Spring, and C. Grady, “The NPAIRS com-

putational statistics framework for data analysis in neuroimaging,” in

Proceedings of COMPSTAT’2010, 2010, pp. 111–120.

[10] The PLS (partial least squares) and NPAIRS (nonparametric, prediction,

activation, inﬂuence, reproducibility, re-sampling) neuroimaging soft-

ware package. [Online]. Available: https://code.google.com/p/plsnpairs/

[11] S. Strother, S. L. Conte, L. K. Hansen, J. Anderson, J. Zhang,

S. Pulapura, and D. Rottenberg, “Optimizing the fMRI data-processing

pipeline using prediction and reproducibility performance metrics: I. a

preliminary group analysis,” NeuroImage, vol. 23, Supplement 1, no. 0,

pp. S196–S207, 2004, mathematics in Brain Imaging.

[12] K. Karimi, N. G. Dickson, and F. Hamze, “A Performance Comparison

of CUDA and OpenCL,” ArXiv e-prints, 2010.

[13] (2008) cuBLAS library. [Online]. Available: https://developer.nvidia.

com/cublas

[14] J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J.

Kelmelis, “CULA: hybrid GPU accelerated linear algebra routines,” in

SPIE Defense, Security, and Sensing, 2010.

[15] Java Native Interface. Accessed: 2014-03-01. [Online]. Available:

http://docs.oracle.com/javase/7/docs/technotes/guides/jni/

[16] L. Wang, H. J. Siegel, V. P. Roychowdhury, and A. A. Maciejewski,

“Task matching and scheduling in heterogeneous computing environ-

ments using a genetic-algorithm-based approach,” Journal of Parallel

and Distributed Computing, vol. 47, no. 1, pp. 8–22, 1997.

[17] S. Song, K. Hwang, and Y.-K. Kwok, “Risk-resilient heuristics and

genetic algorithms for security-assured grid job scheduling,” IEEE

Transactions on Computers, vol. 55, no. 6, pp. 703–719, 2006.

[18] M. Arsuaga-Rios, M. Vega-Rodriguez, and F. Prieto-Castrillo, “Multi-

objective artiﬁcial bee colony for scheduling in grid environments,” in

Swarm Intelligence (SIS), 2011 IEEE Symposium on, 2011, pp. 1–7.

[19] T. Davidovic, M. Selmic, and D. Teodorovic, “Scheduling independent

tasks: Bee colony optimization approach,” in Control and Automation,

2009. MED ’09. 17th Mediterranean Conference on, 2009, pp. 1020–

1025.

[20] A. Streit, “On job scheduling for hpc-clusters and the dynp scheduler,”

in High Performance ComputingHiPC 2001, 2001, pp. 58–67.

[21] O. H. Ibarra and C. E. Kim, “Heuristic algorithms for scheduling

independent tasks on nonidentical processors,” Journal of the ACM,

vol. 24, no. 2, pp. 280–289, 1977.

[22] T. D. Braun, H. J. Siegel, N. Beck, L. L. Blni, M. Maheswaran, A. I.

Reuther, J. P. Robertson, M. D. Theys, B. Yao, D. Hensgen, and R. F.

Freund, “A comparison of eleven static heuristics for mapping a class of

independent tasks onto heterogeneous distributed computing systems,”

Journal of Parallel and Distributed Computing, vol. 61, no. 6, pp. 810–

837, 2001.

[23] G. Staples, “Torque resource manager,” in Proceedings of the 2006

ACM/IEEE conference on Supercomputing, 2006, p. 8.

[24] V. T. Ravi, M. Becchi, W. Jiang, G. Agrawal, and S. Chakradhar,

“Scheduling concurrent applications on a cluster of CPU-GPU nodes,”

Future Generation Computer Systems, vol. 29, no. 8, pp. 2262–2271,

2013.

[25] A. Eklund, P. Dufort, M. Villani, and S. LaConte, “BROCCOLI: Soft-

ware for fast fMRI analysis on many-core CPUs and GPUs,” Frontiers

in Neuroinformatics, vol. 8, p. 24, 1900.

[26] L. Shi, W. Liu, H. Zhang, Y. Xie, and D. Wang, “A survey of GPU-

based medical image computing techniques,” Quantitative imaging in

medicine and surgery, vol. 2, no. 3, pp. 188–206, 2012.

[27] A. R. F. da Silva, “Cudabayesreg: parallel implementation of a bayesian

multilevel model for fMRI data analysis,” Journal of Statistical Software,

vol. 44, no. 4, pp. 1–24, 2011.

[28] W. Vanderbauwhede and T. Takemi, “An investigation into the feasibility

and beneﬁts of GPU/multicore acceleration of the weather research and

forecasting model,” in High Performance Computing and Simulation

(HPCS), 2013 International Conference on, 2013, pp. 482–489.

[29] F. Angiulli, S. Basta, S. Lodi, and C. Sartori, “Fast outlier detection using

a GPU,” in High Performance Computing and Simulation (HPCS), 2013

International Conference on, 2013, pp. 143–150.

[30] W. Liu, B. Schmidt, and W. Muller-Wittig, “CUDA-BLASTP: accel-

erating BLASTP on CUDA-enabled graphics hardware,” IEEE/ACM

Transactions on Computational Biology and Bioinformatics (TCBB),

vol. 8, no. 6, pp. 1678–1684, 2011.

[31] R. Warrender, J. Tindle, and D. Nelson, “Job scheduling in a high

performance computing environment,” in High Performance Computing

and Simulation (HPCS), 2013 International Conference on, 2013, pp.

592–598.

[32] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “Starpu:

a uniﬁed platform for task scheduling on heterogeneous multicore

architectures,” Concurrency and Computation: Practice and Experience,

vol. 23, no. 2, pp. 187–198, 2011.

[33] K. Shirahata, H. Sato, and S. Matsuoka, “Hybrid map task scheduling

for GPU-based heterogeneous clusters,” in Cloud Computing Technology

and Science (CloudCom), 2010 IEEE Second International Conference

on, 2010, pp. 733–740.

ResearchGate has not been able to resolve any citations for this publication.

Computational Medical Image Analysis With a Focus on Real-Time fMRI and Non-Parametric Statistics

Article

Full-text available

Anders Eklund

Broccoli: Software for fast fMRI analysis on many-core CPUs and GPUs

Article

Full-text available

Mar 2014

Analysis of functional magnetic resonance imaging (fMRI) data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful graphics processing units (GPUs) to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL (Open Computing Language) that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU, and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further, dramatic speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU) can perform non-linear spatial normalization to a 1 mm(3) brain template in 4-6 s, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github (https://github.com/wanderine/BROCCOLI/).

cudaBayesreg: Parallel Implementation of a Bayesian Multilevel Model for fMRI Data Analysis

Article

Full-text available

Oct 2011
J STAT SOFTW

Adelino R. Ferreira da Silva

Graphic processing units (GPUs) are rapidly gaining maturity as powerful general parallel computing devices. A key feature in the development of modern GPUs has been the advancement of the programming model and programming tools. Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on Nvidia many-core GPUs. In functional magnetic resonance imaging (fMRI), the volume of the data to be processed, and the type of statistical analysis to perform call for high-performance computing strategies. In this work, we present the main features of the R-CUDA package cudaBayesreg which implements in CUDA the core of a Bayesian multilevel model for the analysis of brain fMRI data. The statistical model implements a Gibbs sampler for multilevel/hierarchical linear models with a normal prior. The main contribution for the increased performance comes from the use of separate threads for fitting the linear regression model at each voxel in parallel. The R-CUDA implementation of the Bayesian model proposed here has been able to reduce significantly the run-time processing of Markov chain Monte Carlo (MCMC) simulations used in Bayesian fMRI data analyses. Presently, cudaBayesreg is only configured for Linux systems with Nvidia CUDA support.

A survey of GPU-based medical image computing techniques

Article

Full-text available

Sep 2012

Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine.

An analysis of the feasibility and benefits of GPU/multicore acceleration of the Weather Research and Forecasting model

Article

Jan 2015
CONCURR COMP-PRACT E

There is a growing need for ever more accurate climate and weather simulations to be delivered in shorter timescales, in particular, to guard against severe weather events such as hurricanes and heavy rainfall. Due to climate change, the severity and frequency of such events - and thus the economic impact - are set to rise dramatically. Hardware acceleration using graphics processing units (GPUs) or Field-Programmable Gate Arrays (FPGAs) could potentially result in much reduced run times or higher accuracy simulations. In this paper, we present the results of a study of the Weather Research and Forecasting (WRF) model undertaken in order to assess if GPU and multicore acceleration of this type of numerical weather prediction (NWP) code is both feasible and worthwhile. The focus of this paper is on acceleration of code running on a single compute node through offloading of parts of the code to an accelerator such as a GPU. The governing equations set of the WRF model is based on the compressible, non-hydrostatic atmospheric motion with multi-physics processes. We put this work into context by discussing its more general applicability to multi-physics fluid dynamics codes: in many fluid dynamics codes, the numerical schemes of the advection terms are based on finite differences between neighboring cells, similar to the WRF code. For fluid systems including multi-physics processes, there are many calls to these advection routines. This class of numerical codes will benefit from hardware acceleration. We studied the performance of the original code of the WRF model and proposed a simple model for comparing multicore CPU and GPU performance. Based on the results of extensive profiling of representative WRF runs, we focused on the acceleration of the scalar advection module. We discuss the implementation of this module as a data-parallel kernel in both OpenCL and OpenMP. We show that our data-parallel kernel version of the scalar advection module runs up to seven times faster on the GPU compared with the original code on the CPU. However, as the data transfer cost between GPU and CPU is very high (as shown by our analysis), there is only a small speed-up (two times) for the fully integrated code. We show that it would be possible to offset the data transfer cost through GPU acceleration of a larger portion of the dynamics code. In order to carry out this research, we also developed an extensible software system for integrating OpenCL code into large Fortran code bases such as WRF. This is one of the main contributions of our work. We discuss the system to show how it allows the replacement of the sections of the original codebase with their OpenCL counterparts with minimal changes - literally only a few lines - to the original code. Our final assessment is that, even with the current system architectures, accelerating WRF - and hence also other, similar types of multi-physics fluid dynamics codes - with a factor of up to five times is definitely an achievable goal. Accelerating multi-physics fluid dynamics codes including NWP codes is vital for its application to weather forecasting, environmental pollution warning, and emergency response to the dispersion of hazardous materials. Implementing hardware acceleration capability for fluid dynamics and NWP codes is a prerequisite for up-to-date and future computer architectures.

An analysis of the feasibility and benefits of GPU/multicore acceleration of the Weather Research and Forecasting model

Conference Paper

Jul 2013
CONCURR COMP-PRACT E

There is a growing need for ever more accurate climate and weather simulations to be delivered in shorter timescales. Hardware Acceleration using GPUs or FPGAs could potentially result in much reduced run times or higher accuracy simulations. We studied the Weather Research and Forecasting Model in order to assess if GPU acceleration of this type of Numerical Weather Prediction code is both feasible and worthwhile. We studied the performance of the original code and created a simple performance model for comparing multicore CPUs and GPUs. Based on the WRF profiling results, we focused on the acceleration of the scalar advection module. We show that our data-parallel kernel version of the scalar advection module runs up to 7× faster on the GPU compared to the original code on the CPU. However, as the data transfer cost between GPU and CPU is very high (as shown by our analysis), there is only a small speed-up (2×) for the fully integrated code. We also developed an extensible system for integrating OpenCL code into large Fortran code bases such as WRF. In conclusion, we have shown that GPU acceleration of WRF is both feasible and worthwhile. Our findings are generally applicable to multi-physics fluid dynamics code and not limited to NWP models.

Job scheduling in a high performance computing environment

Conference Paper

Jul 2013

Preparing jobs to run within a high performance cluster environment usually involves at least the understanding of a series of compromises that will affect the time taken to process the work and produce useful results. Software generally is architected by domain specialists who design for a particular hardware environment. Good, well-written software usually incorporates 'tweaks' or switches that can be externally invoked to take advantage of the different hardware environments likely to be encountered by the software. With the ever changing landscape of computer hardware, it is not uncommon to have to address the way we work with software in order to maximize the capabilities of the software within a new environment. This paper discusses some of the technical challenges encountered when attempting to use software intended for workstation use within a semi-automatic batch cluster (HPC) environment. The paper chronicles the efforts and solutions deployed working with a team of computational chemists actively engaged on ground-breaking work applied to new drug discovery.

Fast outlier detection using a GPU

Conference Paper

Jul 2013

The availability of cost-effective data collections and storage hardware has allowed organizations to accumulate very large data sets, which are a potential source of previously unknown valuable information. The process of discovering interesting patterns in such large data sets is referred to as data mining. Outlier detection is a data mining task consisting in the discovery of observations which deviate substantially from the rest of the data, and has many important practical applications. Outlier detection in very large data sets is however computationally very demanding and currently requires highperformance computing facilities. We propose a family of parallel algorithms for Graphic Processing Units (GPU), derived from two distance-based outlier detection algorithms: the BruteForce and the SolvingSet. We analyze their performance with an extensive set of experiments, comparing the GPU implementations with the base CPU versions and obtaining significant speedups.

Multivariate Analysis

Book

Jan 1979

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Article

May 2012

Heterogeneous architectures comprising a multicore CPU and many-core GPU(s) are increasingly being used within cluster and cloud environments. In this paper, we study the problem of optimizing the overall throughput of a set of applications deployed on a cluster of such heterogeneous nodes. We consider two different scheduling formulations. In the first formulation, we consider jobs that can be executed on either the GPU or the CPU of a single node. In the second formulation, we consider jobs that can be executed on the CPU, GPU, or both, of any number of nodes in the system. We have developed scheduling schemes addressing both of the problems. In our evaluation, we first show that the schemes proposed for first formulation outperform a blind round-robin scheduler and approximate the performances of an ideal scheduler that involves an impractical exhaustive exploration of all possible schedules. Next, we show that the scheme proposed for the second formulation outperforms the best of existing schemes for heterogeneous clusters, TORQUE and MCT, by up to 42%.

Porting a neuro-imaging application to a CPU-GPU cluster

Abstract and Figures

Recommended publications

Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media...

Programmation des architectures hétérogènes à l'aide de tâches divisibles ou modulables

Powerful Mathematical Tools for Solving Complex Problems in High Performance Computing

Accelerating S3D: A GPGPU Case Study