Fig 2 - uploaded by Jarno Mielikainen
Content may be subject to copyright.
Schematic visualization of a GPU device.

Schematic visualization of a GPU device.

Source publication
Article
Full-text available
The Weather Research and Forecasting (WRF) model is an atmospheric simulation system which is designed for both operational and research use. WRF is currently in operational use at the National Oceanic and Atmospheric Administration (NOAA)'s national weather service as well as at the air force weather agency and meteorological services worldwide. G...

Similar publications

Article
Full-text available
Surface observations are the main conventional observations for weather forecasts. However, in modern numerical weather prediction, the use of surface observations, especially those data over complex terrain, remains a unique challenge. There are fundamental difficulties in assimilating surface observations with three-dimensional variational data a...
Article
Full-text available
High-resolution numerical simulations are regularly used for severe weather forecasts. To improve model initial conditions, a single short localization is commonly applied in the ensemble Kalman filter when assimilating observations. This approach prevents large-scale corrections from appearing in a high-resolution analysis. To improve heavy rainfa...
Article
Full-text available
An innovative seamless probabilistic forecasting system has been developed within the EU project PROFORCE (Bridging of PRObabilistic FORecasts and Civil protEction). The system merges four different ensemble prediction systems and provides weather forecasts and the corresponding forecast uncertainties in a seamless way from several days ahead to th...

Citations

... The WRF Single-Moment 5-class (WSM5) scheme [8] additionally considers the composed classes water/ice and rain/snow as individual categories. Different approaches for parallel C and GPU implementations of the WSM5 are provided by Mielikainen et al. [9] and Ridwan et al. [10] reaching speedup values of up to 206 (considering memory transfers) and 403, respectively. The inclusion of hail as a sixth category of hydrometeors by Hong and Lim [11] denotes the WRF Single-Moment 6-class (WSM6) scheme. ...
Article
Full-text available
This article provides an enhanced parallelization of the WSM7 microphysics scheme for the Weather Research and Forecasting Model (WRF). The parallelization is designed to maximize the utilization of a heterogeneous computing system consisting of CPUs, GPUs or both. Therefore the reference implementation of the WSM7 scheme is re-implemented for the heterogeneous execution model. For each time step, a dynamic load distribution is introduced which balances the computational load between the two components aiming for an overall minimum execution time. The evaluation of the parallelized implementation is done for a specific weather situation. Specifically, the precipitation of the low-pressure zone “Bernd” from July 2021 is simulated using an Intel Core i7-7700 CPU and a NVIDIA GTX 1070 GPU. The results show a speedup of up to 28.51 for the GPU version in comparison with the reference implementation. The heterogeneous dynamic load balancing increases the speedup achieved even further by introducing a distribution factor that is updated for each time step.
... For the parameterization methods of cloud microphysical processes on various models and platforms, many scholars have carried out research and experiments on GPU-based acceleration algorithms and achieved good acceleration results. Mielikainen et al. [14] used a single GPU to accelerate the Weather Research and Forecast (WRF) Single Moment 5-class (WSM5) model, achieving a 389× speedup without I/O (Input/Output) transfer. When using 4 GPUs, the WRF WSM5 model got 357× and 1556× performance improvement with and without I/O transfer respectively. ...
... Increasing the resolution to do more in-depth experiments is one of our next works. Without considering the I/O transfer, Mielikainen et al.'s[14] single-GPU-based parallel method for Weather Research and Forecasting (WRF) Single Moment Class 5 (WSM5) achieves a 389× speedup, while our GPU-CMS parallel work achieves a speedup of 507.18× . At the same time, we compared our GPU-CMS work with Kim J Y, et al.'s ...
Article
Full-text available
The National Center for Atmospheric Research released a global atmosphere model named Community Atmosphere Model version 5.0 (CAM5), which aimed to provide a global climate simulation for meteorological research. Among them, the cloud microphysics scheme is extremely time-consuming, so developing efficient parallel algorithms faces large-scale and chronic simulation challenges. Due to the wide application of GPU in the fields of science and engineering and the NVIDIA’s mature and stable CUDA platform, we ported the code to GPU to accelerate computing. In this paper, by analyzing the parallelism of CAM5 cloud microphysical schemes (CAM5 CMS) in different dimensions, corresponding GPU-based one-dimensional (1D) and two-dimensional (2D) parallel acceleration algorithms are proposed. Among them, the 2D parallel algorithm exploits finer-grained parallelism. In addition, we present a data transfer optimization method between the CPU and GPU to further improve the overall performance. Finally, GPU version of the CAM5 CMS (GPU-CMS) was implemented. The GPU-CMS can obtain a speedup of 141.69\(\times\) on a single NVIDIA A100 GPU with I/O transfer. In the case without I/O transfer, compared to the baseline performance on a single Intel Xeon E5-2680 CPU core, the 2D acceleration algorithm obtained a speedup of 48.75\(\times\), 280.11\(\times\), and 507.18\(\times\) on a single NVIDIA K20, P100, and A100 GPU, respectively.
... Summing the GPU memory transfer and compute for the 10 timestep performance test, the GPUs were 229 to 386 times faster than the single CPU (Table 2). This compares to published studies of ocean models that show a speed-up from CPU to GPU ranging from 5-50 (Bleichrodt et al., 2012;Zhao et al., 2017;Xu et al., 2014), and a speed-up of up to 1556x for a GPU/CUDA Based Parallel Weather and Research Forecast Model (WRF) (Mielikainen et al., 2012). Note that our speed-up factor could be increased substantially by transferring data from the GPU to CPU less frequently. ...
Preprint
Full-text available
Some programming languages are easy to develop at the cost of slow execution, while others are fast at run time but much more difficult to write. Julia is a programming language that aims to be the best of both worlds – a development and production language at the same time. To test Julia’s utility in scientific high-performance computing (HPC), we built an unstructured-mesh shallow water model in Julia and compared it against an established Fortran-MPI ocean model, MPAS-Ocean, as well as a Python shallow water code. Three versions of the Julia shallow water code were created, for: single-core CPU; graphics processing unit (GPU); and Message Passing Interface (MPI) CPU clusters. Comparing identical simulations revealed that our first version of the Julia model was 13 times faster than Python using Numpy, where both used an unthreaded single-core CPU. Further Julia optimizations, including static typing and removing implicit memory allocations, provided an additional 10–20x speed-up of the single-core CPU Julia model. The GPU-accelerated Julia code attained a speed-up of 230–380x compared to the single-core CPU Julia code. Parallelized Julia-MPI performance was identical to Fortran-MPI MPAS-Ocean for low processor counts, and ranges from 2x faster to 2x slower for higher processor counts. Our experience is that Julia development is fast and convenient for prototyping, but that Julia requires further investment and expertise to be competitive with compiled codes. We provide advice on Julia code optimization for HPC systems.
... Early examples where the use of a GPU for an atmospheric modeling application explored the porting of computationally expensive elements of the Weather Research and Forecast (WRF) model [20] to run on a GPU. This work, commonly referred to as GPU acceleration, includes work by Michalakes and Vachharajani [21], Mielikainen et al. [22], Silva et al. [23], and Wahib and Maruyama [24]. An alternative approach that has been more recently taken by atmospheric modelers is to port the entire atmospheric model to run resident on the GPU. ...
Article
Full-text available
Recent advances in the development of large eddy simulation (LES) atmospheric models with corresponding atmospheric transport and dispersion (AT&D) modeling capabilities have made it possible to simulate short, time-averaged, single realizations of pollutant dispersion at the spatial and temporal resolution necessary for common atmospheric dispersion needs, such as designing air sampling networks, assessing pollutant sensor system performance, and characterizing the impact of airborne materials on human health. The high computational burden required to form an ensemble of single-realization dispersion solutions using an LES and coupled AT&D model has, until recently, limited its use to a few proof-of-concept studies. An example of an LES model that can meet the temporal and spatial resolution and computational requirements of these applications is the joint outdoor-indoor urban large eddy simulation (JOULES). A key enabling element within JOULES is the computationally efficient graphics processing unit (GPU)-based LES, which is on the order of 150 times faster than if the LES contaminant dispersion simulations were executed on a central processing unit (CPU) computing platform. JOULES is capable of resolving the turbulence components at a suitable scale for both open terrain and urban landscapes, e.g., owing to varying environmental conditions and a diverse building topology. In this paper, we describe the JOULES modeling system, prior efforts to validate the accuracy of its meteorological simulations, and current results from an evaluation that uses ensembles of dispersion solutions for unstable, neutral, and stable static stability conditions in an open terrain environment.
... Large amounts of researches have been done in accelerating WRF microphysics codes and remarkable speedups have been demonstrated [12,17,19,20]. Each of these efforts accelerates one standalone WRF microphysics subroutine on a given hardware platform, and no universal considerations are provided for different microphysics and computer architectures. ...
Article
Full-text available
In large-scale atmospheric simulations, microphysics parameterization often takes a large portion of simulation time and usually consists of dozens of parameterization schemes. Performance optimizing these schemes one by one on different hardware platforms is tedious and error-prone even for skilled programmers. In this work, we propose AutoWM, a novel domain-specific tool for universal performance accelerations of the famous weather research and forecasting model (WRF) microphysics on multi-/many-core systems. The main idea of AutoWM is to reconstruct various schemes into compositions of common building blocks and optimize these building blocks instead of the schemes on target platforms for reusing. To achieve this goal, a light-weight domain-specific language, WML, is provided to describe different microphysics schemes so that the workflow information can be parsed and extracted easily. Experiments on the popular WRF single/double moments microphysics schemes show that AutoWM can automatically generate well optimized microphysics kernels on three multi- and many-core platforms including Intel Ivy Bridge, Intel Xeon Phi and Chinese homegrown SW26010, with the average floating-point efficiency reaching \(47\%\), \(20\%\) and \(10\%\) of the theoretical peak performance, respectively.
... The WRF Kessler cloud microphysics scheme obtained a 132× speedup on 4 GPUs compared to its single-threaded CPU version [36]. The WRF WSM5 microphysics scheme was accelerated by 357× on four GPUs [37]. The horizontal diffusion method in the WRF was accelerated approximately 3.5 times using two Tesla K40m GPUs compared with the single-GPU version [38]. ...
Article
Full-text available
Atmospheric radiation physical process plays an important role in climate simulations. As a radiative transfer scheme, the rapid radiative transfer model for general circulation models (RRTMG) is widely used in weather forecasting and climate simulation systems. However, its expensive computational overhead poses a severe challenge to system performance. Therefore, improving the radiative transfer model’s computational performance has significant scientific research and practical value. Numerous radiative transfer models have benefited from a widely used and powerful GPU. Nevertheless, few of them have exploited CPU/GPU cluster resources within heterogeneous high-performance computing platforms. In this paper, we endeavor to demonstrate an approach that runs a large-scale, computationally intensive, longwave radiative transfer model on a GPU cluster. First, a CUDA-based acceleration algorithm of the RRTMG longwave radiation scheme (RRTMG_LW) on multiple GPUs is proposed. Then, a heterogeneous, hybrid programming paradigm (MPI+CUDA) is presented and utilized with the RRTMG_LW on a GPU cluster. After implementing the algorithm in CUDA Fortran, a multi-GPU version of the RRTMG_LW, namely GPUs-RRTMG_LW, was developed. The experimental results demonstrate that the multi-GPU acceleration algorithm is valid, scalable, and highly efficient when compared to a single GPU or CPU. Running the GPUs-RRTMG_LW on a K20 cluster achieved a 77.78×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$77.78 \times$$\end{document} speedup when compared to a single Intel Xeon E5-2680 CPU core.
... In fact, exemplary activities of GPU acceleration specifically for LES targeting weather prediction include (Schalkwijk et al., 2012;Schalkwijk, Jonker, Siebesma, & Bosveld, 2015;Schalkwijk, Jonker, Siebesma, & Van Meijgaard, 2015), through focused efforts porting the Dutch Atmospheric Large-Eddy Simulation (DALES) model (Heus et al., 2010) to GPU. Notable early work porting a series of model components within the Weather Research and Forecast model (WRF) to GPU include Michalakes and Vachharajani (2008), Mielikainen et al. (2012), Silva et al. (2014), and Wahib and Maruyama (2013). Moreover, several recent model formulation publications at a minimum make additional mention of GPU capability, for example, the Parallelized Large Eddy Simulation Model (PALM) (Maronga et al., 2015) or place a primary emphasis on GPU acceleration itself, for example, MicroHH (van Heerwaarden et al., 2017). ...
Article
Full-text available
Abstract This paper introduces a new large‐eddy simulation model, FastEddy®, purpose built for leveraging the accelerated and more power‐efficient computing capacity of graphics processing units (GPUs) toward adopting microscale turbulence‐resolving atmospheric boundary layer simulations into future numerical weather prediction activities. Here a basis for future endeavors with the FastEddy® model is provided by describing the model dry dynamics formulation and investigating several validation scenarios that establish a baseline of model predictive skill for canonical neutral, convective, and stable boundary layer regimes, along with boundary layer flow over heterogeneous terrain. The current FastEddy® GPU performance and efficiency gains versus similarly formulated, state‐of‐the‐art CPU‐based models is determined through scaling tests as 1 GPU to 256 CPU cores. At this ratio of GPUs to CPU cores, FastEddy® achieves 6 times faster prediction rate than commensurate CPU models under equivalent power consumption. Alternatively, FastEddy® uses 8 times less power at this ratio under equivalent CPU/GPU prediction rate. The accelerated performance and efficiency gains of the FastEddy® model permit more broad application of large‐eddy simulation to emerging atmospheric boundary layer research topics through substantial reduction of computational resource requirements and increase in model prediction rate.
... The Earth System Research Laboratory (ESRL) NIM (Non-hydrostatic Icosahedral Model), a global weather prediction model developed by the National Oceanic and Atmospheric Administration (NOAA), has also been parallelized onto both a GPU and a Many Integrated Core (MIC), using OpenMP and OpenACC directives (Govett et al., 2017). Following this, several WRF physics schemes have been parallelized onto a GPU using CUDA-C, showing encouraging performance compared to CPU codes (Mielikainen et al., 2013(Mielikainen et al., , 2012(Mielikainen et al., , 2016Huang et al., 2015). In addition, physics and dynamics in Non-hydrostatic Unified Model of the Atmosphere (NUMA) were accelerated using Open Concurrent Compute Abstraction (OCCA), which is one of the thread languages (Abdi et al., 2019). ...
Article
Full-text available
In this study, we accelerated a microphysics scheme embedded within the Model for Prediction Across Scales (MPAS), using OpenACC directives. As one of the most time-consuming physics parameterization schemes, we focused on parallelizing the Weather Research and Forecasting (WRF) single-moment 6-class microphysics scheme (WSM6) onto a graphics processing unit (GPU). We applied several essential methodologies to optimize the performance of WSM6 computation on the GPU, to minimize data transfer between the central processing unit (CPU) and GPU and to reduce the waste of GPU threads during computation. As a result, we achieved GPU runs using 1 T V100 that were 2.38 times faster than 48 message passing interface processes runs, on average. When porting the whole model onto the GPU, we achieved a ×5.71 speed-up in WSM6 computation, except in I/O communication. In addition, the precise verification method distinguished nonlinear chaotic error growth from differences introduced by GPU computation, considering the characteristics of the major output variables from WSM6. We then compared the difference between the CPU and the GPU runs to the difference between CPU runs with different compilers. Moreover, we examined bias in these differences, which can distort the climatology of model simulation. Our approach successfully passed the verification process, and this represents the successful application of GPU acceleration to realistic full-model integration of MPAS.
... Currently, increasing numbers of atmospheric applications were accelerated by the GPUs [20,21]. For example, the WSM5 microphysics scheme from the Weather Research and Forecasting (WRF) model obtained a 206× speedup on a GPU [22]. ...
Article
Full-text available
Graphics processing unit (GPU)-based computing for climate system models is a longstanding research area of interest. The rapid radiative transfer model for general circulation models (RRTMG), a popular atmospheric radiative transfer model, can calculate atmospheric radiative fluxes and heating rates. However, the RRTMG has a high calculation time, so it is urgent to study its GPU-based efficient acceleration algorithm to enable large-scale and long-term climatic simulations. To improve the calculative efficiency of radiation transfer, this paper proposes a GPU-based acceleration algorithm for the RRTMG longwave radiation scheme (RRTMG_LW). The algorithm concept is accelerating the RRTMG_LW in the g- p o i n t dimension. After implementing the algorithm in CUDA Fortran, the G-RRTMG_LW was developed. The experimental results indicated that the algorithm was effective. In the case without I/O transfer, the G-RRTMG_LW on one K40 GPU obtained a speedup of 30.98× over the baseline performance on one single Intel Xeon E5-2680 CPU core. When compared to its counterpart running on 10 CPU cores of an Intel Xeon E5-2680 v2, the G-RRTMG_LW on one K20 GPU in the case without I/O transfer achieved a speedup of 2.35×.
... GPU has been also utilized for weather and research forecast (WRF) and the effects of diverse optimization strategies on the performance were discussed in detail. The processing time reduced to 43.5 ms executed on GPU compared to 16,928 ms executed on CPU [24]. ...
Article
Full-text available
An efficient parallel computation using graphics processing units (GPUs) is developed for studying the electromagnetic (EM) backscattering characteristics from a large three-dimensional sea surface. A slope-deterministic composite scattering model (SDCSM), which combines the quasi-specular scattering of Kirchhoff Approximation (KA) and Bragg scattering of the two-scale model (TSM), is utilized to calculate the normalized radar cross section (NRCS in dB) of the sea surface. However, with the improvement of the radar resolution, there will be millions of triangular facets on the large sea surface which make the computation of NRCS time-consuming and inefficient. In this paper, the feasibility of using NVIDIA Tesla K80 GPU with four compute unified device architecture (CUDA) optimization strategies to improve the calculation efficiency of EM backscattering from a large sea surface is verified. The whole GPU-accelerated SDCSM calculation takes full advantage of coalesced memory access, constant memory, fast math compiler options, and asynchronous data transfer. The impact of block size and the number of registers per thread is analyzed to further improve the computation speed. A significant speedup of 748.26x can be obtained utilizing a single GPU for the GPU-based SDCSM implemented compared with the CPU-based counterpart performing on the Intel(R) Core(TM) i5-3450.