Schematic visualization of a GPU device.

Source publication

Improved GPU/CUDA based parallel weather and research forecast (WRF) Single Moment 5-class (WSM5) cloud microphysics

Article

Full-text available

Aug 2012

The Weather Research and Forecasting (WRF) model is an atmospheric simulation system which is designed for both operational and research use. WRF is currently in operational use at the National Oceanic and Atmospheric Administration (NOAA)'s national weather service as well as at the air force weather agency and meteorological services worldwide. G...

Ensemble Kalman filter assimilation of near-surface observations over complex terrain: Comparison with 3DVAR for short-range forecasts

Article

Full-text available

Mar 2013

Surface observations are the main conventional observations for weather forecasts. However, in modern numerical weather prediction, the use of surface observations, especially those data over complex terrain, remains a unique challenge. There are fundamental difficulties in assimilating surface observations with three-dimensional variational data a...

Multilocalization data assimilation for predicting heavy precipitation associated with a multiscale weather system

Article

Full-text available

Jun 2017

High-resolution numerical simulations are regularly used for severe weather forecasts. To improve model initial conditions, a single short localization is commonly applied in the ensemble Kalman filter when assimilating observations. This approach prevents large-scale corrections from appearing in a high-resolution analysis. To improve heavy rainfa...

A seamless probabilistic forecasting system for decision making in Civil Protection

Article

Full-text available

Dec 2018

An innovative seamless probabilistic forecasting system has been developed within the EU project PROFORCE (Bridging of PRObabilistic FORecasts and Civil protEction). The system merges four different ensemble prediction systems and provides weather forecasts and the corresponding forecast uncertainties in a seamless way from several days ahead to th...

Parallelization with load balancing of the weather scheme WSM7 for heterogeneous CPU-GPU platforms

Article

Full-text available

Mar 2024
J SUPERCOMPUT

This article provides an enhanced parallelization of the WSM7 microphysics scheme for the Weather Research and Forecasting Model (WRF). The parallelization is designed to maximize the utilization of a heterogeneous computing system consisting of CPUs, GPUs or both. Therefore the reference implementation of the WSM7 scheme is re-implemented for the heterogeneous execution model. For each time step, a dynamic load distribution is introduced which balances the computational load between the two components aiming for an overall minimum execution time. The evaluation of the parallelized implementation is done for a specific weather situation. Specifically, the precipitation of the low-pressure zone “Bernd” from July 2021 is simulated using an Intel Core i7-7700 CPU and a NVIDIA GTX 1070 GPU. The results show a speedup of up to 28.51 for the GPU version in comparison with the reference implementation. The heterogeneous dynamic load balancing increases the speedup achieved even further by introducing a distribution factor that is updated for each time step.

A GPU-enabled acceleration algorithm for the CAM5 cloud microphysics scheme

Article

Full-text available

May 2023
J SUPERCOMPUT

The National Center for Atmospheric Research released a global atmosphere model named Community Atmosphere Model version 5.0 (CAM5), which aimed to provide a global climate simulation for meteorological research. Among them, the cloud microphysics scheme is extremely time-consuming, so developing efficient parallel algorithms faces large-scale and chronic simulation challenges. Due to the wide application of GPU in the fields of science and engineering and the NVIDIA’s mature and stable CUDA platform, we ported the code to GPU to accelerate computing. In this paper, by analyzing the parallelism of CAM5 cloud microphysical schemes (CAM5 CMS) in different dimensions, corresponding GPU-based one-dimensional (1D) and two-dimensional (2D) parallel acceleration algorithms are proposed. Among them, the 2D parallel algorithm exploits finer-grained parallelism. In addition, we present a data transfer optimization method between the CPU and GPU to further improve the overall performance. Finally, GPU version of the CAM5 CMS (GPU-CMS) was implemented. The GPU-CMS can obtain a speedup of 141.69$\times$ on a single NVIDIA A100 GPU with I/O transfer. In the case without I/O transfer, compared to the baseline performance on a single Intel Xeon E5-2680 CPU core, the 2D acceleration algorithm obtained a speedup of 48.75$\times$, 280.11$\times$, and 507.18$\times$ on a single NVIDIA K20, P100, and A100 GPU, respectively.

Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1)

Preprint

Full-text available

Feb 2023

Some programming languages are easy to develop at the cost of slow execution, while others are fast at run time but much more difficult to write. Julia is a programming language that aims to be the best of both worlds – a development and production language at the same time. To test Julia’s utility in scientific high-performance computing (HPC), we built an unstructured-mesh shallow water model in Julia and compared it against an established Fortran-MPI ocean model, MPAS-Ocean, as well as a Python shallow water code. Three versions of the Julia shallow water code were created, for: single-core CPU; graphics processing unit (GPU); and Message Passing Interface (MPI) CPU clusters. Comparing identical simulations revealed that our first version of the Julia model was 13 times faster than Python using Numpy, where both used an unthreaded single-core CPU. Further Julia optimizations, including static typing and removing implicit memory allocations, provided an additional 10–20x speed-up of the single-core CPU Julia model. The GPU-accelerated Julia code attained a speed-up of 230–380x compared to the single-core CPU Julia code. Parallelized Julia-MPI performance was identical to Fortran-MPI MPAS-Ocean for low processor counts, and ranges from 2x faster to 2x slower for higher processor counts. Our experience is that Julia development is fast and convenient for prototyping, but that Julia requires further investment and expertise to be competitive with compiled codes. We provide advice on Julia code optimization for HPC systems.

A Graphics Processing Unit (GPU) Approach to Large Eddy Simulation (LES) for Transport and Contaminant Dispersion

Article

Full-text available

Jul 2021

Recent advances in the development of large eddy simulation (LES) atmospheric models with corresponding atmospheric transport and dispersion (AT&D) modeling capabilities have made it possible to simulate short, time-averaged, single realizations of pollutant dispersion at the spatial and temporal resolution necessary for common atmospheric dispersion needs, such as designing air sampling networks, assessing pollutant sensor system performance, and characterizing the impact of airborne materials on human health. The high computational burden required to form an ensemble of single-realization dispersion solutions using an LES and coupled AT&D model has, until recently, limited its use to a few proof-of-concept studies. An example of an LES model that can meet the temporal and spatial resolution and computational requirements of these applications is the joint outdoor-indoor urban large eddy simulation (JOULES). A key enabling element within JOULES is the computationally efficient graphics processing unit (GPU)-based LES, which is on the order of 150 times faster than if the LES contaminant dispersion simulations were executed on a central processing unit (CPU) computing platform. JOULES is capable of resolving the turbulence components at a suitable scale for both open terrain and urban landscapes, e.g., owing to varying environmental conditions and a diverse building topology. In this paper, we describe the JOULES modeling system, prior efforts to validate the accuracy of its meteorological simulations, and current results from an evaluation that uses ensembles of dispersion solutions for unstable, neutral, and stable static stability conditions in an open terrain environment.

AutoWM: a novel domain-specific tool for universal multi-/many-core accelerations of the WRF cloud microphysics

Article

Full-text available

Jun 2021
CLUSTER COMPUT

In large-scale atmospheric simulations, microphysics parameterization often takes a large portion of simulation time and usually consists of dozens of parameterization schemes. Performance optimizing these schemes one by one on different hardware platforms is tedious and error-prone even for skilled programmers. In this work, we propose AutoWM, a novel domain-specific tool for universal performance accelerations of the famous weather research and forecasting model (WRF) microphysics on multi-/many-core systems. The main idea of AutoWM is to reconstruct various schemes into compositions of common building blocks and optimize these building blocks instead of the schemes on target platforms for reusing. To achieve this goal, a light-weight domain-specific language, WML, is provided to describe different microphysics schemes so that the workflow information can be parsed and extracted easily. Experiments on the popular WRF single/double moments microphysics schemes show that AutoWM can automatically generate well optimized microphysics kernels on three multi- and many-core platforms including Intel Ivy Bridge, Intel Xeon Phi and Chinese homegrown SW26010, with the average floating-point efficiency reaching $47\%$, $20\%$ and $10\%$ of the theoretical peak performance, respectively.

GPUs-RRTMG_LW: high-efficient and scalable computing for a longwave radiative transfer model on multiple GPUs

Article

Full-text available

May 2021
J SUPERCOMPUT

Atmospheric radiation physical process plays an important role in climate simulations. As a radiative transfer scheme, the rapid radiative transfer model for general circulation models (RRTMG) is widely used in weather forecasting and climate simulation systems. However, its expensive computational overhead poses a severe challenge to system performance. Therefore, improving the radiative transfer model’s computational performance has significant scientific research and practical value. Numerous radiative transfer models have benefited from a widely used and powerful GPU. Nevertheless, few of them have exploited CPU/GPU cluster resources within heterogeneous high-performance computing platforms. In this paper, we endeavor to demonstrate an approach that runs a large-scale, computationally intensive, longwave radiative transfer model on a GPU cluster. First, a CUDA-based acceleration algorithm of the RRTMG longwave radiation scheme (RRTMG_LW) on multiple GPUs is proposed. Then, a heterogeneous, hybrid programming paradigm (MPI+CUDA) is presented and utilized with the RRTMG_LW on a GPU cluster. After implementing the algorithm in CUDA Fortran, a multi-GPU version of the RRTMG_LW, namely GPUs-RRTMG_LW, was developed. The experimental results demonstrate that the multi-GPU acceleration algorithm is valid, scalable, and highly efficient when compared to a single GPU or CPU. Running the GPUs-RRTMG_LW on a K20 cluster achieved a 77.78×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$77.78 \times$$\end{document} speedup when compared to a single Intel Xeon E5-2680 CPU core.

The FastEddy® Resident‐GPU Accelerated Large‐Eddy Simulation Framework: Model Formulation, Dynamical‐Core Validation and Performance Benchmarks

Article

Full-text available

Nov 2020

Abstract This paper introduces a new large‐eddy simulation model, FastEddy®, purpose built for leveraging the accelerated and more power‐efficient computing capacity of graphics processing units (GPUs) toward adopting microscale turbulence‐resolving atmospheric boundary layer simulations into future numerical weather prediction activities. Here a basis for future endeavors with the FastEddy® model is provided by describing the model dry dynamics formulation and investigating several validation scenarios that establish a baseline of model predictive skill for canonical neutral, convective, and stable boundary layer regimes, along with boundary layer flow over heterogeneous terrain. The current FastEddy® GPU performance and efficiency gains versus similarly formulated, state‐of‐the‐art CPU‐based models is determined through scaling tests as 1 GPU to 256 CPU cores. At this ratio of GPUs to CPU cores, FastEddy® achieves 6 times faster prediction rate than commensurate CPU models under equivalent power consumption. Alternatively, FastEddy® uses 8 times less power at this ratio under equivalent CPU/GPU prediction rate. The accelerated performance and efficiency gains of the FastEddy® model permit more broad application of large‐eddy simulation to emerging atmospheric boundary layer research topics through substantial reduction of computational resource requirements and increase in model prediction rate.

GPU acceleration of MPAS microphysics WSM6 using OpenACC directives: Performance and verification

Article

Full-text available

Oct 2020
COMPUT GEOSCI-UK

In this study, we accelerated a microphysics scheme embedded within the Model for Prediction Across Scales (MPAS), using OpenACC directives. As one of the most time-consuming physics parameterization schemes, we focused on parallelizing the Weather Research and Forecasting (WRF) single-moment 6-class microphysics scheme (WSM6) onto a graphics processing unit (GPU). We applied several essential methodologies to optimize the performance of WSM6 computation on the GPU, to minimize data transfer between the central processing unit (CPU) and GPU and to reduce the waste of GPU threads during computation. As a result, we achieved GPU runs using 1 T V100 that were 2.38 times faster than 48 message passing interface processes runs, on average. When porting the whole model onto the GPU, we achieved a ×5.71 speed-up in WSM6 computation, except in I/O communication. In addition, the precise verification method distinguished nonlinear chaotic error growth from differences introduced by GPU computation, considering the characteristics of the major output variables from WSM6. We then compared the difference between the CPU and the GPU runs to the difference between CPU runs with different compilers. Moreover, we examined bias in these differences, which can distort the climatology of model simulation. Our approach successfully passed the verification process, and this represents the successful application of GPU acceleration to realistic full-model integration of MPAS.

A Novel GPU-Based Acceleration Algorithm for a Longwave Radiative Transfer Model

Article

Full-text available

Jan 2020

Graphics processing unit (GPU)-based computing for climate system models is a longstanding research area of interest. The rapid radiative transfer model for general circulation models (RRTMG), a popular atmospheric radiative transfer model, can calculate atmospheric radiative fluxes and heating rates. However, the RRTMG has a high calculation time, so it is urgent to study its GPU-based efficient acceleration algorithm to enable large-scale and long-term climatic simulations. To improve the calculative efficiency of radiation transfer, this paper proposes a GPU-based acceleration algorithm for the RRTMG longwave radiation scheme (RRTMG_LW). The algorithm concept is accelerating the RRTMG_LW in the g- p o i n t dimension. After implementing the algorithm in CUDA Fortran, the G-RRTMG_LW was developed. The experimental results indicated that the algorithm was effective. In the case without I/O transfer, the G-RRTMG_LW on one K40 GPU obtained a speedup of 30.98× over the baseline performance on one single Intel Xeon E5-2680 CPU core. When compared to its counterpart running on 10 CPU cores of an Intel Xeon E5-2680 v2, the G-RRTMG_LW on one K20 GPU in the case without I/O transfer achieved a speedup of 2.35×.

Parallel Computation of EM Backscattering from Large Three-Dimensional Sea Surface with CUDA

Article

Full-text available

Oct 2018
SENSORS-BASEL

An efficient parallel computation using graphics processing units (GPUs) is developed for studying the electromagnetic (EM) backscattering characteristics from a large three-dimensional sea surface. A slope-deterministic composite scattering model (SDCSM), which combines the quasi-specular scattering of Kirchhoff Approximation (KA) and Bragg scattering of the two-scale model (TSM), is utilized to calculate the normalized radar cross section (NRCS in dB) of the sea surface. However, with the improvement of the radar resolution, there will be millions of triangular facets on the large sea surface which make the computation of NRCS time-consuming and inefficient. In this paper, the feasibility of using NVIDIA Tesla K80 GPU with four compute unified device architecture (CUDA) optimization strategies to improve the calculation efficiency of EM backscattering from a large sea surface is verified. The whole GPU-accelerated SDCSM calculation takes full advantage of coalesced memory access, constant memory, fast math compiler options, and asynchronous data transfer. The impact of block size and the number of registers per thread is analyzed to further improve the computation speed. A significant speedup of 748.26x can be obtained utilizing a single GPU for the GPU-based SDCSM implemented compared with the CPU-based counterpart performing on the Intel(R) Core(TM) i5-3450.

Schematic visualization of a GPU device.

Similar publications

Citations