Fig 20 - uploaded by Michael J. Aftosmis
Content may be subject to copyright.
(a) Pressure contours around full SSLV configuration including orbiter, external tank, solid rocket boosters, and fore and aft attach hardware for benchmarking case described in text. (b) Parallel scalability of Cart3D solver module on Columbia using SSLV example on 25 million cell mesh. Runs conducted on single 512 CPU node of Columbia system.

(a) Pressure contours around full SSLV configuration including orbiter, external tank, solid rocket boosters, and fore and aft attach hardware for benchmarking case described in text. (b) Parallel scalability of Cart3D solver module on Columbia using SSLV example on 25 million cell mesh. Runs conducted on single 512 CPU node of Columbia system.

Source publication
Article
Full-text available
This paper focuses on the parallel performance of two high-performance aerodynamic simulation packages on the newly installed NASA Columbia supercomputer. These packages include both a high-fidelity, unstructured, Rey-nolds-averaged Navier–Stokes solver, and a fully-automated inviscid flow package for cut-cell Cartesian grids. The complementary com...

Similar publications

Conference Paper
Full-text available
This paper focuses on the parallel performance of two high-performance aerodynamic simulation packages on the newly installed NASA Columbia supercomputer. These packages include both a high-fidelity, unstructured, Reynolds-averaged Navier-Stokes solver, and a fully-automated inviscid flow package for cut-cell Cartesian grids. The complementary comb...
Article
Full-text available
This paper focuses on the parallel performance of two high-performance aerodynamic simulation packages on the newly installed NASA Columbia supercomputer. These packages include both a high-fidelity, unstructured, Reynolds-averaged Navier–Stokes solver, and a fully-automated inviscid flow package for cut-cell Cartesian grids. The complementary comb...

Citations

... Domain decomposition via space-filling curves permits parallel computation, and the solver has demonstrated excellent scalability on several thousand processing cores [16]. This solver has been extensively validated on scores of internal and external aerodynamic flows over a wide range of speeds and using blast damage and shock arrival times for the Chelyabinsk meteor of 2013 [12]. ...
Article
Full-text available
Entry and breakup models predict that airburst in the Earth's atmosphere is likely for non-metallic asteroids with diameters up to approximately 200 meters. (Collins, 2005; Collins, 2017; Wheeler, 2017). Objects of this size can deposit over 250 megatons of energy into the atmosphere. Fast-running ground damage prediction codes for such events rely heavily upon methods developed from nuclear weapons research to estimate the damage potential for an airburst at altitude (Collins, 2005; Mathias, 2017; Rumpf, 2017; Rumpf, 2016; Hills, 1993). In particular, these tools rely upon the powerful yield scaling laws developed for point-source blasts that are used in conjunction with a Height of Burst (HOB) map to predict ground damage for an airburst of a specific energy at a given altitude. While this approach works extremely well for yields as large as tens of megatons, it becomes less accurate as asteroid size and effective yields increase to the hundreds of megatons potentially released in larger airburst events. Accordingly, this study revisits the assumptions underlying this approach and shows how atmospheric buoyancy becomes important as yield increases beyond a few megatons. We then use large-scale three-dimensional simulations to construct numerically generated Height of Burst maps that are appropriate at the higher energy levels associated with the entry of asteroids with diameters of hundreds of meters. These numerically generated HOB maps can then be incorporated into engineering methods for damage prediction, significantly improving their accuracy for asteroids with diameters greater than 80-100 m.
... The dataset used is a wing-bodynacelle-pylon geometry (DLRF6) with 23 zones and 36 million grid points. The input dataset is 1.6 GB in size, and the solution file is 2 GB. 2) CART3D is a high fidelity, inviscid CFD application that solves the Euler equations of fluid dynamics [23]. It includes a solver called Flowcart, which uses a secondorder , cell-centered, finite volume upwind spatial discretization scheme, in conjunction with a multi-grid accelerated Runge-Kutta method for steady-state cases. ...
Conference Paper
Full-text available
The high performance computing (HPC) community has shown tremendous interest in exploring cloud computing as it promises high potential. In this paper, we examine the feasibility, performance, and scalability of production quality scientific and engineering applications of interest to NASA on NASA's cloud computing platform, called Nebula, hosted at Ames Research Center. This work represents the comprehensive evaluation of Nebula using NUTTCP, HPCC, NPB, I/O, and MPI function benchmarks as well as four applications representative of the NASA HPC workload. Specifically, we compare Nebula performance on some of these benchmarks and applications to that of NASA's Pleiades supercomputer, a traditional HPC system. We also investigate the impact of virtIO and jumbo frames on interconnect performance. Overall results indicate that on Nebula (i) virtIO and jumbo frames improve network bandwidth by a factor of 5x, (ii) there is a significant virtualization layer overhead of about 10% to 25%, (iii) write performance is lower by a factor of 25x, (iv) latency for short MPI messages is very high, and (v) overall performance is 15% to 48% lower than that on Pleiades for NASA HPC applications. We also comment on the usability of the cloud platform.
... floating-point operations, instruction counts, clock cycles, cache misses/hits, etc. [18]. The tool relies on the Performance Application Programming Interface (PAPI) [13] to access hardware performance counters. In the present study, op_scope was built with PAPI version 4.1.0. ...
Conference Paper
Full-text available
Intel provides Hyper-Threading (HT) in processors based on its Pentium and Nehalem micro-architecture such as the Westmere-EP. HT enables two threads to execute on each core in order to hide latencies related to data access. These two threads can execute simultaneously, filling unused stages in the functional unit pipelines. To aid better understanding of HT-related issues, we collect Performance Monitoring Unit (PMU) data (instructions retired; unhalted core cycles; L2 and L3 cache hits and misses; vector and scalar floating-point operations, etc.). We then use the PMU data to calculate a new metric of efficiency in order to quantify processor resource utilization and make comparisons of that utilization between single-threading (ST) and HT modes. We also study performance gain using unhalted core cycles, code efficiency of using vector units of the processor, and the impact of HT mode on various shared resources like L2 and L3 cache. Results using four full-scale, production-quality scientific applications from computational fluid dynamics (CFD) used by NASA scientists indicate that HT generally improves processor resource utilization efficiency, but does not necessarily translate into overall application performance gain.
... NSU3D has been shown to scale well on massively parallel computer architectures using up to 4000 cores. 21 The results reported in this paper have been run on the NASA Pleiades machine, using MPI exclusively and using from 32 to 256 cores. Typical wall clock requirements were approximately 1.5 hours for 1000 solver iterations on the medium grid using 64 cores. ...
Article
Simulation results for the first AIAA CFD High Lift Prediction Workshop using the unstructured computational fluid dynamics code NSU3D are presented. The solution algo-rithms employed in NSU3D for this study are described along with examples of convergence history and computational cost. The geometry used for the simulation is the NASA three-element swept wing (Trap Wing) model with experimental data taken in the 14x22 foot wind tunnel at NASA Langley. Computational grids for the study were prepared by the authors with the VGRIDns package using CAD geometry provided by the workshop com-mittee. A grid convergence study was performed using a family of three grids to assess sensitivity to grid resolution. A range of angle-of-attack values from 6 • to 37 • was completed on the medium grid and the resulting lift, drag and longitudinal moments are compared to the experimental results. The effect of changing the flap angle is investigated with a second model using an equivalent medium grid. Comparisons of surface pressure to experimental data are presented for both configurations. Flow features are also presented using surface constrained and volume streamlines for selected cases.
... The problem is about 50M points and requires about 32GB of system memory. Cart3D is a simulation package targeted at conceptual and preliminary design of aerospace vehicles with complex geometries [14] . The code uses a topologically unstructured set of Cartesian meshes and the resultant indirect addressing makes it CPU and memory bound especially for large datasets. ...
Conference Paper
Full-text available
Resource sharing in commodity multicore processors can have a significant impact on the performance of production applications. In this paper we use a differential performance analysis methodology to quantify the costs of contention for resources in the memory hierarchy of several multicore processors used in high-end computers. In particular, by comparing runs that bind MPI processes to cores in different patterns, we can isolate the effects of resource sharing. We use this methodology to measure how such sharing affects the performance of four applications of interest to NASA-OVERFLOW, MITgcm, Cart3D, and NCC. We also use a subset of the HPCC benchmarks and hardware counter data to help interpret and validate our findings. We conduct our study on high-end computing platforms that use four different quad-core microprocessors-Intel Clovertown, Intel Harpertown, AMD Barcelona, and Intel Nehalem-EP. The results help further our understanding of the requirements these codes place on their production environments and also of each computer's ability to deliver performance.
... NSU3D has been extensively validated in stand-alone mode, both for steady-state fixed-wing cases as a regular participant in the AIAA Drag Prediction workshop series, 34, 35 as well as for unsteady aerodynamic and aeroelastic problems, 36 and has been extensively benchmarked on large parallel computer systems. 37 For operation within an overset environment, "iblanking" capability has been added. The iblank variable specify which nodes the solution variables are to be updated in the near-body solver (iblank = 1) and which nodes are not updated by the solver, i.e., fringes and holes (iblank = 0). ...
Article
Full-text available
This article describes the architecture, components, capabilities, and validation of the first version of the Helios platform, targeted towards rotorcraft aerodynamics. Capabilities delivered in the first version include fuselage aerodynamics with and without momentum-disk rotor models, and isolated rotor dynamics for ideal hover and forward flight coupled with aeroelasticity and trim. Helios is based on an overset framework that employs unstruc-tured mixed-element meshes in the near-body domain combined with high-order Cartesian meshes in the off-body domain. In addition, the aerodynamics solution is coupled with structural dynamics and trim using a delta-coupling algorithm. The near-body CFD, off-body CFD, CSD and trim modules are coupled using a Python infrastructure that controls the execution sequence of the solution procedure. Specific validation studies presented include the Slowed Rotor Compound fuselage, Georgia Tech rotor body, TRAM rotor in hover and UH-60A rotor in forward flight. In all cases, Helios predictions are compared with experimental data and other state-of-the-art codes to demonstrate the accuracy, effi-ciency and scalability of the code.
... Moreover , its computational efficiency diminishes in the case of variable-coefficient operators . Additional examples of scalable approaches for unstructured meshes include [1] and [40]. In those works, multigrid approaches for general elliptic operators were proposed. ...
Article
In this article, we present a parallel geometric multigrid algorithm for solving variable-coefficient elliptic partial differential equations on the unit box (with Dirichlet or Neumann boundary conditions) using highly nonuniform, octree-based, conforming finite element discretizations. Our octrees are 2:1 balanced, that is, we allow no more than one octree-level difference between octants that share a face, edge, or vertex. We describe a parallel algorithm whose input is an arbitrary 2:1 balanced fine-grid octree and whose output is a set of coarser 2:1 balanced octrees that are used in the multigrid scheme. Also, we derive matrix-free schemes for the discretized finite element operators and the intergrid transfer operations. The overall scheme is second-order accurate for sufficiently smooth right-hand sides and material properties; its complexity for nearly uniform trees is O( N/np log N/zp )+O(n p log np), where N is the number of octree nodes and np is the number of processors. Our implementation uses the Message Passing Interface standard. We present numerical experiments for the Laplace and Navier (linear elasticity) operators that demonstrate the scalability of our method. Our largest run was a highly nonuniform, 8-billion-unknown, elasticity calculation using 32,000 processors on the Teragrid system, "Ranger," at the Texas Advanced Computing Center. Our implementation is publically available in the Dendro library, which is built on top of the PETSc library from Argonne National Laboratory.
... CART3D is a high-fidelity, inviscid CFD application that solves the Euler equations of fluid dynamics [15]. It includes a solver called Flowcart, which uses a second-order, cell-centered, finitevolume upwind spatial discretization scheme, in conjunction with a multi-grid accelerated Runge-Kutta method for steady-state cases. ...
Article
Full-text available
In this paper, we present an early performance evaluation of a 624-core cluster based on the Intel® Xeon® Processor 5560 (code named "Nehalem-EP", and referred to as Xeon 5560 in this paper)—the third-generation quad-core architecture from Intel. This is the first processor from Intel with a non-uniform memory access (NUMA) architecture managed by on-chip integrated memory controller. It employs a point-to-point interconnect called the Intel® QuickPath Interconnect (QPI) between processors and to the input/output (I/O) hub. It also introduces to a quad-core architecture both Intel's hyper-threading technology (or simultaneous multi-threading, "SMT") and Intel® Turbo Boost Technology ("Turbo mode") that automatically allow processor cores to run faster than the base operating frequency if the processor is operating below rated power, temperature, and current specification limits. It can be engaged with any number of cores or logical processors enabled and active. We critically evaluate these features using the High Performance Computing Challenge (HPCC) benchmarks, NAS Parallel Benchmarks (NPB), and four full-scale scientific applications. We compare and contrast the results of a cluster based on the Xeon 5560 with an SGI® Altix® ICE 8200EX cluster of quad-core Intel® Xeon® 5472 Processor ("Xeon 5472" from here on) and another cluster of Intel® Xeon® 5462 Processor ("Xeon 5462"; the Xeon 5400 Series Processors are previous generation quad-core Intel processors and were code named Harpertown).
... Other methods are also possible like multilevel methods, whether used as stand alone methods to substitute projection methods, as proposed in [24], coupled with projection methods as in [25], or coupled with Krylov methods as preconditioners as proposed in [26]. These methods have been successfully used in a parallel context in compressible flows in [27,28]. Apart from the fact that multilevel methods may be cumbersome in dealing with adaptive or moving meshes, they involve a much higher implementation effort. ...
Article
This paper presents a parallel implementation of fractional solvers for the incompressible Navier–Stokes equations using an algebraic approach. Under this framework, predictor–corrector and incremental projection schemes are seen as sub-classes of the same class, making apparent its differences and similarities. An additional advantage of this approach is to set a common basis for a parallelization strategy, which can be extended to other split techniques or to compressible flows.The predictor–corrector scheme consists in solving the momentum equation and a modified “continuity” equation (namely a simple iteration for the pressure Schur complement) consecutively in order to converge to the monolithic solution, thus avoiding fractional errors. On the other hand, the incremental projection scheme solves only one iteration of the predictor–corrector per time step and adds a correction equation to fulfill the mass conservation. As shown in the paper, these two schemes are very well suited for massively parallel implementation. In fact, when compared with monolithic schemes, simpler solvers and preconditioners can be used to solve the non-symmetric momentum equations (GMRES, Bi-CGSTAB) and to solve the symmetric continuity equation (CG, Deflated CG). This gives good speedup properties of the algorithm. The implementation of the mesh partitioning technique is presented, as well as the parallel performances and speedups for thousands of processors.
... This marked increase in the level of high-performance computing now offers both unprecedented capability and capacity to the aerodynamic simulation community [2]. These systems are capable of aerodynamic simulations with 10 8 -10 10 degrees of freedom, offering ever increasing physical fidelity [3,4]. While such extremely large "capability" simulations are becoming commonplace, the engineering community has focused on the enormous capacity of these systems through an increasing reliance on parametric, trade and optimization studies. ...
... These investigations use the parallel multi-level Cartesian Euler solver developed in references [5] and [6] to produce the aerodynamic data. This simulation package has been used extensively on large shared and distributed memory systems and has very good parallel scalability [2,3,7]. The robustness and automation of this simulation package has led to its wide adoption for producing aerodynamic databases in support of engineering analysis and design [8,9,10,11]. ...
Conference Paper
Full-text available
This work examines the level of discretization error in simulation-based aerodynamic databases and introduces strategies for error control. Simulations are performed using a parallel, multi-level Euler solver on embedded-boundary Cartesian meshes. Discretiza-tion errors in user-selected outputs are estimated using the method of adjoint-weighted residuals and we use adaptive mesh refinement to reduce these errors to specified tol-erances. Using this framework, we examine the behavior of discretization error throughout a token database computed for a NACA 0012 airfoil consisting of 120 cases. We compare the cost and accuracy of two approaches for aerodynamic database generation. In the first approach, mesh adaptation is used to compute all cases in the database to a prescribed level of accuracy. The second approach conducts all simula-tions using the same computational mesh without adaptation. We quantitatively assess the error landscape and computational costs in both databases. This investigation high-lights sensitivities of the database under a variety of conditions. The presence of tran-sonic shocks or the stiffness in the governing equations near the incompressible limit are shown to dramatically increase discretization error requiring additional mesh reso-lution to control. Results show that such pathologies lead to error levels that vary by over factor of 40 when using a fixed mesh throughout the database. Alternatively, con-trolling this sensitivity through mesh adaptation leads to mesh sizes which span two orders of magnitude. We propose strategies to minimize simulation cost in sensitive regions and discuss the role of error-estimation in database quality.