Example of the RESTful API: (left) Input query for memory usage of compute-5-11 (right) API result

Source publication

The visual interface of Nagios Core where hosts are simply listed on a...

The situation on March 21, 2019, for the 467-node Quanah cluster, HPC...

Example of the RESTful API: (left) Input query for memory usage of...

Main interface of our HiperView visualization: a top panel, b summary...

Customized color scales and critical thresholds for different HPC...

HiperView: real-time monitoring of dynamic behaviors of high-performance computing centers

Article

Full-text available

Oct 2021

This paper presents HiperView, a visual analytics framework monitoring and characterizing the health status of high-performance computing systems through a RESTful interface in real time. The primary objectives of this visual analytical system are: (1) to provide a graphical interface for tracking the health status of a large number of data center...

A Scalable, Distributed Monitoring Framework for HPC Clusters Using Redfish-Nagios Integration

Preprint

Full-text available

Mar 2023

Current monitoring tools for high-performance computing (HPC) systems are often inefficient in terms of scalability and interfacing with modern data center management APIs. This inefficiency leads to a lack of effective management of the infrastructure of modern data centers. Nagios is one of the widely used industry-standard tools for data center infrastructure monitoring, which mainly includes monitoring of nodes and associated hardware and software components. However, current Nagios monitoring has special requirements that introduce several limitations. First, significant human effort is needed for the configuration of monitored nodes in the Nagios server. Second, the Nagios Remote Plugin Executor and the Nagios Service Check Acceptor are required on the Nagios server and each monitored node for active and passive monitoring, respectively. Third, Nagios monitoring also requires monitoring-specific agents on each monitored node. These shortcomings are inherently due to Nagios’ in-band implementation nature. To overcome these limitations, we introduced Redfish-Nagios, a scalable out-of-band monitoring tool for modern HPC systems. It integrates the Nagios server with the out-of-band Distributed Management Task Force’s Redfish telemetry model, which is implemented in the baseboard management controller of the nodes. This integration eliminates the requirements of any agent, plugin, hardware component, or configuration on the monitored nodes. It is potentially a paradigm shift in Nagios-based monitoring for two reasons. First, it simplifies communication between the Nagios server and monitored nodes. Second, it saves computational costs by removing the requirements of running complex Nagios-native protocols and agents on the monitored nodes. The Redfish-Nagios integration methodology enables the monitoring of next-generation HPC systems using the scalable and modern Redfish telemetry model and interface.

HPC 2 lusterScape: Increasing Transparency and Efficiency of Shared High-Performance Computing Clusters for Large-scale AI Models

Conference Paper

Oct 2023

JobViewer: Graph-based Visualization for Monitoring High-Performance Computing System

Conference Paper

Dec 2022

Analysing Supercomputer Nodes Behaviour with the Latent Representation of Deep Learning Models

Chapter

Aug 2022

Example of the RESTful API: (left) Input query for memory usage of compute-5-11 (right) API result

Citations