Unleashing the Full Power of NVIDIA GPUs with Gridware Cluster Scheduler: Transforming HPC and AI Workflows

March 30, 2025

Maximize your NVIDIA GPU resources effortlessly with Gridware Cluster Scheduler, designed for intelligent optimization, seamless monitoring, and simplified workload management. As a proud partner of NVIDIA’s Inception Program, Gridware ensures your GPUs achieve peak performance, reducing complexity and accelerating innovation. Empower your HPC and AI environments with unmatched efficiency and ease.

In the rapidly evolving world of high-performance computing (HPC) and artificial intelligence (AI), the efficient utilization of resources has never been more critical. GPUs, in particular, have become the backbone of compute-intensive tasks, from complex simulations to deep learning. However, maximizing GPU usage while simplifying management remains a substantial challenge for many organizations.

The Gridware Cluster Scheduler, an innovative solution designed to keep your GPUs 100% busy with compute jobs, streamline NVIDIA GPU management, and provide comprehensive monitoring and accounting. With support for ARM architectures, including NVIDIA’s cutting-edge Grace Hopper and Grace Blackwell Superchip, Gridware is poised to revolutionize your HPC and AI environments.

HPC Gridware is excited to announce its membership in the NVIDIA Inception Program.

Bridging the Gap Between HPC and AI with Optimal GPU Utilization

As AI becomes an integral part of HPC workloads, the demand for GPU resources has skyrocketed. Deep learning models, neural networks, and data analytics require immense computational power. The Gridware Cluster Scheduler ensures that every GPU in your cluster is utilized to its fullest potential, driving significant performance improvements and cost efficiencies for both HPC and AI applications.

Gridware’s sophisticated batch queueing system intelligently manages job assignments to GPUs, ensuring that these valuable resources are never idle. By analyzing job requirements and resource availability, the scheduler:

– Optimizes GPU Utilization: Allocates jobs to keep GPUs constantly engaged.

– Reduces Wait Times: Minimizes queue times for GPU-intensive tasks by setting resource based priorities.

– Enhances Performance: Delivers faster computation times for complex workloads through sophisticated job distribution policies, keeping workload closely grouped to reduce network latency.

This intelligent scheduling is crucial for AI workloads, where training and inference tasks can be time-sensitive and resource-intensive.

Streamlining Setup with One-Line Prolog and Epilog Scripts

Configuring GPUs for distributed computing can be a daunting task. Gridware simplifies this process dramatically:

– Easy Configuration: Integrate GPU support by adding a single-line prolog and epilog script to cluster queues associated with GPUs. All work of configuration of environment variable setup, per job GPU accounting, and GPU testing is done by qgpu. The setup is as simple as calling qgpu prolog and qgpu epilog.

Configure resource monitoring of all GPUs on a host with an optimized one-line load-sensor configuration (qgpu loadsensor). Manage installation and resource configuration automatically.

– Automatic Environment Setup: The scheduler sets the correct NVIDIA environment variables for the selected GPUs, such as NVIDIA_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES. No need to make any additional configuration changes. Just request the amount of GPUs you need.

Enhancing Productivity and Reducing Errors

By automating the configuration process, Gridware:

– Reduces Manual Effort: Eliminates the need for complex scripts and manual setup.

– Minimizes Errors: Prevents misconfigurations that can lead to performance issues.

– Accelerates Deployment: Gets your HPC and AI workloads running faster than ever.

Visibility into GPU Usage, Performance, and Health

Gridware provides a robust monitoring solution that offers deep insights into your GPU resources:

– GPU Types and Specifications: Easily identify the exact GPU models in your cluster and their hardware resources.

– Temperature Monitoring: Keep an eye on GPU temperatures to prevent overheating and ensure longevity.

– CPU Binding Information: Understand how GPUs and CPUs are paired to optimize performance.

Understanding how resources are consumed is essential for optimization and cost management. Gridware’s new per-job GPU accounting provides:

– Detailed Metrics: Gain insights into power usage, memory consumption, error counts, potential slowdowns and performance for each job.

– Accounting Integration: View each job’s individual GPU consumption directly in the qacct accounting outputs or post-process raw JSON based accounting records in your favorite tool chain.

– Resource Optimization: Use data to fine-tune workloads for better efficiency.

Facilitating Chargeback and Reporting:

For organizations that require billing or resource chargebacks, transparent accounting is invaluable:

– Accurate Cost Allocation: Assign costs based on actual GPU usage.

– Reporting and Compliance: Generate reports for stakeholders with precise resource utilization data.

Furthermore, to enable advanced workload management and automation (such as setting alarms based on real-time GPU load), qtelemetry can automatically export key metrics to Grafana. A preview of this integrated feature is available in the upcoming GCS 9.0.5 release. By having GPU load and other critical data readily accessible in Grafana, you can more efficiently diagnose performance bottlenecks, proactively address potential issues, and optimize your environment for maximum operational efficiency.

Automated Health Checks for Reliability

Reliability is paramount in HPC and AI environments. Gridware enhances system stability by:

– Running Automated Tests: Optionally perform health checks on GPUs before job execution.

– Preventing Downtime: Identify and address GPU issues proactively.

– Ensuring Data Integrity: Keep critical computations running smoothly without unexpected failures.

Versatile Support for Diverse Workloads

Gridware is designed to support a wide range of job configurations:

– Single-Node GPU Jobs: Ideal for tasks that require dedicated GPU access.

– Multi-Job per Node: Efficiently allocate multiple jobs to a single GPU node exploiting a set of GPUs installed on the node. Use any number of GPUs required for the job. Configure NVIDIA MIG (multi-instance GPU setup) to share single GPUs with different workloads and users at the same time.

– Multi-Node GPU Jobs (MPI): Scale complex HPC and AI applications across multiple nodes with multiple GPUs.

Embracing the Future with ARM and NVIDIA Grace Hopper and Grace Blackwell Support

Gridware Cluster Scheduler extends its capabilities to support ARM architectures, including:

– NVIDIA Grace Hopper Superchip: Leverage the combined power of NVIDIA GPUs and ARM-based CPUs. Gridware Cluster Scheduler fully supports NVIDIAs new ARM architecture for CPU, GPU, and CPU/GPU based applications.

– Enhanced AI and HPC Performance: Benefit from the superchip’s high memory bandwidth and energy efficiency.

– Future-Proofing: Stay ahead with support for the latest computing innovations.

Gridware Cluster Scheduler supports any application running on the new NVIDIA ARM platform including well known demanding HPC applications like the molecular dynamics simulation tool GROMACS.

We are immensely grateful to NVIDIA for their continuous support of HPC Gridware as part of the NVIDIA Inception Program.

Example: Running GROMACS on NVIDIA Grace Hopper

Installing the Gridware Cluster Scheduler on the Grace Hopper platform is as straightforward as its deployment on AMD64 architectures. The key distinction lies in selecting our ARM-based packages. It’s important to note that the Gridware Cluster Scheduler is specifically engineered to support heterogeneous architectures within a unified control plane. This capability enables seamless integration of new ARM-based compute nodes alongside traditional AMD EPYC or Intel Xeon compute nodes, ensuring streamlined management under one cohesive system

The qhost output for a system configured in this way appears as follows:

				
					HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
TRY-YYYYY-gpu01         lx-arm64       72    1   72   72  0.04  573.3G   23.7G     0.0     0.0
   hl:arch=lx-arm64
   hl:num_proc=72.000000
   hl:mem_total=573.304G
   hl:swap_total=0.000
   hl:virtual_total=573.304G
   hl:load_avg=0.040000
   hl:load_short=0.060000
   hl:load_medium=0.040000
   hl:load_long=0.070000
   hl:mem_free=549.594G
   hl:swap_free=0.000
   hl:virtual_free=549.594G
   hl:mem_used=23.710G
   hl:swap_used=0.000
   hl:virtual_used=23.710G
   hl:cpu=0.200000
   hl:m_topology=SCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
   hl:m_topology_inuse=SCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
   hl:m_socket=1.000000
   hl:m_core=72.000000
   hl:m_thread=72.000000
   hl:np_load_avg=0.000556
   hl:np_load_short=0.000833
   hl:np_load_medium=0.000556
   hl:np_load_long=0.000972
   hl:nvidia_gpu_count=1
   hl:nvidia_gpu_0_dcgm_supported=1.000000
   hl:nvidia_gpu_0_uuid=GPU-0673bf0c-bc6a-0a75-4445-2f71d9c423d0
   hl:nvidia_gpu_0_brand=NVIDIA
   hl:nvidia_gpu_0_model=NVIDIA GH200 480GB
   hl:nvidia_gpu_0_serial_number=1654223016010
   hl:nvidia_gpu_0_vbios=96.00.7E.00.02
   hl:nvidia_gpu_0_inforom_image_version=G530.0206.00.02
   hl:nvidia_gpu_0_bus_id=00000009:01:00.0
   hl:nvidia_gpu_0_bar1=3.999K
   hl:nvidia_gpu_0_framebuffer_memory=0.000
   hl:nvidia_gpu_0_bandwidth_mbs=1969.000000
   hl:nvidia_gpu_0_power_w=900.000000
   hl:nvidia_gpu_0_cpu_affinity={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71}
   hl:nvidia_gpu_0_p2p_available=None
   hl:nvidia_gpu_0_power_draw_w=81.677000
   hl:nvidia_gpu_0_temperature_c=26.000000
   hl:nvidia_gpu_0_utilization_gpu=0.000000
   hl:nvidia_gpu_0_utilization_memory=0.000000
   hl:nvidia_gpu_0_utilization_encoder=0.000000
   hl:nvidia_gpu_0_clocks_sm_mhz=345.000000
   hl:nvidia_gpu_0_clocks_memory_mhz=2619.000000
   hl:nvidia_gpu_0_pci_bus_id=00000009:01:00.0
   hl:nvidia_gpu_0_performance_state=P0
   hl:nvidia_gpu_0_mig_uuids=None
   hc:NVIDIA_GPUS=1.000000

This system automatically detects the hardware topology and current usage, including sockets, cores, and various memory types. After an effortless setup—be it automatic (`qgpu install`) or manual—of the cluster queue and GPU resources based on RSMAP, the user only needs to request the GPU cluster queue, specify the number of GPUs, and provide the application start script to submit batch jobs to the system.

For instance, to submit a single GROMACS job utilizing one Hopper GPU, you can execute:

				
					qsub -l NVIDIA_GPUS=1 -q gpu.q ./gromacs.sh

With this method, numerous jobs can be submitted simultaneously. These are then queued and scheduled to the compute nodes as GPUs become available.

Users can conveniently track job statuses using the qstat command line, which provides detailed information about the selected compute node and GPU during runtime.

				
					qstat -j 2111
==============================================================
job_number:                 2111
exec_file:                  job_scripts/2111
submission_time:            2025-01-12 15:20:50.775056
submit_cmd_line:            qsub -q gpu.q -l NVIDIA_GPUS=1 ./gromacs.sh
effective_submit_cmd_line:  qsub -A sge -binding NONE -M nvidia@gpu01.nvidialaunchpad.com -N gromacs.sh -pty no -r yes -scope global -hard -l NVIDIA_GPUS=1 -hard -q gpu.q ./gromacs.sh
owner:                      nvidia
uid:                        1000
group:                      nvidia
gid:                        1000
groups:                     4(adm), 24(cdrom), 27(sudo), 30(dip), 46(plugdev), 110(lxd), 997(docker), 1000(nvidia)
sge_o_home:                 /home/nvidia
sge_o_log_name:             nvidia
sge_o_path:                 /opt/gcs/bin/lx-arm64:/usr/local/gromacs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/nvidia
sge_o_host:                 gpu01
account:                    sge
hard_resource_list:         NVIDIA_GPUS=1
hard_queue_list:            gpu.q
mail_list:                  nvidia@gpu01.nvidialaunchpad.com
notify:                     FALSE
job_name:                   gromacs.sh
jobshare:                   0
env_list:                   
script_file:                ./gromacs.sh
department:                 defaultdepartment
binding:                    NONE
usage                  1:   wallclock=00:00:20.033818, cpu=00:00:20.019999, mem=285.83958 GBs, io=0.11187, vmem=19.117G, maxvmem=19.117G, rss=280.188M, maxrss=280.188M
binding                1:   NONE
resource_map           1:   NVIDIA_GPUS=gpu01.nvidialaunchpad.com=(0)

Once a job is completed, all accounting information is recorded for further analysis. Gridware Cluster Scheduler supports a new, extensible JSONL-based accounting format for GPU jobs, while also maintaining backward compatibility with the traditional “SGE” format.

In the GROMACS example, the qacct -j <jobID> output displays crucial metrics such as:

– Job submission details, such as exact qsub command line

– Job runtime information: Cluster queue selection, job submit, start, and stop times

– Detailed resource usage: Exact CPU, memory, and IO usage

– NVIDIA GPU per job specific accounting information: Energy consumption, average/max/min power usage, GPU memory usage, clock speeds, memory utilization, PCIe utilizations, ECC counts, slowdown occurrences – thanks to our new extensible accounting infrastructure

These insights facilitate better job efficiency evaluations and enable potential cluster optimizations, such as reallocating less GPU-intensive jobs to more cost-effective GPUs or exporting raw JSON data for further processing in other analytical frameworks.

On the provided NVIDIA LaunchPad platform, hundreds of GROMACS jobs were submitted simultaneously, ensuring sustained system utilization without the need for manual intervention. The NVIDIA-provided Grafana dashboards were instrumental in monitoring host utilization over time, maximizing the system’s performance and resource usage.

To further improve hardware efficiency on the nodes, NVIDIA’s MIG (many-instance-GPUs) can be set up to virtualize the GPU into many instances. On MIG enabled hosts, the MIG UUIDs are automatically reported.

Example: Running NVIDIA Containers with Enroot on Gridware Cluster Scheduler

NVIDIA’s enroot container runtime is renowned for its simplicity in HPC cluster environments, providing seamless, out-of-the-box GPU support without the need for complex configuration. Integrating enroot into the Gridware Cluster Scheduler is straightforward and fully supported by us, enabling you to effortlessly run containers such as NVIDIA’s Clara Parabricks. This streamlined setup ensures GPU-accelerated workloads can be deployed quickly and efficiently, maximizing performance and resource utilization within your HPC environment.

Conclusion: Upgrading Your HPC and AI Environments with Gridware

The Gridware Cluster Scheduler offers a comprehensive solution to the challenges of GPU workload management in modern HPC and AI infrastructures. By keeping GPUs fully utilized, simplifying configurations using our new qgpu tool, and providing deep insights into resource consumption, Gridware empowers organizations to:

– Increase Efficiency: Maximize the return on investment in GPU hardware.

– Accelerate Innovation: Speed up computational tasks and AI model training.

– Simplify Management: Reduce complexity and overhead in resource administration.

Whether you’re running complex simulations, training deep learning models, or managing large-scale data analytics, Gridware provides the tools and support you need to succeed.

Discover More

To learn more about how Gridware can revolutionize your HPC and AI workloads, contact us for a personalized consultation at dgruber@hpc-gridware.com.

© 2023 NVIDIA, the NVIDIA logo, and NVIDIA Grace Hopper, NVIDIA Grace Blackwell, NVIDIA Clara are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries.