Open Cluster Scheduler has introduced a powerful and flexible resource type known as Resource Map (RSMAP), designed to manage and assign specific instances of resources like GPUs. Initially integrated into Univa Grid Engine Open Core, RSMAP is now also available in the latest version of Cluster Scheduler.
Introduction
Unlike conventional resource types such as `int`, `double`, and `memory` that merely allocate an amount of a particular resource, RSMAP allocates specific instances of resources (e.g., distinct GPU numbers). This novel approach has several key advantages:
- Collision Prevention: Ensures exclusive usage of resources by assigning specific instances, thus avoiding conflicts with other jobs.
- Enhanced Monitoring & Accounting: Facilitates precise tracking and reporting of actual resource usage.
RSMAPs can be employed to manage both host-level and global resources.
Host-Level Resources
- Port numbers
- GPUs
- NUMA resources (for core and memory binding)
- Network devices
- Storage devices
Global Resources
- IP addresses and DNS names
- License Servers
- Port numbers
Example Configuration for GPU management on Host Level
This section illustrates how to define and use an RSMAP resource for GPU management.
Resource Definition in the Resource Configuration ("complexes")
To create a new GPU resource type based on RSMAP, open the resource complex configuration with the following command:
qconf -mc
This opens an editor containing the current resource definitions:
#name shortcut type relop requestable consumable default urgency
#-------------------------------------------------------------------
arch a RESTRING == YES NO NONE 0
...
Add the following line:
GPU gpu RSMAP <= YES YES NONE 0
This defines a resource named `GPU` with a shortcut `gpu` of type `RSMAP`. The comparison operator `<=` indicates acceptability of the resource being less than or equal to the requested amount. The resource is requestable and consumable, with default and urgency values set to `NONE` and `0`, respectively. Save and close the editor.
Resource Initialzation in the Host Configuration
To assign values to the resources on a specific host, modify the host configuration.
qconf -sel
...
qconf -me
(line 1) Lists the available hosts. With the command in (line 4) you can change the host configuration.
Assume the host has 4 GPUs, update the `complex_values` entry as follows:
complex_values GPU=4(0 1 2 3)
This indicates the host has 4 GPU instances with IDs 0, 1, 2, and 3. Verify resource availability with:
qhost -F GPU
...
Host Resource(s): hc:GPU=4.000000
Submitting a Job Using a GPU Resource
With the administrative setup done, users can now request GPU resources for their jobs.
Job Script Example
The following job script demonstrates the GPU request:
#!/bin/bash
env | grep SGE_HGR
Job Submission
Submit the job while requesting 2 GPUs:
qsub -l GPU=2 ./job.sh
Job Output
The job output should display the granted GPU IDs:
SGE_HGR_GPU=0 1
For NVIDIA jobs, convert the GPU IDs to a comma-separated format and set the `CUDA_VISIBLE_DEVICES` environment variable:
export CUDA_VISIBLE_DEVICES=$(echo $SGE_HGR_GPU | ts ' ' ',')
That’s it!
Conclusion
Utilizing the RSMAP resource type for GPU management in Open Cluster Scheduler ensures efficient resource allocation and minimizes conflicts, enhancing both performance and resource tracking. Additionally, HPC Gridware is set to release a new GPU package with streamlined configuration, improved GPU accounting, and automated environment variable management, significantly easing GPU cluster management.
Stay tuned for further updates and advanced features to make your computing experience more powerful and user-friendly.