Preparations for the first version of Cluster Scheduler

April 1, 2024
We are excited to announce that HPC-Gridware will be continuing the development of the renowned “Grid Engine” software originally released by Sun Microsystems. Rebranded as “Open Cluster Scheduler,” this project will remain open-source under the SISSL v2 license, with the source code available on GitHub soon.

In addition to maintaining the open-source version, we will introduce enhanced functionality under the name “Gridware Cluster Scheduler.” This iteration will come with commercial support and consulting services. The new features will be released under the Apache License v2.0.

Choosing the Code Base

After careful consideration, we have selected the Univa code base as our foundation. Despite the availability of alternatives like Open Grid Scheduler and Son of Grid Engine, the Univa version aligns best with our goal of making significant codebase improvements. Our inaugural release — version 9.0.0 — will mark a modernization effort, aligned with the ongoing growth in HPC and AI workloads which demand faster, more flexible computing clusters.

Technological advancements in CPU and GPU design necessitate advanced decision-making algorithms for schedulers. We aim to tackle these challenges, ensuring the Cluster Scheduler codebase can effectively support modern multi-core CPUs, GPUs, NPUs, and FPGAs.

Preparing for the Future

Our immediate focus is on laying a robust foundation for the future of cluster schedulers. We’ll be:

  • Convert the code base to C++ and CMake.
  • Support modern development environments (e.g. CLion).
  • Updating internal data stores and threading mechanisms.
  • Enhancing concurrent execution within the master service for better thread parallelism.

These updates will be part of the initial “Open Cluster Scheduler” release. Key enhancements planned for version 9.0.0 include:

  • RSMAPs for host-specific resource management, such as GPUs and other accelerators.
  • Integration with the hwloc library for hardware topology and architecture analysis.
  • Support for diverse computer architectures such as OpenPower (lx-ppc64le), RISC-V (lx-riscv64), Apple’s ARM-based CPUs (darwin-arm64), and FreeBSD for Intel/AMD64 (fbsd-amd64).
  • Enhanced online usage reporting and customizable accounting values.

  • Implementation of request limits to guard against denial-of-service attacks.
  • Container based builds.

Streamlining the System

To further improve usability and maintainability, we plan to remove outdated or rarely-used components:

  • Discontinuing the old Motif-based GUI.
  • Removing qtcsh, in line with other commercial schedulers.
  • Phasing out complex components like the JGDI interface and its associated services.

  • Temporarily suspending CSP mode due to low user adoption.

While these changes might seem disruptive, they are essential for modernizing the system. We recognize the need for alternatives to these functionalities and commit to providing replacements in the future.

Join the Conversation

We invite you to share your questions, suggestions, or interest in contributing to this project. Feel free to reach out to us.

Stay tuned for more updates as we embark on this exciting journey to revolutionize cluster scheduling.