Improving Scalability: First Architectural Changes to the Gridware Cluster Scheduler

January 20, 2025

In today’s digital landscape, characterized by rapid growth in data and increasing complexity, an efficient workload management system (WLM) is more than just a tool — it’s the backbone of modern IT infrastructures. The core task of the Gridware Cluster Scheduler is to manage workloads efficiently, and scalability is one of the most critical aspects of the system. With the increasing number of cores in today’s CPUs, it is essential to have a solution that can leverage this growth to meet the ever-evolving demands of businesses.

In this blog post, we will explore the recent architectural changes made to our WLM. We’ll discuss the challenges that prompted these adjustments and delve into the new technologies and approaches that have been implemented to enhance performance, scalability, and user experience. Join us on this exciting journey through the world of system architecture and discover how innovations in software development are revolutionizing our workflows.

Toolbox for the Future

To address the evolving challenges in our industry, we at HPC-Gridware have embarked on a comprehensive overhaul of our core architecture, which is built on Univa Grid Engine. Our primary focus has been on further standardizing internal, non-visible product components. This approach will not only enhance the maintenance and development of our Gridware Cluster Scheduler but also empower us to introduce innovative new features.

Self-Sustaining Data Stores

A data store, in the context of the Gridware Cluster Scheduler, is a specialized in-memory component designed to hold object information and data structures that are frequently accessed by the system.

The previous architecture’s monolithic data store, which relied on a single repository for data, has proven to be outdated. This approach limited parallel processing capabilities, allowing only a few threads to operate simultaneously, while comprehensive requests frequently disrupted overall performance.

The foundation of our new architecture comprises small, autonomous data stores that are capable of self-management and updates. These stores can be dynamically created to handle numerous requests in parallel. A prime example of this is the processing of authentication requests.

Each external client component generates authentication requests that can now be managed independently, without needing to involve the core system, the scheduler, or other components tied to the scheduler. To achieve this decoupling, we have extracted the data necessary for authentication from the main data repository and placed it into a dedicated data store. This store can continuously update and manage itself through its own thread.

All external authentication requests are now directed exclusively to this specialized data store, which processes them in parallel using a pool of threads. Importantly, this occurs without requiring the main data store, other secondary data stores, or additional system components to be involved.

The primary benefit of this architecture is that the number of requests processed concurrently is constrained only by the number of threads assigned to the data store. This thread count can be dynamically adjusted to accommodate the volume of incoming requests, effectively leveraging the high core count of modern CPUs.

In addition to authentication requests, any request that involves reading data can be processed using this architecture. This includes user queries about job, queue, or host statuses (qstat, qhost), available resources (qstat -F, qstat -j, etc.), or cluster settings (qconf -s…). Furthermore, requests from other system components, such as services running on thousands of compute nodes (sge_execd) that require the same information, can be efficiently managed using this new architecture.

Cascaded Processing by Groups of Thread Pools

The repeated application of self-sustaining data stores, combined with specialized thread pools, significantly increases the number of requests processed in parallel. This processing can be visualized as multiple pipelines, where each external and internal task is divided into multiple sub-tasks, which are efficiently handled by dedicated thread pools. Each thread pool has immediate access to the necessary data, allowing it to focus solely on processing its specific sub-tasks.

While this approach may introduce a slight delay for individual requests due to the creation of new handover points, it overall ensures:

Maximization of total request capacity.
Equal processing of requests, regardless of their type.
Maximization of the number of requests processed in parallel.
Reduction of system blockages under high load caused by overwhelming requests.
Improved utilization of contemporary hardware resources.

By leveraging this cascaded processing model, we are enhancing our system’s efficiency and responsiveness, ensuring that we can better meet the demands of our users while optimizing resource allocation.

In our test system, we compared various cluster performance metrics between different Gridware Cluster Scheduler and Grid Engine versions, and we were thrilled to measure a 25% reduction in overall runtime (with the same number of compute nodes and jobs). Even better, we achieved this by accepting 50% more submits per second and successfully processing more than 2.5 times as many user requests per second!

Automated Session Management

To ensure a consistent view of the overall system, we require a session concept that is as transparent as possible for the user.

Example:

				
					job_id=$(qsub -terse -b y sleep 60)
qstat -j $job_id
...

In this example, a job is submitted (line 1), and the job’s status is queried (line 2). Although the two commands are executed sequentially, WLMs do not necessarily guarantee that the job will be visible when the job-submitting command has completed. To ensure this, a session concept is needed that provides this assurance.

Other commercial WLMs require sessions to be manually created, which must then be closed after use. Such sessions are typically restricted to the script that created them.

The Gridware Cluster Scheduler takes a different approach by automatically creating cross-host sessions for each user — when needed. This ensures that users can interact with the system consistently and seamlessly, without the need to manage sessions themselves. This automation enhances user-friendliness and significantly optimizes workflow.

Potential Future Extensions

The space created in the core of our system through recent changes opens up numerous opportunities for future enhancements. Here are some promising areas we are considering:

Predictive Analytics: Improved access to runtime information allows not only for better planning but also for forecasting future demands within the system. By analyzing historical data, particularly with the aid of machine learning techniques, we can generate more accurate predictions that optimize resource allocation and prevent bottlenecks.

Further Performance Improvements: Continuous optimization of the thread pool structure and data processing can further enhance the performance of our system. This includes increasing the number of requests processed simultaneously, reducing latency, and improving responsiveness to keep pace with the growing number of compute nodes and CPU cores in the cluster.

Heterogeneous CPU Support: Supporting heterogeneous CPU architectures is a crucial step in enhancing the capabilities of our Gridware Cluster Scheduler. Scheduling involves not just traditional resources like memory and I/O load, but also the types of assigned CPU cores. The increasing use of powerful GPUs for specific tasks can be maximized when the appropriate CPU cores are allocated. This ensures that the full potential of the GPUs is utilized. Additionally, significant energy efficiency gains can be achieved through the implementation of heterogeneous CPU architectures.

These enhancement opportunities will help us future-proof our system and meet the growing demands of our users.

Try Out and Get in Touch

We invite you to try out the new features and improvements in our Cluster Scheduler. The source code for Open Cluster Scheduler is available on GitHub.

HPC-Gridware@GitHub

Prebuilt packages for Open Cluster Scheduler are available for Linux on lx-amd64, lx-arm64, lx-riscv64; and for BSD on fbsd-amd64.

Additionally, the Gridware Cluster Scheduler is also available for Linux on lx-ppc64le, lx-s390x; Solaris on sol-amd64; and for older Linux distributions with older glibc versions as ulx-amd64 ans xlx-amd64.

Download of Open Cluster Scheduler and Gridware Cluster Scheduler

Your feedback is invaluable to us as we continue to enhance our system. To learn more about how Gridware can revolutionize your HPC and AI workloads, please contact us for a personal consultation at ebablick@hpc-gridware.com.