Skip to content

Capacity Computing

Introduction

In many cases, it is useful to submit a huge number of computational jobs into the Slurm queue system. A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving the best runtime, throughput, and computer utilization. This is called Capacity Computing

However, executing a huge number of jobs via the Slurm queue may strain the system. This strain may result in slow response to commands, inefficient scheduling, and overall degradation of performance and user experience for all users.
We recommend using Job arrays or HyperQueue to execute many jobs.

There are two primary scenarios:

  1. Number of jobs < 1500, and the jobs are able to utilize one or more full nodes:
    Use Job arrays.
    The Job array allows to submit and control up to 1500 jobs (tasks) in one packet. Several job arrays may be submitted.

  2. Number of jobs >> 1500, or the jobs only utilize a few cores/accelerators each:
    Use HyperQueue.
    HyperQueue can help efficiently load balance a very large number of jobs (tasks) amongst available computing nodes. HyperQueue may be also used if you have dependencies among the jobs.