Launches

Clusterfudge was designed with the belief that complex machine-learning workloads and the processes they launch are not well represented by static configuration languages such as YAML.

To truly support complex process-relationships and nuanced parameter configuration, a fully-featured programming language is required.

This is why every launch in Clusterfudge is defined in Python; instigated by running the launch script from your developer machine.

Initiating a launch

python launch.py

The launch script itself leverages the Clusterfudge Python API to define the shape of the workload to launch, the resources required and the nodes that should be utilized. You can include custom code both before and after initiating the launch to help with dynamic configuration/setup as well as .

Launches, Jobs, Processes and Replicas

  • A Launch encapsulates a single run of a launch script.
    • One launch can have multiple jobs defined inside of it.
  • A Job represents a series of processes, that are logically related and therefore grouped together.
    • Each job can have multiple processes defined inside of it.
  • A Process maps to a Linux process as it is executed on the node.
    • A process may spawn additional child processes (which are not explicitly tracked by Clusterfudge).
  • A Replica is a single instance of a job; you can scale up the number of replicas to request more copies of your job.
    • Replicas are bin-packed onto nodes. For example, if a replica requests 2 GPUs, and a node has 8 GPUs, it can run up to four replicas.

Resources

To specify which GPUs your processes need access to, you include a resources request for each process. Clusterfudge supports most NVIDIA data-center devices, with some support for consumer-grade cards as well.

If you do not specify any resource requirements, your replicas will be placed onto a single node, and the scheduler will not consider that node to be in use, so other workloads will continue to be scheduled onto that same node. Explicitly specify your resource requirements to ensure efficient allocation of workloads across your accelerators.

Specifying a process should use two H100s

clusterfudge.CreateLaunchRequest(
    ...
    jobs=[
        clusterfudge.Job(
            short_name="experiment",
            replicas=1,
            processes=[
                clusterfudge.Process(
                    command=["python3", "experiment.py"],
                    resource_requirements=clusterfudge.Resources(
                        h100=2,
                    ),
                ),
            ],
        ),
    ],
)

Node Selection

By default, Clusterfudge will consider all of your nodes when making scheduling decisions. Organizations often like to logically compartmentalize their compute infrastructure across teams/projects.

We use the concepts of clusters and shards to represent these allocations.

You optionally specify the cluster/shard you wish your workload to launch on, or limit the node selection down even further by using the hostnames filter.

Specifying a cluster and a shard

clusterfudge.CreateLaunchRequest(
    ...
    cluster='primary',
    shard='rl',
    jobs=[...],
)

Specifying a specific host

clusterfudge.CreateLaunchRequest(
    ...
    hostnames=['my-pet-node'],
    jobs=[...],
)

Failure Replica Behaviour

A key aspect of high utilization on your compute clusters is quick and graceful termination of failed launches.

As soon as resources are no longer being utilized, they should be returned to the available pool immediately.

Clusterfudge achieves this with the concept of Failure Replica Behaviour, which allows you to configure how your replicas should respond to failure events.

For example, if you are doing a distributed training run where progress is impossible if even a single process on any node fails, you can specify that for your job.

Specifying that the launch should be killed if a single replica fails

clusterfudge.CreateLaunchRequest(
    ...
    jobs=[
        clusterfudge.Job(
            short_name="python",
            replicas=3,
            replica_failure_behaviour=clusterfudge.OnReplicaFailureOtherReplicasAreStopped(),
            processes=[
                clusterfudge.Process(
                    command=["python3", "this_will_fail.py"],
                ),
                clusterfudge.Process(
                    command=["python3", "this_will_run_without_issue.py"],
                ),
            ],
        ),
    ],
)

Deployment

Before we can launch your workloads, we first need to get your code onto each of the nodes. You can do this using git or by using zip files. We use the term Deployment to describe this process of taking code from your local machine or from a centralized repository onto your cluster.

Cloning from a Git Repository

Inside your launch script, you can specify the git repo, the branch, and optionally, a specific commit to checkout using the GitRepo deployment option.

Specifying a git repo w/ branch in a launch script

clusterfudge.CreateLaunchRequest(
    deployment=clusterfudge.GitRepo(
        repo="https://github.com/clusterfudgeai/examples.git",
        branch="main",
        commit="abcd123" # long or short commit digest
    ),
    jobs=[
        ...
    ],
)

You can specify either a HTTPS or SSH repo URL in either of the following 2 formats:

git repo formats

https://github.com/clusterfudgeai/examples.git
git@github.com:clusterfudgeai/examples.git

To pull a private repository, use the SSH format ([email protected]). The node needs an SSH deploy key, that can be referenced in the fudgelet.toml config file. In addition, a known_hosts file is also required. The known_hosts file must contain the fingerprints of the servers you explicitly trust to clone code from (i.e if you are cloning from GitHub, you will want to add these fingerprints).

fudgelet.toml configuration items for cloning private SSH repos

...
git_known_hosts_file_path = "..."
git_private_ssh_key_file_path = "..."
...

Zip files

Specifying that your local directory should be zipped

clusterfudge.CreateLaunchRequest(
    description="Example showing how to deploy via zip",
    deployment=clusterfudge.LocalDir(),
    jobs=[...]
)

Working Directory

Clusterfudge runs your process from inside a temporary directory on the node. If you have included a Git repo or Zip file, that will be extracted into this temporary directory as well.

The command specified in your launch script is invoked from the root of this temporary directory. This means you should use relative file paths whenever you reference source files in your launch script.

This temporary working directory will persist after the job has completed, so any artifacts you produce can be stored there. Do note that your system will eventually clean up these temporary directories (often on reboot).

Environment Variables

Clusterfudge injects a series environment variables into your processes to enable the co-ordination of large scale distributed training runs. It also automatically mounts the CUDA_VISIBLE_DEVICES env var, which will reference the devices the scheduler determined the process should use.

  • Name
    CUDA_VISIBLE_DEVICES
    Description

    The index of the accelerator the process has been assigned.

  • Name
    CLUSTERFUDGE_LAUNCH_ID
    Description

    UUID that uniquely identifies the launch.

  • Name
    CLUSTERFUDGE_JOB_NAME
    Description

    The name of the job this process is part of.

  • Name
    CLUSTERFUDGE_REPLICA_INDEX
    Description

    The global replica index of the job this process is part of.

  • Name
    CLUSTERFUDGE_REPLICA_COUNT
    Description

    The total number of replicas specified for this job.

  • Name
    CLUSTERFUDGE_PROCESS_COUNT_ACROSS_ALL_REPLICAS
    Description

    The total number of processes launched across all replicas of this job.

  • Name
    CLUSTERFUDGE_PROCESS_INDEX
    Description

    The local index of this process within it's job, on this node.

  • Name
    CLUSTERFUDGE_PROCESS_INDEX_ACROSS_ALL_REPLICAS
    Description

    The global index of this process across all replicas of this job.