Launches
Clusterfudge was designed with the belief that complex machine-learning workloads and the processes they launch are not well represented by static configuration languages such as YAML.
To truly support complex process-relationships and nuanced parameter configuration, a fully-featured programming language is required.
This is why every launch in Clusterfudge is defined in Python; instigated by running the launch script from your developer machine.
Initiating a launch
python launch.py
The launch script itself leverages the Clusterfudge Python API to define the shape of the workload to launch, the resources required and the nodes that should be utilized. You can include custom code both before and after initiating the launch to help with dynamic configuration/setup as well as .
Launches, Jobs, Processes and Replicas
- A Launch encapsulates a single run of a launch script.
- One launch can have multiple jobs defined inside of it.
- A Job represents a series of processes, that are logically related and therefore grouped together.
- Each job can have multiple processes defined inside of it.
- A Process maps to a Linux process as it is executed on the node.
- A process may spawn additional child processes (which are not explicitly tracked by Clusterfudge).
- A Replica is a single instance of a job; you can scale up the number of replicas to request more copies of your job.
- Replicas are bin-packed onto nodes. For example, if a replica requests 2 GPUs, and a node has 8 GPUs, it can run up to four replicas.
Resources
To specify which GPUs your processes need access to, you include a resources
request for each process.
Clusterfudge supports most NVIDIA data-center devices, with some support for consumer-grade cards as well.
If you do not specify any resource requirements, your replicas will be placed onto a single node, and the scheduler will not consider that node to be in use, so other workloads will continue to be scheduled onto that same node. Explicitly specify your resource requirements to ensure efficient allocation of workloads across your accelerators.
Specifying a process should use two H100s
clusterfudge.CreateLaunchRequest(
...
jobs=[
clusterfudge.Job(
short_name="experiment",
replicas=1,
processes=[
clusterfudge.Process(
command=["python3", "experiment.py"],
resource_requirements=clusterfudge.Resources(
h100=2,
),
),
],
),
],
)
Node Selection
By default, Clusterfudge will consider all of your nodes when making scheduling decisions. Organizations often like to logically compartmentalize their compute infrastructure across teams/projects.
We use the concepts of clusters and shards to represent these allocations.
You optionally specify the cluster/shard you wish your workload to launch on, or limit the node selection down even further by using the hostnames
filter.
Specifying a cluster and a shard
clusterfudge.CreateLaunchRequest(
...
cluster='primary',
shard='rl',
jobs=[...],
)
Specifying a specific host
clusterfudge.CreateLaunchRequest(
...
hostnames=['my-pet-node'],
jobs=[...],
)
Failure Replica Behaviour
A key aspect of high utilization on your compute clusters is quick and graceful termination of failed launches.
As soon as resources are no longer being utilized, they should be returned to the available pool immediately.
Clusterfudge achieves this with the concept of Failure Replica Behaviour, which allows you to configure how your replicas should respond to failure events.
For example, if you are doing a distributed training run where progress is impossible if even a single process on any node fails, you can specify that for your job.
Specifying that the launch should be killed if a single replica fails
clusterfudge.CreateLaunchRequest(
...
jobs=[
clusterfudge.Job(
short_name="python",
replicas=3,
replica_failure_behaviour=clusterfudge.OnReplicaFailureOtherReplicasAreStopped(),
processes=[
clusterfudge.Process(
command=["python3", "this_will_fail.py"],
),
clusterfudge.Process(
command=["python3", "this_will_run_without_issue.py"],
),
],
),
],
)
Deployment
Before we can launch your workloads, we first need to get your code onto each of the nodes. You can do this using git
or by using zip files
.
We use the term Deployment to describe this process of taking code from your local machine or from a centralized repository onto your cluster.
Cloning from a Git Repository
Inside your launch script, you can specify the git repo, the branch, and optionally, a specific commit to checkout using the GitRepo
deployment option.
Specifying a git repo w/ branch in a launch script
clusterfudge.CreateLaunchRequest(
deployment=clusterfudge.GitRepo(
repo="https://github.com/clusterfudgeai/examples.git",
branch="main",
commit="abcd123" # long or short commit digest
),
jobs=[
...
],
)
You can specify either a HTTPS or SSH repo URL in either of the following 2 formats:
git repo formats
https://github.com/clusterfudgeai/examples.git
git@github.com:clusterfudgeai/examples.git
To pull a private repository, use the SSH format ([email protected]
). The node needs an SSH deploy key, that can be referenced in the fudgelet.toml
config file.
In addition, a known_hosts
file is also required. The known_hosts
file must contain the fingerprints of the servers you explicitly trust to clone code from (i.e if you are cloning from GitHub, you will want to add these fingerprints).
fudgelet.toml configuration items for cloning private SSH repos
...
git_known_hosts_file_path = "..."
git_private_ssh_key_file_path = "..."
...
Zip files
Specifying that your local directory should be zipped
clusterfudge.CreateLaunchRequest(
description="Example showing how to deploy via zip",
deployment=clusterfudge.LocalDir(),
jobs=[...]
)
Working Directory
Clusterfudge runs your process from inside a temporary directory on the node. If you have included a Git repo or Zip file, that will be extracted into this temporary directory as well.
The command
specified in your launch script is invoked from the root of this temporary directory. This means you should use relative file paths whenever you reference source files in your launch script.
This temporary working directory will persist after the job has completed, so any artifacts you produce can be stored there. Do note that your system will eventually clean up these temporary directories (often on reboot).
Environment Variables
Clusterfudge injects a series environment variables into your processes to enable the co-ordination of large scale distributed training runs.
It also automatically mounts the CUDA_VISIBLE_DEVICES
env var, which will reference the devices the scheduler determined the process should use.
- Name
CUDA_VISIBLE_DEVICES
- Description
- The index of the accelerator the process has been assigned.
- Name
CLUSTERFUDGE_LAUNCH_ID
- Description
- UUID that uniquely identifies the launch.
- Name
CLUSTERFUDGE_JOB_NAME
- Description
- The name of the job this process is part of.
- Name
CLUSTERFUDGE_REPLICA_INDEX
- Description
- The global replica index of the job this process is part of.
- Name
CLUSTERFUDGE_REPLICA_COUNT
- Description
- The total number of replicas specified for this job.
- Name
CLUSTERFUDGE_PROCESS_COUNT_ACROSS_ALL_REPLICAS
- Description
The total number of processes launched across all replicas of this job.
- Name
CLUSTERFUDGE_PROCESS_INDEX
- Description
- The local index of this process within it's job, on this node.
- Name
CLUSTERFUDGE_PROCESS_INDEX_ACROSS_ALL_REPLICAS
- Description
The global index of this process across all replicas of this job.
Queueing Launches
By default, if your cluster does not have capacity to run your workload immediately, your launch request will fail.
If you'd rather this workload was placed onto a queue, to be executed as soon as resources become available, you can request this behaviour using the QueueingBehaviour
field in our API.
Specifying that a job should be queued if it cannot be scheduled immediately
launch = client.create_launch(
clusterfudge.CreateLaunchRequest(
name="queueing-example",
description="Run this immediately, or pop it on the queue",
queueing_behaviour=clusterfudge.QueueingBehaviour(
enqueue_if_cluster_busy=True,
),
jobs=[
...
],
)
)