Nodes

Clusterfudge gives you a full visibility into your nodes and the workloads that are running on them.

Cluster and Shards

By default, Clusterfudge will consider all of your nodes when making scheduling decisions. Organizations often like to logically compartmentalize their compute infrastructure across teams/projects.

We use the concepts of clusters and shards to represent these allocations.

By default, all nodes are placed in the 'Unassigned' cluster, in an 'Unassigned' shard. If you do not wish to leverage the cluster/shard functionality of Clusterfudge, you don't need to do anything else; the scheduler will freely chose from your entire node pool when launching workloads.

You can assign a cluster and/or shard to a node on the Nodes page of the Clusterfudge dashboard.

Cordoning

You can cordon nodes using the Clusterfudge dashboard to immediately exclude that node from scheduling decisions. This is our recommended approach for dealing with hardware issues/failures on your nodes so you can maintain visibility on which nodes need replacing within a given cluster/shard.

By default, nodes appear 'Uncordoned' and therefore will be available for scheduling.

Offline Nodes

As soon as you install a fudgelet onto a node, it begins communicating with the Clusterfudge server. If this communication stops for at least 6 minutes, we consider that node to be offline. Offline nodes are still visible in the dashboard, but will not be considered by the scheduler. As soon as the fudgelet re-establishes communication with the server, the node will no longer be marked as offline.

If a node has been offline for over 24 hours, it will be removed from the dashboard; if it re-establishes communication with the server after 24 hours it will reappear, no action is required.

GPU Processes (Used/Used Non-Clusterfudge)

The fudgelet keeps track of all processes on the node that are using GPUs. Some of these processes may have been launched via Clusterfudge, others may be workloads that are managed outside of Clusterfudge.

The scheduler takes these processes into consideration when making scheduling decisions; you can safely schedule work onto a cluster even if other members of your team are not launching their workloads via Clusterfudge.