Cluster Workload Manager

October 11, 2025 · View on GitHub

A lightweight, multithreaded workload manager for LAN clusters written in Go. This manager runs on each node in the cluster and distributes jobs based on available thread capacity.

Originally created to support running distributed workloads for the Boltzmannomics project (https://github.com/ianfr/economic-simulation).

Note that this tool is WIP and should only be run in trusted environments; authentication & encryption has not been added yet, just IP whitelisting which isn't secure.

Features

Thread-aware scheduling: Jobs are queued and executed based on thread availability
Distributed job management: Head node delegates jobs to worker nodes
REST API: Submit and monitor jobs via HTTP endpoints
Standalone mode: Can run on a single node without clustering
Web-Based Monitoring GUI: Simple UI to monitor cluster health and job status

A screenshot of the monitoring UI with the workload manager running in standalone mode

Architecture

The workload manager operates in three modes:

Head node: Accepts job submissions, runs jobs locally, and delegates to worker nodes
Worker node: Receives jobs from the head node and executes them locally
Standalone: Runs independently without clustering

Installation

Build the executable

cd golang-scheduler

# Build with default architecture
go build -o workload-manager
cd monitor && go build -o monitor

# Compile for ARM64
GOARCH=arm64 go build -o workload-manager
cd monitor && GOARCH=arm64 go build -o monitor

Deploy to cluster nodes

Note: While not strictly necessary, a shared filesystem is strongly reccomended to facilitate centralized job logging.

Copy the executable to the shared filesystem for the cluster:

# Example using scp
scp workload-manager user@node1:/mnt/md0/cluster
scp monitor/monitor user@node1:/mnt/md0/cluster

Configuration

Create a configuration file for each node:

Head Node Configuration (`config-head.json`)

{
  "role": "head",
  "listen_port": 8080,
  "max_threads": 16,
  "worker_nodes": [
    "192.168.1.101:8080",
    "192.168.1.102:8080",
    "192.168.1.103:8080"
  ]
}

Worker Node Configuration (`config-worker.json`)

{
  "role": "worker",
  "listen_port": 8080,
  "max_threads": 16,
  "head_node_address": "192.168.1.100:8080"
}

Standalone Configuration (`config-standalone.json`)

{
  "role": "standalone",
  "listen_port": 8080,
  "max_threads": 8
}

Configuration Options

role: "head", "worker", or "standalone"
listen_port: Port for the HTTP API (default: 8080)
max_threads: Maximum threads this node can use (default: number of CPU cores)
head_node_address: Address of the head node (required for workers)
worker_nodes: List of worker node addresses (required for head node)

Usage

Starting the Manager

On the head node:

./workload-manager -config config-head.json

On worker nodes:

./workload-manager -config config-worker.json

Standalone:

./workload-manager -config config-standalone.json

Submitting a Job

Submit a job via the REST API using curl or any HTTP client:

curl -X POST http://head-node:8080/api/v1/jobs/submit \
  -H "Content-Type: application/json" \
  -d '{
    "command": "sleep 10 && echo Hello World",
    "threads": 2,
    "stdout_path": "/mnt/md0/cluster/jobs/job1.stdout",
    "stderr_path": "/mnt/md0/cluster/jobs/job1.stderr"
  }'

Response:

{
  "job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "message": "Job submitted successfully",
  "status": "queued"
}

Job Submission Fields

command: Shell command to execute (required)
threads: Number of threads the job will use (required)
stdout_path: Path where stdout will be written (required)
stderr_path: Path where stderr will be written (required)

Checking Job Status

Get information about a specific job:

curl http://head-node:8080/api/v1/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890

Response:

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "command": "sleep 10 && echo Hello World",
  "threads": 2,
  "stdout_path": "/mnt/md0/cluster/jobs/job1.stdout",
  "stderr_path": "/mnt/md0/cluster/jobs/job1.stderr",
  "status": "running",
  "node_address": "192.168.1.101:8080",
  "submitted_at": "2025-10-02T10:30:00Z",
  "started_at": "2025-10-02T10:30:05Z"
}

Checking Node Status

Get the status of a node:

curl http://node:8080/api/v1/status

Response:

{
  "address": "",
  "max_threads": 16,
  "used_threads": 4,
  "available_threads": 12,
  "queued_jobs": 2,
  "running_jobs": 2,
  "completed_jobs": 15,
  "last_heartbeat": "2025-10-02T10:35:00Z",
  "is_healthy": true
}

Checking Cluster Status (Head Node Only)

Get the status of the entire cluster:

curl http://head-node:8080/api/v1/cluster/status

Response:

{
  "head_node": {
    "address": "localhost:8080",
    "max_threads": 16,
    "used_threads": 2,
    "available_threads": 14,
    "queued_jobs": 0,
    "running_jobs": 1,
    "completed_jobs": 10,
    "last_heartbeat": "2025-10-02T10:35:00Z",
    "is_healthy": true
  },
  "worker_nodes": [
    {
      "address": "192.168.1.101:8080",
      "max_threads": 16,
      "used_threads": 8,
      "available_threads": 8,
      "queued_jobs": 3,
      "running_jobs": 4,
      "completed_jobs": 20,
      "last_heartbeat": "2025-10-02T10:35:00Z",
      "is_healthy": true
    }
  ],
  "total_jobs": 45
}

Health Check

curl http://node:8080/api/v1/health

API Endpoints

Endpoint	Method	Description
`/api/v1/jobs/submit`	POST	Submit a new job
`/api/v1/jobs/{id}`	GET	Get job information
`/api/v1/jobs`	GET	List all jobs on this node
`/api/v1/status`	GET	Get node status
`/api/v1/health`	GET	Health check
`/api/v1/cluster/status`	GET	Get cluster status (head only)

How It Works

Job Scheduling

Job Submission: Jobs are submitted to the head node via the REST API
Node Selection: The head node selects the best worker based on:
- Available thread capacity
- Current queue length
- Node health status
Execution: Jobs run immediately if threads are available, otherwise they're queued
Completion: Job outputs are written to the specified paths on the shared filesystem

Thread Management

Each job specifies the number of threads it will use
Jobs only execute when sufficient thread capacity is available
The scheduler uses a FIFO queue for pending jobs
Thread capacity is released when jobs complete

Worker Health Monitoring

Head node performs health checks every 5 seconds
Unhealthy workers are excluded from job delegation
Jobs fail over to other nodes if submission fails

Example: Running a Batch of Jobs

Create a script to submit multiple jobs:

#!/bin/bash

for i in {1..10}; do
  curl -X POST http://192.168.1.100:8080/api/v1/jobs/submit \
    -H "Content-Type: application/json" \
    -d "{
      \"command\": \"python3 process_data.py --input data_$i.txt\",
      \"threads\": 4,
      \"stdout_path\": \"/mnt/md0/cluster/jobs/job_$i.stdout\",
      \"stderr_path\": \"/mnt/md0/cluster/jobs/job_$i.stderr\"
    }"
  echo "Submitted job $i"
done

Monitoring

Monitor the cluster status continuously:

watch -n 5 'curl -s http://192.168.1.100:8080/api/v1/cluster/status | jq .'

Troubleshooting

Jobs not being delegated to workers

Verify worker nodes are accessible from the head node
Check worker node health: curl http://worker:8080/api/v1/health
Ensure worker addresses in head config match actual IPs/hostnames

Jobs stuck in queue

Check thread availability: curl http://node:8080/api/v1/status
Verify no jobs are consuming all threads
Check if max_threads configuration is appropriate

Cannot access shared filesystem paths

Ensure all nodes have the same mount point for shared storage
Verify write permissions on stdout/stderr paths
Test file creation: touch /mnt/md0/cluster/jobs/test.txt