Cluster Workload Manager
October 11, 2025 ยท View on GitHub
A lightweight, multithreaded workload manager for LAN clusters written in Go. This manager runs on each node in the cluster and distributes jobs based on available thread capacity.
Originally created to support running distributed workloads for the Boltzmannomics project (https://github.com/ianfr/economic-simulation).
Note that this tool is WIP and should only be run in trusted environments; authentication & encryption has not been added yet, just IP whitelisting which isn't secure.
Features
- Thread-aware scheduling: Jobs are queued and executed based on thread availability
- Distributed job management: Head node delegates jobs to worker nodes
- REST API: Submit and monitor jobs via HTTP endpoints
- Standalone mode: Can run on a single node without clustering
- Web-Based Monitoring GUI: Simple UI to monitor cluster health and job status

A screenshot of the monitoring UI with the workload manager running in standalone mode
Architecture
The workload manager operates in three modes:
- Head node: Accepts job submissions, runs jobs locally, and delegates to worker nodes
- Worker node: Receives jobs from the head node and executes them locally
- Standalone: Runs independently without clustering
Installation
Build the executable
cd golang-scheduler
# Build with default architecture
go build -o workload-manager
cd monitor && go build -o monitor
# Compile for ARM64
GOARCH=arm64 go build -o workload-manager
cd monitor && GOARCH=arm64 go build -o monitor
Deploy to cluster nodes
Note: While not strictly necessary, a shared filesystem is strongly reccomended to facilitate centralized job logging.
Copy the executable to the shared filesystem for the cluster:
# Example using scp
scp workload-manager user@node1:/mnt/md0/cluster
scp monitor/monitor user@node1:/mnt/md0/cluster
Configuration
Create a configuration file for each node:
Head Node Configuration (config-head.json)
{
"role": "head",
"listen_port": 8080,
"max_threads": 16,
"worker_nodes": [
"192.168.1.101:8080",
"192.168.1.102:8080",
"192.168.1.103:8080"
]
}
Worker Node Configuration (config-worker.json)
{
"role": "worker",
"listen_port": 8080,
"max_threads": 16,
"head_node_address": "192.168.1.100:8080"
}
Standalone Configuration (config-standalone.json)
{
"role": "standalone",
"listen_port": 8080,
"max_threads": 8
}
Configuration Options
- role:
"head","worker", or"standalone" - listen_port: Port for the HTTP API (default: 8080)
- max_threads: Maximum threads this node can use (default: number of CPU cores)
- head_node_address: Address of the head node (required for workers)
- worker_nodes: List of worker node addresses (required for head node)
Usage
Starting the Manager
On the head node:
./workload-manager -config config-head.json
On worker nodes:
./workload-manager -config config-worker.json
Standalone:
./workload-manager -config config-standalone.json
Submitting a Job
Submit a job via the REST API using curl or any HTTP client:
curl -X POST http://head-node:8080/api/v1/jobs/submit \
-H "Content-Type: application/json" \
-d '{
"command": "sleep 10 && echo Hello World",
"threads": 2,
"stdout_path": "/mnt/md0/cluster/jobs/job1.stdout",
"stderr_path": "/mnt/md0/cluster/jobs/job1.stderr"
}'
Response:
{
"job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"message": "Job submitted successfully",
"status": "queued"
}
Job Submission Fields
- command: Shell command to execute (required)
- threads: Number of threads the job will use (required)
- stdout_path: Path where stdout will be written (required)
- stderr_path: Path where stderr will be written (required)
Checking Job Status
Get information about a specific job:
curl http://head-node:8080/api/v1/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890
Response:
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"command": "sleep 10 && echo Hello World",
"threads": 2,
"stdout_path": "/mnt/md0/cluster/jobs/job1.stdout",
"stderr_path": "/mnt/md0/cluster/jobs/job1.stderr",
"status": "running",
"node_address": "192.168.1.101:8080",
"submitted_at": "2025-10-02T10:30:00Z",
"started_at": "2025-10-02T10:30:05Z"
}
Checking Node Status
Get the status of a node:
curl http://node:8080/api/v1/status
Response:
{
"address": "",
"max_threads": 16,
"used_threads": 4,
"available_threads": 12,
"queued_jobs": 2,
"running_jobs": 2,
"completed_jobs": 15,
"last_heartbeat": "2025-10-02T10:35:00Z",
"is_healthy": true
}
Checking Cluster Status (Head Node Only)
Get the status of the entire cluster:
curl http://head-node:8080/api/v1/cluster/status
Response:
{
"head_node": {
"address": "localhost:8080",
"max_threads": 16,
"used_threads": 2,
"available_threads": 14,
"queued_jobs": 0,
"running_jobs": 1,
"completed_jobs": 10,
"last_heartbeat": "2025-10-02T10:35:00Z",
"is_healthy": true
},
"worker_nodes": [
{
"address": "192.168.1.101:8080",
"max_threads": 16,
"used_threads": 8,
"available_threads": 8,
"queued_jobs": 3,
"running_jobs": 4,
"completed_jobs": 20,
"last_heartbeat": "2025-10-02T10:35:00Z",
"is_healthy": true
}
],
"total_jobs": 45
}
Health Check
curl http://node:8080/api/v1/health
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/jobs/submit | POST | Submit a new job |
/api/v1/jobs/{id} | GET | Get job information |
/api/v1/jobs | GET | List all jobs on this node |
/api/v1/status | GET | Get node status |
/api/v1/health | GET | Health check |
/api/v1/cluster/status | GET | Get cluster status (head only) |
How It Works
Job Scheduling
- Job Submission: Jobs are submitted to the head node via the REST API
- Node Selection: The head node selects the best worker based on:
- Available thread capacity
- Current queue length
- Node health status
- Execution: Jobs run immediately if threads are available, otherwise they're queued
- Completion: Job outputs are written to the specified paths on the shared filesystem
Thread Management
- Each job specifies the number of threads it will use
- Jobs only execute when sufficient thread capacity is available
- The scheduler uses a FIFO queue for pending jobs
- Thread capacity is released when jobs complete
Worker Health Monitoring
- Head node performs health checks every 5 seconds
- Unhealthy workers are excluded from job delegation
- Jobs fail over to other nodes if submission fails
Example: Running a Batch of Jobs
Create a script to submit multiple jobs:
#!/bin/bash
for i in {1..10}; do
curl -X POST http://192.168.1.100:8080/api/v1/jobs/submit \
-H "Content-Type: application/json" \
-d "{
\"command\": \"python3 process_data.py --input data_$i.txt\",
\"threads\": 4,
\"stdout_path\": \"/mnt/md0/cluster/jobs/job_$i.stdout\",
\"stderr_path\": \"/mnt/md0/cluster/jobs/job_$i.stderr\"
}"
echo "Submitted job $i"
done
Monitoring
Monitor the cluster status continuously:
watch -n 5 'curl -s http://192.168.1.100:8080/api/v1/cluster/status | jq .'
Troubleshooting
Jobs not being delegated to workers
- Verify worker nodes are accessible from the head node
- Check worker node health:
curl http://worker:8080/api/v1/health - Ensure worker addresses in head config match actual IPs/hostnames
Jobs stuck in queue
- Check thread availability:
curl http://node:8080/api/v1/status - Verify no jobs are consuming all threads
- Check if
max_threadsconfiguration is appropriate
Cannot access shared filesystem paths
- Ensure all nodes have the same mount point for shared storage
- Verify write permissions on stdout/stderr paths
- Test file creation:
touch /mnt/md0/cluster/jobs/test.txt