OneComp Web App

June 9, 2026 · View on GitHub

LLM quantization dashboard built on top of OneCompression. Pick a Hugging Face model and quantization parameters from the browser, run quantization on the GPU, and verify the result with an inference chat.

Demo

This repository targets SLURM-managed HPC clusters (GPU nodes) running without Docker. PostgreSQL / MinIO are replaced with SQLite and local file storage, and Redis is built from source in user space.


Architecture (HPC)

Local PC                       Login node                 GPU node (salloc target)
┌─────────────┐   SSH tunnel   ┌──────────┐              ┌──────────────────────────┐
│ Vite (yarn) │ ──────────────►│ login    │ ── srun ──►  │ FastAPI  (:8001)         │
│  :5173      │   LocalForward │          │              │ Celery worker (solo)     │
└─────────────┘   → gpuXX:8001 └──────────┘              │ Redis     (127.0.0.1)    │
                                                         │ vLLM      (:8090)        │
                                                         │ SQLite + tmp/quantized/  │
                                                         └──────────────────────────┘
ComponentStackNotes (HPC)
FrontendReact, TypeScript, Vite, TanStack QueryRuns on the local PC; API accessed through an SSH tunnel
Backend APIFastAPI, SQLAlchemystart_backend.py on the GPU node
Task QueueCelery + RedisWorker also on the GPU node; Redis built from source
DatabaseSQLite (backend/onecomp.db)The file can live on lustre
Quantization outputLocal directory (backend/tmp/quantized/)Saved per job ID
QuantizationOneCompression (onecomp)ONECOMP_DEVICE=cuda
Inference (CUDA)vLLM (separate process)Spawned from the Python in the main .venv

Quantization and vLLM share the same Python environment (backend/.venv). onecomp and vllm are resolved together by pyproject.toml (PyTorch for CUDA 13.0 via pytorch-cu130).


Prerequisites

ItemExample
Job schedulerSLURM (salloc / srun)
Python3.12
Package manageruv
GPUNVIDIA (CUDA 12.x / 13.x driver)
Frontend (local)Node.js + Yarn

Quick start

See docs/setup-hpc.md for details and troubleshooting.

1. First-time setup (login node or GPU node)

When the node has network access, run uv sync on the GPU node you will use for jobs (recommended). Prebuilt wheels often work from the login node, but native builds, glibc, and the CUDA driver/runtime can differ between login and compute nodes.

cd backend/
uv sync

If you ran uv sync on the login node, check on the GPU node (after step 2) before quantizing:

cd backend/ && . .venv/bin/activate
python -c "import torch; print('cuda:', torch.cuda.is_available())"

Redis (often apt install is not available on HPC):

cd backend/
wget https://github.com/redis/redis/archive/refs/tags/7.2.7.tar.gz
tar xzf 7.2.7.tar.gz
cd redis-7.2.7 && make -j$(nproc) && cd ..

2. Allocate a GPU node

Pick a GPU partition with sinfo (name varies by site; e.g. interactive, gpu):

salloc -p <partition> --time=04:00:00 --gres=gpu:1

All commands below run on the GPU node.

3. Start the services (GPU node)

Terminal A — Redis + Worker:

cd backend/
export LC_ALL=C
mkdir -p tmp/redis
./redis-7.2.7/src/redis-server \
  --daemonize yes \
  --bind 127.0.0.1 \
  --dir "$(pwd)/tmp/redis" \
  --pidfile "$(pwd)/tmp/redis/redis.pid" \
  --logfile "$(pwd)/tmp/redis/redis.log"
./redis-7.2.7/src/redis-cli -h 127.0.0.1 ping   # → PONG

. .venv/bin/activate
export ONECOMP_DEVICE=cuda
# Local model root — required on worker **and** API when using short local names
# instead of a Hugging Face repo id. See Environment variables.
export LOCAL_MODEL_ROOT="$(pwd)/models"
# Required on nodes without nvcc (CUDA toolkit); see Environment variables below
export VLLM_USE_FLASHINFER_SAMPLER=0
nohup uv run python start_worker.py > /tmp/worker.log 2>&1 &

Terminal B — API (second shell on the same GPU node; from the login node, setup-hpc.md §2.5):

squeue -u $USER                    # find the JOBID (e.g. 41229)
srun --jobid=41229 --pty bash      # replace 41229 with your JOBID
cd backend/
. .venv/bin/activate
export ONECOMP_DEVICE=cuda
export LOCAL_MODEL_ROOT="$(pwd)/models"
uv run python start_backend.py --reload --port 8001

Connectivity check on the same node:

curl http://127.0.0.1:8001/api/health
# {"status":"ok"}

4. Use the UI from your local PC

Steps 4a–4d all run on your local PC. Step 3 left the API on the GPU node; the SSH tunnel in 4c forwards localhost:8001 on your PC to that API. See setup-hpc.md §2.6 / §2.7.

4a. Prepare the frontend (one-time per clone; local PC):

# clone the repo on your PC, then:
cd frontend/
npm install    # or: yarn — Node.js required; see setup-hpc §2.7

4b. SSH config (local PC) — point LocalForward at the GPU node name (it changes every salloc):

Host my-hpc-login
    HostName <login-node>
    LocalForward 8001 <gpu-node>:8001
    ServerAliveInterval 60

LocalForward alone does not open a tunnel; 4c is still required.

4c. Open the tunnel (local PC; leave this terminal open):

ssh my-hpc-login -N

With step 3 already running on the GPU node, check from the local PC: curl http://localhost:8001/api/health{"status":"ok"}.

4d. Start Vite (local PC; another terminal):

cd frontend/
VITE_API_TARGET=http://localhost:8001 yarn dev

Browser: http://localhost:5173

5. Shut down

On your local PC (steps 4d / 4c):

  • Stop the frontend: Ctrl+C in the terminal running yarn dev / npm run dev
  • Close the SSH tunnel: Ctrl+C in the terminal running ssh my-hpc-login -N

On the cluster — end the SLURM allocation: exit from the salloc shell, or scancel <jobid>. That stops Redis, the worker, the API, and vLLM on the GPU node.

Keep the allocation, restart services only — e.g. after changing env vars or config.py, stop processes on the GPU node and run step 3 again. See setup-hpc.md §2.9:

pkill -f start_worker.py
pkill -f vllm.entrypoints

Typical workflow

  1. New Job — pick a Hugging Face model and a method (gptq, autobit, jointq, or auto_run). QEP is optional (default on; not supported with JointQ). Fractional bit widths are allowed for autobit / auto_run; auto_run sets bits and group size from VRAM automatically.
  2. Quantize — wait until the job completes. Output is saved under tmp/quantized/<job-id>/.
  3. Deploy — start vLLM (ONECOMP_DEVICE=cuda required).
  4. Chat — test inference in the browser. On failure, check error_message on the job detail page.
StateChat behavior
vLLM deploy succeeded (inference_url set)POST /chat returns immediately
vLLM not deployed / failedFalls back to polling chat-result (slow)

Deploy / chat logs:

tail -f /tmp/worker.log
tail -80 /tmp/vllm-<job-id>.log    # when vLLM fails

Environment variables

ONECOMP_* settings default in backend/app/core/config.py. Other variables below are read directly from the process environment.

LOCAL_MODEL_ROOT (local model directory)

Set this before starting both the Celery worker and the API when jobs use short local directory names (e.g. gemma-2-2b-it) rather than a Hugging Face repo id (org/model). The server maps model_name to {LOCAL_MODEL_ROOT}/<model_name> for job validation and quantization.

VariableDefaultDescription
LOCAL_MODEL_ROOT/modelsRoot directory of pre-downloaded models on shared storage

Example (run from backend/; models live in backend/models/):

export LOCAL_MODEL_ROOT="$(pwd)/models"

Use the same value in Terminal A (worker) and Terminal B (API). If only one process has it, jobs may pass validation but fail during quantization with “not a local folder” / Hugging Face Hub errors.

After changing LOCAL_MODEL_ROOT, restart both processes (step 3).

ONECOMP_* (application settings)

VariableDefault (HPC)Description
ONECOMP_DATABASE_URLsqlite:///./onecomp.dbDB (path relative to backend/)
ONECOMP_REDIS_URLredis://127.0.0.1:6379/0Redis (127.0.0.1 required; localhost may fail over IPv6)
ONECOMP_DEVICEcpuSet to cuda (both worker and API)
ONECOMP_QUANTIZED_DIRtmp/quantizedWhere quantized models are stored
ONECOMP_VLLM_PYTHON.venv/bin/pythonPython used to launch vLLM. Use .venv-vllm/bin/python when separated (§1.2.1)
ONECOMP_VLLM_PORT8090vLLM port
ONECOMP_MOCK_QUANTIZATIONfalseSet to true to skip quantization (pipeline smoke test)
ONECOMP_CHAT_TIMEOUT900Chat HTTP timeout in seconds
VLLM_USE_FLASHINFER_SAMPLER(vLLM default)Set to 0 on the Celery worker when nvcc is missing (see below)

vLLM deploy without nvcc (HPC)

Quantization (ONECOMP_DEVICE=cuda) only needs the CUDA driver/runtime. Chat deploy starts vLLM, which may enable the FlashInfer sampler and trigger a JIT build that requires nvcc. On many HPC GPU nodes the driver works but the CUDA toolkit is not installed.

Symptomerror_message contains Could not find nvcc
Fixexport VLLM_USE_FLASHINFER_SAMPLER=0 before starting the worker
WhereWorker only — the API does not spawn vLLM; setting this on the API alone has no effect
After changepkill -f start_worker.py and restart the worker, then Stop → Deploy in the UI
pkill -f start_worker.py
cd backend && . .venv/bin/activate
export ONECOMP_DEVICE=cuda
export LOCAL_MODEL_ROOT="$(pwd)/models"
export VLLM_USE_FLASHINFER_SAMPLER=0
nohup uv run python start_worker.py > /tmp/worker.log 2>&1 &

See setup-hpc.md #10.

After editing config.py, restart the Celery worker. The API's --reload does not propagate to the worker.

pkill -f start_worker.py
cd backend && . .venv/bin/activate
export ONECOMP_DEVICE=cuda
export LOCAL_MODEL_ROOT="$(pwd)/models"
nohup uv run python start_worker.py > /tmp/worker.log 2>&1 &

Directory layout

dashboard/
├── backend/
│   ├── app/              # FastAPI, Celery tasks, inference
│   ├── start_backend.py
│   ├── start_worker.py
│   ├── pyproject.toml    # onecomp + vllm + torch (cu130)
│   ├── onecomp.db        # SQLite (created at runtime)
│   └── tmp/quantized/    # Quantization output (gitignored)
├── frontend/             # React SPA
└── docs/
    └── setup-hpc.md      # HPC procedure, architecture, troubleshooting

Troubleshooting

See troubleshooting in docs/setup-hpc.md for details.

SymptomFix
Redis won't start (Failed to configure LOCALE)export LC_ALL=C before redis-server (#1b)
Redis Error 97 / Celery reconnect failureUse 127.0.0.1 in the Redis URL and --bind
vLLM /health returns Squid 403no_proxy (set automatically in code)
Deploy fails because vllm_python path is missingAlign config with the venv layout and restart the worker (#7)
onecomp / vllm dependency conflictSeparated venv + vllm_python in config.py
SSH tunnel established but API unreachablePoint LocalForward at the GPU node name
Chat is slow / keeps pollingONECOMP_DEVICE=cuda, then Stop → Deploy
Deploy fails: Could not find nvccVLLM_USE_FLASHINFER_SAMPLER=0 on the worker, restart worker, Stop → Deploy (#10)
Local model name fails / “not a local folder” on HFSet the same LOCAL_MODEL_ROOT on worker and API, restart both; model dir must be {LOCAL_MODEL_ROOT}/<name>/ (#11)

Separating the vLLM venv

Background (why a separated venv used to exist / why it usually is not needed now)

The original HPC layout used two Python environments, one for quantization (onecomp) and one for inference (vLLM): backend/.venv and backend/.venv-vllm. Reasons at the time:

  • onecomp required transformers 5.x
  • vLLM 0.19.x still required transformers 4.x, so uv sync could not resolve both in the same venv
  • The worker therefore quantized in the main venv and launched vLLM as a separate process via settings.vllm_python (previously defaulting to .venv-vllm/bin/python)

Since OneCompression v1.1.1, vLLM 0.21.x and later are supported together, so the dependency tension is gone. The pyproject.toml in this repo resolves onecomp and vllm>=0.21.0 in the same venv.

For normal operation the main backend/.venv is enough, and the default in config.py is:

vllm_python: str = ".venv/bin/python"

(vLLM is still launched as a subprocess from Celery. What is unified is the interpreter and packages; the processes remain separate.)

If you do need a dedicated vLLM venv (dependency conflict, different CUDA build, etc.):

  1. Create .venv-vllm/ with cd backend && bash setup_vllm.sh
  2. Either edit backend/app/core/config.py or set the env var when starting the worker:
# config.py (backend/app/core/config.py)
vllm_python: str = ".venv-vllm/bin/python"
export ONECOMP_VLLM_PYTHON="$(pwd)/.venv-vllm/bin/python"   # if you do not want to edit config.py
  1. Always pkill -f start_worker.py, restart the worker, then Stop → Deploy from the UI

See setup-hpc.md §1.2.1 / #9 for the full procedure.


License

See FujitsuResearch/OneCompression for the OneCompression license.