Skip to main content

Local GPU Backend

The Local backend runs SDPO training and vLLM inference on your own hardware. It requires a GPU with >= 24 GB VRAM.

Requirements

  • NVIDIA GPU with >= 24 GB VRAM (e.g. RTX 3090, RTX 4090, A5000, L40S)
  • NVIDIA Container Toolkit (for Docker)
  • Docker and Docker Compose
  • Python 3.11+ and uv

Installation

1

Clone and install

git clone https://github.com/kfallah/CLaaS.git
cd CLaaS
uv sync --extra local
2

Configure environment

cd docker
cp .env.local.example .env
Edit .env and set TELEGRAM_BOT_TOKEN (required). Optionally set HF_TOKEN for gated models.
3

Start the stack

docker compose --profile local up --build
The first run downloads Qwen3-8B (~16 GB). The vLLM health check takes 10-20 minutes on first start.
4

Verify

curl http://localhost:8000/v1/models -H "Authorization: Bearer sk-local"
curl http://localhost:8080/
curl http://localhost:8080/v1/lora

Services

ServicePortDescription
vllm8000Qwen3-8B with LoRA serving and sleep/wake support
claas-api8080CLaaS feedback API and distill worker
openclaw-local18789OpenClaw gateway with Telegram bot
init-localOne-shot: creates LoRA adapter + writes OpenClaw config

Configuration

These variables are set in the .env file.
VariableRequiredDefaultDescription
TELEGRAM_BOT_TOKENYesBot token from @BotFather
HF_TOKENNoHuggingFace token (gated models only)
MODELNoQwen/Qwen3-8BBase model ID
GPU_MEMORY_UTILIZATIONNo0.70VRAM fraction for vLLM
MAX_MODEL_LENNo32768Max sequence length
For the full Hydra config and all environment variables, see the Configuration Reference.

Verification

# Check vLLM models
curl http://localhost:8000/v1/models -H "Authorization: Bearer sk-local"

# Check CLaaS API
curl http://localhost:8080/

# List LoRA adapters
curl http://localhost:8080/v1/lora

# Test feedback loop
curl -X POST http://localhost:8080/v1/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "lora_id": "openclaw/assistant-latest",
    "prompt": "hi",
    "response": "hello",
    "feedback": "good",
    "training": {"teacher_mode": "self"}
  }'
Send a DM to your Telegram bot. It should respond using the openclaw-assistant-latest LoRA model.
If you prefer not to use Docker, you can run each service manually:
# 1. Start vLLM with LoRA support
vllm serve Qwen/Qwen3-8B --host 0.0.0.0 --port 8000 \
  --enable-lora --lora-modules my-lora=/loras/user/my-lora-init

# 2. Start the CLaaS API
uv run uvicorn claas.api:web_app --host 0.0.0.0 --port 8080

# 3. Initialize a LoRA adapter
curl -X POST http://localhost:8080/v1/lora/init \
  -H "Content-Type: application/json" \
  -d '{"lora_id": "user/my-lora"}'

# 4. Send feedback
curl -X POST http://localhost:8080/v1/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "lora_id": "user/my-lora-init",
    "prompt": "Write a function to calculate factorial",
    "response": "def factorial(n): ...",
    "feedback": "Good recursive solution"
  }'
When running uvicorn directly, use claas.api:web_app, not claas.api:app. The app object is a Modal App and is not ASGI-compatible.