Local GPU Backend

The Local backend runs SDPO training and vLLM inference on your own hardware. It requires a GPU with >= 24 GB VRAM.

Requirements

NVIDIA GPU with >= 24 GB VRAM (e.g. RTX 3090, RTX 4090, A5000, L40S)
NVIDIA Container Toolkit (for Docker)
Docker and Docker Compose
Python 3.11+ and uv

Installation

Clone and install

git clone https://github.com/kfallah/CLaaS.git
cd CLaaS
uv sync --extra local

Configure environment

cd docker
cp .env.local.example .env

Edit .env and set TELEGRAM_BOT_TOKEN (required). Optionally set HF_TOKEN for gated models.

Start the stack

docker compose --profile local up --build

The first run downloads Qwen3-8B (~16 GB). The vLLM health check takes 10-20 minutes on first start.

Verify

curl http://localhost:8000/v1/models -H "Authorization: Bearer sk-local"
curl http://localhost:8080/
curl http://localhost:8080/v1/lora

Services

Service	Port	Description
`vllm`	8000	Qwen3-8B with LoRA serving and sleep/wake support
`claas-api`	8080	CLaaS feedback API and distill worker
`openclaw-local`	18789	OpenClaw gateway with Telegram bot
`init-local`	—	One-shot: creates LoRA adapter + writes OpenClaw config

Configuration

These variables are set in the .env file.

Variable	Required	Default	Description
`TELEGRAM_BOT_TOKEN`	Yes	—	Bot token from @BotFather
`HF_TOKEN`	No	—	HuggingFace token (gated models only)
`MODEL`	No	`Qwen/Qwen3-8B`	Base model ID
`GPU_MEMORY_UTILIZATION`	No	`0.70`	VRAM fraction for vLLM
`MAX_MODEL_LEN`	No	`32768`	Max sequence length

For the full Hydra config and all environment variables, see the Configuration Reference.

Verification

# Check vLLM models
curl http://localhost:8000/v1/models -H "Authorization: Bearer sk-local"

# Check CLaaS API
curl http://localhost:8080/

# List LoRA adapters
curl http://localhost:8080/v1/lora

# Test feedback loop
curl -X POST http://localhost:8080/v1/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "lora_id": "openclaw/assistant-latest",
    "prompt": "hi",
    "response": "hello",
    "feedback": "good",
    "training": {"teacher_mode": "self"}
  }'

Send a DM to your Telegram bot. It should respond using the openclaw-assistant-latest LoRA model.

Manual Setup (without Docker)

If you prefer not to use Docker, you can run each service manually:

# 1. Start vLLM with LoRA support
vllm serve Qwen/Qwen3-8B --host 0.0.0.0 --port 8000 \
  --enable-lora --lora-modules my-lora=/loras/user/my-lora-init

# 2. Start the CLaaS API
uv run uvicorn claas.api:web_app --host 0.0.0.0 --port 8080

# 3. Initialize a LoRA adapter
curl -X POST http://localhost:8080/v1/lora/init \
  -H "Content-Type: application/json" \
  -d '{"lora_id": "user/my-lora"}'

# 4. Send feedback
curl -X POST http://localhost:8080/v1/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "lora_id": "user/my-lora-init",
    "prompt": "Write a function to calculate factorial",
    "response": "def factorial(n): ...",
    "feedback": "Good recursive solution"
  }'

When running uvicorn directly, use claas.api:web_app, not claas.api:app. The app object is a Modal App and is not ASGI-compatible.

​Local GPU Backend

​Requirements

​Installation

​Services

​Configuration

​Verification

Local GPU Backend

Requirements

Installation

Services

Configuration

Verification