Configuration & Running

The eval harness uses Hydra for configuration. This page covers the key settings and a step-by-step guide for running evals in Tinker mode (no GPU required). For Local or Modal mode, swap the environment variables and dependencies accordingly.

Key Config Fields

Field	Default	Description
`preferences`	`[no_emoji, concise, identity]`	Preferences to train and evaluate
`num_steps`	`20`	Feedback steps per preference
`batch_size`	`4`	Samples per feedback batch
`mode`	`tinker`	Execution backend: `local`, `tinker`, or `modal`
`base_model`	`Qwen/Qwen3-30B-A3B`	Base model for LoRA init (use Tinker name in Tinker mode)
`plots`	`true`	Generate matplotlib plots

For the full config field reference, see the Configuration Reference.

Secrets

Variable	Required for	Purpose
`CLAAS_TINKER_API_KEY`	Tinker mode	Tinker SDK authentication
`GEMINI_API_KEY`	`general` metric	Gemini-based capability evaluation

CLI Overrides

Hydra overrides are positional arguments after the eval subcommand:

# Run only conciseness for 10 steps
uv run python -m claas.eval 'preferences=[concise]' num_steps=10

# Override base model and mode
uv run python -m claas.eval base_model=Qwen/Qwen3-30B-A3B mode=tinker

# Skip OpenClaw gateway, proxy completions through CLaaS directly
uv run python -m claas.eval openclaw_url=null

# Use a custom config directory
uv run python -m claas.eval --config-dir ./my_configs --config-name my_config

Programmatic Usage

from claas.eval.config import build_harness_config
from claas.eval.runner import run_harness
from claas.eval.types import EvalConfig
import asyncio

config = build_harness_config(
    EvalConfig(
        preferences=["concise"],
        num_steps=5,
    )
)
asyncio.run(run_harness(config))

Running the Eval

Install dependencies

uv sync --extra tinker --extra dev

Start the Tinker inference proxy

CLAAS_TINKER_API_KEY="tml-..." \
CLAAS_TINKER_BASE_MODEL="Qwen/Qwen3-30B-A3B" \
  uv run uvicorn claas.proxy.tinker_inference_proxy:app \
    --host 0.0.0.0 --port 8000

Start the CLaaS API

CLAAS_DISTILL_EXECUTION_MODE=tinker \
CLAAS_TINKER_API_KEY="tml-..." \
CLAAS_TINKER_BASE_MODEL="Qwen/Qwen3-30B-A3B" \
CLAAS_ALLOWED_INIT_BASE_MODELS="Qwen/Qwen3-30B-A3B" \
  uv run uvicorn claas.api:web_app \
    --host 0.0.0.0 --port 8080

Use claas.api:web_app, not claas.api:app. The app object is a Modal App and is not ASGI-compatible.

Run the eval

CLAAS_DISTILL_EXECUTION_MODE=tinker \
CLAAS_TINKER_API_KEY="tml-..." \
CLAAS_TINKER_BASE_MODEL="Qwen/Qwen3-30B-A3B" \
  uv run python -m claas.eval

This runs with the default Hydra config (claas/eval/configs/base.yaml). Override any field via key=value arguments.

View results

Results are written to ./data/evals/<run-id>/. View them in the browser via the eval dashboard:

http://localhost:8080/v1/eval?results_dir=./data/evals

Or inspect the raw output:

cat data/evals/<run-id>/summary.json

Known Gotchas

Tinker model naming

Tinker uses its own model identifiers that differ from HuggingFace names. For example, the HuggingFace model Qwen/Qwen3-Coder-30B-A3B-Instruct is Qwen/Qwen3-30B-A3B in Tinker. Sampling works with either name, but LoRA training init will reject the HuggingFace name with a 400 error. Always use the Tinker name in base_model.

API entry point

When running the CLaaS API with uvicorn directly (no Docker/Modal), use claas.api:web_app, not claas.api:app. The app object is a Modal App and is not ASGI-compatible.

CLAAS_TINKER_BASE_MODEL must match base_model

The proxy reads CLAAS_TINKER_BASE_MODEL to initialize its sampling client, and the eval config’s base_model is passed to the API for LoRA init. If they reference different models, scoring and training will target different models.

Collapse metric is slow

The collapse metric generates multiple stochastic samples per step. It only runs at steps listed in collapse_steps (default [0, 5, 10, 15, 19]) to limit overhead. You can further reduce cost by narrowing the list.

​Configuration & Running

​Key Config Fields

​Secrets

​CLI Overrides

​Programmatic Usage

​Running the Eval

​Known Gotchas

Configuration & Running

Key Config Fields

Secrets

CLI Overrides

Programmatic Usage

Running the Eval

Known Gotchas