Configuration & Running
The eval harness uses Hydra for configuration. This page covers the key settings and a step-by-step guide for running evals in Tinker mode (no GPU required). For Local or Modal mode, swap the environment variables and dependencies accordingly.Key Config Fields
| Field | Default | Description |
|---|---|---|
preferences | [no_emoji, concise, identity] | Preferences to train and evaluate |
num_steps | 20 | Feedback steps per preference |
batch_size | 4 | Samples per feedback batch |
mode | tinker | Execution backend: local, tinker, or modal |
base_model | Qwen/Qwen3-30B-A3B | Base model for LoRA init (use Tinker name in Tinker mode) |
plots | true | Generate matplotlib plots |
Secrets
| Variable | Required for | Purpose |
|---|---|---|
CLAAS_TINKER_API_KEY | Tinker mode | Tinker SDK authentication |
GEMINI_API_KEY | general metric | Gemini-based capability evaluation |
CLI Overrides
Hydra overrides are positional arguments after theeval subcommand:
Programmatic Usage
Running the Eval
Run the eval
claas/eval/configs/base.yaml). Override any field via key=value arguments.Known Gotchas
Tinker model naming
Tinker model naming
Tinker uses its own model identifiers that differ from HuggingFace names. For example, the HuggingFace model
Qwen/Qwen3-Coder-30B-A3B-Instruct is Qwen/Qwen3-30B-A3B in Tinker. Sampling works with either name, but LoRA training init will reject the HuggingFace name with a 400 error. Always use the Tinker name in base_model.API entry point
API entry point
When running the CLaaS API with uvicorn directly (no Docker/Modal), use
claas.api:web_app, not claas.api:app. The app object is a Modal App and is not ASGI-compatible.CLAAS_TINKER_BASE_MODEL must match base_model
CLAAS_TINKER_BASE_MODEL must match base_model
The proxy reads
CLAAS_TINKER_BASE_MODEL to initialize its sampling client, and the eval config’s base_model is passed to the API for LoRA init. If they reference different models, scoring and training will target different models.Collapse metric is slow
Collapse metric is slow
The
collapse metric generates multiple stochastic samples per step. It only runs at steps listed in collapse_steps (default [0, 5, 10, 15, 19]) to limit overhead. You can further reduce cost by narrowing the list.
