Skip to main content

Eval Harness

The CLaaS eval harness runs automated feedback loops against a live CLaaS stack and measures whether training shifts the model toward preferred behaviours without collapsing.

What it does

  1. Initializes a fresh LoRA adapter for each preference being tested
  2. Runs N feedback steps, each sending a preference-targeted prompt and feedback to the CLaaS API
  3. Measures metrics at each step: logprob margins, compliance rates, capability retention, and collapse signals
  4. Generates plots (optional) showing learning curves over training steps
  5. Produces a summary with per-preference pass/fail verdicts

Preferences

The eval harness ships with three built-in preferences:
PreferenceWhat it trainsExample feedback
no_emojiSuppress emoji in responses”Don’t use any emoji in your responses”
conciseKeep responses short”Be more concise, use 3 sentences or fewer”
identityAdopt a persona”Your name is Kuro”
Preferences are configured via the preferences list in the Hydra config. You can run any subset:
uv run python -m claas.eval 'preferences=[concise]' num_steps=10

Output structure

Each eval run produces a directory under data/evals/<run-id>/:
data/evals/<run-id>/
├── summary.json              # Per-preference pass/fail verdicts
└── <preference>/
    ├── metadata.json          # Run config + LoRA ID
    ├── baseline.json          # Pre-training metric snapshot
    └── steps.jsonl            # One JSON object per feedback step
Each line in steps.jsonl contains: step number, timestamp, feedback given, SDPO training metrics, eval metrics (logprob margin, compliance, general capability, collapse), and rollout transcripts.

Viewing results

Results can be viewed in the browser via the built-in eval dashboard served by the CLaaS API:
GET http://localhost:8080/v1/eval?results_dir=./data/evals