Documentation Index
Fetch the complete documentation index at: https://docs.openclaas.com/llms.txt
Use this file to discover all available pages before exploring further.
Metrics
The eval harness supports four metric types. Select which to run via themetrics list in config or CLI override.
Metric descriptions
| Metric | What it measures |
|---|---|
logprob | Logprob margin between preferred and dispreferred response pairs. A positive margin means the model favours the preferred response. Delta from baseline tracks training progress. |
compliance | Generates responses to probe prompts, runs a programmatic verifier (e.g. emoji count, sentence count, keyword presence), and averages the pass rate. |
general | Coding task (fibonacci with exec + verify) plus 3 IFEval-style instruction-following probes. Measures capability retention during training. |
collapse | Three collapse detectors: token entropy (distribution confidence), self-ROUGE-L (output diversity across stochastic samples), and logprob drift (mean logprob shift from baseline). |
Compliance verifiers
Thecompliance metric uses programmatic verifiers to check whether generated responses match the trained preference:
| Verifier | Preference | Pass condition |
|---|---|---|
no_emoji | no_emoji | Zero emoji characters in response |
concise | concise | 3 or fewer sentences (linear decay to 0.0 at 9+ sentences) |
identity | identity | ”kuro” appears in response (case-insensitive) |

