logprob | Logprob margin between preferred and dispreferred response pairs. A positive margin means the model favours the preferred response. Delta from baseline tracks training progress. |
compliance | Generates responses to probe prompts, runs a programmatic verifier (e.g. emoji count, sentence count, keyword presence), and averages the pass rate. |
general | Coding task (fibonacci with exec + verify) plus 3 IFEval-style instruction-following probes. Measures capability retention during training. |
collapse | Three collapse detectors: token entropy (distribution confidence), self-ROUGE-L (output diversity across stochastic samples), and logprob drift (mean logprob shift from baseline). |