Plain-English definitions of every abbreviation that shows up in the dashboards, in the four-field shape verl uses for its train-inference correction module: what the metric is, the unit it aggregates over, the cap that triggers OPBC action, and what happens when the cap fires.
ESSEffective sample size of importance weights — how usable a rollout is for off-policy training.
ESS = (Σw)² / (N · Σw²) where w is the importance weight per token. ESS=1 means the rollout matches the trainer's policy exactly. ESS dropping toward 0 means the rollout is increasingly off-policy and a trainer would need stronger correction (or skip the rollout).
|Δlogp|Per-token disagreement size between rollout and trainer logprobs, in nats.
Mean over the response of |trainer_logp(token) − rollout_logp(token)|. Tiny values (~0.01) mean the two engines agree on most tokens; values > 0.1 mean meaningful single-token drift.
log_ratiolog( trainer_prob / rollout_prob ) per token, in nats — the exponent of the importance weight.
Each token's importance weight is exp(log_ratio). max|log_ratio| is the worst single-token disagreement. A log_ratio of 1 nat ≈ the trainer is e≈2.7× more confident than the rollout was.
sequence_log_ratioSum of per-token log_ratios across the response, in nats — the log of the full-sequence importance weight.
How far the rollout drifted from the trainer-side view across the whole sequence. ±0.5 nats over 128 tokens ≈ negligible; 5+ nats means the engines systematically disagree.
clipped_fractionFraction of tokens whose |log_ratio| exceeded the clamp threshold.
Clamping importance weights stops one bad token from blowing up the gradient. >0.1 here means ≥10% of tokens are clamped — typically a trigger to mark the group `train_with_correction`.
veto_fractionFraction of tokens whose |log_ratio| exceeded the hard-veto threshold.
>0 here means at least one token is so off-policy the OPBC quarantines the whole group — even after clamping it would corrupt the gradient.
top_1pct_gradient_massFraction of total importance weight carried by the worst 1% of tokens.
Tells you whether the drift is uniform (~0.01) or concentrated in a few outlier tokens (>0.05). Concentrated drift is more dangerous.
second_momentE[w²] of importance weights w = exp(log_ratio).
Used to bound the variance of the importance-corrected estimator. Closer to 1 = lower-variance correction.
router_flip_rateFraction of (token, layer) pairs where MoE top-1 routed expert differs between rollout and trainer.
Top-1 stability under quantization or precision changes. Low values (<5%) suggest the dominant expert is robust. Same as the top-1 flip rate shown on the router dashboard.
token_expert_disagreement_rateFraction of tokens with at least one MoE layer where the top-k *set* of routed experts differs.
More sensitive than top-1 flip — quantization noise often shuffles the lower-ranked experts even when the dominant one is stable. Visible in the gap between this and the top-1 flip rate.
tool_call_jaccardJaccard similarity of (tool_name, arguments_hash) sets between rollout and reference trajectories.
1.0 = identical tool invocations across the run. <1.0 = the rollout engine picked at least one tool the reference engine didn't (or vice versa). Lower = more behavioural drift.
first_divergence_stepFirst assistant step where rollout and reference trajectories took a different action.
If non-null, the rollout's behaviour diverged from the reference starting at this step. The lower the number, the earlier the trajectory rolled off-policy.
answer_matchWhether the rollout engine and reference engine reached the same final assistant answer (normalised whitespace + case).
The headline trajectory metric. Even small per-token drift can compound into a different final answer when the agent is making multi-step tool-use decisions.
bf1616-bit brain-float — the standard training/inference precision for modern LLMs. Same exponent range as fp32, narrower mantissa.
fp88-bit floating point. Two formats — E4M3 and E5M2. Halves memory vs bf16 and runs ~2× faster on Hopper-class GPUs.
top-1 flip rateAlias of `router_flip_rate` — see that entry for cap/drop.
top-k set disagreementAlias of `token_expert_disagreement_rate` — see that entry for cap/drop.
answer_match_rateRate of `answer_match` across all comparisons in an engine pair — fraction of tasks where rollout and reference produced the same final answer.
tool_choice_disagreement_rateFraction of assistant steps where rollout and reference invoked different tools.
Per-step counterpart to Jaccard. >0 means the trajectories were divergent at that step level.
token_ids_availableWhether the API exposes the integer token IDs of the response (not just text).
sampled_logprobs_availableWhether the API returns the logprob of each sampled token.
top_logprobs_availableWhether the API returns the top-k alternatives at each sampled position.
seed_supportedWhether the API honours a generation seed and signals it back via `system_fingerprint`.
OPBCOff-Policy Budget Controller — the project's policy that decides whether a group is `train`, `train_with_correction`, `replay`, `quarantine`, or `reject`.
TP=4Tensor-parallel size 4 — the model is sharded across 4 GPUs at the tensor level.