← back to dashboards

What the metrics mean

Plain-English definitions of every abbreviation that shows up in the dashboards, in the four-field shape verl uses for its train-inference correction module: what the metric is, the unit it aggregates over, the cap that triggers OPBC action, and what happens when the cap fires.

ESS

Effective sample size of importance weights — how usable a rollout is for off-policy training.

per
per-sequence (one ESS value per group / response).
cap
soft floors at `BudgetPolicy.replay_ess_threshold = 0.60` and `BudgetPolicy.min_ess = 0.30`; no per-token clamp.
on cap
below 0.60 routes the group to `replay`; below 0.30 routes it to `quarantine`. The tokens themselves are not modified.

ESS = (Σw)² / (N · Σw²) where w is the importance weight per token. ESS=1 means the rollout matches the trainer's policy exactly. ESS dropping toward 0 means the rollout is increasingly off-policy and a trainer would need stronger correction (or skip the rollout).

|Δlogp|

Per-token disagreement size between rollout and trainer logprobs, in nats.

per
per-token; dashboards display the mean of |Δlogp| over the response.
cap
`BudgetPolicy.clamp = 20.0 nats` (defined in `rollout_market.opbc.BudgetPolicy`); shared with `mismatch_metrics.summarize_logprob_mismatch(clamp=20.0)`.
on cap
tokens with |Δlogp| > clamp are clipped in place to ±20 nats; the rest of the group survives.

Mean over the response of |trainer_logp(token) − rollout_logp(token)|. Tiny values (~0.01) mean the two engines agree on most tokens; values > 0.1 mean meaningful single-token drift.

log_ratio

log( trainer_prob / rollout_prob ) per token, in nats — the exponent of the importance weight.

per
per-token.
cap
`BudgetPolicy.clamp = 20.0 nats` (soft) and `BudgetPolicy.veto_abs_log_ratio = 30.0 nats` (hard).
on cap
tokens with |log_ratio| > 20 are clipped in place; tokens with |log_ratio| > 30 fire a veto and quarantine the whole group.

Each token's importance weight is exp(log_ratio). max|log_ratio| is the worst single-token disagreement. A log_ratio of 1 nat ≈ the trainer is e≈2.7× more confident than the rollout was.

sequence_log_ratio

Sum of per-token log_ratios across the response, in nats — the log of the full-sequence importance weight.

per
per-sequence (one number per response).
cap
no per-sequence cap; budget action is decided from ESS, `max_clipped_fraction`, and `veto_abs_log_ratio` on the constituent tokens.
on cap
no direct drop — propagates into ESS, which drives the budget decision.

How far the rollout drifted from the trainer-side view across the whole sequence. ±0.5 nats over 128 tokens ≈ negligible; 5+ nats means the engines systematically disagree.

clipped_fraction

Fraction of tokens whose |log_ratio| exceeded the clamp threshold.

per
per-group (fraction over all valid policy tokens).
cap
`BudgetPolicy.max_clipped_fraction = 0.10` (a.k.a. STEER's `high_clipped_fraction = 0.1`); threshold inside `BudgetPolicy.clamp = 20.0 nats`.
on cap
above 0.10 routes the group to `train_with_correction`; individual offending tokens stay clipped in place.

Clamping importance weights stops one bad token from blowing up the gradient. >0.1 here means ≥10% of tokens are clamped — typically a trigger to mark the group `train_with_correction`.

veto_fraction

Fraction of tokens whose |log_ratio| exceeded the hard-veto threshold.

per
per-group (fraction over all valid policy tokens).
cap
`BudgetPolicy.veto_abs_log_ratio = 30.0 nats`; the veto fraction itself fires on any non-zero value (veto_fraction > 0.0).
on cap
any non-zero veto fraction quarantines the entire group; OPBC does not attempt correction.

>0 here means at least one token is so off-policy the OPBC quarantines the whole group — even after clamping it would corrupt the gradient.

top_1pct_gradient_mass

Fraction of total importance weight carried by the worst 1% of tokens.

per
per-group (one number per response/group).
cap
no formal cap; rendered alongside `BudgetPolicy.clamp = 20.0 nats` and `max_clipped_fraction = 0.10` for context.
on cap
no direct drop — used as a concentration diagnostic feeding operator triage.

Tells you whether the drift is uniform (~0.01) or concentrated in a few outlier tokens (>0.05). Concentrated drift is more dangerous.

second_moment

E[w²] of importance weights w = exp(log_ratio).

per
per-group.
cap
no formal cap; reads as the variance upper bound and is consumed alongside ESS (which is derived from it).
on cap
no direct drop — feeds the ESS computation, which drives the group routing decision.

Used to bound the variance of the importance-corrected estimator. Closer to 1 = lower-variance correction.

router_flip_rate

Fraction of (token, layer) pairs where MoE top-1 routed expert differs between rollout and trainer.

per
per-(token, layer); dashboards display the mean.
cap
no cap — observability only. OPBC does not directly act on router_flip_rate; the metric informs whether to pin the rollout to the trainer's precision class via `PolicyManifest`.
on cap
no drop — observability only. Decisions to reject a worker on precision mismatch happen at validator time, not at metric time.

Top-1 stability under quantization or precision changes. Low values (<5%) suggest the dominant expert is robust. Same as the top-1 flip rate shown on the router dashboard.

token_expert_disagreement_rate

Fraction of tokens with at least one MoE layer where the top-k *set* of routed experts differs.

per
per-token (a token counts if any layer disagrees).
cap
no cap — observability only.
on cap
no drop — observability only.

More sensitive than top-1 flip — quantization noise often shuffles the lower-ranked experts even when the dominant one is stable. Visible in the gap between this and the top-1 flip rate.

tool_call_jaccard

Jaccard similarity of (tool_name, arguments_hash) sets between rollout and reference trajectories.

per
per-trajectory (one Jaccard value per comparison).
cap
no cap — observability only.
on cap
no drop — observability only.

1.0 = identical tool invocations across the run. <1.0 = the rollout engine picked at least one tool the reference engine didn't (or vice versa). Lower = more behavioural drift.

first_divergence_step

First assistant step where rollout and reference trajectories took a different action.

per
per-trajectory (one integer-or-null per comparison).
cap
no cap — observability only.
on cap
no drop — observability only.

If non-null, the rollout's behaviour diverged from the reference starting at this step. The lower the number, the earlier the trajectory rolled off-policy.

answer_match

Whether the rollout engine and reference engine reached the same final assistant answer (normalised whitespace + case).

per
per-trajectory (boolean); rolled up per engine pair as a rate.
cap
no cap — observability only.
on cap
no drop — observability only.

The headline trajectory metric. Even small per-token drift can compound into a different final answer when the agent is making multi-step tool-use decisions.

bf16

16-bit brain-float — the standard training/inference precision for modern LLMs. Same exponent range as fp32, narrower mantissa.

per
per-checkpoint (precision_class is a property of how the engine serves the model, not a per-token measurement).
cap
no cap — precision class label. Mismatch between worker precision and `PolicyManifest.precision_class` is rejected at validator time, not at metric time.
on cap
no drop on its own; downstream `validators.precision_mismatch` rejects whole groups when the served precision diverges from the manifest pin.
fp8

8-bit floating point. Two formats — E4M3 and E5M2. Halves memory vs bf16 and runs ~2× faster on Hopper-class GPUs.

per
per-checkpoint (precision_class label).
cap
no cap — precision class label. The numerical drift FP8 introduces is measured by the dense and router dashboards (see `|Δlogp|`, `router_flip_rate`).
on cap
no drop on its own; downstream `validators.precision_mismatch` rejects whole groups when an FP8 worker submits against a bf16-pinned manifest.
top-1 flip rate

Alias of `router_flip_rate` — see that entry for cap/drop.

per
per-(token, layer).
cap
no cap — observability only.
on cap
no drop — observability only.
top-k set disagreement

Alias of `token_expert_disagreement_rate` — see that entry for cap/drop.

per
per-token.
cap
no cap — observability only.
on cap
no drop — observability only.
answer_match_rate

Rate of `answer_match` across all comparisons in an engine pair — fraction of tasks where rollout and reference produced the same final answer.

per
per-(rollout_engine, trainer_engine) pair.
cap
no cap — observability only.
on cap
no drop — observability only.
tool_choice_disagreement_rate

Fraction of assistant steps where rollout and reference invoked different tools.

per
per-step, rolled up per trajectory.
cap
no cap — observability only.
on cap
no drop — observability only.

Per-step counterpart to Jaccard. >0 means the trajectories were divergent at that step level.

token_ids_available

Whether the API exposes the integer token IDs of the response (not just text).

per
per-endpoint (capability flag).
cap
no cap — endpoint capability probe.
on cap
no drop at metric time; missing token IDs make a rollout unverifiable downstream.
sampled_logprobs_available

Whether the API returns the logprob of each sampled token.

per
per-endpoint (capability flag).
cap
no cap — endpoint capability probe.
on cap
no drop at metric time; missing logprobs force the trainer to redo the forward pass.
top_logprobs_available

Whether the API returns the top-k alternatives at each sampled position.

per
per-endpoint (capability flag).
cap
no cap — endpoint capability probe.
on cap
no drop — capability probe only.
seed_supported

Whether the API honours a generation seed and signals it back via `system_fingerprint`.

per
per-endpoint (capability flag).
cap
no cap — endpoint capability probe.
on cap
no drop at metric time; missing seed support breaks audit replay.
OPBC

Off-Policy Budget Controller — the project's policy that decides whether a group is `train`, `train_with_correction`, `replay`, `quarantine`, or `reject`.

per
per-group (reads per-token / per-group metrics and emits one typed decision).
cap
configured by `BudgetPolicy` — `clamp = 20.0 nats`, `veto_abs_log_ratio = 30.0 nats`, `max_clipped_fraction = 0.10`, `min_ess = 0.30`, `replay_ess_threshold = 0.60`.
on cap
veto → `quarantine`; low ESS → `quarantine`; high `clipped_fraction` → `train_with_correction`; moderate ESS or policy lag → `replay`; otherwise → `train`.
TP=4

Tensor-parallel size 4 — the model is sharded across 4 GPUs at the tensor level.

per
per-engine deployment.
cap
no cap — deployment parameter.
on cap
no drop — deployment parameter.