Controlled dense mismatch dashboard

ESS traffic-light by engine pair

Green: mean ESS ≥ 0.99 (effectively on-policy). Amber: 0.95–0.99. Red: <0.95 (off-policy enough that OPBC may divert).

hermes-qwen3-30b-a3b-bf16 → fsdp-bf16

0.9532

mean ESS over 1 run

hermes-qwen3-30b-a3b-bf16 → megatron-bf16

0.9404

mean ESS over 1 run

hermes-qwen3-32b-bf16 → fsdp-bf16

0.9997

mean ESS over 1 run

hermes-qwen3-32b-bf16 → megatron-bf16

0.9996

mean ESS over 1 run

ESS by engine pair

Mean effective sample size, averaged across all runs in each engine pair. The closer to 1, the more the rollout engine's logprobs agree with the trainer reference.

Per-token drift

Left: mean absolute delta logprob — typical disagreement size. Right: worst single-token |log_ratio| in nats — the worst spike that survived inside the engine pair.

Raw numbers — tables for the headline charts (click to expand)

Per (rollout_engine -> trainer_engine) pair

rollout	trainer	count	mean ESS	mean seq_log_ratio	mean \|Δ logp\|	worst max\|log_ratio\|	tokens
hermes-qwen3-30b-a3b-bf16	fsdp-bf16	1	0.9532	-1.6093	0.1358	2.0267	18
hermes-qwen3-30b-a3b-bf16	megatron-bf16	1	0.9404	-5.2363	0.3372	5.6530	18
hermes-qwen3-32b-bf16	fsdp-bf16	1	0.9997	-0.1154	0.0047	0.0841	27
hermes-qwen3-32b-bf16	megatron-bf16	1	0.9996	-0.1504	0.0061	0.0841	27

Per run

run_id	model	engines	precision	device	tokens	ess	max\|log_ratio\|	top1% mass
hermes-qwen3-32b-bf16-vs-fsdp-bf16-no-op-trivia-turn0	Qwen/Qwen3-32B	hermes-qwen3-32b-bf16 -> fsdp-bf16	bf16	L40S (g6e.12xlarge)	27	0.9997	0.0841	0.0373
hermes-qwen3-30b-a3b-bf16-vs-fsdp-bf16-no-op-trivia-turn0	Qwen/Qwen3-30B-A3B	hermes-qwen3-30b-a3b-bf16 -> fsdp-bf16	bf16	L40S (g6e.12xlarge)	18	0.9532	2.0267	0.0725
hermes-qwen3-32b-bf16-vs-megatron-bf16-no-op-trivia-turn0	Qwen/Qwen3-32B	hermes-qwen3-32b-bf16 -> megatron-bf16	bf16	L40S (g6e.12xlarge)	27	0.9996	0.0841	0.0374
hermes-qwen3-30b-a3b-bf16-vs-megatron-bf16-no-op-trivia-turn0	Qwen/Qwen3-30B-A3B	hermes-qwen3-30b-a3b-bf16 -> megatron-bf16	bf16	L40S (g6e.12xlarge)	18	0.9404	5.6530	0.0729

What these metrics mean

ESS

Effective sample size of importance weights — how usable a rollout is for off-policy training.

per: per-sequence (one ESS value per group / response).
cap: soft floors at `BudgetPolicy.replay_ess_threshold = 0.60` and `BudgetPolicy.min_ess = 0.30`; no per-token clamp.
on cap: below 0.60 routes the group to `replay`; below 0.30 routes it to `quarantine`. The tokens themselves are not modified.

ESS = (Σw)² / (N · Σw²) where w is the importance weight per token. ESS=1 means the rollout matches the trainer's policy exactly. ESS dropping toward 0 means the rollout is increasingly off-policy and a trainer would need stronger correction (or skip the rollout).

|Δlogp|

Per-token disagreement size between rollout and trainer logprobs, in nats.

per: per-token; dashboards display the mean of |Δlogp| over the response.
cap: `BudgetPolicy.clamp = 20.0 nats` (defined in `rollout_market.opbc.BudgetPolicy`); shared with `mismatch_metrics.summarize_logprob_mismatch(clamp=20.0)`.
on cap: tokens with |Δlogp| > clamp are clipped in place to ±20 nats; the rest of the group survives.

Mean over the response of |trainer_logp(token) − rollout_logp(token)|. Tiny values (~0.01) mean the two engines agree on most tokens; values > 0.1 mean meaningful single-token drift.

log_ratio

log( trainer_prob / rollout_prob ) per token, in nats — the exponent of the importance weight.

per: per-token.
cap: `BudgetPolicy.clamp = 20.0 nats` (soft) and `BudgetPolicy.veto_abs_log_ratio = 30.0 nats` (hard).
on cap: tokens with |log_ratio| > 20 are clipped in place; tokens with |log_ratio| > 30 fire a veto and quarantine the whole group.

Each token's importance weight is exp(log_ratio). max|log_ratio| is the worst single-token disagreement. A log_ratio of 1 nat ≈ the trainer is e≈2.7× more confident than the rollout was.

sequence_log_ratio

Sum of per-token log_ratios across the response, in nats — the log of the full-sequence importance weight.

per: per-sequence (one number per response).
cap: no per-sequence cap; budget action is decided from ESS, `max_clipped_fraction`, and `veto_abs_log_ratio` on the constituent tokens.
on cap: no direct drop — propagates into ESS, which drives the budget decision.

How far the rollout drifted from the trainer-side view across the whole sequence. ±0.5 nats over 128 tokens ≈ negligible; 5+ nats means the engines systematically disagree.

clipped_fraction

Fraction of tokens whose |log_ratio| exceeded the clamp threshold.

per: per-group (fraction over all valid policy tokens).
cap: `BudgetPolicy.max_clipped_fraction = 0.10` (a.k.a. STEER's `high_clipped_fraction = 0.1`); threshold inside `BudgetPolicy.clamp = 20.0 nats`.
on cap: above 0.10 routes the group to `train_with_correction`; individual offending tokens stay clipped in place.

Clamping importance weights stops one bad token from blowing up the gradient. >0.1 here means ≥10% of tokens are clamped — typically a trigger to mark the group `train_with_correction`.

veto_fraction

Fraction of tokens whose |log_ratio| exceeded the hard-veto threshold.

per: per-group (fraction over all valid policy tokens).
cap: `BudgetPolicy.veto_abs_log_ratio = 30.0 nats`; the veto fraction itself fires on any non-zero value (veto_fraction > 0.0).
on cap: any non-zero veto fraction quarantines the entire group; OPBC does not attempt correction.

>0 here means at least one token is so off-policy the OPBC quarantines the whole group — even after clamping it would corrupt the gradient.

top_1pct_gradient_mass

Fraction of total importance weight carried by the worst 1% of tokens.

per: per-group (one number per response/group).
cap: no formal cap; rendered alongside `BudgetPolicy.clamp = 20.0 nats` and `max_clipped_fraction = 0.10` for context.
on cap: no direct drop — used as a concentration diagnostic feeding operator triage.

Tells you whether the drift is uniform (~0.01) or concentrated in a few outlier tokens (>0.05). Concentrated drift is more dangerous.