Controlled dense mismatch dashboard

4 headline runs (4 engine pairs); 0 additional runs in the full-engine appendix. Devices observed: L40S (g6e.12xlarge). Numbers are token-level logprob mismatch on the same checkpoint served by two engines.

Research question

On the same checkpoint and same prompts, how much do different inference engines and precision classes (bf16, FP8) disagree on per-token logprobs?

What we observed

Across 4 headline engine pairs, mean ESS ranges from 0.9997 (best) to 0.9404 (worst). The drift between engines at the same precision is comparable to the drift introduced by FP8 quantization. clipped_fraction = 0 everywhere, so OPBC routes all cells to `train` despite the measurable disagreement. The full-engine data is in the appendix below.

Next: Same matrix at sequence lengths 128 / 512 / 2048 to see whether the drift accumulates linearly or super-linearly.

4
headline runs
4
headline engine pairs
0.9997
best ESS
0.9404
worst ESS
5.653
worst max|log_ratio|

ESS traffic-light by engine pair

Green: mean ESS ≥ 0.99 (effectively on-policy). Amber: 0.95–0.99. Red: <0.95 (off-policy enough that OPBC may divert).

hermes-qwen3-30b-a3b-bf16 → fsdp-bf16
0.9532
mean ESS over 1 run
hermes-qwen3-30b-a3b-bf16 → megatron-bf16
0.9404
mean ESS over 1 run
hermes-qwen3-32b-bf16 → fsdp-bf16
0.9997
mean ESS over 1 run
hermes-qwen3-32b-bf16 → megatron-bf16
0.9996
mean ESS over 1 run

ESS by engine pair

Mean effective sample size, averaged across all runs in each engine pair. The closer to 1, the more the rollout engine's logprobs agree with the trainer reference.

Per-token drift

Left: mean absolute delta logprob — typical disagreement size. Right: worst single-token |log_ratio| in nats — the worst spike that survived inside the engine pair.

Raw numbers — tables for the headline charts (click to expand)

Per (rollout_engine -> trainer_engine) pair

rollouttrainercountmean ESSmean clippedmean seq_log_ratiomean |Δ logp|worst max|log_ratio|tokens
hermes-qwen3-30b-a3b-bf16fsdp-bf1610.95320.0000-1.60930.13582.026718
hermes-qwen3-30b-a3b-bf16megatron-bf1610.94040.0000-5.23630.33725.653018
hermes-qwen3-32b-bf16fsdp-bf1610.99970.0000-0.11540.00470.084127
hermes-qwen3-32b-bf16megatron-bf1610.99960.0000-0.15040.00610.084127

Per run

run_idmodelenginesprecisiondevicetokensessclippedvetomax|log_ratio|top1% mass
hermes-qwen3-32b-bf16-vs-fsdp-bf16-no-op-trivia-turn0Qwen/Qwen3-32Bhermes-qwen3-32b-bf16 -> fsdp-bf16bf16L40S (g6e.12xlarge)270.99970.00000.00000.08410.0373
hermes-qwen3-30b-a3b-bf16-vs-fsdp-bf16-no-op-trivia-turn0Qwen/Qwen3-30B-A3Bhermes-qwen3-30b-a3b-bf16 -> fsdp-bf16bf16L40S (g6e.12xlarge)180.95320.00000.00002.02670.0725
hermes-qwen3-32b-bf16-vs-megatron-bf16-no-op-trivia-turn0Qwen/Qwen3-32Bhermes-qwen3-32b-bf16 -> megatron-bf16bf16L40S (g6e.12xlarge)270.99960.00000.00000.08410.0374
hermes-qwen3-30b-a3b-bf16-vs-megatron-bf16-no-op-trivia-turn0Qwen/Qwen3-30B-A3Bhermes-qwen3-30b-a3b-bf16 -> megatron-bf16bf16L40S (g6e.12xlarge)180.94040.00000.00005.65300.0729

What these metrics mean

ESS

Effective sample size of importance weights — how usable a rollout is for off-policy training.

per
per-sequence (one ESS value per group / response).
cap
soft floors at `BudgetPolicy.replay_ess_threshold = 0.60` and `BudgetPolicy.min_ess = 0.30`; no per-token clamp.
on cap
below 0.60 routes the group to `replay`; below 0.30 routes it to `quarantine`. The tokens themselves are not modified.

ESS = (Σw)² / (N · Σw²) where w is the importance weight per token. ESS=1 means the rollout matches the trainer's policy exactly. ESS dropping toward 0 means the rollout is increasingly off-policy and a trainer would need stronger correction (or skip the rollout).

|Δlogp|

Per-token disagreement size between rollout and trainer logprobs, in nats.

per
per-token; dashboards display the mean of |Δlogp| over the response.
cap
`BudgetPolicy.clamp = 20.0 nats` (defined in `rollout_market.opbc.BudgetPolicy`); shared with `mismatch_metrics.summarize_logprob_mismatch(clamp=20.0)`.
on cap
tokens with |Δlogp| > clamp are clipped in place to ±20 nats; the rest of the group survives.

Mean over the response of |trainer_logp(token) − rollout_logp(token)|. Tiny values (~0.01) mean the two engines agree on most tokens; values > 0.1 mean meaningful single-token drift.

log_ratio

log( trainer_prob / rollout_prob ) per token, in nats — the exponent of the importance weight.

per
per-token.
cap
`BudgetPolicy.clamp = 20.0 nats` (soft) and `BudgetPolicy.veto_abs_log_ratio = 30.0 nats` (hard).
on cap
tokens with |log_ratio| > 20 are clipped in place; tokens with |log_ratio| > 30 fire a veto and quarantine the whole group.

Each token's importance weight is exp(log_ratio). max|log_ratio| is the worst single-token disagreement. A log_ratio of 1 nat ≈ the trainer is e≈2.7× more confident than the rollout was.

sequence_log_ratio

Sum of per-token log_ratios across the response, in nats — the log of the full-sequence importance weight.

per
per-sequence (one number per response).
cap
no per-sequence cap; budget action is decided from ESS, `max_clipped_fraction`, and `veto_abs_log_ratio` on the constituent tokens.
on cap
no direct drop — propagates into ESS, which drives the budget decision.

How far the rollout drifted from the trainer-side view across the whole sequence. ±0.5 nats over 128 tokens ≈ negligible; 5+ nats means the engines systematically disagree.

clipped_fraction

Fraction of tokens whose |log_ratio| exceeded the clamp threshold.

per
per-group (fraction over all valid policy tokens).
cap
`BudgetPolicy.max_clipped_fraction = 0.10` (a.k.a. STEER's `high_clipped_fraction = 0.1`); threshold inside `BudgetPolicy.clamp = 20.0 nats`.
on cap
above 0.10 routes the group to `train_with_correction`; individual offending tokens stay clipped in place.

Clamping importance weights stops one bad token from blowing up the gradient. >0.1 here means ≥10% of tokens are clamped — typically a trigger to mark the group `train_with_correction`.

veto_fraction

Fraction of tokens whose |log_ratio| exceeded the hard-veto threshold.

per
per-group (fraction over all valid policy tokens).
cap
`BudgetPolicy.veto_abs_log_ratio = 30.0 nats`; the veto fraction itself fires on any non-zero value (veto_fraction > 0.0).
on cap
any non-zero veto fraction quarantines the entire group; OPBC does not attempt correction.

>0 here means at least one token is so off-policy the OPBC quarantines the whole group — even after clamping it would corrupt the gradient.

top_1pct_gradient_mass

Fraction of total importance weight carried by the worst 1% of tokens.

per
per-group (one number per response/group).
cap
no formal cap; rendered alongside `BudgetPolicy.clamp = 20.0 nats` and `max_clipped_fraction = 0.10` for context.
on cap
no direct drop — used as a concentration diagnostic feeding operator triage.

Tells you whether the drift is uniform (~0.01) or concentrated in a few outlier tokens (>0.05). Concentrated drift is more dangerous.