4 headline runs (4 engine pairs); 0 additional runs in the full-engine appendix. Devices observed: L40S (g6e.12xlarge). Numbers are token-level logprob mismatch on the same checkpoint served by two engines.
Research question
On the same checkpoint and same prompts, how much do different inference engines and precision classes (bf16, FP8) disagree on per-token logprobs?
What we observed
Across 4 headline engine pairs, mean ESS ranges from 0.9997 (best) to 0.9404 (worst). The drift between engines at the same precision is comparable to the drift introduced by FP8 quantization. clipped_fraction = 0 everywhere, so OPBC routes all cells to `train` despite the measurable disagreement. The full-engine data is in the appendix below.
Next: Same matrix at sequence lengths 128 / 512 / 2048 to see whether the drift accumulates linearly or super-linearly.
Green: mean ESS ≥ 0.99 (effectively on-policy). Amber: 0.95–0.99. Red: <0.95 (off-policy enough that OPBC may divert).
Mean effective sample size, averaged across all runs in each engine pair. The closer to 1, the more the rollout engine's logprobs agree with the trainer reference.
Left: mean absolute delta logprob — typical disagreement size. Right: worst single-token |log_ratio| in nats — the worst spike that survived inside the engine pair.
| rollout | trainer | count | mean ESS | mean clipped | mean seq_log_ratio | mean |Δ logp| | worst max|log_ratio| | tokens |
|---|---|---|---|---|---|---|---|---|
| hermes-qwen3-30b-a3b-bf16 | fsdp-bf16 | 1 | 0.9532 | 0.0000 | -1.6093 | 0.1358 | 2.0267 | 18 |
| hermes-qwen3-30b-a3b-bf16 | megatron-bf16 | 1 | 0.9404 | 0.0000 | -5.2363 | 0.3372 | 5.6530 | 18 |
| hermes-qwen3-32b-bf16 | fsdp-bf16 | 1 | 0.9997 | 0.0000 | -0.1154 | 0.0047 | 0.0841 | 27 |
| hermes-qwen3-32b-bf16 | megatron-bf16 | 1 | 0.9996 | 0.0000 | -0.1504 | 0.0061 | 0.0841 | 27 |
| run_id | model | engines | precision | device | tokens | ess | clipped | veto | max|log_ratio| | top1% mass |
|---|---|---|---|---|---|---|---|---|---|---|
| hermes-qwen3-32b-bf16-vs-fsdp-bf16-no-op-trivia-turn0 | Qwen/Qwen3-32B | hermes-qwen3-32b-bf16 -> fsdp-bf16 | bf16 | L40S (g6e.12xlarge) | 27 | 0.9997 | 0.0000 | 0.0000 | 0.0841 | 0.0373 |
| hermes-qwen3-30b-a3b-bf16-vs-fsdp-bf16-no-op-trivia-turn0 | Qwen/Qwen3-30B-A3B | hermes-qwen3-30b-a3b-bf16 -> fsdp-bf16 | bf16 | L40S (g6e.12xlarge) | 18 | 0.9532 | 0.0000 | 0.0000 | 2.0267 | 0.0725 |
| hermes-qwen3-32b-bf16-vs-megatron-bf16-no-op-trivia-turn0 | Qwen/Qwen3-32B | hermes-qwen3-32b-bf16 -> megatron-bf16 | bf16 | L40S (g6e.12xlarge) | 27 | 0.9996 | 0.0000 | 0.0000 | 0.0841 | 0.0374 |
| hermes-qwen3-30b-a3b-bf16-vs-megatron-bf16-no-op-trivia-turn0 | Qwen/Qwen3-30B-A3B | hermes-qwen3-30b-a3b-bf16 -> megatron-bf16 | bf16 | L40S (g6e.12xlarge) | 18 | 0.9404 | 0.0000 | 0.0000 | 5.6530 | 0.0729 |
ESSEffective sample size of importance weights — how usable a rollout is for off-policy training.
ESS = (Σw)² / (N · Σw²) where w is the importance weight per token. ESS=1 means the rollout matches the trainer's policy exactly. ESS dropping toward 0 means the rollout is increasingly off-policy and a trainer would need stronger correction (or skip the rollout).
|Δlogp|Per-token disagreement size between rollout and trainer logprobs, in nats.
Mean over the response of |trainer_logp(token) − rollout_logp(token)|. Tiny values (~0.01) mean the two engines agree on most tokens; values > 0.1 mean meaningful single-token drift.
log_ratiolog( trainer_prob / rollout_prob ) per token, in nats — the exponent of the importance weight.
Each token's importance weight is exp(log_ratio). max|log_ratio| is the worst single-token disagreement. A log_ratio of 1 nat ≈ the trainer is e≈2.7× more confident than the rollout was.
sequence_log_ratioSum of per-token log_ratios across the response, in nats — the log of the full-sequence importance weight.
How far the rollout drifted from the trainer-side view across the whole sequence. ±0.5 nats over 128 tokens ≈ negligible; 5+ nats means the engines systematically disagree.
clipped_fractionFraction of tokens whose |log_ratio| exceeded the clamp threshold.
Clamping importance weights stops one bad token from blowing up the gradient. >0.1 here means ≥10% of tokens are clamped — typically a trigger to mark the group `train_with_correction`.
veto_fractionFraction of tokens whose |log_ratio| exceeded the hard-veto threshold.
>0 here means at least one token is so off-policy the OPBC quarantines the whole group — even after clamping it would corrupt the gradient.
top_1pct_gradient_massFraction of total importance weight carried by the worst 1% of tokens.
Tells you whether the drift is uniform (~0.01) or concentrated in a few outlier tokens (>0.05). Concentrated drift is more dangerous.