Dense mismatch dashboard →
Per-(rollout_engine, trainer_engine) ESS / max|log_ratio| over Qwen3-32B response tokens, 8 prompts × 4 engine cells.
Real-data observatory for the rollout marketplace. Lenses on how engine and precision choices change what an LLM-driven agent actually does.
📖 What do these metrics mean?🔬 Future research & open problems📂 Source code
What this measures. A real Hermes-Agent multi-turn trajectory captured through the proxy at scripts/live/logprob_capture_proxy.py, then teacher-forced through FSDP and Megatron in bf16 over the same token IDs the rollout produced. The trainer only ever sees those tokens, so the relevant question is whether the two engines assign each token the same probability.
Per-token mismatch is the gap |logπ_trainer(r_t) − logπ_rollout(r_t)| at each response position. Sequence-level mismatch is the tile value: ESS = Effective Sample Size, the fraction of the response the trainer can reuse after importance correction (1 / E[exp(2·Σ Δlogp)] over response tokens). ESS = 1.0 means the engines agree on the whole sequence; lower ESS means a single bad token can tank reuse of the whole trajectory.
Color bands: green (ESS > 0.99) = rollout and trainer agree token-for-token; amber (0.95–0.99) = trainer needs stronger off-policy correction; red (< 0.95) = rollout would be dropped. The card below the matrices surfaces the worst-token |Δlogprob| spike — a sequence-level robustness check that ESS averaging can hide.
How to read each tile. The big number is mean ESS for that (rollout, trainer) cell. The sub-label decodes as: n=N run(s) is the number of probe-bearing assistant turns the cell averages over (a single trajectory contributes one turn per Hermes-Agent step that called a tool or emitted a final answer); mean |Δlogprob|=X is the average per-token log-probability gap between rollout and trainer over those turns. ESS is a sequence-level rollup of that per-token gap.
Same metric, MoE rollouts. Hermes-Agent trajectory over Qwen3-30B-A3B, teacher-forced through FSDP and Megatron in bf16. The per-token and sequence-level mismatch story is identical to the dense case above — MoE forward passes produce a logprob per token just like dense ones do, so ESS means the same thing here. The MoE-specific signal is the router matrix below.
Why MoE needs a second metric. An MoE FFN doesn't compute one big matrix per token; it routes the token through a small subset of experts and combines their outputs. In Qwen3-30B-A3B: 48 transformer layers, each with its own 128 experts (no sharing across layers). Per token per layer, a gate network picks the top 8 experts and assigns each a routing weight. The layer output is the weighted sum of those 8 experts' outputs.
What can go wrong even when logprobs agree. Two engines can pick different experts at the same (token, layer) position and still produce nearly identical layer outputs — and therefore identical next-token logprobs. This happens because experts at the same layer learn overlapping subspaces of knowledge (so two paths through the layer can be functionally equivalent). ESS sees the output and says fine, reuse this rollout; but at training time, the gradient lands on the experts the trainer would have activated, not the ones the rollout actually activated — so the wrong weights get updated.
How we measure it. router_flip_rate is the rate at which the rollout's top-1 expert disagrees with the trainer's top-1 expert at the same (response_token, layer) position. Color bands: green ≤ 5%, amber 5–15%, red > 15%. Rollout side: patched vLLM HTTP shim emits choices[0].routed_experts; the proxy trims to the gen-token tail. Trainer side: FSDP runs HF with output_router_logits=True + torch.topk per layer; Megatron hooks mlp.router directly. We currently only compare expert identities, not their routing weights — the weight-disagreement signal is a follow-up.
How to read each tile. The big percentage is the mean top-1 router_flip_rate across every (response_token, MoE_layer) position in that cell. The sub-label decodes as: n=N run(s) is the number of probe-bearing assistant turns the cell averages over (one tool-calling Hermes-Agent task usually contributes 2–4 turns, so n=19 ≈ 6 tasks × 3 turns avg); 703 tok is the sum of response tokens across those turns (the cell's denominator for the % above); worst layer 33.3% is the single MoE layer with the highest flip rate in this cell — a sanity check that the mean isn't hiding a wildly out-of-band layer.
Per-(rollout_engine, trainer_engine) ESS / max|log_ratio| over Qwen3-32B response tokens, 8 prompts × 4 engine cells.
Qwen3-30B-A3B router trace: per-(token, layer) top-1 flip rate and top-k set disagreement, FP8 vs bf16.
Click through to inspect one captured Hermes-Agent trajectory turn-by-turn — the prompt, the response tokens vLLM generated, and the tool calls embedded in the response. The training-inference mismatch on these tokens is what the matrices above measure.