Modulum cross-stack signal framework · what does the platform actually do?

What Modulum does that vanilla Gemma-4 — and frontier — don't.

A signal-by-signal comparison of Modulum, vanilla Gemma-4-31B-Q4 (same weights, no platform), and 4 current-generation frontier products at 128k context on BABILong qa1 / qa2 / qa3. Goes beyond accuracy: output-length distribution, pure-hallucination rate, refusal behavior, decay slope, decode speed, error rate. The goal is to identify what Modulum is doing differently at the data level.

The single most important finding

Modulum reduces pure-hallucination rate by 4.7× vs vanilla on qa1 128k (12.3 % vs 57.9 % of wrong answers) — without adding a refusal mechanism. The platform structurally constrains output to canonical format, which truncates fabrication.

When wrong, Modulum still commits to a canonical-format wrong location ("Mary is in the kitchen" when target was "bedroom") at the same rate as Opus 4.6. What Modulum eliminates is the narrative-fabrication failure mode that vanilla Gemma-4 exhibits at long context — long made-up biographies of distractor characters from PG19 noise. Vanilla's median wrong-answer length at qa1 128k is 86 chars (max 500); Modulum's is 47 chars (max ~75). This is the production-relevant finding for hyperscalers: same base weights, but Hypernym's platform layer suppresses the fabrication failure mode.

Every measurable signal, all 7 stacks side-by-side.

Every column is computed from the canonical CSV. Modulum highlighted in terracotta; vanilla in muted italic so the apples-to-apples comparison is visible. Frontier API stacks (Opus 4.6/4.7, GPT-5.5, Gemini 3.1 Pro, Grok 4.3) don't expose decode timings, so those cells are dash-marked.

qa1 retrieval @ 128k

Stack Acc N Wrong-output median chars Pure halluc % of wrong Refusal % of wrong Decode tok/s Errors
Claude Opus 4.696.0 %5080 %100 %0
GPT-5.596.0 %504550 %50 %0
Claude Opus 4.792.0 %502225 %75 %0
Gemini 3.1 Pro84.0 %505225 %25 %0
Modulum (Gemma-4-31B-Q4 + platform)71.5 %2004712.3 %10.5 %37.13
Vanilla Gemma-4-31B-Q4 (no platform)62.0 %508657.9 %10.5 %35.90
Grok 4.330.0 %50525.7 %11.4 %1

Read: Modulum is mid-pack on accuracy but best in class on output-length adherence — its wrong-answer outputs are 47 chars median (5th of 7), and its pure-hallucination rate is 12 % (5th of 7). The platform layer adds a 45 pp drop in pure hallucination vs vanilla without adding refusal capability. Note Opus 4.6 dominates everything but operates at hyperscaler scale; Modulum competes on the long-context fabrication-suppression axis at workstation scale.

qa2 2-fact reasoning @ 128k

StackAccNWrong-output charsPure halluc %Refusal %Decode tok/s
GPT-5.592.0 %502725 %25 %
Claude Opus 4.690.0 %5050 %100 %
Gemini 3.1 Pro72.0 %504814 %7 %
Claude Opus 4.766.0 %502412 %6 %
Modulum39.5 %2003231.4 %0 %32.7
Vanilla Gemma-430.0 %506465.7 %2.9 %34.9
Grok 4.318.0 %50310 %0 %

qa3 3-fact temporal reasoning @ 128k

StackAccNWrong-output charsPure halluc %Refusal %Decode tok/s
Claude Opus 4.680.0 %50140 %10 %
GPT-5.564.0 %50485.5 %5.5 %
Gemini 3.1 Pro40.8 %5005311 %0.7 %
Claude Opus 4.738.0 %50193 %3 %
Modulum27.0 %500480 %0 %40.2
Vanilla Gemma-420.0 %509012.5 %2.5 %34.4
Grok 4.315.4 %26854.5 %9 %

Five derived findings.

Finding 01 · load-bearing

Modulum suppresses the fabrication failure mode that vanilla Gemma-4 exhibits at long context.

Vanilla's wrong-answer median output is 86 chars at qa1 128k (max 500); Modulum's is 47 chars. The reduction in pure-hallucination rate is 4.7× (12.3 % vs 57.9 %), 2.1× (31.4 % vs 65.7 %), and ∞× (0 % vs 12.5 %) on qa1/qa2/qa3 respectively. Same base weights. Different inference stack. The platform appears to enforce canonical output format, which truncates fabricated narratives.

Finding 02 · win

Modulum decodes 17–22 % faster than vanilla on qa3.

40.2 vs 34.4 tok/s at 128k; same pattern at 32k and 64k. Decode speedup at zero accuracy cost. The qa1 decode is +3 % (essentially flat) on the clean phase-1 retry data. Modulum's structurally shorter outputs concentrate decode-time compute.

Finding 03 · win

Modulum has the flattest qa3 decay slope of any tested stack (−2.5 pp / doubling of context).

Opus 4.6 is −4.0, GPT-5.5 −9.0, Gemini −7.6, Grok −8.2. Modulum's qa3 slope ties with Opus 4.7 (which sits at much lower absolute accuracy). The platform preserves multi-fact reasoning state across length better than any non-Modulum stack we tested.

Finding 04 · neutral

Modulum does NOT add refusal capability.

Refusal rate on wrong answers is 0–10.5 % across tasks at 128k, very similar to vanilla. Opus 4.6 refuses 100 % on qa1 128k wrong answers — that's the gold standard. Modulum suppresses fabrication by shortening output, not by teaching the model to say "I don't know." Production routing layer for human-handoff is a future R&D direction.

Finding 05 · loss

Modulum trails frontier on absolute accuracy by 38–43 pp at 128k.

Opus 4.6 88.7 % avg vs Modulum 46.0 %. The base model is a 31B-Q4 open-weight Gemma-4 — vastly smaller than the proprietary FP16 hyperscaler-served frontier. The value of Modulum is not parity; it is workstation-scale deployment of a long-context-stable, fabrication-suppressed inference stack.

Finding 06 · loss

3 errors on qa1 128k (the 503-storm retry survivors).

Modulum is the only stack with non-zero errors at qa1 128k. Single-slot demo backend can't absorb sustained sequential 128k requests without throttling. Production deployment needs queue-aware retry orchestration as a first-class server feature.

What we measure across stacks, and what only some stacks expose.

SignalSourceModulumVanillaAnthropicOpenAIGooglexAI
Accuracy (correct / N)substring match on target
Wilson 95 % CIderived from N, k
Decay slope (pp / 2× ctx)OLS on cells
Output text (full)API response
Output-length distributionlen(output) by cell
Refusal rate (keyword classifier)regex on output
Pure hallucination ratederived: wrong AND no valid loc AND no refusal
Wall-clock latencyclient total_ms
Within-run drift (terciles)sample_idx order
Error rate / failure modeshttp_status, error text
Prefill / decode tokens/secllama.cpp timings block
Per-token logprob / PPLlogprobs flag✓ (phase-4 N=20)limitedlimitedlimitedlimitedlimited
Needle-NOT-in-haystack refusalcustom hallucination probe (in flight)queuedqueuedblocked: creditrunningrunningrunning

9 of 13 signals are universally comparable across all 7 stacks. Only the llama.cpp-specific timings (prefill/decode tok/s) and logprob detail are Modulum/Vanilla-exclusive — frontier APIs don't expose them. This is enough to do apples-to-apples cross-stack analysis on every behavioral axis except internal compute timings.

The signals we'd need to verify Hypernym's full claim set.

ClaimWhat it would requireStatus
"−14.18 % cleaner than F16" (PPL drops vs F16 baseline) Run Gemma-4-31B-it FP16 unquantized on same prompts. Needs 80 GB GPU + 1 day setup. Not measured
"Effective infinite context in fixed memory" Modulum endpoint provisioned beyond 128k. Endpoint capped at 128k by Hypernym
"3 corpora · 7 context lengths · 38 measurements" Add LongBench / RULER beyond BABILong; extend below 32k. 1 corpus (BABILong) at 3 lengths × 3 tasks
"Zero speed cost" vs F16 F16 baseline. We have Modulum vs vanilla Q4 = +17–22 % faster on qa3. Partial — Q4 reference, not F16
"Computes on cleaner data than F16" Attention entropy / KV state introspection + F16 PPL baseline. Indirect support only