Modulum cross-stack signal framework · what does the platform actually do?

What Modulum does that vanilla Gemma-4 — and frontier — don't.

A signal-by-signal comparison of Modulum, vanilla Gemma-4-31B-Q4 (same weights, no platform), and 4 current-generation frontier products at 128k context on BABILong qa1 / qa2 / qa3. Goes beyond accuracy: output-length distribution, pure-hallucination rate, refusal behavior, decay slope, decode speed, error rate. The goal is to identify what Modulum is doing differently at the data level.

The single most important finding

Modulum reduces pure-hallucination rate by 4.7× vs vanilla on qa1 128k (12.3 % vs 57.9 % of wrong answers) — without adding a refusal mechanism. The platform structurally constrains output to canonical format, which truncates fabrication.

When wrong, Modulum still commits to a canonical-format wrong location ("Mary is in the kitchen" when target was "bedroom") at the same rate as Opus 4.6. What Modulum eliminates is the narrative-fabrication failure mode that vanilla Gemma-4 exhibits at long context — long made-up biographies of distractor characters from PG19 noise. Vanilla's median wrong-answer length at qa1 128k is 86 chars (max 500); Modulum's is 47 chars (max ~75). This is the production-relevant finding for hyperscalers: same base weights, but Hypernym's platform layer suppresses the fabrication failure mode.

01 · Cross-stack signals at 128k context

Every measurable signal, all 7 stacks side-by-side.

Every column is computed from the canonical CSV. Modulum highlighted in terracotta; vanilla in muted italic so the apples-to-apples comparison is visible. Frontier API stacks (Opus 4.6/4.7, GPT-5.5, Gemini 3.1 Pro, Grok 4.3) don't expose decode timings, so those cells are dash-marked.

qa1 retrieval @ 128k

Stack	Acc	N	Wrong-output median chars	Pure halluc % of wrong	Refusal % of wrong	Decode tok/s	Errors
Claude Opus 4.6	96.0 %	50	8	0 %	100 %	—	0
GPT-5.5	96.0 %	50	45	50 %	50 %	—	0
Claude Opus 4.7	92.0 %	50	22	25 %	75 %	—	0
Gemini 3.1 Pro	84.0 %	50	52	25 %	25 %	—	0
Modulum (Gemma-4-31B-Q4 + platform)	71.5 %	200	47	12.3 %	10.5 %	37.1	3
Vanilla Gemma-4-31B-Q4 (no platform)	62.0 %	50	86	57.9 %	10.5 %	35.9	0
Grok 4.3	30.0 %	50	52	5.7 %	11.4 %	—	1

Read: Modulum is mid-pack on accuracy but best in class on output-length adherence — its wrong-answer outputs are 47 chars median (5th of 7), and its pure-hallucination rate is 12 % (5th of 7). The platform layer adds a 45 pp drop in pure hallucination vs vanilla without adding refusal capability. Note Opus 4.6 dominates everything but operates at hyperscaler scale; Modulum competes on the long-context fabrication-suppression axis at workstation scale.

qa2 2-fact reasoning @ 128k

Stack	Acc	N	Wrong-output chars	Pure halluc %	Refusal %	Decode tok/s
GPT-5.5	92.0 %	50	27	25 %	25 %	—
Claude Opus 4.6	90.0 %	50	5	0 %	100 %	—
Gemini 3.1 Pro	72.0 %	50	48	14 %	7 %	—
Claude Opus 4.7	66.0 %	50	24	12 %	6 %	—
Modulum	39.5 %	200	32	31.4 %	0 %	32.7
Vanilla Gemma-4	30.0 %	50	64	65.7 %	2.9 %	34.9
Grok 4.3	18.0 %	50	31	0 %	0 %	—

qa3 3-fact temporal reasoning @ 128k

Stack	Acc	N	Wrong-output chars	Pure halluc %	Refusal %	Decode tok/s
Claude Opus 4.6	80.0 %	50	14	0 %	10 %	—
GPT-5.5	64.0 %	50	48	5.5 %	5.5 %	—
Gemini 3.1 Pro	40.8 %	500	53	11 %	0.7 %	—
Claude Opus 4.7	38.0 %	50	19	3 %	3 %	—
Modulum	27.0 %	500	48	0 %	0 %	40.2
Vanilla Gemma-4	20.0 %	50	90	12.5 %	2.5 %	34.4
Grok 4.3	15.4 %	26	85	4.5 %	9 %	—

02 · What the cross-stack data says Modulum does

Five derived findings.

Finding 01 · load-bearing

Modulum suppresses the fabrication failure mode that vanilla Gemma-4 exhibits at long context.

Vanilla's wrong-answer median output is 86 chars at qa1 128k (max 500); Modulum's is 47 chars. The reduction in pure-hallucination rate is 4.7× (12.3 % vs 57.9 %), 2.1× (31.4 % vs 65.7 %), and ∞× (0 % vs 12.5 %) on qa1/qa2/qa3 respectively. Same base weights. Different inference stack. The platform appears to enforce canonical output format, which truncates fabricated narratives.

Finding 02 · win

Modulum decodes 17–22 % faster than vanilla on qa3.

40.2 vs 34.4 tok/s at 128k; same pattern at 32k and 64k. Decode speedup at zero accuracy cost. The qa1 decode is +3 % (essentially flat) on the clean phase-1 retry data. Modulum's structurally shorter outputs concentrate decode-time compute.

Finding 03 · win

Modulum has the flattest qa3 decay slope of any tested stack (−2.5 pp / doubling of context).

Opus 4.6 is −4.0, GPT-5.5 −9.0, Gemini −7.6, Grok −8.2. Modulum's qa3 slope ties with Opus 4.7 (which sits at much lower absolute accuracy). The platform preserves multi-fact reasoning state across length better than any non-Modulum stack we tested.

Finding 04 · neutral

Modulum does NOT add refusal capability.

Refusal rate on wrong answers is 0–10.5 % across tasks at 128k, very similar to vanilla. Opus 4.6 refuses 100 % on qa1 128k wrong answers — that's the gold standard. Modulum suppresses fabrication by shortening output, not by teaching the model to say "I don't know." Production routing layer for human-handoff is a future R&D direction.

Finding 05 · loss

Modulum trails frontier on absolute accuracy by 38–43 pp at 128k.

Opus 4.6 88.7 % avg vs Modulum 46.0 %. The base model is a 31B-Q4 open-weight Gemma-4 — vastly smaller than the proprietary FP16 hyperscaler-served frontier. The value of Modulum is not parity; it is workstation-scale deployment of a long-context-stable, fabrication-suppressed inference stack.

Finding 06 · loss

3 errors on qa1 128k (the 503-storm retry survivors).

Modulum is the only stack with non-zero errors at qa1 128k. Single-slot demo backend can't absorb sustained sequential 128k requests without throttling. Production deployment needs queue-aware retry orchestration as a first-class server feature.

03 · The framework — every signal extractable from canonical data

What we measure across stacks, and what only some stacks expose.

Signal	Source	Modulum	Vanilla	Anthropic	OpenAI	Google	xAI
Accuracy (correct / N)	substring match on target	✓	✓	✓	✓	✓	✓
Wilson 95 % CI	derived from N, k	✓	✓	✓	✓	✓	✓
Decay slope (pp / 2× ctx)	OLS on cells	✓	✓	✓	✓	✓	✓
Output text (full)	API response	✓	✓	✓	✓	✓	✓
Output-length distribution	len(output) by cell	✓	✓	✓	✓	✓	✓
Refusal rate (keyword classifier)	regex on output	✓	✓	✓	✓	✓	✓
Pure hallucination rate	derived: wrong AND no valid loc AND no refusal	✓	✓	✓	✓	✓	✓
Wall-clock latency	client total_ms	✓	✓	✓	✓	✓	✓
Within-run drift (terciles)	sample_idx order	✓	✓	✓	✓	✓	✓
Error rate / failure modes	http_status, error text	✓	✓	✓	✓	✓	✓
Prefill / decode tokens/sec	llama.cpp timings block	✓	✓	✗	✗	✗	✗
Per-token logprob / PPL	logprobs flag	✓ (phase-4 N=20)	limited	limited	limited	limited	limited
Needle-NOT-in-haystack refusal	custom hallucination probe (in flight)	queued	queued	blocked: credit	running	running	running

9 of 13 signals are universally comparable across all 7 stacks. Only the llama.cpp-specific timings (prefill/decode tok/s) and logprob detail are Modulum/Vanilla-exclusive — frontier APIs don't expose them. This is enough to do apples-to-apples cross-stack analysis on every behavioral axis except internal compute timings.

04 · What we still cannot measure

The signals we'd need to verify Hypernym's full claim set.

Claim	What it would require	Status
"−14.18 % cleaner than F16" (PPL drops vs F16 baseline)	Run Gemma-4-31B-it FP16 unquantized on same prompts. Needs 80 GB GPU + 1 day setup.	Not measured
"Effective infinite context in fixed memory"	Modulum endpoint provisioned beyond 128k.	Endpoint capped at 128k by Hypernym
"3 corpora · 7 context lengths · 38 measurements"	Add LongBench / RULER beyond BABILong; extend below 32k.	1 corpus (BABILong) at 3 lengths × 3 tasks
"Zero speed cost" vs F16	F16 baseline. We have Modulum vs vanilla Q4 = +17–22 % faster on qa3.	Partial — Q4 reference, not F16
"Computes on cleaner data than F16"	Attention entropy / KV state introspection + F16 PPL baseline.	Indirect support only