A signal-by-signal comparison of Modulum, vanilla Gemma-4-31B-Q4 (same weights, no platform), and 4 current-generation frontier products at 128k context on BABILong qa1 / qa2 / qa3. Goes beyond accuracy: output-length distribution, pure-hallucination rate, refusal behavior, decay slope, decode speed, error rate. The goal is to identify what Modulum is doing differently at the data level.
Modulum reduces pure-hallucination rate by 4.7× vs vanilla on qa1 128k (12.3 % vs 57.9 % of wrong answers) — without adding a refusal mechanism. The platform structurally constrains output to canonical format, which truncates fabrication.
When wrong, Modulum still commits to a canonical-format wrong location ("Mary is in the kitchen" when target was "bedroom") at the same rate as Opus 4.6. What Modulum eliminates is the narrative-fabrication failure mode that vanilla Gemma-4 exhibits at long context — long made-up biographies of distractor characters from PG19 noise. Vanilla's median wrong-answer length at qa1 128k is 86 chars (max 500); Modulum's is 47 chars (max ~75). This is the production-relevant finding for hyperscalers: same base weights, but Hypernym's platform layer suppresses the fabrication failure mode.
Every column is computed from the canonical CSV. Modulum highlighted in terracotta; vanilla in muted italic so the apples-to-apples comparison is visible. Frontier API stacks (Opus 4.6/4.7, GPT-5.5, Gemini 3.1 Pro, Grok 4.3) don't expose decode timings, so those cells are dash-marked.
| Stack | Acc | N | Wrong-output median chars | Pure halluc % of wrong | Refusal % of wrong | Decode tok/s | Errors |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 | 96.0 % | 50 | 8 | 0 % | 100 % | — | 0 |
| GPT-5.5 | 96.0 % | 50 | 45 | 50 % | 50 % | — | 0 |
| Claude Opus 4.7 | 92.0 % | 50 | 22 | 25 % | 75 % | — | 0 |
| Gemini 3.1 Pro | 84.0 % | 50 | 52 | 25 % | 25 % | — | 0 |
| Modulum (Gemma-4-31B-Q4 + platform) | 71.5 % | 200 | 47 | 12.3 % | 10.5 % | 37.1 | 3 |
| Vanilla Gemma-4-31B-Q4 (no platform) | 62.0 % | 50 | 86 | 57.9 % | 10.5 % | 35.9 | 0 |
| Grok 4.3 | 30.0 % | 50 | 52 | 5.7 % | 11.4 % | — | 1 |
Read: Modulum is mid-pack on accuracy but best in class on output-length adherence — its wrong-answer outputs are 47 chars median (5th of 7), and its pure-hallucination rate is 12 % (5th of 7). The platform layer adds a 45 pp drop in pure hallucination vs vanilla without adding refusal capability. Note Opus 4.6 dominates everything but operates at hyperscaler scale; Modulum competes on the long-context fabrication-suppression axis at workstation scale.
| Stack | Acc | N | Wrong-output chars | Pure halluc % | Refusal % | Decode tok/s |
|---|---|---|---|---|---|---|
| GPT-5.5 | 92.0 % | 50 | 27 | 25 % | 25 % | — |
| Claude Opus 4.6 | 90.0 % | 50 | 5 | 0 % | 100 % | — |
| Gemini 3.1 Pro | 72.0 % | 50 | 48 | 14 % | 7 % | — |
| Claude Opus 4.7 | 66.0 % | 50 | 24 | 12 % | 6 % | — |
| Modulum | 39.5 % | 200 | 32 | 31.4 % | 0 % | 32.7 |
| Vanilla Gemma-4 | 30.0 % | 50 | 64 | 65.7 % | 2.9 % | 34.9 |
| Grok 4.3 | 18.0 % | 50 | 31 | 0 % | 0 % | — |
| Stack | Acc | N | Wrong-output chars | Pure halluc % | Refusal % | Decode tok/s |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | 80.0 % | 50 | 14 | 0 % | 10 % | — |
| GPT-5.5 | 64.0 % | 50 | 48 | 5.5 % | 5.5 % | — |
| Gemini 3.1 Pro | 40.8 % | 500 | 53 | 11 % | 0.7 % | — |
| Claude Opus 4.7 | 38.0 % | 50 | 19 | 3 % | 3 % | — |
| Modulum | 27.0 % | 500 | 48 | 0 % | 0 % | 40.2 |
| Vanilla Gemma-4 | 20.0 % | 50 | 90 | 12.5 % | 2.5 % | 34.4 |
| Grok 4.3 | 15.4 % | 26 | 85 | 4.5 % | 9 % | — |
Vanilla's wrong-answer median output is 86 chars at qa1 128k (max 500); Modulum's is 47 chars. The reduction in pure-hallucination rate is 4.7× (12.3 % vs 57.9 %), 2.1× (31.4 % vs 65.7 %), and ∞× (0 % vs 12.5 %) on qa1/qa2/qa3 respectively. Same base weights. Different inference stack. The platform appears to enforce canonical output format, which truncates fabricated narratives.
40.2 vs 34.4 tok/s at 128k; same pattern at 32k and 64k. Decode speedup at zero accuracy cost. The qa1 decode is +3 % (essentially flat) on the clean phase-1 retry data. Modulum's structurally shorter outputs concentrate decode-time compute.
Opus 4.6 is −4.0, GPT-5.5 −9.0, Gemini −7.6, Grok −8.2. Modulum's qa3 slope ties with Opus 4.7 (which sits at much lower absolute accuracy). The platform preserves multi-fact reasoning state across length better than any non-Modulum stack we tested.
Refusal rate on wrong answers is 0–10.5 % across tasks at 128k, very similar to vanilla. Opus 4.6 refuses 100 % on qa1 128k wrong answers — that's the gold standard. Modulum suppresses fabrication by shortening output, not by teaching the model to say "I don't know." Production routing layer for human-handoff is a future R&D direction.
Opus 4.6 88.7 % avg vs Modulum 46.0 %. The base model is a 31B-Q4 open-weight Gemma-4 — vastly smaller than the proprietary FP16 hyperscaler-served frontier. The value of Modulum is not parity; it is workstation-scale deployment of a long-context-stable, fabrication-suppressed inference stack.
Modulum is the only stack with non-zero errors at qa1 128k. Single-slot demo backend can't absorb sustained sequential 128k requests without throttling. Production deployment needs queue-aware retry orchestration as a first-class server feature.
| Signal | Source | Modulum | Vanilla | Anthropic | OpenAI | xAI | |
|---|---|---|---|---|---|---|---|
| Accuracy (correct / N) | substring match on target | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Wilson 95 % CI | derived from N, k | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Decay slope (pp / 2× ctx) | OLS on cells | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Output text (full) | API response | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Output-length distribution | len(output) by cell | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Refusal rate (keyword classifier) | regex on output | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Pure hallucination rate | derived: wrong AND no valid loc AND no refusal | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Wall-clock latency | client total_ms | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Within-run drift (terciles) | sample_idx order | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Error rate / failure modes | http_status, error text | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Prefill / decode tokens/sec | llama.cpp timings block | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| Per-token logprob / PPL | logprobs flag | ✓ (phase-4 N=20) | limited | limited | limited | limited | limited |
| Needle-NOT-in-haystack refusal | custom hallucination probe (in flight) | queued | queued | blocked: credit | running | running | running |
9 of 13 signals are universally comparable across all 7 stacks. Only the llama.cpp-specific timings (prefill/decode tok/s) and logprob detail are Modulum/Vanilla-exclusive — frontier APIs don't expose them. This is enough to do apples-to-apples cross-stack analysis on every behavioral axis except internal compute timings.
| Claim | What it would require | Status |
|---|---|---|
| "−14.18 % cleaner than F16" (PPL drops vs F16 baseline) | Run Gemma-4-31B-it FP16 unquantized on same prompts. Needs 80 GB GPU + 1 day setup. | Not measured |
| "Effective infinite context in fixed memory" | Modulum endpoint provisioned beyond 128k. | Endpoint capped at 128k by Hypernym |
| "3 corpora · 7 context lengths · 38 measurements" | Add LongBench / RULER beyond BABILong; extend below 32k. | 1 corpus (BABILong) at 3 lengths × 3 tasks |
| "Zero speed cost" vs F16 | F16 baseline. We have Modulum vs vanilla Q4 = +17–22 % faster on qa3. | Partial — Q4 reference, not F16 |
| "Computes on cleaner data than F16" | Attention entropy / KV state introspection + F16 PPL baseline. | Indirect support only |