# LLM Inference Bottleneck Registry — Seed v0.1

**Maintainer:** Pablo L. Rainieri (rainieripabloluciano@gmail.com)
**License:** CC-BY-4.0 (content) / MIT (tooling)
**Canonical source:** `e:/ASIC_LLM_Project/paper_02_asic_projection/bottleneck_registry.md`

## Schema (v0.2)

Each entry is a YAML-front-matter Markdown block. Schema fields:

```yaml
id:                    # B<n>, stable, never recycled
name:                  # short canonical name
type:                  # physics | architecture | algorithm | stack | supply | ecosystem | measurement | economic
priority:              # 1-5 (5 = gating frontier economics)
difficulty:            # 1-5 (5 = open research frontier)
status:                # open | partial | resolved
estimated_unlock_value: # short string: e.g. "~40× cost/token", "+10× context"
                       # what the field gains if this single bottleneck flips to resolved
depends_on:            # list of B<n> IDs whose resolution conditions or accelerates this one
known_solutions:       # list of {title, citation, maturity}
                       # maturity ∈ {research, partial, production, shipping, philosophical}
blockers:              # list of strings (adoption reasons — be specific)
contributors:          # list of {name, github_handle?, contribution: "author|edit|review"}
last_reviewed:         # ISO date
review_cadence_days:   # int, default 90 — entries past cadence are flagged stale
```

Body: free-form prose, 200-600 words. Should answer: (a) what is the bottleneck, (b) why it binds now, (c) what changes unlock it, (d) who is working on it.

**Migration status (v0.2 complete, 2026-04-22):** All 18 entries are on v0.2 schema. `estimated_unlock_value`, `depends_on`, `contributors`, `review_cadence_days` populated across the registry. Generated dependency graph: B1 ← {B4, B11}; B2 ← {B11}; B5 ← {B11}; B6 ← {B16}; B8 ← {B15}; B12 ← {B2}; B13 ← {B1}; B15 ← {B8}; B16 ← {B6, B4, B8}; B17 ← {B5, B8}; B18 ← {B4}. Future contributors must populate all v0.2 fields.

---

## B1 — Memory-bandwidth wall in decode

```yaml
id: B1
name: Memory-bandwidth wall in decode
type: architecture
priority: 5
difficulty: 3
status: partial
known_solutions:
  - title: SRAM-dominant dataflow (Groq LPU)
    citation: Groq 2025 public disclosure
    maturity: shipping
  - title: Wafer-scale SRAM (Cerebras WSE-3)
    citation: Cerebras 2024 inference launch
    maturity: shipping
  - title: FP4 weight quantization (halves bandwidth demand)
    citation: Ashkboos et al. 2024 (QuaRot); Liu et al. 2024 (SpinQuant)
    maturity: research-to-production
  - title: Sparse MoE activation (reduces active footprint)
    citation: DeepSeek-V3 2024
    maturity: shipping
blockers:
  - CapEx for SRAM-dominant designs
  - FP4 QAT not yet standard for frontier closed-source models
  - MoE adds routing overhead and serving complexity
last_reviewed: 2026-04-22
estimated_unlock_value: "10-50x decode throughput on memory-bound workloads"
depends_on: [B4, B11]
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 90
```

Autoregressive decode streams the active weight set through compute once per token. For dense 70B BF16 this is 140 GB/token; at H100's 3.35 TB/s HBM3 it caps single-stream throughput at ~24 tok/s regardless of FLOPs. Transformer-dedicated silicon with SRAM-dominant dataflow breaks the cap by raising effective bandwidth 10-50×. The physics is solved; the economics of retrofitting the industry to SRAM-first is the binding constraint.

---

## B2 — KV-cache memory scaling

```yaml
id: B2
name: KV-cache O(n) memory and O(n²) context cost
type: architecture
priority: 5
difficulty: 3
status: partial
known_solutions:
  - title: Grouped-Query Attention (GQA) / Multi-head Latent Attention (MLA)
    citation: Ainslie et al. 2023; DeepSeek-V2 2024
    maturity: shipping
  - title: KV cache eviction (H2O, StreamingLLM)
    citation: Zhang et al. 2023; Xiao et al. 2024
    maturity: production-partial
  - title: Linear-recurrence hybrids (Mamba, DeltaNet, GLA)
    citation: Gu & Dao 2023; Yang et al. 2024
    maturity: frontier-research
blockers:
  - Aggressive eviction degrades recall on long-context tasks
  - Hybrid SSM architectures not yet proven at Claude/GPT-4 scale
  - Frontier closed models locked to standard attention
last_reviewed: 2026-04-22
estimated_unlock_value: "5-10x context length at fixed memory; or 4-8x concurrent streams"
depends_on: [B11]
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 90
```

Per-token KV-cache memory grows linearly with context; compute grows quadratically. At 200K context the cache alone dominates per-request memory, bottlenecking batch and agent-density economics. MLA (DeepSeek-V2/V3) halves cache and is shipping in production; hybrid SSMs promise constant-memory decode but remain unproven at frontier quality. The open question is whether an SSM/attention hybrid can match a pure-attention frontier model at ≥1T parameters.

---

## B3 — MoE router load imbalance

```yaml
id: B3
name: MoE router load imbalance
type: algorithm
priority: 4
difficulty: 2
status: partial
known_solutions:
  - title: Expert-choice routing
    citation: Zhou et al. 2022
    maturity: research-to-production
  - title: Shared-expert designs with balanced auxiliary loss
    citation: DeepSeek-V3 2024
    maturity: shipping
blockers:
  - Model must be pre-trained with MoE-aware objectives
  - Runtime load-balancing across expert-distributed hardware is ops-complex
last_reviewed: 2026-04-22
estimated_unlock_value: "2-3x effective throughput on MoE deployments"
depends_on: []
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 120
```

Mixture-of-Experts routes each token to a subset of experts. Under token-choice routing, heavy-hit experts become serving bottlenecks and light-hit experts waste capacity. Shared-expert + balanced-load designs (DeepSeek-V3) close most of this gap and are reproducible. The blocker is that frontier closed models need to adopt these designs at pre-training; it cannot be retrofitted cheaply.

---

## B4 — Quantization degradation below FP4

```yaml
id: B4
name: Quantization degradation below FP4
type: algorithm
priority: 4
difficulty: 3
status: partial
known_solutions:
  - title: Rotational quantization (QuaRot, SpinQuant)
    citation: Ashkboos et al. 2024; Liu et al. 2024
    maturity: research-to-production
  - title: QAT + outlier-aware scaling
    citation: NVIDIA Nemotron 2025 examples
    maturity: partial
  - title: Log-FP4 / BlockFP formats
    citation: Microsoft MX specification 2024
    maturity: standards-emerging
blockers:
  - QAT adds 10-30% training cost
  - Long-tail-task degradation still measurable at FP4
  - Below FP4 (ternary, 1.58-bit) works only for some architectures
last_reviewed: 2026-04-22
estimated_unlock_value: "~2x throughput per bit-halving below FP8 (FP8->FP4->FP2 sequence)"
depends_on: []
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 90
```

FP4 halves memory-bandwidth cost relative to FP8, doubling theoretical throughput in decode. Rotational methods (QuaRot, SpinQuant) bring FP4 within ~1% of BF16 on most benchmarks. The open question is tail-behavior: reasoning chains, rare-token coverage, and adversarial inputs still show systematic regression. Below FP4 (BitNet-style ternary) is attractive but restricts architecture choices.

---

## B5 — Inter-chip communication for multi-die models

```yaml
id: B5
name: Inter-chip communication for multi-die models
type: architecture
priority: 4
difficulty: 4
status: open
known_solutions:
  - title: NVLink 5 / NVSwitch
    citation: NVIDIA Blackwell platform
    maturity: shipping, proprietary
  - title: CXL 3.x fabrics
    citation: CXL Consortium 2024
    maturity: partial
  - title: Co-packaged optics (Ayar Labs, Lightmatter)
    citation: 2025 industry disclosures
    maturity: early
blockers:
  - Open-fabric standards fragmented
  - Optical I/O cost not yet competitive below hyperscaler volumes
last_reviewed: 2026-04-22
estimated_unlock_value: "Enables single-namespace serving of >1T-param models without latency penalty"
depends_on: [B11]
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 120
```

At ≥1 T active parameters a single die cannot hold the working set; tensor/pipeline parallelism requires high-bandwidth, low-latency interconnect. NVIDIA's NVLink stack is shipping but vendor-locked. CXL 3.x offers an open alternative but lags in bandwidth. Co-packaged optics are the likely endgame; the blocker is manufacturing ramp, not physics.

---

## B6 — Reasoning-length decode tax

```yaml
id: B6
name: Reasoning / chain-of-thought decode-length tax
type: algorithm
priority: 5
difficulty: 4
status: open
known_solutions:
  - title: Speculative decoding (Medusa, EAGLE, n-gram)
    citation: Cai et al. 2024; Li et al. 2024
    maturity: production
  - title: Latent-space reasoning (Coconut, etc.)
    citation: Hao et al. 2024
    maturity: research
  - title: Early exit (CALM, LayerSkip)
    citation: Schuster et al. 2022; Elhoushi et al. 2024
    maturity: research-to-partial
blockers:
  - Reasoning models spend the majority of compute on CoT tokens the user never reads
  - Generalizing speculative decode beyond common patterns is hard
  - Latent-space reasoning loses interpretability
last_reviewed: 2026-04-22
estimated_unlock_value: "2-10x cost reduction on reasoning queries (subsumed by B16 in reasoning-model regime)"
depends_on: [B16]
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 60
```

Frontier reasoning models (o1, Claude reasoning modes) spend the majority of wall-clock and dollar cost on private chain-of-thought tokens. Speculative decoding (Medusa, EAGLE) gives 2-3× on common-case decode. Latent reasoning (Coconut) would in principle cut CoT cost by 10× but gives up interpretability. Early-exit (CALM/LayerSkip) is orthogonal. The field has three incompatible directions; none is clearly winning.

---

## B7 — Batch-latency Pareto frontier

```yaml
id: B7
name: Batch-latency Pareto frontier
type: architecture
priority: 3
difficulty: 2
status: partial
known_solutions:
  - title: Disaggregated prefill/decode (DistServe)
    citation: Zhong et al. 2024
    maturity: production-partial
  - title: Continuous batching (vLLM)
    citation: Kwon et al. 2023
    maturity: shipping
blockers:
  - Disaggregation adds ops complexity and cold-start cost
  - Interactive workloads (chat) cannot batch deeply
last_reviewed: 2026-04-22
estimated_unlock_value: "2-4x goodput improvement at fixed SLA"
depends_on: []
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 120
```

Large batches maximize throughput; small batches minimize latency. Disaggregating prefill (compute-bound, batches well) from decode (memory-bound, batches less) lets each run on appropriate hardware. DistServe shows 2-4× end-to-end gains. Adoption is gated by ops complexity; the primitives are known.

---

## B8 — Software stack fragmentation / CUDA lock-in

```yaml
id: B8
name: Software stack fragmentation and CUDA lock-in
type: stack
priority: 5
difficulty: 5
status: open
known_solutions:
  - title: MLIR / OpenXLA / StableHLO
    citation: LLVM foundation; Google 2023+
    maturity: partial
  - title: Triton
    citation: OpenAI 2021
    maturity: shipping
  - title: Vendor-specific compilers (Groq, Cerebras, Etched)
    citation: various
    maturity: silo'd
blockers:
  - CUDA network effects (10+ years, millions of engineer-years)
  - Every new ASIC ships a parallel software stack that doesn't share primitives
  - Kernel library ecosystems (FlashAttention, FasterTransformer) are CUDA-first
last_reviewed: 2026-04-22
estimated_unlock_value: "Unlocks competitive entry of non-NVIDIA accelerators; reduces port-time 10-100x"
depends_on: [B15]
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 180
```

Hardware diversity is meaningless without software portability. Every ASIC vendor ships its own compiler, kernel library, and debugging tools; porting a frontier model costs quarters, not weeks. MLIR/OpenXLA are the plausible open baselines but remain incomplete. This is the most important non-physics bottleneck in the list.

---

## B9 — Fab capacity at advanced nodes

```yaml
id: B9
name: Fab capacity at 3 nm and 2 nm
type: supply
priority: 4
difficulty: 5
status: open
known_solutions:
  - title: TSMC N3P / N2 ramps
    citation: TSMC capacity announcements 2025
    maturity: ramping
  - title: Samsung SF2 / Intel 18A as alternatives
    citation: vendor roadmaps
    maturity: early
blockers:
  - Hyperscaler pre-bookings consume most capacity through 2027
  - Geopolitical risk (TSMC concentration)
  - Capital intensity ($20B+ per leading-edge fab)
last_reviewed: 2026-04-22
estimated_unlock_value: "Removes the hard ceiling on hyperscaler compute growth post-2027"
depends_on: []
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 180
```

Leading-edge fab capacity is a hard constraint. TSMC's N3 and N2 nodes are pre-booked through 2027 by NVIDIA, Apple, AMD, and Broadcom hyperscaler customers. Startups face multi-year wait times and unfavorable economics. Alternatives (Samsung, Intel 18A) are maturing but trail TSMC by 1-2 years. This is a capital and geopolitical problem, not a research one.

---

## B10 — Thermal density at frontier nodes

```yaml
id: B10
name: Thermal density at frontier nodes
type: physics
priority: 3
difficulty: 3
status: partial
known_solutions:
  - title: Direct liquid cooling
    citation: NVIDIA GB200 platform 2024
    maturity: shipping
  - title: Immersion cooling
    citation: Submer, GRC 2023+
    maturity: production
  - title: 3D stacking with thermal vias
    citation: TSMC SoIC / Intel Foveros
    maturity: partial
blockers:
  - Datacenter retrofit cost
  - PUE targets vs. coolant choice
last_reviewed: 2026-04-22
estimated_unlock_value: "Enables 2 kW+ per node deployments at standard datacenter PUE"
depends_on: []
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 180
```

2 kW+ per node means air cooling is non-viable. Liquid cooling and immersion are shipping but require datacenter retrofit. For dense AI pods, the constraint shifts from compute to power delivery and heat extraction. This is well-known; the blocker is CapEx amortization.

---

## B11 — Architecture ossification risk

```yaml
id: B11
name: Model architecture ossification risk
type: architecture
priority: 4
difficulty: 3
status: open
known_solutions:
  - title: Reconfigurable dataflow (Tenstorrent, SambaNova)
    citation: vendor architectures
    maturity: shipping
  - title: Keep FFN + attention primitives programmable
    citation: Etched Sohu design philosophy (explicit)
    maturity: philosophical
blockers:
  - Hardwiring transformer primitives wins ~3× area vs. programmable
  - If Mamba/RWKV/DeltaNet displaces transformer, hardwired silicon is stranded
last_reviewed: 2026-04-22
estimated_unlock_value: "Preserves silicon optionality across architecture transitions (dense->MoE->SSM->?)"
depends_on: []
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 180
```

An ASIC wins efficiency by hardwiring the computation graph. But if the dominant architecture changes (SSMs, diffusion-LM, something new), hardwired silicon becomes dead inventory. Reconfigurable dataflow (Tenstorrent, SambaNova) gives flexibility at area cost. The field has no consensus on what "transformer-safe until N years" looks like.

---

## B12 — Long-context amplification cost

```yaml
id: B12
name: Long-context prefill and KV amplification
type: algorithm
priority: 4
difficulty: 3
status: partial
known_solutions:
  - title: Sliding-window attention
    citation: Longformer, Mistral
    maturity: shipping
  - title: Chunked prefill
    citation: vLLM, Sarathi
    maturity: shipping
  - title: Retrieval-augmented generation (RAG) as substitute
    citation: 2020+ RAG lineage
    maturity: shipping
blockers:
  - Quality regression at extreme contexts
  - Needle-in-haystack tasks resist windowing
last_reviewed: 2026-04-22
estimated_unlock_value: "Enables 1M+ token context at production cost"
depends_on: [B2]
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 90
```

At 200K+ context, prefill compute and KV memory dominate. Chunked prefill amortizes latency; windowed attention and RAG substitute for full-context. Each trades quality on some task. No single approach wins; serving stacks combine three.

---

## B13 — Fine-tune / LoRA serving at scale

```yaml
id: B13
name: Multi-LoRA / fine-tune serving at scale
type: architecture
priority: 3
difficulty: 3
status: partial
known_solutions:
  - title: S-LoRA, Punica, LoRAX
    citation: Sheng et al. 2023; Chen et al. 2024
    maturity: shipping
  - title: Weight paging with dedicated memory tier
    citation: various
    maturity: partial
blockers:
  - SRAM pressure grows linearly with concurrent adapters
  - Per-tenant isolation adds scheduling complexity
last_reviewed: 2026-04-22
estimated_unlock_value: "Per-tenant model customization at hyperscaler unit economics"
depends_on: [B1]
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 120
```

Per-user or per-tenant fine-tunes become economically feasible only if thousands can share base-model serving with per-request LoRA loading. S-LoRA and Punica demonstrate this at GPU scale. On SRAM-dominant silicon, adapter residency becomes the binding constraint.

---

## B14 — Benchmark integrity for inference

```yaml
id: B14
name: Open benchmark integrity for inference
type: measurement
priority: 4
difficulty: 2
status: partial
known_solutions:
  - title: MLPerf Inference (audited)
    citation: MLCommons
    maturity: shipping
  - title: LMSYS Chatbot Arena
    citation: Chiang et al. 2024
    maturity: shipping
  - title: Independent third-party audits (SemiAnalysis)
    citation: ongoing
    maturity: partial
blockers:
  - Vendor benchmark theater (cherry-picked configs)
  - Submission costs exclude smaller teams
last_reviewed: 2026-04-22
estimated_unlock_value: "Restores comparability of vendor claims; reduces market-friction across procurement decisions"
depends_on: []
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 90
```

Public discourse runs on vendor-reported numbers with inconsistent baselines. MLPerf is the closest to rigorous, but submission costs and config space leave room for theater. Independent audits (SemiAnalysis et al.) improve the situation but are selective. A public, community-auditable benchmark layer would cut signal-to-noise substantially.

---

## B15 — Cross-vendor interchange formats

```yaml
id: B15
name: Cross-vendor interchange formats
type: ecosystem
priority: 5
difficulty: 4
status: open
known_solutions:
  - title: ONNX / ONNX-MLIR
    citation: ONNX Foundation
    maturity: partial
  - title: StableHLO
    citation: OpenXLA 2023
    maturity: partial
  - title: GGUF / GGML quantization format
    citation: llama.cpp ecosystem
    maturity: shipping for inference
blockers:
  - Vendors treat compiler IR as competitive moat
  - Frontier model weights are closed, reducing interchange pressure
last_reviewed: 2026-04-22
estimated_unlock_value: "Engineer-time saving 10-100x on training-to-inference porting"
depends_on: [B8]
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
review_cadence_days: 180
```

Moving a trained model from training hardware to inference silicon should be automatic. It is not. Each vendor's IR, quantization format, and kernel library is silo'd. GGUF showed that community-driven interchange works for open-weight quantized inference; an equivalent for the full training-to-inference path would unlock vast engineer time.

---

---

## B16 — Reasoning / test-time-compute tax

> **Deep dive available:** [Paper 04 — The Reasoning-Tax Crisis](papers/paper_04_reasoning_tax/Paper_04_Reasoning_Tax.pdf) expands this entry with full economic analysis, three orthogonal solution families, and open-benchmark proposal.

```yaml
id: B16
name: Reasoning / test-time-compute tax
type: economic
priority: 5
difficulty: 5
status: open
estimated_unlock_value: "~10-100× cost reduction on reasoning queries; enables Claude-class agents at population scale"
depends_on: [B6, B4, B8]
known_solutions:
  - title: Latent-space reasoning (Coconut)
    citation: Hao et al. 2024
    maturity: research
  - title: Speculative decoding for CoT (EAGLE-2, Medusa-2)
    citation: Cai et al. 2024; Li et al. 2024
    maturity: production-partial
  - title: Reasoning caching / memoization
    citation: emerging 2025 systems work
    maturity: research
  - title: Adaptive depth-of-reasoning policies (early-stop CoT)
    citation: open 2026 work, no canonical citation yet
    maturity: research
blockers:
  - Reasoning quality vs. token-budget Pareto is task-dependent
  - Latent reasoning loses interpretability — tension with safety/eval requirements
  - No public benchmark isolates "useful CoT tokens" from "filler CoT tokens"
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
last_reviewed: 2026-04-22
review_cadence_days: 60
```

Frontier reasoning models (o1, o3, Claude reasoning modes) spend the majority of inference cost on private chain-of-thought tokens that the user never sees. Empirically, reasoning queries on AIME-class problems consume 10-100× more tokens than direct-answer queries, scaling per-query cost from ~$0.002 to $0.50-2.00. **This is not a fixed-overhead bug — it is the new cost basis of frontier inference.** Without 10× compression of useful reasoning per token, "agent-per-task" economics remain infeasible at population scale even on dedicated silicon.

Three orthogonal solution families: (a) speculative decoding (cheap, generic, ~2-3×); (b) latent-space reasoning (powerful, ~10×, sacrifices interpretability); (c) adaptive policies that stop CoT once the model self-assesses sufficient confidence (lossy, hard to calibrate). The field has no consensus on which family wins, and serving stacks combining all three remain unbuilt. This is the **most economically consequential open bottleneck** in the registry.

---

## B17 — Training / inference hardware bifurcation

```yaml
id: B17
name: Training / inference hardware bifurcation
type: architecture
priority: 4
difficulty: 3
status: open
estimated_unlock_value: "~2-5× cost reduction on inference-only fleet; ~30% on RL rollouts"
depends_on: [B5, B8]
known_solutions:
  - title: Inference-only ASICs with no training path (Etched Sohu, Groq LPU)
    citation: vendor disclosures 2024-2025
    maturity: shipping/early
  - title: Training-grad-capable accelerators with FP8 forward + BF16 backward (NVIDIA Blackwell)
    citation: NVIDIA 2024
    maturity: shipping
  - title: Hybrid serving (training fleet does RLHF/RL rollouts during off-peak)
    citation: hyperscaler practice, undocumented
    maturity: production-internal
blockers:
  - RLHF and RL post-training need fast inference loops, blurring the line
  - Two fleets ≠ one CapEx; smaller orgs can't afford bifurcation
  - Operational complexity of moving checkpoints across fleets
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
last_reviewed: 2026-04-22
review_cadence_days: 90
```

Training and inference have diverged in their hardware preferences. Training requires backward-pass support, high-precision gradient accumulation, optimizer state in HBM, and inter-node all-reduce. Inference requires only forward pass, tolerates aggressive quantization (FP4 weights, FP8 activations), and benefits from SRAM residency. An ASIC optimized for one is materially worse at the other. Etched Sohu and Groq LPU are inference-only; NVIDIA Blackwell tries to do both at non-optimal cost.

The complication is **RLHF/RL post-training**: it requires *fast* inference loops to generate rollouts, then training compute to update on them. A pure-inference fleet can't do the training half; a pure-training fleet wastes silicon on the rollout half. Hyperscalers solve this with mixed fleets and orchestration; smaller orgs take the consolidation tax. The open question is whether the rollout/training boundary stabilizes (favoring bifurcation) or further blurs (favoring unified silicon).

---

## B18 — Numerical determinism / reproducibility

```yaml
id: B18
name: Numerical determinism / reproducibility under low-precision
type: measurement
priority: 3
difficulty: 2
status: open
estimated_unlock_value: "Restores reproducible eval/safety/regression testing under FP8/FP4"
depends_on: [B4]
known_solutions:
  - title: Deterministic reduction ordering (fixed kernel scheduling)
    citation: PyTorch deterministic mode; scattered vendor support
    maturity: partial
  - title: Atomically-summed reductions with fixed batch ordering
    citation: open implementations 2024-2025
    maturity: research-to-production
  - title: Higher-precision accumulators with low-precision multiplies
    citation: standard practice; cost ~5-15% throughput
    maturity: shipping
blockers:
  - Vendors prioritize throughput over determinism
  - Multi-GPU/multi-chip non-determinism amplifies under tensor parallelism
  - No widely-adopted determinism benchmark for inference
contributors:
  - {name: "Pablo L. Rainieri", contribution: "author"}
last_reviewed: 2026-04-22
review_cadence_days: 120
```

Under FP8/FP4 with non-deterministic reduction ordering, the same prompt to the same model can produce different outputs across runs — sometimes identical token-stream, sometimes diverging after hundreds of tokens. This silently breaks: (a) regression testing during model development, (b) safety eval reproducibility, (c) forensic incident analysis ("did the model say X or are we mis-reproducing?"), and (d) RL training stability when rollouts vary across replicas.

Solutions exist (deterministic reductions, fixed batch ordering, higher-precision accumulators) at 5-15% throughput cost. Vendors deprioritize them because benchmarks reward throughput, not reproducibility. As model behavior comes under more regulatory and safety scrutiny, the cost-benefit will flip; the registry tracks this bottleneck because it is **systematically under-prioritized despite known solutions** — a textbook coordination failure.

---

## Change log

- **2026-04-22, v0.2**: Schema extended (`estimated_unlock_value`, `depends_on`, `contributors`, `review_cadence_days`). Added B16 (reasoning tax), B17 (train/inference bifurcation), B18 (numerical determinism). B1-B15 remain on v0.1 schema pending community backfill.
- **2026-04-22, v0.1**: Initial 15-entry seed. Schema stabilized. Priority/difficulty calibrated against public 2025-2026 disclosures.

## Contribution guidelines (draft)

1. New entries: one Markdown section per bottleneck, matching the schema above. Minimum one citation per known_solution.
2. Status transitions (open → partial → resolved) must cite evidence.
3. Blockers should be specific (not "more research needed"). If the blocker is social/incentive, say so.
4. `last_reviewed` must be updated when the entry is edited.
5. No advocacy for single vendors without symmetric treatment of alternatives.
