LLM Inference Bottleneck Registry

The thesis

Public discourse on AI compute oscillates between two errors: marketing-grade "200× speedup" claims that collapse under audit, and nihilistic "we need $10 trillion in fabs" framings that paralyze coordination. The truth is in between, and it is more actionable than either.

Of the eighteen frontier-inference bottlenecks catalogued here:

~80% are coordination or engineering problems with known solutions — blocked by software lock-in, vendor incentives, benchmark theater, or missing standards.
~13% are algorithmic — solvable by research labs at laptop scale (latent reasoning, FP4 QAT maturation, MoE routing).
~7% require heavy capital (fab capacity, thermal infrastructure).

The binding constraint on AI abundance is shared map, not money or physics. This registry is one attempt at that map.

The companion paper

Bottleneck-Driven Projection of Frontier-Class LLM Inference on Dedicated ASICs · v0.2 (2026-04-22)

What it shows

Calibrated baselines from public Groq, Cerebras, NVIDIA MLPerf data — not vendor marketing scalars.
Projection of Claude-class MoE inference on a 2027-feasible ASIC, with explicit sensitivity analysis.
"Agent density" as a CEO/policymaker-facing economic metric.
Eighteen bottlenecks, taxonomized by type and difficulty.
Adversarial critique section anticipating ten objections.

Headline numbers (verifiable ranges)

Decode throughput	10-70× over H100
Energy per token	50-200×
Cost per million tokens	20-100×
Agent density per $1M CapEx	3,000-12,000 streams

Each range is conditional on model size fit, batch regime, and software maturity. Single-scalar comparisons across these metrics are misleading.

Read paper (PDF) Markdown source BibTeX (.bib)

Companion papers in this series

Paper 02 — Bottleneck-Driven Projection of Frontier-Class LLM Inference on Dedicated ASICs (above). The calibrated case study and source of the registry seed.
Paper 03 — The Compute-Robot-Energy Triad: A Vertical-Integration Blueprint for the Musk Operating System. Strategic memo on cross-organizational integration of Tesla, SpaceX, xAI, X, Neuralink, Boring, and the new fab venture. Three meta-flywheels, five high-leverage integrations, the energy moat, the orbital-inference question (honest treatment), and a five-year capital-recapitalization sequence. [PDF] [Markdown]
Paper 04 — The Reasoning-Tax Crisis (forthcoming). Deep dive on bottleneck B16 — why o1/o3-class reasoning models have re-based the cost economics of inference, and what serving stacks need to do about it.

Radar chart of platform capabilities — Six-axis platform comparison (log scale; H100 = unit). Etched Sohu excluded — vendor-claimed numbers unaudited.

Cost per million tokens — Amortized cost per million decoded tokens. Hatched red = vendor claim, unaudited. Dotted = projection.

How to contribute

Pull request against bottleneck_registry.md. Schema is documented at the top of that file; one Markdown section per entry.
Status transitions (open → partial → resolved) require a citation.
Blockers must be specific. "More research needed" is not a blocker; "FP4 QAT degrades long-tail tasks by ~3% on MMLU" is.
Failed-attempt reports are welcome and tracked with the same rigor as wins.

View on GitHub View the source registry (raw)

Support this work

This project is run by an independent researcher on a $20/month budget. Donations go directly to compute (LoRA training runs, benchmark replication) and registry maintenance. Every dollar is accounted for in a public ledger.

Donation channels (GitHub Sponsors, Ko-fi, etc.) pending — to be filled in by author. Placeholder: data-todo="confirm-donation-channels-with-author".