- Session
- 08:55
- Duration: 29 mins
- Publication date: 13 Nov 2025
- Location: Turing Lecture Theatre, IET London: Savoy Place, London, United Kingdom
- Part of event REACH 2025
About the session
Large Language Models (LLMs) are transforming every corner of the technology landscape—from code generation and content creation to customer support and scientific discovery. But as models grow ever larger, the infrastructure required to serve them is straining under the weight of complexity, cost, and energy consumption.
In this keynote, we examine the true system-level bottlenecks that constrain the deployment and scalability of LLM inference today and into the near future. Drawing from our recent study, we identify four inescapable limits that define performance and efficiency: memory bandwidth, compute capability, synchronization latency, and memory capacity.
Rather than evaluating a specific chip or vendor stack, we present a hardware-agnostic performance model that lets us reason across the full spectrum of architectural possibilities—from GPUs and TPUs to wafer-scale systems and custom accelerators. We model technologies ranging from today’s HBM3 to emerging HBM4, 3D-stacked DRAM, and high-density SRAM designs, capturing both compute and memory scaling trends.
Here are a few insights that challenge common assumptions:
Model size still matters: Even with clever sharding and compression techniques, inference for large models like GPT-3 or GPT-4 requires 100s of GB of memory per instance. You can't cheat physics—or capacity.
Bandwidth is king: High memory bandwidth is non-negotiable for high user throughput. If you can't move weights fast enough, your expensive compute cores sit idle.
Latency kills scaling: Synchronization across distributed accelerators must happen in under a microsecond. Anything slower undermines bandwidth and negates gains from hardware scale-out.
DRAM wins (for now): While SRAM offers raw speed, DRAM-based designs provide far better energy and cost efficiency at system scale. For most realistic deployments, they’re the right economic tradeoff.
We’re nearing a plateau: Today's systems can serve thousands of tokens per second per user. Getting to 10,000+ tokens/sec isn’t just a hardware problem—it will require algorithmic advances, smaller models, or shorter context windows.
This talk is not just about bottlenecks—it’s about opportunity. By understanding the core friction points in LLM serving, we can guide the next wave of AI system innovation. Whether you're designing next-generation silicon, deploying inference at hyperscale, or optimizing software stacks, this framework helps you prioritize what really matters.
As the AI community races toward ever-larger models and faster deployment, our findings provide a reality check and a call to co-design—between hardware, systems, and ML algorithms.
We’ll close by discussing forward-looking implications: what kinds of architectures might win in a post-10,000 token/sec world? What’s the right balance between specialization and flexibility? And where should we invest our collective energy to make LLMs not just smarter—but more sustainable, affordable, and available?