Access your personal account

Log in to see your favourites, lists and progress.

Access via institution

Not currently connected to any institutions

Connect via

Access Code

Redeem Access Code

Log in to redeem access code

Serving Intelligence: The Unseen Limits of LLM Inference and the Hardware Race to Overcome Them

Session
Tuesday, 11 November 2025
08:55
Duration: 29 mins
Publication date: 13 Nov 2025
Location: Turing Lecture Theatre, IET London: Savoy Place, London, United Kingdom
Part of event REACH 2025

About the session

Large Language Models (LLMs) are transforming every corner of the technology landscape—from code generation and content creation to customer support and scientific discovery. But as models grow ever larger, the infrastructure required to serve them is straining under the weight of complexity, cost, and energy consumption.

In this keynote, we examine the true system-level bottlenecks that constrain the deployment and scalability of LLM inference today and into the near future. Drawing from our recent study, we identify four inescapable limits that define performance and efficiency: memory bandwidth, compute capability, synchronization latency, and memory capacity.

Rather than evaluating a specific chip or vendor stack, we present a hardware-agnostic performance model that lets us reason across the full spectrum of architectural possibilities—from GPUs and TPUs to wafer-scale systems and custom accelerators. We model technologies ranging from today’s HBM3 to emerging HBM4, 3D-stacked DRAM, and high-density SRAM designs, capturing both compute and memory scaling trends.

Here are a few insights that challenge common assumptions:

Model size still matters: Even with clever sharding and compression techniques, inference for large models like GPT-3 or GPT-4 requires 100s of GB of memory per instance. You can't cheat physics—or capacity.

Bandwidth is king: High memory bandwidth is non-negotiable for high user throughput. If you can't move weights fast enough, your expensive compute cores sit idle.

Latency kills scaling: Synchronization across distributed accelerators must happen in under a microsecond. Anything slower undermines bandwidth and negates gains from hardware scale-out.

DRAM wins (for now): While SRAM offers raw speed, DRAM-based designs provide far better energy and cost efficiency at system scale. For most realistic deployments, they’re the right economic tradeoff.

We’re nearing a plateau: Today's systems can serve thousands of tokens per second per user. Getting to 10,000+ tokens/sec isn’t just a hardware problem—it will require algorithmic advances, smaller models, or shorter context windows.

This talk is not just about bottlenecks—it’s about opportunity. By understanding the core friction points in LLM serving, we can guide the next wave of AI system innovation. Whether you're designing next-generation silicon, deploying inference at hyperscale, or optimizing software stacks, this framework helps you prioritize what really matters.

As the AI community races toward ever-larger models and faster deployment, our findings provide a reality check and a call to co-design—between hardware, systems, and ML algorithms.

We’ll close by discussing forward-looking implications: what kinds of architectures might win in a post-10,000 token/sec world? What’s the right balance between specialization and flexibility? And where should we invest our collective energy to make LLMs not just smarter—but more sustainable, affordable, and available?

Keywords:: IET conference

Large Language Models (LLMs)

REACH 2025

Reach Emerging Architectures in Computing Horizons

Savoy Place London

Synchronization across distributed accelerators

compute capability

memory bandwidth

synchronization

Channels

Communications

Communications

IT

Lectures

Lectures