Skip to main content
The Institution of Engineering and Technology iet.tv
Site name
  • Videos
  • Channels
  • Events
  • Series

Access and Account

Access your personal account

Log in to see your favourites, lists and progress.

IET Login

Access via institution

Not currently connected to any institutions

Connect via

Access Code

Redeem Access Code
Log in to redeem access code
  1. Videos
  2. Video

Serving Intelligence: The Unseen Limits of LLM Inference and the Hardware Race to Overcome Them

  • WhatsApp
  • Facebook
  • Email
  • LinkedIn
  • Bluesky
CPD This content can contribute towards your Continuing Professional Development (CPD) as part of the IET's CPD Monitoring scheme.
Presentation
  • Session
  • Tuesday, 11 November 2025
  • 08:55
  • Duration: 29 mins
  • Publication date: 13 Nov 2025
  • Location: Turing Lecture Theatre, IET London: Savoy Place, London, United Kingdom
  • Part of event REACH 2025

About the session

Large Language Models (LLMs) are transforming every corner of the technology landscape—from code generation and content creation to customer support and scientific discovery. But as models grow ever larger, the infrastructure required to serve them is straining under the weight of complexity, cost, and energy consumption.

In this keynote, we examine the true system-level bottlenecks that constrain the deployment and scalability of LLM inference today and into the near future. Drawing from our recent study, we identify four inescapable limits that define performance and efficiency: memory bandwidth, compute capability, synchronization latency, and memory capacity.

Rather than evaluating a specific chip or vendor stack, we present a hardware-agnostic performance model that lets us reason across the full spectrum of architectural possibilities—from GPUs and TPUs to wafer-scale systems and custom accelerators. We model technologies ranging from today’s HBM3 to emerging HBM4, 3D-stacked DRAM, and high-density SRAM designs, capturing both compute and memory scaling trends.

Here are a few insights that challenge common assumptions:

Model size still matters: Even with clever sharding and compression techniques, inference for large models like GPT-3 or GPT-4 requires 100s of GB of memory per instance. You can't cheat physics—or capacity.

Bandwidth is king: High memory bandwidth is non-negotiable for high user throughput. If you can't move weights fast enough, your expensive compute cores sit idle.

Latency kills scaling: Synchronization across distributed accelerators must happen in under a microsecond. Anything slower undermines bandwidth and negates gains from hardware scale-out.

DRAM wins (for now): While SRAM offers raw speed, DRAM-based designs provide far better energy and cost efficiency at system scale. For most realistic deployments, they’re the right economic tradeoff.

We’re nearing a plateau: Today's systems can serve thousands of tokens per second per user. Getting to 10,000+ tokens/sec isn’t just a hardware problem—it will require algorithmic advances, smaller models, or shorter context windows.

This talk is not just about bottlenecks—it’s about opportunity. By understanding the core friction points in LLM serving, we can guide the next wave of AI system innovation. Whether you're designing next-generation silicon, deploying inference at hyperscale, or optimizing software stacks, this framework helps you prioritize what really matters.

As the AI community races toward ever-larger models and faster deployment, our findings provide a reality check and a call to co-design—between hardware, systems, and ML algorithms.

We’ll close by discussing forward-looking implications: what kinds of architectures might win in a post-10,000 token/sec world? What’s the right balance between specialization and flexibility? And where should we invest our collective energy to make LLMs not just smarter—but more sustainable, affordable, and available?

Keywords:
  • IET conference
  • Large Language Models (LLMs)
  • REACH 2025
  • Reach Emerging Architectures in Computing Horizons
  • Savoy Place London
  • Synchronization across distributed accelerators
  • compute capability
  • memory bandwidth
  • synchronization

Channels

Communications

Communications

IT

IT

Lectures

Lectures

Speaker

  • KS

    Karu Sankaralingam

    NVIDIA | University of Wisconsin-Madison, USA, Principal Research Scientist | Professor

The Institution of Engineering and Technology iet.tv

Address: Futures Place, Kings Way, Stevenage, SG1 2UA

Telephone: +44 (0)33 049 9123

Email:  iet.tv@theiet.org

© 2026 The Institution of Engineering and Technology.

The Institution of Engineering and Technology is registered as a Charity in England & Wales (no 211014) and Scotland (no SC038698). Futures Place, Kings Way, Stevenage, Hertfordshire, SG1 2UA, United Kingdom

  • LinkedIn
  • Instagram
  • YouTube
Privacy statement Cookie Preferences Accessibility About us theiet.org Help

Powered by Cadmore Media

Embed Code

<script type="text/javascript" src="https://play.cadmore.media/js/EMBED.js"></script> <div class="cmpl_iframe_div"> <iframe src="https://play.cadmore.media/Player/c060fc56-e89a-4e2e-9a04-8fbf7c945c68" scrolling="no" allowtransparency="true" allowautoplay="true" frameborder="0" allow="encrypted-media;autoplay;fullscreen" class="cmpl_iframe" allowfullscreen="" style="overflow: hidden;border: 0px; margin: 0px; height: 100%; width:100%;"></iframe> </div>

Are you sure you want to reset your password?

If so, you will be redirected to the Authentication Service

Title

Prompt