WaferLLM is the first wafer-scale LLM inference system, designed for a next-generation AI accelerator with hundreds of thousands of cores, tens of gigabytes of distributed on-chip memory, and tens of PB/s on-chip bandwidth. It introduces novel parallel strategies and kernel implementations that achieve orders-of-magnitude performance improvements over GPU-based systems.

Key Features

PLMR Performance Model — Captures the unique hardware characteristics of wafer-scale architecture (mesh interconnect, distributed on-chip memory, ultra-high bandwidth) to guide inference optimization.

Wafer-Scale LLM Parallelism — Novel parallel strategies that optimize the utilization of hundreds of thousands of on-chip cores for unprecedented parallel efficiency.

MeshGEMM & MeshGEMV — The first GEMM and GEMV implementations designed for wafer-scale accelerators, scaling effectively across massive core counts.

Results

  • GEMV performance: 606x faster than NVIDIA A100, 16x better energy efficiency
  • Accelerator utilization: up to 200x over SOTA methods
  • End-to-end LLM inference: 10–20x faster than A100 GPU clusters (SGLang/vLLM)

Collaborators

Microsoft Research