ServerlessLLM is a low-latency serverless inference system for large language models. Its core innovations — multi-tier checkpoint loading, live inference migration, and startup-time-optimized scheduling — have been adopted by nearly every major AI cloud provider, delivering 10–200x latency reductions over state-of-the-art serverless systems.

Key Features

Multi-Tier Checkpoint Loading — A loading-optimized checkpoint format and multi-tier loading system that fully exploits the complex storage hierarchy bandwidth of GPU servers.

Live LLM Inference Migration — Newly started inference can leverage local checkpoint storage while ensuring minimal user disruption.

Startup-Time-Optimized Scheduling — Evaluates checkpoint locality across servers and schedules models to the server with the shortest cold-start time.

Results

  • Latency reduced by 10–200x over SOTA serverless systems across various LLM workloads
  • Core innovations adopted by nearly every major AI cloud provider
  • Three complementary contributions: fast loading solves I/O bottlenecks, live migration solves resource fragmentation, optimized scheduling solves placement decisions