ServerlessLLM \ Saber AI

ServerlessLLM is a low-latency serverless inference system for large language models. Its core innovations — multi-tier checkpoint loading, live inference migration, and startup-time-optimized scheduling — have been adopted by nearly every major AI cloud provider, delivering 10–200x latency reductions over state-of-the-art serverless systems.

Key Features

Multi-Tier Checkpoint Loading — A loading-optimized checkpoint format and multi-tier loading system that fully exploits the complex storage hierarchy bandwidth of GPU servers.

Live LLM Inference Migration — Newly started inference can leverage local checkpoint storage while ensuring minimal user disruption.

Startup-Time-Optimized Scheduling — Evaluates checkpoint locality across servers and schedules models to the server with the shortest cold-start time.

Results

Latency reduced by 10–200x over SOTA serverless systems across various LLM workloads
Core innovations adopted by nearly every major AI cloud provider
Three complementary contributions: fast loading solves I/O bottlenecks, live migration solves resource fragmentation, optimized scheduling solves placement decisions

ServerlessLLM

Key Features

Results

核心功能

成果

ServerlessLLMServerlessLLM

Key Features

Results

核心功能

成果

ServerlessLLM