SwarmX is a scheduler agent framework for large-scale agentic workflow clusters. Submitted to OSDI 2026 and deployed in Tencent WeChat's production environment, it addresses the critical challenge of efficiently scheduling complex AI agent workflows across hundreds of heterogeneous GPUs and millions of CPU cores.

Key Features

Scheduler Agent Framework — Purpose-built for agentic workflow clusters, supporting intelligent routing and resource allocation for complex workflows.

Heterogeneous GPU Scheduling — Unified scheduling across heterogeneous GPU clusters, adapting to different hardware configurations.

Drift-Robust Stability — Maintains scheduling stability under severe drift conditions in production environments.

Production-Scale Deployment — Validated from 128-GPU benchmarks to million-core CPU + near-thousand GPU production clusters at Tencent WeChat.

Results

  • Tail latency improved by 10–60% over SOTA methods (128-GPU benchmark)
  • P99 latency reduced by up to 50% in production
  • Throughput doubled under the same SLO constraints
  • Deployed for Tencent WeChat Hunyuan model serving and WeChat OCR scheduling

Collaborators

Tencent — WeChat Hunyuan model deployment and OCR scheduling