ContextPilot accelerates long-context LLM inference through context reuse — a new paradigm that identifies overlapping context blocks across users and conversation turns to maximize KV-cache reuse while maintaining or even improving inference quality. Developed in collaboration with Tencent.

Key Features

Context Index — Identifies overlapping context blocks across LLM interactions (cross-user, cross-turn) and builds a reuse index for efficient KV-cache sharing.

Context Ordering & De-duplication — Reorders and de-duplicates context to maximize KV-cache reuse rates across requests.

Succinct Context Annotations — Lightweight annotations that prevent quality degradation during reuse, and actually improve reasoning quality in longer-context scenarios.

Modular Architecture — Clean interfaces designed for integration with existing inference engines such as vLLM and SGLang.

Results

  • Prefill latency reduced by up to 3x over SOTA methods
  • Inference quality maintained — and improved in longer-context scenarios
  • Open source

Collaborators

Tencent