A complete GPU KV-cache offload solution that moves KV tensors from Host GPU memory to BlueField DPU-backed storage tiers without Host CPU involvement.
This project provides an end-to-end pipeline for offloading GPU-resident data — primarily LLM KV caches — to storage attached to a local BlueField DPU. It is built from three integrated pieces:
- blue-cache (
blue-cache/) — The DPU-side agent. It runs on the BlueField DPU ARM cores, imports the remote GPU memory map, executes DOCA DMA operations, and writes incoming data to DPU-side storage backends. - NIXL Plugin (
nixl-plugin/) — A host-side NIXL backend namedBLUE_CACHE. It registers GPU buffers asVRAM_SEG, exports them over PCIe with DOCA DMA, and forwards transfer requests to the DPU agent. - LMCache Integration (
examples/lmcache/) — A patch set and configuration example that enables LMCache v0.4.3 to use theBLUE_CACHEbackend for transparent KV-cache tiering.
Together these components let an application such as LMCache express a transfer as VRAM_SEG ↔ OBJ_SEG and have the actual PCIe DMA and storage I/O executed by the DPU.
The DPU agent can land data in multiple backend types, allowing the same offload path to target different cost/performance tiers:
| Target | How it is used | Typical use case |
|---|---|---|
| DPU DRAM | Pre-allocated staging buffer; can also serve as a fast transient tier | Low-latency cache spill |
| DPU-local disk | POSIX files via the agent's posix_storage_backend |
Capacity tier on BlueField NVMe |
| Remote / object storage | NIXL OBJ_SEG backend (e.g. xdfs_storage_backend) |
Shared object store, distributed cache |
Bulk data always moves over DOCA DMA between Host GPU and DPU. Only small control messages travel over DOCA Comch or TCP.
In LLM serving, the KV cache is large, grows with sequence length, and competes with model weights for limited GPU HBM. Existing offload paths often route data through the Host CPU or across the network, which:
- consumes host CPU cycles that could run the inference engine,
- adds extra memory copies,
- and is hard to integrate cleanly with a tiered cache.
By using the BlueField DPU's dedicated DOCA DMA engine, this solution:
- moves data directly between GPU and DPU storage across the PCIe complex,
- keeps the host CPU out of the data path,
- and exposes the offload path through the standard NIXL API so applications like LMCache do not need to know DOCA details.
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│ Host │
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
│ │ LMCache / vLLM │ │ NIXL Agent │ │
│ │ (KV-cache manager) │───►│ + BLUE_CACHE backend │ │
│ └─────────────────────┘ │ - registers GPU VRAM │ │
│ │ - exports GPU mmap │ │
│ │ - sends transfer requests │ │
│ └─────────────┬───────────────┘ │
│ │ │
│ Control plane│(DOCA Comch / TCP) │
│ ▼ │
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
│ │ GPU HBM (VRAM_SEG) │◄──►│ DOCA DMA over PCIe │ │
│ └─────────────────────┘ └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│ BlueField DPU │
│ ┌─────────────────────────────────────────────────────────────────────────────────┐ │
│ │ blue-cache agent │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ DOCA DMA │───►│ staging buffer│───►│ NIXL storage backend │ │ │
│ │ │ engine │ │ (DPU DRAM) │ │ (posix / xdfs / xdfs_kv / ...) │ │ │
│ │ └───────────────┘ └───────────────┘ └─────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┴──────────────┐ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────┐ ┌─────────────────┐ │ │
│ │ │ DPU-local │ │ Remote Storage │ │ │
│ │ │ (posix) │ │ xdfs / xdfs_kv │ │ │
│ │ └─────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────────┘
The DPU agent is the piece that executes the offload. It runs as a service on the BlueField DPU and is intentionally separate from the NIXL library so it can evolve independently.
Responsibilities:
- Import the host GPU mmap from the PCI export descriptor sent by the plugin.
- Maintain a reusable DPU-side staging buffer.
- Execute chunked, pipelined DOCA DMA with configurable queue depth.
- Forward received data to a NIXL storage backend running on the DPU, which in turn writes to local files or object storage.
Build and run instructions are in blue-cache/README.md.
The host plugin implements the NIXL nixlBackendEngine interface. It exposes two memory types:
VRAM_SEG— Host GPU memory, exported viadoca_mmap_export_pci().OBJ_SEG— DPU-resident object/file, identified by a path or key string.
The backend is local-only (supportsRemote() == false): both the GPU and the DPU must be reachable through the same host-side BlueField PCI function.
Because NIXL loads backends dynamically, the plugin source is injected into a NIXL source tree with scripts/patch_nixl.sh and built together with NIXL.
examples/lmcache/ contains:
lmcache_integration.patch— modifications to LMCache v0.4.3 to recognize and use theBLUE_CACHEbackend.lmcache-config.yaml— sample configuration.patch_lmcache.sh— helper that applies the patch idempotently.
After patching LMCache, you can configure a storage backend that points to the DPU agent and offload KV tensors transparently.
.
├── common/ # Shared host-DPU control channel + wire protocol (dma_transfer.h)
├── nixl-plugin/ # NIXL backend plugin source (patch into NIXL)
├── blue-cache/ # BlueField DPU proxy service
├── examples/
│ ├── cpp/ # NIXL C++ example
│ ├── python/ # NIXL Python example
│ ├── standalone/ # Standalone host test tool (no NIXL required)
│ └── lmcache/ # LMCache v0.4.3 integration patch
├── scripts/ # patch_nixl.sh and build helpers
├── docs/ # Architecture and integration docs
├── CMakeLists.txt
├── LICENSE
└── CONTRIBUTING.md
On the BlueField DPU:
mkdir -p build && cd build
cmake .. -DBUILD_EXAMPLES=OFF
make -j$(nproc) blue-cacheRun the agent (TCP fallback mode for the easiest first test):
./blue-cache/blue-cache -p 0000:03:00.0 -m 256 -q 4 -b posix -TOmit -T to use DOCA Comch mode.
On the host where NIXL is built:
./scripts/patch_nixl.sh /path/to/nixl/source
cd /path/to/nixl/source
meson setup build -Denable_plugins=BLUE_CACHE
ninja -C buildThe patch script is idempotent; running it multiple times is safe.
export NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib/plugins
python3 examples/python/nixl_blue_cache_example.py \
-o push \
-p 0000:ba:00.0 \
-g 0 \
-f /data/test_obj \
-s 64 \
-d 10.75.70.125 \
-m tcpSee examples/python/README.md for push/pull examples and COMCH-mode usage.
This project has been verified against NIXL v1.1.0. Other NIXL versions may require minor adjustments to scripts/patch_nixl.sh or the plugin source.
docs/ARCHITECTURE.md— Host plugin, DPU agent, control plane, and data plane design.docs/LMCache_INTEGRATION.md— KV-cache offload reference architecture.blue-cache/README.md— Build, run, and tune the DPU-side agent.examples/python/README.md— Python end-to-end example.examples/standalone/— Standalone host test tool that does not require NIXL.CONTRIBUTING.md— Build, test, and NIXL upstreaming workflow.
NIXL 1.1.0 uses tomlplusplus as a required dependency. When the telemetry plugin is enabled, its doca backend may miss the tomlplusplus include path because nixl_common_dep is not listed in its dependencies.
Recommended fix: patch src/plugins/telemetry/doca/meson.build to add nixl_common_dep:
# In src/plugins/telemetry/doca/meson.build
- dependencies: [nixl_infra, absl_log_dep, doca_dep],
+ dependencies: [nixl_infra, nixl_common_dep, absl_log_dep, doca_dep],Then rebuild:
cd /path/to/nixl/source
meson setup build --wipe -Denable_plugins=BLUE_CACHE
ninja -C buildThis fix mirrors the upstream NIXL commit b98dd59. It keeps telemetry enabled while correctly propagating the required include path.
Fallback: If you do not need telemetry, disable the telemetry plugins entirely:
cd /path/to/nixl/source
sed -i "s/^subdir('telemetry')/# subdir('telemetry')/" src/plugins/meson.build
meson setup build --wipe -Denable_plugins=BLUE_CACHE
ninja -C buildThe C++ examples require CUDA Toolkit. On a machine without CUDA, disable examples:
cmake .. -DBUILD_EXAMPLES=OFF
make blue-cacheOr build blue-cache directly from the blue-cache/ directory:
cd blue-cache
./scripts/build_dpu.shSet the plugin search path:
export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/plugins:$LD_LIBRARY_PATHOr in Python/C++ code:
agent.add_plugin_directory("/opt/nvidia/nvda_nixl/lib/plugins")If NIXL was built with -Dstatic_plugins=BLUE_CACHE, the plugin is linked into libnixl.so and no search path is needed.
DOCA SDK is not installed or DOCA_DIR is incorrect:
cmake .. -DDOCA_DIR=/opt/mellanox/docaVerify that /opt/mellanox/doca/include/doca_dma.h exists.
Apache-2.0. See LICENSE.