Weight Sync Bridge¶

The weight sync bridge moves a complete model-weight update from a training worker to a rollout worker with an explicit version, tensor metadata, and lifecycle state.

Think of a published update as a sealed package:

the tensors are the package contents;
the WeightUpdateManifest is the shipping label;
import_update(...) opens the package on the rollout side;
acknowledge(...) says the rollout worker has safely installed it;
reject(...) records a failed install;
release(...) lets both sides drop buffers, file descriptors, or GPU handles.

This protocol matters because rollout workers must never silently read a half-updated model. Every update is complete, versioned, validated, and released when it is no longer active.

At a Glance¶

flowchart LR
    subgraph Training["Training side"]
        Model["PyTorch / DeepSpeed model"]
        Publisher["WeightPublisher"]
        Model -->|"state_dict + version"| Publisher
    end

    Publisher -->|"WeightUpdateManifest"| Manifest["Manifest\nversion, tensor metadata,\ntransport handles"]

    subgraph Transport["Bridge transport"]
        Local["local-clone"]
        Shm["shared-memory"]
        VMM["cuda-vmm"]
        IPC["cuda-ipc"]
    end

    Manifest --> Transport

    subgraph Rollout["Rollout / inference side"]
        Consumer["WeightConsumer"]
        Executor["RolloutExecutor.update_weights"]
        Runtime["vLLM / rollout runtime"]
        Consumer -->|"import tensors"| Executor
        Executor -->|"install full update"| Runtime
        Runtime -->|"success"| Ack["acknowledge"]
        Runtime -->|"failure"| Reject["reject"]
    end

    Transport --> Consumer
    Ack --> Release["release update buffers"]
    Reject --> Release

In plain words: the training worker does not send "some tensors" and hope the rollout worker guesses what happened. It publishes a labeled, complete update. The rollout worker imports that exact update, installs it, and then records whether the install succeeded.

When to Use It¶

Use the bridge when a training process publishes new weights and another runtime, usually rollout or vLLM inference, must install those weights without restarting.

Common flows:

local tests use local-clone;
CPU cross-process smoke tests use shared-memory;
same-node CUDA zero-copy uses cuda-vmm when supported;
legacy PyTorch CUDA IPC uses cuda-ipc when the current driver/runtime can rebuild CUDA IPC handles;
vLLM integration uses a request builder or install adapter on top of the same manifest contract.

Core Objects¶

Object	Role
`TensorDescriptor`	Shape, dtype, stride, byte count, device, and checksum for one tensor.
`WeightUpdateManifest`	Immutable public record for one complete weight update.
`WeightPublisher`	Training-side protocol: `publish(...)` and `release(...)`.
`WeightConsumer`	Rollout-side protocol: `import_update(...)`, `acknowledge(...)`, `reject(...)`, and `release(...)`.
`WeightInstallAdapter`	Optional adapter that installs imported tensors into a runtime such as vLLM.
`RolloutExecutor.update_weights(...)`	High-level rollout entry point that imports, installs, acknowledges, and activates a manifest.

Lifecycle¶

sequenceDiagram
    participant Train as Training worker
    participant Bridge as Weight bridge
    participant Rollout as Rollout worker

    Train->>Bridge: publish(model, weight_version=N)
    Bridge-->>Train: WeightUpdateManifest
    Train-->>Rollout: send manifest
    Rollout->>Bridge: import_update(manifest)
    Bridge-->>Rollout: tensor mapping
    Rollout->>Rollout: install tensors into runtime
    alt install succeeds
        Rollout->>Bridge: acknowledge(update_id)
    else install fails
        Rollout->>Bridge: reject(update_id, reason)
    end
    Rollout->>Bridge: release(update_id)
    Train->>Bridge: release(update_id)

Two rules keep the handoff safe:

weight_version must increase monotonically on the publisher.
A consumer should acknowledge only after the runtime has installed the full tensor set.

Choose a Transport¶

Transport	Best for	Copy behavior	Notes
`local-clone`	Unit tests and single-process contract checks	Copies tensors	Safest baseline. It proves version, manifest, ack, reject, and release semantics.
`shared-memory`	CPU tensors across local processes	Zero-copy CPU storage import	Uses Python `multiprocessing.shared_memory`. Good for realistic CPU lifecycle tests.
`cuda-vmm`	Same-node CUDA handoff	Zero-copy GPU aliasing	Uses CUDA VMM and POSIX file descriptor export. This is the preferred same-node CUDA zero-copy path when available.
`cuda-ipc`	Legacy PyTorch CUDA IPC runtimes	Zero-copy GPU aliasing when supported	Uses PyTorch `reduce_tensor` handles. Some WSL2/driver/runtime combinations reject handle rebuild with `CUDA error: invalid resource handle`; in that case use `cuda-vmm`.

Create bridges through make_weight_bridge(...) when possible:

from rl_engine.executors.bridge import make_weight_bridge

training_bridge = make_weight_bridge(
    "shared-memory",
    source_worker="trainer",
    source_rank=0,
)

rollout_bridge = make_weight_bridge(
    "shared-memory",
    source_worker="rollout",
    source_rank=0,
)

Publish and Import Manually¶

This example uses CPU shared memory so it can run without CUDA:

import torch

from rl_engine.executors.bridge import SharedMemoryTensorBridge

model = torch.nn.Sequential(torch.nn.Linear(4, 4), torch.nn.LayerNorm(4))

publisher = SharedMemoryTensorBridge(source_worker="trainer", source_rank=0)
consumer = SharedMemoryTensorBridge(source_worker="rollout", source_rank=0)

manifest = publisher.publish(
    model,
    weight_version=1,
    metadata={"step": 1, "layout": {"kind": "full-state"}},
)

try:
    tensors = consumer.import_update(manifest)
    # Install tensors into the rollout runtime here.
    consumer.acknowledge(manifest.update_id)
finally:
    consumer.release(manifest.update_id)
    publisher.release(manifest.update_id)

The same lifecycle applies to local-clone, cuda-vmm, and cuda-ipc. The transport changes how tensor storage is shared, not the public protocol.

Use `RolloutExecutor.update_weights`¶

Most rollout code should not call import_update(...) directly. Use RolloutExecutor.update_weights(...) so import, optional runtime install, acknowledgement, active-version tracking, and old-update release happen in one place.

import torch

from rl_engine.executors.bridge import SharedMemoryTensorBridge
from rl_engine.executors.rollout import RolloutExecutor

model = torch.nn.Linear(4, 4)
publisher = SharedMemoryTensorBridge(source_worker="trainer", source_rank=0)
manifest = publisher.publish(model, weight_version=7)

rollout_bridge = SharedMemoryTensorBridge(source_worker="rollout", source_rank=0)
rollout = RolloutExecutor(weight_bridge=rollout_bridge)

try:
    imported = rollout.update_weights(manifest)
    assert rollout.active_weight_version == 7
    assert set(imported) == set(model.state_dict())
finally:
    rollout.release_weights()
    publisher.release(manifest.update_id)

If installation fails, RolloutExecutor rejects the update and keeps the previous active version.

vLLM Hot-Weight Update¶

vLLM does not accept a raw manifest object. It expects a runtime-specific request shape. RL-Kernel provides adapters so the bridge lifecycle stays the same while vLLM receives the format it expects.

For vLLM IPC:

from rl_engine.executors.bridge import VLLMIPCWeightUpdateRequestBuilder

builder = VLLMIPCWeightUpdateRequestBuilder(is_checkpoint_format=False)
request = builder(manifest, imported_cuda_tensors)
llm.init_weight_transfer_engine({"init_info": {}})
llm.update_weights(request)
builder.release(manifest.update_id)

For vLLM CUDA VMM external storage, use VLLMCUDAVMMExternalStorageAdapter when the rollout worker can install external CUDA storage aliases in-process.

The important idea is the same in both paths: keep the publisher-side tensors alive until vLLM finishes the update, then release the update id.

CUDA Notes¶

cuda-vmm and cuda-ipc are both same-node CUDA transports, but they depend on different CUDA runtime capabilities.

Use cuda-vmm when you need the production-oriented same-node zero-copy path and the platform supports CUDA VMM export/import.

Use cuda-ipc only after validating the runtime. PyTorch CUDA IPC handle rebuild can fail on some WSL2 or driver combinations with:

CUDA error: invalid resource handle

That error means PyTorch could not reopen the CUDA IPC memory handle in the consumer process. It is a runtime capability blocker, not a manifest validation failure. The benchmark reports this as blocked instead of pretending the transport succeeded.

Benchmark and Smoke Tests¶

Run the local protocol smoke:

python benchmarks/benchmark_weight_sync_bridge.py --smoke --mode local
python benchmarks/benchmark_weight_sync_bridge.py --smoke --mode shared-memory

Run CUDA transport probes when CUDA is available:

python benchmarks/benchmark_weight_sync_bridge.py --smoke --mode cuda-vmm
python benchmarks/benchmark_weight_sync_bridge.py --smoke --mode cuda-ipc

Run rollout and vLLM paths:

python benchmarks/benchmark_weight_sync_bridge.py --smoke --mode rollout-update

Install the vLLM extra before running the vLLM benchmark modes:

pip install -e ".[vllm]"
python benchmarks/benchmark_weight_sync_bridge.py --mode vllm-cuda-ipc-hot-update --model /path/to/model
python benchmarks/benchmark_weight_sync_bridge.py --mode vllm-cuda-vmm-external-storage --model /path/to/model

The benchmark prints one JSON row with:

transport mode;
status: pass or blocked;
publish/import/ack/release timing;
tensor count and byte count;
active weight version;
environment notes and precise blocker messages.

Troubleshooting¶

Symptom	What it usually means	What to do
`weight_version must increase monotonically`	The publisher reused an old version.	Increment the version after every completed training step.
`cannot acknowledge update before import_update succeeds`	The consumer acknowledged without importing.	Call `import_update(...)`, install tensors, then acknowledge.
`tensor checksum mismatch`	The manifest metadata does not match imported tensor contents.	Treat the update as corrupt or stale and reject it.
`CUDA IPC handle reconstruction failed`	Legacy CUDA IPC is not working in this runtime.	Use `cuda-vmm` for same-node CUDA zero-copy, or run CUDA IPC on a validated runtime.
`multi-node/RDMA weight transport is not implemented`	The bridge only supports same-node layouts today.	Publish a gathered full-state update on one node, or add a dedicated RDMA/NCCL transport.

Production Checklist¶

Before promoting a new weight handoff path:

run unit tests for manifest validation and lifecycle state transitions;
run the benchmark mode for the selected transport;
prove the rollout runtime installs the full tensor set before acknowledgement;
keep publisher tensors and handles alive until consumers finish;
record blocked runtime capabilities explicitly instead of falling back silently;
release both publisher and consumer update ids during shutdown.