Skip to content

RL-Kernel

Home

Welcome to RL-Kernel¶

High-performance kernels and runtime infrastructure for RL post-training.

Star Watch Fork

RL-Kernel bridges high-level alignment algorithms and low-level hardware optimizations. It targets GRPO, PPO, DPO, and other reinforcement learning post-training workloads where log probability computation, sampling, and memory pressure dominate the critical path.

Where to get started depends on the type of user:

Run RL-Kernel locally with the Quickstart Guide.
Read the latest writeup: Announcing RL-Kernel for vime.
Understand supported kernels from the Operators section.
Add a new operator by following the Developer Guide.
Read dispatch details in the Runtime Dispatch design document.

RL-Kernel focuses on:

Hardware-aware dispatch for CUDA, ROCm, and PyTorch fallback paths.
Fused GPU operators for post-training bottlenecks.
Operator documentation as part of the merge contract.
A documentation structure that can grow with the project as more operators, runtime features, benchmarks, and APIs are added.