Welcome to RL-Kernel¶
High-performance kernels and runtime infrastructure for RL post-training.
RL-Kernel bridges high-level alignment algorithms and low-level hardware optimizations. It targets GRPO, PPO, DPO, and other reinforcement learning post-training workloads where log probability computation, sampling, and memory pressure dominate the critical path.
Where to get started depends on the type of user:
- Run RL-Kernel locally with the Quickstart Guide.
- Understand supported kernels from the Operators section.
- Add a new operator by following the Developer Guide.
- Read dispatch details in the Runtime Dispatch design document.
RL-Kernel focuses on:
- Hardware-aware dispatch for CUDA, ROCm, and PyTorch fallback paths.
- Fused GPU operators for post-training bottlenecks.
- Operator documentation as part of the merge contract.
- A documentation structure that can grow with the project as more operators, runtime features, benchmarks, and APIs are added.