Skip to content

Welcome to RL-Kernel

RL-Kernel Logo

High-performance kernels and runtime infrastructure for RL post-training.

Star Watch Fork

RL-Kernel bridges high-level alignment algorithms and low-level hardware optimizations. It targets GRPO, PPO, DPO, and other reinforcement learning post-training workloads where log probability computation, sampling, and memory pressure dominate the critical path.

Where to get started depends on the type of user:

RL-Kernel focuses on:

  • Hardware-aware dispatch for CUDA, ROCm, and PyTorch fallback paths.
  • Fused GPU operators for post-training bottlenecks.
  • Operator documentation as part of the merge contract.
  • A documentation structure that can grow with the project as more operators, runtime features, benchmarks, and APIs are added.