LLM post-training with GRPO

Inspired by the nano-aha-moment project, I created a minimal, easy-to-understand implementation of Group Relative Policy Optimization (GRPO) for post-training large language models.

This project fine-tunes Qwen2.5 3B for reasoning tasks, focusing on simplicity and clarity rather than maximal performance.
Part of this is an independent Jupyter notebook that runs on an 80GB A100 GPU, producing interesting results in under an hour and offering a clear, from-scratch introduction to GRPO and RL for LLMs.

Check out the code here.

I also wrote a brief blog post introducing RL for LLMs and walking through GRPO step by step.