GRPO is mostly a systems problem

Every GRPO explainer leads with the same pitch: it's PPO without the value model. True, and also the least interesting thing about it. The math fits on a napkin. What doesn't fit on a napkin is what happens when you actually run it: your training job grows an inference service inside it, your GPU memory fills up with copies of the same network doing different jobs, and the correctness of the whole thing starts depending on how fast you can move weights between processes.

This post covers the napkin, then spends the rest of its time on the machine: where rewards come from, what lives in GPU memory, why generation dominates your step time, and the specific ways runs die. At the end there's code you can actually run.

The napkin version

Group Relative Policy Optimization, per prompt:

Sample a group of $G$ completions from the current policy.
Score each one with a reward function.
Compute each completion's advantage against its own siblings:

$A_i = \frac{r_i - \operatorname{mean}(r_1, \dots, r_G)}{\operatorname{std}(r_1, \dots, r_G)}$

Take a clipped policy-gradient step, with a KL penalty pulling the policy toward a frozen reference:

$\mathcal{L}(\theta) = -,\mathbb{E}\left[ \min!\big(\rho_{i,t}, A_i,\ \operatorname{clip}(\rho_{i,t},, 1-\epsilon,, 1+\epsilon), A_i\big) \right] + \beta, D_{\mathrm{KL}}!\left(\pi_\theta ,|, \pi_{\mathrm{ref}}\right)$

The ratio $\rho_{i,t}$ is per token; the advantage $A_i$ is per sequence, broadcast to every token in it.

PPO needs a learned value network to estimate a baseline, a whole second transformer trained alongside the policy just to answer "how good is this state, typically". GRPO replaces it with statistics: sample more completions per prompt and let the group mean be the baseline. If you beat your siblings, you get pushed up. If you lose to them, you get pushed down.

Notice what this trade actually is. The baseline didn't become free. Its cost moved from training a second network to sampling G completions per prompt, every step, forever. GRPO takes memory pressure and converts it into inference pressure. Keep that conversion in mind; every section below is a consequence of it.

Rewards you can grade with a program

Before GRPO can score a group, something has to produce the scores. Classic RLHF's answer was a reward model: train a transformer on human preference pairs, then trust its scalar output. That works, but you inherit two problems. The reward model is another multi-billion-parameter network to train, serve, and keep in memory. And it's an approximation of human judgment with exploitable holes, and policy optimization is exceptionally good at finding holes. The literature politely calls this reward hacking.

Verifiable rewards sidestep both, for the domains that allow it. If the task has a checkable notion of correctness, the reward can be a program: a math answer compared against ground truth, code run against unit tests, a tool call parsed and matched field by field. The Tulu 3 paper named the recipe RLVR, reinforcement learning with verifiable rewards, and DeepSeek-R1 showed how far it scales. GRPO pairs naturally with it because the trainer doesn't care where scores come from. In TRL, a reward is any Python callable with the right signature:

def tool_call_reward(completions, answers=None, **kwargs):
    """1.0 if the emitted tool call exactly matches ground truth, else 0.0."""
    rewards = []
    for completion, truth in zip(completions, answers):
        pred = normalize(parse_tool_calls(completion))
        rewards.append(1.0 if pred and pred == normalize(truth) else 0.0)
    return rewards

Two details in that signature carry real design weight. First, answers arrives through **kwargs plumbing: any extra column in your dataset is passed to the reward function but never shown to the model, which is how ground truth stays hidden from the policy while remaining available to the grader. Second, you can register several reward functions with different weights, say an exact-match reward at 1.0 plus a lenient formatting reward at 0.5, so the group has an ordering to learn from even before anyone gets the answer fully right. Sparse rewards starve GRPO early; a shaped secondary signal feeds it.

Two rules for writing these, learned the annoying way:

The reward must never raise. It will be fed truncated JSON, half-open tags, repetition loops, and prose. Garbage in, low score out. An uncaught exception in a reward function takes down a multi-GPU job an hour in.
The parser is part of the reward spec. Every bit of leniency you grant (accepting bare JSON without tags, ignoring trailing text) is behavior you're training in. The policy will find and exploit exactly the slack your regex leaves. You're not writing a parser, you're writing the rules of the game.

That last phrase is the right segue: a reward function is really the degenerate case of an environment. Single-turn GRPO is a bandit problem. The model takes one action (the whole completion), receives one grade, and the episode ends; the "environment" collapses into a stateless verifier function. The moment your task goes agentic, the model calls a tool, observes the result, and acts again, you have a real multi-step environment: stateful, with reset and step semantics, rewards arriving per step or only at the end of a trajectory. Systems-wise this is a different animal. Verifiers were cheap functions in the trainer process; environments are sandboxes, browsers, and terminals, CPU-heavy and latency-bound, running as fleets of workers next to your GPUs. That's why serious RL stacks split into trainer, rollout workers, and environment services, and why standardization efforts like OpenEnv (Meta and Hugging Face, a Gymnasium-style API for agentic environments) exist at all. GRPO the algorithm doesn't change; everything around it does.

What lives in GPU memory

Count the transformers in a classic PPO RLHF setup: the policy (trained), the value model (trained), the reward model (frozen), the reference (frozen). Four networks, two of them carrying optimizer state.

GRPO with a verifiable reward cuts the census to two. The value model became a mean. The reward model became json.loads and a string compare. What survives is the policy and the reference, and they're not equals. For a 7B model in bf16 with mixed-precision Adam, the trained policy costs about 16 bytes per parameter: bf16 weights, an fp32 master copy, two fp32 Adam moments, bf16 gradients. Call it 112 GB before a single activation is materialized. The reference is weights only, 14 GB, frozen, never touched by the optimizer. In practice you shard all of it with ZeRO-3 or FSDP, reference included, and a 7B run fits on a node where the PPO equivalent would be gasping.

Why does the reference exist at all? Because verifiable rewards are narrow. An exact-match reward says nothing about the 99.9% of the distribution that isn't the answer format, and a policy optimized hard against a narrow signal will happily cannibalize every capability it isn't being graded on. The KL term anchors the policy to the pre-RL snapshot; $\beta$ is the length of the leash. Set $\beta = 0$ and you can delete the reference and its 14 GB. Sometimes that's fine. Know what you're giving up.

The reference's real cost is paid per step, not just in allocation. To compute the KL you need reference log-probs for every token you just sampled, which is a full forward pass through the frozen model over the whole batch of completions. Add the policy's own forward for the gradient, and old-policy log-probs if you reuse a batch for multiple updates, and you're running up to three forward passes over the same tokens before the backward even starts. When people ask why their RL steps are five times slower than their SFT steps, this is a decent chunk of the answer.

One escape hatch worth knowing: train with LoRA and you don't need the second copy at all. Disable the adapter and the base weights are the reference policy; TRL does this automatically. Full fine-tuning has no equivalent trick.

Generation is the workload

Here is the real mental shift coming from supervised fine-tuning. In SFT the dataset is static. The GPU alternates forward and backward, the dataloader prefetches, utilization is high, life is simple. In GRPO, every optimizer step begins with generation: the current policy produces $G$ completions per prompt, autoregressively, token by token, before any gradient exists. The data you train on this step did not exist last step. That's the on-policy contract, and it puts an inference workload on the critical path of every single training step.

And your training stack is bad at inference. model.generate() on a ZeRO-3-sharded model gathers parameters on demand while decoding one token at a time, with no continuous batching and no paged KV cache. Inference engines solve exactly these problems, which is why the standard move is to run vLLM next to the trainer, either colocated on the same GPUs or as a separate server, and pull samples from it.

Congratulations: you now operate two live copies of the policy in two different runtimes, and they disagree after every optimizer step. The importance ratio $\rho$ in the loss assumes the samples came from the model in the denominator. Let the sampler go stale and that assumption quietly breaks: you're doing off-policy learning with an estimator that wasn't designed for it. The insidious part is that nothing crashes. Clipping absorbs the drift, the loss curves look normal, and the model just learns worse than it should. A correctness bug that presents as a quality problem.

So you synchronize, every step: gather the policy's parameters from their ZeRO-3 shards (a collective operation in itself), push them into the vLLM process (TRL does this over NCCL), swap, resume sampling. For a 7B model that's 14 GB of weights on the move per optimizer step. Within a node, over NVLink, it's background noise. Across nodes it becomes a real line item in your step time, and it's where a large share of the engineering effort in open-source RL stacks lives right now; TRL, verl, and OpenRLHF each have their own answer to this one problem. The framing that keeps you honest: weight sync is not a performance optimization. It's what keeps the algorithm being the algorithm.

One accounting habit makes all of this easier to configure: think in completions, not prompts. A group must land inside a single optimization step, because the baseline is the group mean, so your global batch measured in completions has to be divisible by $G$ . A concrete shape: 8 GPUs, 2 sequences per device, 8 gradient accumulation steps gives 128 completions per step, which at $G = 8$ is just 16 unique prompts. Generation-time KV memory scales with concurrent completions too. Most "my effective batch size makes no sense" confusion in GRPO configs is this.

Where it dies

Two collapse modes. One is silent, one is loud, and both are visible in metrics if you know what to look for.

Advantage collapse, the silent one. Look at the advantage formula again. If all $G$ completions in a group get the same reward, the standard deviation is zero and every advantage in the group is zero. An all-wrong group produces no gradient. An all-right group also produces no gradient. GRPO learns exclusively from within-group disagreement.

The implication is brutal for task selection: the task has to sit at the edge of the model's ability, where groups come back mixed. Point GRPO at a dataset the base model already aces and you'll pay for a full multi-GPU run that transfers zero bits into the weights, while the reward curve sits flat at 1.0 looking like success. Measuring the base model's pass rate on your task before renting anything is the cheapest experiment in all of RL. This failure mode is also why DAPO introduced dynamic sampling, which filters zero-variance groups out of the batch and refills with informative ones.

Generation collapse, the loud one. Push the policy too hard and it can degrade into text that never terminates. Every completion slams into the length cap, the true reward pins to zero, and the run rarely climbs back out, because once nothing parses there's no within-group ordering left to learn from. The metric signature is unmistakable, which means you can automate the response instead of watching dashboards:

class StopOnCollapse(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        collapsed = (
            (logs.get("completions/clipped_ratio") or 0) >= 0.9
            and (logs.get("rewards/tool_call_reward/mean") or 0) <= 0
        )
        if collapsed:
            control.should_training_stop = True

Everything length-capped, nothing correct: stop the job, stop paying.

The guardrails against both are unglamorous: a small learning rate, tight gradient clipping, a nonzero KL, and rewards that never make degenerate length profitable. Modern recipes add two quiet corrections to the loss itself. Token-level normalization (DAPO-style) fixes vanilla GRPO's habit of giving tokens in short completions more gradient than tokens in long ones. And scaling advantages by the batch-wide reward std, rather than each group's own, stops near-unanimous groups from turning trivial reward differences into huge updates. Both are one-line config choices in TRL.

Now run it

Here's the part that should feel almost anticlimactic after all of the above: the entire algorithm surface you actually touch is this.

from trl import GRPOConfig, GRPOTrainer

config = GRPOConfig(
    num_generations=8,               # G: the group size
    max_completion_length=128,
    beta=0.1,                        # KL leash; 0.0 drops the reference model
    loss_type="dapo",                # token-level normalization
    scale_rewards="batch",           # batch-std advantage scaling
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # completions per step, not prompts
    bf16=True,
)

trainer = GRPOTrainer(
    model="HuggingFaceTB/SmolLM3-3B",
    args=config,
    reward_funcs=[tool_call_reward, format_reward],   # plain callables
    train_dataset=dataset,           # "prompt" column + hidden "answers"
    callbacks=[StopOnCollapse()],
)
trainer.train()

Thirty lines wrapping the two-model residency, the generation loop, the sync machinery, the group bookkeeping. That gap between how little you write and how much runs underneath is exactly why it's worth understanding the layers below before your first multi-GPU bill.

If you'd rather watch these constraints show up in real logs than take my word for them, I wrote a tutorial that runs this exact recipe end to end: SmolLM3-3B trained on tool calling against Salesforce/xlam-function-calling-60k, exact-match plus format rewards, DeepSpeed ZeRO-3 across 8 GPUs, reward curves plotted from the training logs. It happens to run on SageMaker, but the recipe is stack-agnostic; every idea in this post appears in its config, and both collapse modes are one bad hyperparameter away if you feel like meeting them personally.

On paper, GRPO deletes a model. In practice, it adds a distributed system: a trainer and an inference engine sharing one set of weights, synchronized every step, fed by rewards that must never crash, on tasks where the model can still disagree with itself.