Daily Dose of Data Science | Avi Chawla | Substack

How to Beat GRPO Without Touching Model Weights

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

GEPA is a prompt optimization method that outperforms GRPO on compound AI systems by using 35× fewer rollouts and no GPU training. Instead of reducing rollout traces to a scalar reward like GRPO does, GEPA feeds full traces to a reflection LLM that rewrites prompts based on observed failure patterns. The method uses Pareto selection to preserve diverse prompt candidates rather than always mutating from the top performer. A concrete HotpotQA example shows a prompt jumping from 38% to 69% accuracy through one reflection cycle. The post also covers when to use GEPA vs GRPO vs MIPROv2 vs TextGrad, and notes that smaller training sets (20–100 examples) often outperform larger ones with GEPA. A secondary section explains why weaker teacher models can produce better fine-tuning data for smaller student models due to capacity mismatch.

#llm

#reinforcement-learning

May 01•14m read time•From blog.dailydoseofds.com

Table of contents

A tricky LLM interview question for AI Engineers How to beat GRPO without touching model weights

Comment

Bookmark

Copy

Sort: