EcomRLVE-GYM extends the RLVE framework from single-turn reasoning puzzles to multi-turn, tool-augmented e-commerce conversational agents. It provides 8 verifiable environments covering product discovery, cart building, returns, order tracking, policy QA, bundle planning, substitution, and multi-intent journeys. Each environment uses procedural problem generation, a 12-axis difficulty curriculum, and algorithmically verifiable rewards (no LLM-as-a-judge). A Qwen 3 8B model trained with DAPO over 300 steps shows progressive difficulty growth, validating the adaptive scheduling approach. The full environment, verifiers, and a 2M-product catalog are open-sourced.
Table of contents
Why RL for shopping agents?What a training episode looks likeThe eight environmentsAdaptive difficulty curriculumDeep dive: Cart Building (E_CART)User simulationEnvironment scalingEarly resultsTry it yourselfResourcesReferencesSort: