A detailed survey and perspective on Adaptive Parallel Reasoning (APR) for LLMs, covering the limitations of sequential reasoning (context-rot, latency, compute cost) and how parallel reasoning addresses them. The post traces the evolution from simple fork-and-join methods (self-consistency, Best-of-N) through heuristic-based search (Tree-of-Thoughts, MCTS) to fully adaptive approaches where the model itself decides when and how to parallelize. Key systems discussed include ThreadWeaver (engine-agnostic, client-side orchestration) and Multiverse (KV cache stitching with engine modifications). Training challenges are explored including SFT for control-flow syntax, reward design using critical-path efficiency metrics, and the instability of parallelization behavior under RL. Open questions include whether APR benefits are primarily inference-time or training-time, hardware-aware parallelization, and recursive/nested parallelism.

19m read timeFrom bair.berkeley.edu
Post cover image
Table of contents
MotivationFrom Fixed Parallelism to Adaptive ControlInference Systems for Adaptive ParallelismTraining Models to Use ParallelismEvaluation and Open QuestionsAcknowledgements

Sort: