A deep-dive into the RYS (Repeat Your Self) technique for improving LLM performance by duplicating middle transformer layers without any weight changes or training. Applied to Qwen3.5-27B, the author confirms that the method generalizes beyond Qwen2-72B. Key findings include: cosine similarity experiments across English, Chinese, and Base64 inputs directly reveal a three-phase transformer anatomy (encoding, reasoning, decoding), with the reasoning phase operating in a language-agnostic 'universal' space. Contiguous mid-stack block duplication dominates the Pareto efficiency frontier over beam search compositions and sparse repeats. A surrogate XGBoost model scored 2 million configurations to surface promising candidates, with top picks fully benchmarked. Four Pareto-optimal RYS model variants were released on HuggingFace, and the scanning codebase is open-sourced. The work suggests transformer reasoning circuits are a general architectural property, not model-specific artifacts.
Table of contents
Why Qwen3.5-27BSeeing the Anatomy DirectlyThe Heatmaps: Results FirstSingle-Layer Repeats: Running One Step of the Recipe AgainBeam Search: Composing BlocksSurrogate Model: Ranking Millions, Measuring HundredsLarger Validation Sets: Graduating from 16 QuestionsPareto PunchlineThe ModelsThe CodeWhat This MeansCiting this workSort: