Google’s new AI training method helps small models tackle complex reasoning

Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. Supervised Reinforcement Learning (SRL) reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals during the training process.

This approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks.

SRL is a versatile training framework that can elevate smaller and less expensive models to higher reasoning abilities.

The limits of current LLM reasoning training

Recent advances in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies.

However, the success of this outcome-based approach depends on the model's ability to discover a correct solution within a limited number of attempts, or "rollouts." Since each rollout is computationally expensive, models can't try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget.

This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but get derailed by a single mistake, leading to an incorrect answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and provides sparse rewards.

An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the full reasoning process laid out by experts. While SFT can instill reasoning abilities, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This issue is made worse by the fact that high-quality, human-created training data is both scarce and expensive to produce.

As the paper notes, these limitations leave "a critical gap for training small open-source models to effectively learn difficult problems."

How supervised reinforcement learning works

SRL introduces a framework that reformulates problem-solving as a "sequential decision-making process," striking a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert's entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to an expert while developing its own internal reasoning style.

In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.

According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. "SRL sits in the middle: It captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what ‘good reasoning’ looks like at each step," Hsu told VentureBeat. "This makes SRL suitable for domains like data science automation or probably supply chain optimization — tasks that reward sound intermediate reasoning rather than mere final answers."

During training, the model first generates an "inner monologue" (its internal reasoning process, enclosed in <think> tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model's predicted action and the expert's action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution isn't perfect. This solves the sparse reward problem RLVR faces.

SRL in action

The researchers' experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve solution quality without just making the outputs longer.

For enterprise leaders, performance gains are only valuable if they don't come with runaway costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. "The gains come from better reasoning quality and structure, not from verbosity," he said. "In terms of efficiency, SRL-trained models are roughly on par with the base model in token usage… while SRL isn’t designed to reduce inference cost, it achieves stronger reasoning performance without increasing it."

For the math tests, the team fine-tuned Qwen2.5-7B-Instruct on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using the GRPO algorithm common in models like DeepSeek-R1) on four competition-level math benchmarks. The SRL-trained model achieved a substantial 3.0% average performance boost over other methods.

The team extended SRL to agentic software engineering, a domain critical for enterprise automation. They trained a coding-specialized model, Qwen2.5-Coder-7B-Instruct, on 5,000 expert trajectories of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and SWE-Gym-7B, a strong baseline fine-tuned with SFT. SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This shows SRL's ability to train more competent AI agents for complex, real-world programming tasks.

A new standard for high-stakes AI?

The paper's strongest results came from combining methods: First, using SRL to teach foundational reasoning, then using RLVR to refine that skill. In their experiments, when the researchers used SRL as a pre-training and applied RLVR in post-training, they observed a 3.7% average increase, demonstrating a powerful curriculum learning strategy.

This raises the question of whether this could become a new blueprint for building specialized AI.

"We view SRL as a strong foundation," Hsu said. "In a sense, SRL provides a curriculum — teaching models to think and act step by step — before we refine those behaviors with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL stage but also makes reasoning more interpretable and generalizable, which is critical for high-stakes applications."

Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the path forward. "While high-quality expert trajectories remain important," he concluded, "we think the next big leap will come from automating their generation and filtering — leveraging strong teacher models or even self-improving student models to bootstrap new data."

Source link