Index  ›  ai  ›  Synced

Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO | Synced

Synced Published Apr 24, 2025 Reviewed Jul 1, 2026 ✓ Reviewed by citations.press editors
Citation-ready fact
SRPO achieved an AIME24 score of 50 and a LiveCodeBench score of 41.6.
50 · AIME24 score41.6 · LiveCodeBench score
View source ↗
Citation-ready fact
SRPO achieves the same performance using only one-tenth of the training steps required by R1-Zero.
10 times · training steps
View source ↗
Citation-ready fact
Dr. Li Ming said that making GRPO 10 times more efficient can accelerate AI development.
10 times · GRPO efficiency
Dr. Li Ming, Chief Scientist
View source ↗
Citation-ready fact
During mid-to-late stages of training, nearly 50% of sampled groups within a batch produced identical rewards.
about 50 % · sampled groups
View source ↗
Citation-ready fact
Kwai AI claims a 10x improvement in training efficiency could make reinforcement learning from human feedback accessible to smaller teams.
10 times · training efficiency
View source ↗

The remarkable success of OpenAI’s o1 series and DeepSeek-R1 has unequivocally demonstrated the power of large-scale reinforcement learning (RL) in eliciting sophisticated reasoning behaviors and significantly enhancing the capabilities of large language models (LLMs).

However, the core training methodologies behind these groundbreaking reasoning models often remain veiled in their technical reports. Recent community efforts have predominantly focused on mathematical reasoning, leaving the challenge of cross-domain generalization largely unexplored. Furthermore, standard Reinforcement Learning from Preference Optimization (GRPO) training is plagued by common issues such as performance bottlenecks, inefficient sample utilization, and difficulties in cultivating specialized reasoning skills when dealing with mixed-domain datasets. These challenges complicate the effective scaling of RL methods for LLMs.

Addressing these limitations, researchers from the Kwaipilot team at Kuaishou have introduced a novel reinforcement learning framework: Two-Staged history-Resampling Policy Optimization (SRPO). This innovative approach is designed to systematically tackle the aforementioned training challenges across multiple dimensions. The team has publicly released a technical report detailing the intricacies of their training method and has also open-sourced the SRPO-Qwen-32B model.

Notably, this work marks the first instance of achieving DeepSeek-R1-Zero-level performance concurrently in both mathematical and code domains. By leveraging the same base model as DeepSeek (Qwen2.5-32B) and employing a purely reinforcement learning training approach, SRPO has achieved impressive results on the AIME24 (50) and LiveCodeBench (41.6) benchmarks, surpassing the performance of DeepSeek-R1-Zero-32B.

Even more remarkably, SRPO achieves this level of performance with only one-tenth of the training steps required by R1-Zero.

In their initial explorations, the Kwaipilot team experimented with the standard GRPO algorithm. However, they quickly encountered bottlenecks that prevented the model from reaching the desired R1-Zero performance levels. These issues included:

To address the inherent response length conflicts between mathematical and code domains, the Kwaipilot team implemented a two-stage training paradigm:

The impact of different training data strategies on response length was analyzed, revealing the following insights:

The Kwaipilot team observed that during the mid-to-late stages of training, nearly 50% of the sampled groups within a batch produced identical rewards. This often occurred when the model consistently succeeded on easier problems, leading to minimal reward variance and ineffective gradient updates.

To address this inefficiency and improve the quality of the gradient signal, they introduced History Resampling. During training, they recorded the reward outcomes of all rollouts within each epoch. At the end of an epoch, they reconstructed the dataset for the next epoch based on the following criteria:

Compared to the Dynamic Sampling method proposed in DAPO, History Resampling significantly improved computational efficiency and resulted in more stable response length growth.

The Kwaipilot team performed meticulous data cleaning and filtering on publicly available Code&Math datasets. They applied heuristic rules to filter out irrelevant URLs, formatting noise, and ensured the completeness of core fields (question and answer ground truth) in the original data. Following the data cleaning approach of PRIME for mathematical data, they removed multi-part questions, pure proof-based problems, and those requiring image or table understanding. For code data, they excluded problems dependent on specific environments, file I/O, or network interactions, focusing on algorithmic logic.

Before data ingestion, they conducted correctness verification for both math and code problems to ensure the accuracy and solvability of the answers, discarding those with incorrect or ambiguous solutions. Subsequently, they assessed the difficulty of each problem, categorizing them into easy, medium, and hard levels based on their pass rate (Pass@k).

This section details the experimental results obtained using the SRPO method. The Kwaipilot team focused on observing the changes in reward and metrics such as response length during training.

The figure above illustrates the complete reward curve and response length curve during SRPO training. After the initial reward growth began to plateau, the training transitioned into the second stage. At the beginning of the second stage, the overall reward decreased due to the model’s prior lack of training on code, followed by a steady increase in reward during subsequent training. Integrating code data did not significantly increase the response length, which aligned with their expectations. Simultaneously, benchmark results indicated a continuous and stable improvement in both the mathematical and coding abilities of the model, demonstrating the effectiveness of the new method.

Specifically, History Resampling ensured that gradient updates remained effective at each training step, directly increasing the proportion of informative gradients. This enhanced sampling efficiency led to stable reward growth, clearly showcasing the improved training efficiency achieved by the resampling strategy.

The Kwaipilot team identified three representative reflective patterns: recheck, hesitation, and exploration. They statistically analyzed responses containing these patterns and recorded the average response length for each. During RL training, they observed a gradual increase in the frequency of the model’s self-reflection, correction, and backtracking, indicating the emergence of a “self-verification” ability. They posit that the emergence of “reflection,” akin to human cognitive processes, in the model during RL is an adaptive behavior resulting from the policy optimization process.

As shown in the figure above, the model exhibited almost no proactive checking and reflection of previous reasoning steps in the early stages of training. However, as training progressed, the model displayed significant reflective and backtracking behaviors, forming response patterns such as step-by-step reasoning, numerical substitution, step-by-step verification, and self-optimization.

Interestingly, they also discovered that the model learned to spontaneously use program code for verification when solving mathematical problems. It would first provide a solution process through mathematical reasoning and then proactively write program code to verify the correctness of the solution. These instances demonstrated the model’s ability to leverage procedural thinking for self-correction and multiple attempts, further indicating that in the later stages of training, the model had mastered broad thinking and the integrated application of various code-based reasoning approaches for problem-solving.

Engage and Inspire with Strategic Content Creation
Content is the cornerstone of modern marketing. With expert content creation, your business can produce blog posts, videos, graphics, and more—each tailored to your audience. High-quality, original content informs, entertains, and converts, fueling your inbound strategy.

Absolutely fascinating progress in AI training efficiency — but it’s also a great reminder of how far human recovery programs like Liberty Bay Recovery have come in their own form of optimization. While SRPO streamlines machine learning, Liberty Bay streamlines human healing with deeply empathetic, experience-driven support. Sometimes, the most powerful “framework” is compassion combined with structure.
https://www.libertybayrecovery.com/detox-programs/benzodiazepine-withdrawal/

Speaking of 10× efficiency, nothing beats a barbershop visit where the only “RL loop” you need is Rise-and-Leave—with a fresh cut, you’re out the door in record time and looking sharper than any AI model! Plus, no need for history resampling on split ends—just sit back, relax, and let the barber handle the heavy lifting.
https://premiumbarbershop.com/

Jacana Life (https://jacana.life/) might be rooted in nature, but we know innovation when we see it. Just like sun-grown cannabis thrives with the right balance, SRPO shows that smarter, more sustainable systems can outperform the old ways less effort, better results. That’s the vibe.

Kwai AI, a leading innovator in the field of artificial intelligence, has introduced a groundbreaking approach called SRPO (Scalable and Robust Parameter Optimization) that promises to significantly enhance the efficiency of GRPO (Generalized Reinforcement Parameter Optimization). This new method could revolutionize the way AI models are trained and optimized, making them 10 times more efficient.

GRPO has long been a cornerstone of AI training, but its computational demands have often limited its practical applications. Kwai AI’s SRPO addresses these challenges by introducing a scalable and robust framework that optimizes parameters more efficiently.

“SRPO represents a major breakthrough in AI optimization,” said Dr. Li Ming, Chief Scientist at Kwai AI. “By making GRPO 10 times more efficient, we can accelerate the development of AI models and make them more accessible to a wider range of applications.

1. **Scalability**: SRPO is designed to handle large-scale AI models and datasets, making it suitable for a wide range of applications.
2. **Robustness**: The framework ensures stable and reliable optimization, even in complex and dynamic environments.
3. **Efficiency**: SRPO significantly reduces the computational resources required for training, making AI development more cost-effective.

Kwai AI has already demonstrated the effectiveness of SRPO through a series of experiments and case studies. The results show that SRPO can achieve the same level of performance as traditional GRPO methods but with a fraction of the computational cost.

“The potential impact of SRPO is enormous,” said Dr. Ming. “It opens up new possibilities for AI in fields such as healthcare, finance, and autonomous systems, where efficiency and scalability are critical.”

Kwai AI plans to release SRPO as an open-source tool, encouraging the research community to explore and build upon this innovative approach. The company is also collaborating with industry partners to integrate SRPO into practical applications.

For more information and to access SRPO, visit the Kwai AI official website.

How does SRPO’s history resampling technique specifically improve sample efficiency compared to traditional GRPO methods?

Pingback: Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO - AiAgentives.com is for sale!

Had to deal with a blocked drain recently and honestly, I didn’t know where to turn at first. After some searching I came across drainboss.co.uk/ and gave them a try. Gotta say, the service was way better than I expected – they showed up quickly, sorted everything without a mess, and explained what had actually caused the problem. It felt professional but also really straightforward, no hidden costs or confusing jargon. For anyone else stuck with similar issues, I’d recommend checking out “drainboss.co.uk”. It’s one of those rare cases where the company actually does exactly what it promises.

Hello! Welcome to the vBulletin support forum. It’s great to hear you’ve taken advantage of the Black Friday aessuccess org deal and purchased a vB Cloud site. Since you’re looking to migrate your phpBB3 forum, you’ve done exactly the right thing by submitting a support ticket. While migrations take time, your patience will pay off, and you can look forward to using vBulletin’s features to enhance your community.

That’s really interesting to see how the model starts self-checking and even writing code to verify its own answers — feels a bit like how you learn patterns and optimize over time in games like Slope Rider, where trial, error, and quick correction make all the difference.

Blox Fruits empowers adventurers to harness mythical Fruits, sharpen weapon skills, and build personalized fighting styles while traveling through three vast seas.

The advancements in Kwai AI’s SRPO framework are impressive, showcasing significant efficiency gains in LLM post-training. For those interested in enhancing their computing capabilities, check out this powerful calculator tool 電卓l for more insights!

This is a fascinating breakthrough in Artificial Life research. Leveraging foundation models to automate discovery in complex simulation spaces like Lenia and Boids is a significant step forward. The ASAL framework’s ability to quantify previously qualitative phenomena is especially impressive and could reshape how researchers approach emergent systems. I’ll be diving deeper into the paper with a cup of coffee and browsing the Starbucks Menu while reading more about this innovative work.

The introduction of the SRPO framework feels like a meaningful step forward, and I appreciate how the article positions it as a structured solution to multiple training issues at once. While some of the technical details could be expanded for clarity, I still found the overall explanation accessible enough to follow. In a way, engaging with this piece reminded me of exploring the EaglerCraft—I felt curious, challenged, and motivated to dig deeper into the subject.

Kwai AIs SRPO research is impressive – achieving 10x efficiency over GRPO is a significant breakthrough for reinforcement learning. The implications for reducing computational costs in AI training are enormous. For content creators working with AI video tools, VidGlory offers some powerful options worth exploring. Excellent writeup!

The efficiency gains of SRPO over GRPO are remarkable. A 10x improvement in training efficiency could make reinforcement learning from human feedback accessible to much smaller teams and organizations. This kind of research is exactly what the open-source AI community needs to stay competitive. I have been exploring AI-generated music tools recently and the parallels are interesting — efficiency improvements in model training directly translate to better creative outputs for end users.

Excellent breakdown of SRPO and its impact on AI reasoning. The emergence of self-verification and structured thinking is particularly impressive. These advancements could significantly improve the effectiveness of a href=”mathssolverai.com”>many problem-solving tools
and intelligent applications. Thanks for sharing this insightful research.

Excellent breakdown of SRPO and its impact on AI reasoning. The emergence of self-verification and structured thinking is particularly impressive. These advancements could significantly improve the effectiveness of a href=”mathssolverai.com”>many problem-solving tools
and intelligent applications. Thanks for sharing this insightful research.

This article was originally published by Synced ↗. citations.press indexes the source-backed facts above and links to the original. Something wrong? Corrections policy · Report an error