AI DAMN - Mind-blowing AI News & Innovations/Google DeepMind Fine-tunes AI Decision-making via Reinforcement Learning

Google DeepMind Fine-tunes AI Decision-making via Reinforcement Learning

Google DeepMind Fine-tunes AI Decision-making via Reinforcement LearningaibaseAIbase基地Published inAI News · 5 min read · May 20, 20255 Recently, the Google DeepMind team collaborated with the LIT AI Lab at Johannes Kepler University Linz on a new study regarding artificial intelligence language models. They employed reinforcement learning fine-tuning (RLFT) techniques to enhance the decision-making capabilities of these models. The focus of this research was to address some critical issues in the decision-making process of models by reinforcing training through chains of reasoning.

{{MEDIA_PLACEHOLDER_0}}

With the application of big data, existing language models have demonstrated superior capabilities in processing text and can even make knowledge-based decisions in interactive environments. However, these models often encounter problems where they appear to be "all talk and no action" when making real-world decisions—they can derive correct strategies but fail to execute them effectively. Additionally, they tend to opt for choices that yield higher short-term rewards. Smaller models also frequently exhibit frequency bias, repeatedly performing common actions.

Traditional reinforcement learning methods, such as the UCB algorithm, can balance exploration and exploitation to some extent but still cannot fully resolve the disconnect between model reasoning and action. To address this, the DeepMind team innovatively introduced reinforcement learning fine-tuning technology, using self-generated chains of reasoning as training signals. The system evaluates the rewards corresponding to each reasoning step, encouraging the model to prioritize logically consistent and effective action plans.

In practical implementation, the model generates sequences containing reasoning processes and actions based on input instructions and historical actions and rewards. Through Monte Carlo baseline evaluation and generalized advantage estimation optimization, ineffective actions trigger penalty mechanisms. Moreover, the introduction of reward shaping not only ensures output standardization but also retains exploration space.

In experiments, the research team tested multi-armed bandit models. In the 10-arm test, the 2B-parameter model’s action coverage improved by 12 percentage points. In the 20-arm test, although the improvement was less significant, the frequency bias rate dropped from 70% to 35%, demonstrating the effectiveness of the research. Results from the tic-tac-toe experiments showed that the model's win rate against random opponents increased fivefold, and its average return against optimal Monte Carlo tree search agents rose from -0.95 to zero. Furthermore, the probability of the 27B large model generating correct reasoning reached 87%, compared to only 21% executing optimal actions without fine-tuning. These data clearly demonstrate the effectiveness of reinforcement learning fine-tuning in narrowing the gap between reasoning and execution.

Key Takeaways:

📊 The study uses reinforcement learning fine-tuning (RLFT) technology to enhance AI language models' decision-making capabilities.

🧩 Training through self-generated chains of reasoning effectively improves the logical reasoning and action selection of the model.

🏆 Experiments show that the model significantly improved performance in multi-armed bandits and tic-tac-toe, narrowing the gap between reasoning and execution.

ReinforcementLearningfromHumanFeedback(RLHF)AlphaGoDeepMindThoughtChainLanguageModelThis article is from AIbase Dailysvg]:px-3 bg-[#0080FF] text-white rounded-lg text-sm px-4 py-2 hover:bg-blue-500" data-state="closed">Scan to viewWelcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.—— Created by the AIbase Daily Team© Copyright AIbase Base 2024, Click to View Source -


© 2024 - 2025 Summer Origin Tech

Powered by Summer Origin Tech