Can Reward be a Key for General Intelligence?

Reward maximization has been proposed as a sufficient requirement for general intelligence, unifying abilities such as knowledge acquisition, perception, social interaction, and generalization through a single objective. In this post, I share my personal reflections on both the strengths and the limitations of this claim. To the end, in order to make progress toward general intelligence, I believe that developing methods for selecting effective reward functions and advancing techniques for interpreting learned reward functions will be essential.

“Reward is Enough” proposes a hypothesis that maximization of reward underlies intelligence and its associated abilities, such as knowledge, learning, perception, social intelligence, language, generalization and imitation, and emphasizes that it would be a key for understanding and constructing artificial general intelligence. Unlike previous works that demand specialized problem formulations for each ability, this suggests that a single, simple reward-based goal could provide a common basis for all abilities associated with intelligence. However, while some statements appear compelling, others remain open to debate.

First of all, as shown in Section 6, the fact that a seemingly innocuous reward signal could be sufficient for the complex environment seems convincing, especially considering that innate knowledge cannot be operationalized and cannot be acquired from experience (Section 3.1). Although it may seem awkward to suggest that reward can be useful in such an environment, designing reward based on the ultimate goal can offer a dense abstraction that guides the learning of innate knowledge required to achieve it. Conversely, this could pose a challenge to other approaches, such as supervised learning, which relies on labeled data and lacks the flexibility to infer knowledge in the absence of explicit experience. Similarly, in the context of perception, we can see that, unlike supervised learning, maximizing reward can work with context-dependent data and is not restricted to predefined classes. Moreover, since social intelligence inherently requires interaction with others, maximizing reward within the environment could enable ‘effective interaction’ and serve as an indicator of how well the agent ‘understands’ social dynamics. In particular, the adaptability of other agents’ strategies makes the use of reward more appealing, as it enables reward to represent social intelligence while accounting for such possibilities, which can naturally occur in real-world interactions, thereby increasing the robustness of the agent. Lastly, the fact that reward does not assume a symmetric teacher and that other agents can be an integral part of the agent’s environment seems persuasive. For example, in traditional machine learning, knowledge distillation has gained attention as a method for transferring a teacher model’s ability to a smaller model to improve efficiency, which requires a teacher with precisely the desired behavior. In contrast, reward enables the agent to learn within the environment, further enhancing the agent’s capability beyond that of the teacher by discarding behaviors that seem undesirable, therefore viewing others not only as teachers but also as colleagues.

On the other hand, some arguments remain subject to debate. For instance, the claim that reward-maximizing behavior could align with specific behaviors from distinct goals and therefore can provide general intelligence requires further justification. While it offers explanations for various abilities, every section relies solely on assumptions without rigorous mathematical proof. Especially, even if we accept this, the challenge of defining the reward function remains. For example, with a predefined fixed reward, the agent may struggle to adapt to drastic environmental changes, even when acting based on the reward function. Even if the reward function is updated, establishing clear criteria for constructing an optimal reward function that effectively guides the agent toward the ultimate goal remains difficult, given the complexity of the environment. Furthermore, it may not be feasible to incorporate all relevant factors into a single reward function. This challenge becomes even more pronounced in dynamically changing environments, which are characteristic of natural systems. In such cases, the ability to select and adapt to the optimal reward function from a set of available reward functions (whether predefined or not) or applying strategies other than reinforcement learning might be more beneficial than relying on a single reward function. This, in turn, brings us back to the fundamental question: how do we build such strategies for generalization? Finally, the idea that reward maximization provides a deeper understanding of why such an ability arises does not seem convincing, as it once again leads to another question: how should we interpret the agent’s behavior based on the reward function?

Therefore, to establish a robust foundation, we should focus on providing an accurate framework for selecting, constructing, and utilizing reward functions to make them a key component of general intelligence. For this, recent work such as “To the Max: Reinventing Reward in Reinforcement Learning”, addresses the challenge of selecting an effective reward function by proposing max-reward reinforcement learning. Furthermore, developing a metric to assess the effectiveness of reward functions is crucial. In this regard, “Understanding Learned Reward Functions” explores techniques for interpreting learned reward functions, emphasizing the need for further research to enhance interpretability and construct more robust reward functions capable of generalization.