View on GitHub

Aligning Agents like LLMs

Adam Jelley¹², Yuhan Cao¹, David Bignell¹,
Sam Devlin¹, Tabish Rashid¹

¹Microsoft Research Cambridge, ²University of Edinburgh

Website for "Aligning Agents like Large Language Models". ArXiv: https://arxiv.org/abs/2406.04208

In this project, we investigate training an agent by following the modern Large Language Model (LLM) alignment pipeline of unsupervised pre-training, supervised fine-tuning and reinforcement learning from human feedback (RLHF) on the Xbox game Bleeding Edge.

Motivation

Training an agent with imitation learning provides a scalable approach to learning how to behave in a complex 3D environment from high-dimensional visual information (pixels). However, imitation agents do not always perform the desired behaviors when deployed!

A 103M parameter GPT-style transformer agent, trained with imitation learning on 1.12 years of Bleeding Edge human gameplay. Not all imitated behaviors are desirable!

Note that videos are not representative of typical gameplay.

In this work, we consider an illustrative example where the agent spawns on an island with three jumppads (the yellow ramps in the above videos). We would like our agent to navigate directly to the left (or right) jumppad. We see that our general imitation learning agent sometimes performs this behavior, but not reliably, and over 40% of the time fails to reach any jumppad at all.

Base Imitation Model Success Rate
General imitation learning agent jumppad success rates.

We draw an analogy between the undesirable behaviors of our imitation learning agent and the unhelpful respones of unaligned LLMs. Unaligned LLMs (trained only with unsupervised pre-training) contain a lot of knowledge, but frequently produce unhelpful responses, and must be aligned with subsequent supervised pre-training and reinforcement learning from human feedback (RLHF) stages to make them useful. Analagously, while scaling up our model and data can provide improved gameplay knowledge and generality, it provides no means for the agent to distinguish between expert and novice behaviors (or more generally, desired and undesired behaviors).

Shoggoth with Smiley Face
Artistic illustration of LLM alignment. Source: https://huyenchip.com/2023/05/02/rlhf

By following the modern LLM alignment pipeline, we hope to align our base imitation model to reliably perform the desired behavior, and make it useful. More generally, this may include adjusting the ability of the agent, to obtain different gameplay styles or personalities, or just to achieve more human-like behavior.

InstructGPT Performance Ablation
Alignment improves perceived helpfulness across langauge model sizes (InstructGPT). Source: https://openai.com/index/instruction-following

Supervised Fine-Tuning

We begin by fine-tuning our base imitation agent on curated trajectories that travel directly to a jumppad.

Demonstration trajectories of an agent going to the left, middle and right jumppads.

We find that our fine-tuned agent has an increased success rate for reaching all three jumppads. However, the agent still does not have a preference for a particular jumppad, and reaches all three in roughly even proportions (as expected from the training data).

Fine-Tuned Imitation Model Success Rate
Fine-tuned imitation learning agent jumppad success rates.

While these demonstration trajectories can be successfully used for fine-tuning our agent, we find that training an agent from scratch on these limited trajectories does not perform as well. For example, we find that pre-training makes the agent more robust to going out of distribution of the fine-tuning trajectories, since the agent has additional information from pre-training on how to return to the distribution of desired trajectories.

The general pre-trained agent (left) is more robust to going out-of-distribution than the fine-tuned only agent (right). This simple example demonstrates the benefits of incorporating larger scale data with unsupervised pre-training.

Preference Modeling

Still following the LLM alignment pipeline, we now train a reward model to capture our preferences about the fine-tuned agent’s behavior. In our work we use synthetic preferences to investigate how performance scales with preference labels (a proxy for human labellilng time). We find that initializing the reward model with the pre-trained agent enables the reward model to capture our preferences much more accurately, enabling strong performance with comparatively few preference labels.

Reward model performances.
Reward model test performances.

Alignment with Reinforcement Learning (Synthetic RLHF)

We can now align our agent with our preferences by further fine-tuning our agent with reinforcement learning using our reward models. We find that we are able to significantly improve alignment efficiency via first fine-tuning on the trajectories which are labelled with the greatest reward. This is similar to Reinforced Self-Training (ReST) (Gulcehre et al. 2023) introduced for LLM alignment. We term this additional alignment step preference fine-tuning.

We find that with this improved alignment procedure we are able to reliably align our agent within our limited compute budget to reach both the left and the right jumppads.

Left Jumppad Alignment

The agent has been aligned to consistently reach the left jumppad.
Left-Aligned Model Success Rate
Left-aligned agent jumppad success rates.

Right Jumppad Alignment

Alternatively, the agent can be aligned to consistently reach the right jumppad.
Right-Aligned Model Success Rate
Right-aligned agent jumppad success rates.

Alignment Summary

A summary of our alignment procedure and a heatmap of agent trajectories for each stage of alignment are shown below.

Alignment Procedure Overview
Overview of our procedure for aligning an agent.
Agent Trajectories Heatmap
Heatmap of our agent's trajectories at each stage of alignment.

Conclusion

We demonstrated that the modern LLM training procedure can be used to reliably align agents to perform desired behaviors in complex environments. These behaviors would be difficult to achieve with any one stage of training alone. Our analysis shows that many of the recent developments in the current procedure for training LLMs can also be applied and have similar benefits for training agents.

Check out the full paper for more details: https://arxiv.org/abs/2406.04208

And a big thank you to Ninja Theory for enabling this research as part of Project Paidia!

Citation

If you found this work interesting, please consider citing our paper:

@misc{jelley2024aligning,
      title={Aligning Agents like Large Language Models},
      author={Adam Jelley and Yuhan Cao and Dave Bignell and Sam Devlin and Tabish Rashid},
      year={2024},
      eprint={2406.04208},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}