In this project, we investigate training an agent by following the modern Large Language Model (LLM) alignment pipeline of unsupervised pre-training, supervised fine-tuning and reinforcement learning from human feedback (RLHF) on the Xbox game Bleeding Edge.
Motivation
Training an agent with imitation learning provides a scalable approach to learning how to behave in a complex 3D environment from high-dimensional visual information (pixels). However, imitation agents do not always perform the desired behaviors when deployed!
In this work, we consider an illustrative example where the agent spawns on an island with three jumppads (the yellow ramps in the above videos). We would like our agent to navigate directly to the left (or right) jumppad. We see that our general imitation learning agent sometimes performs this behavior, but not reliably, and over 40% of the time fails to reach any jumppad at all.
We draw an analogy between the undesirable behaviors of our imitation learning agent and the unhelpful respones of unaligned LLMs. Unaligned LLMs (trained only with unsupervised pre-training) contain a lot of knowledge, but frequently produce unhelpful responses, and must be aligned with subsequent supervised pre-training and reinforcement learning from human feedback (RLHF) stages to make them useful. Analagously, while scaling up our model and data can provide improved gameplay knowledge and generality, it provides no means for the agent to distinguish between expert and novice behaviors (or more generally, desired and undesired behaviors).
By following the modern LLM alignment pipeline, we hope to align our base imitation model to reliably perform the desired behavior, and make it useful. More generally, this may include adjusting the ability of the agent, to obtain different gameplay styles or personalities, or just to achieve more human-like behavior.
Supervised Fine-Tuning
We begin by fine-tuning our base imitation agent on curated trajectories that travel directly to a jumppad.
We find that our fine-tuned agent has an increased success rate for reaching all three jumppads. However, the agent still does not have a preference for a particular jumppad, and reaches all three in roughly even proportions (as expected from the training data).
While these demonstration trajectories can be successfully used for fine-tuning our agent, we find that training an agent from scratch on these limited trajectories does not perform as well. For example, we find that pre-training makes the agent more robust to going out of distribution of the fine-tuning trajectories, since the agent has additional information from pre-training on how to return to the distribution of desired trajectories.
Preference Modeling
Still following the LLM alignment pipeline, we now train a reward model to capture our preferences about the fine-tuned agent’s behavior. In our work we use synthetic preferences to investigate how performance scales with preference labels (a proxy for human labellilng time). We find that initializing the reward model with the pre-trained agent enables the reward model to capture our preferences much more accurately, enabling strong performance with comparatively few preference labels.
Alignment with Reinforcement Learning (Synthetic RLHF)
We can now align our agent with our preferences by further fine-tuning our agent with reinforcement learning using our reward models. We find that we are able to significantly improve alignment efficiency via first fine-tuning on the trajectories which are labelled with the greatest reward. This is similar to Reinforced Self-Training (ReST) (Gulcehre et al. 2023) introduced for LLM alignment. We term this additional alignment step preference fine-tuning.
We find that with this improved alignment procedure we are able to reliably align our agent within our limited compute budget to reach both the left and the right jumppads.
Left Jumppad Alignment
Right Jumppad Alignment
Alignment Summary
A summary of our alignment procedure and a heatmap of agent trajectories for each stage of alignment are shown below.
Conclusion
We demonstrated that the modern LLM training procedure can be used to reliably align agents to perform desired behaviors in complex environments. These behaviors would be difficult to achieve with any one stage of training alone. Our analysis shows that many of the recent developments in the current procedure for training LLMs can also be applied and have similar benefits for training agents.
Check out the full paper for more details: https://arxiv.org/abs/2406.04208
And a big thank you to Ninja Theory for enabling this research as part of Project Paidia!
Citation
If you found this work interesting, please consider citing our paper:
@misc{jelley2024aligning,
title={Aligning Agents like Large Language Models},
author={Adam Jelley and Yuhan Cao and Dave Bignell and Sam Devlin and Tabish Rashid},
year={2024},
eprint={2406.04208},
archivePrefix={arXiv},
primaryClass={cs.LG}
}