With trl you can train transformer language models with Proximal Policy Optimization (PPO). The library is built with the transformer library by 🤗 Hugging Face (link). Therefore, pre-trained language models can be directly loaded via the transformer interface. At this point only GTP2 is implemented.


  • GPT2 model with a value head: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
  • PPOTrainer: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
  • Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.

