With trl you can train transformer language models with Proximal Policy Optimization (PPO). The library is built with the transformer library by 🤗 Hugging Face (link). Therefore, pre-trained language models can be directly loaded via the transformer interface. At this point only GTP2 is implemented.


  • GPT2 model with a value head: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
  • PPOTrainer: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
  • Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.

Don't forget to tag @lvwerra in your comment, otherwise they may not be notified.

Authors community post
Data scientist
Share this project
Similar projects
Transformer Reinforcement Learning
Leandro von Werra tells us about his new library which enables you to fine-tune GPT-2 towards a higher-level objective (sentiment of the generated text).
T5 fine-tuning
A colab notebook to showcase how to fine-tune T5 model on various NLP tasks (especially non text-2-text tasks with text-2-text approach)
DETR: End-to-End Object Detection with Transformers
A new method that views object detection as a direct set prediction problem.
FARM: Framework for Adapting Representation Models
🏡 Fast & easy transfer learning for NLP. Harvesting language models for the industry.
Top collections