Since the Transformer architecture facilitates more parallelization during training computations, it has enabled training on much more data than was possible before it was introduced. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT-2, which have been trained with huge amounts of general language data prior to being released, and can then be fine-tune trained to specific language tasks.
Table of Contents
Share a project
Share something you or the community has made with ML.
If you would like daily updates on trending content and new features, follow us on