VirTex: Learning Visual Representations from Textual Annotations
We train CNN+Transformer from scratch from COCO, transfer the CNN to 6 downstream vision tasks, and exceed ImageNet features despite using 10x fewer ...
convolutional-neural-networks transformers coco visual-representations image-captioning object-detection transfer-learning pytorch pretraining natural-language-processing computer-vision article code paper arxiv:2006.06666 virtex tutorial research

VirTex is a pretraining approach which uses semantically dense captions to learn visual representations. We train CNN + Transformers from scratch on COCO Captions, and transfer the CNN to downstream vision tasks including image classification, object detection, and instance segmentation. VirTex matches or outperforms models which use ImageNet for pretraining -- both supervised or unsupervised -- despite using up to 10x fewer images.

Don't forget to tag @jcjohnson , @kdexd in your comment, otherwise they may not be notified.

Authors community post
Share this project
Similar projects
The Transformer … “Explained”?
An intuitive explanation of the Transformer by motivating it through the lens of CNNs, RNNs, etc.
Jukebox: A Generative Model for Music
We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles.
Image GPT: Generative Pretraining from Pixels
Transformers trained on pixel sequences can generate coherent image completions and samples.
C++ Implementation of PyTorch Tutorials for Everyone
This repository provides tutorial code in C++ to learn PyTorch by building CNNs, RNNs, etc. Tutorials are divided into three sections based on complexity.
Top collections