VirTex: Learning Visual Representations from Textual Annotations
We train CNN+Transformer from scratch from COCO, transfer the CNN to 6 downstream vision tasks, and exceed ImageNet features despite using 10x fewer ...
convolutional-neural-networks transformers coco visual-representations image-captioning object-detection transfer-learning pytorch pretraining natural-language-processing computer-vision article code paper arxiv:2006.06666 virtex tutorial research

VirTex is a pretraining approach which uses semantically dense captions to learn visual representations. We train CNN + Transformers from scratch on COCO Captions, and transfer the CNN to downstream vision tasks including image classification, object detection, and instance segmentation. VirTex matches or outperforms models which use ImageNet for pretraining -- both supervised or unsupervised -- despite using up to 10x fewer images.

Top collections

Don't forget to tag @jcjohnson , @kdexd in your comment.

Authors community post
Share this project
Similar projects
Image GPT: Generative Pretraining from Pixels
Transformers trained on pixel sequences can generate coherent image completions and samples.
Jukebox: A Generative Model for Music
We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles.
AI Basketball Analysis
🏀 AI web app and API to analyze basketball shots and shooting pose.
Classify photos in 600 classes using nine million Open Images
If you’re looking build an image classifier but need training data, look no further than Google Open Images.