Nowadays, transfer learning from pretrained models on Imagenet is the ultimate standard in computer vision. Self-supervised learning dominates natural language processing, but this doesn’t mean that there are no significant use-cases for computer vision that it should be considered. There are indeed a lot of cool self-supervised tasks that one can devise when she/he is dealing with images, such as jigsaw puzzles , image colorization, image inpainting, or even unsupervised image synthesis.
But what happens when the time dimension comes into play? How can you approach the video-based tasks that you would like to solve?
So, let’s start from the beginning, one concept at a time. What is self-supervised learning? And how is it different from transfer learning? What is a pretext task?
Don't forget to tag @black0017 in your comment, otherwise they may not be notified.