# Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

## 2. Feature Representation Learning Schemes

#### 2.1. Transfer Learning

#### 2.2. Unsupervised Learning: Generative Modeling

#### 2.3. Self-Supervised Learning

#### 2.4. Contrastive Learning

## 3. Auxiliary Pretext Task Learning Frameworks

#### 3.1. Unlabeled Data

#### 3.2. Pretext Tasks

#### 3.3. Downstream Tasks

## 4. Contrastive Learning Framework

#### 4.1. Data Augmentation

#### 4.2. Encoder $f(.)$

#### 4.3. Base Header $g(.)$

#### 4.4. Contrastive Loss

## 5. Approaches to Auxiliary Pretext Tasks

#### 5.1. Context Prediction

#### 5.2. Colorization

#### 5.3. Generative Modelling

#### 5.4. Future Prediction

#### 5.5. Clustering as Pretext Task

## 6. Contrastive Learning Methods for Learning Visual Representation

#### 6.1. Instance Discrimination

#### 6.2. SimCLR

#### 6.3. Memory Bank

#### 6.4. Contrastive Learning without Negatives

#### 6.5. Clustering-Based Methods

## 7. The Performance of Image Feature Learning

## 8. Discussion of Self-Supervised Learning

## 9. Conclusions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016; pp. 21–37. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version] - He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 2961–2969. [Google Scholar]
- Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng.
**2021**. [Google Scholar] [CrossRef] - Kolesnikov, A.; Zhai, X.; Beyer, L. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1920–1929. [Google Scholar]
- West, J.; Ventura, D.; Warnick, S. Spring Research Presentation: A Theoretical Foundation for Inductive Transfer; Brigham Young University, College of Physical and Mathematical Sciences: Provo, UT, USA, 2007; Volume 1, p. 8. [Google Scholar]
- Yang, F.; Zhang, W.; Tao, L.; Ma, J. Transfer learning strategies for deep learning-based PHM algorithms. Appl. Sci.
**2020**, 10, 2361. [Google Scholar] [CrossRef] [Green Version] - Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, New York, NY, USA, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv
**2013**, arXiv:1312.6114. [Google Scholar] - Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst.
**2014**, 27. [Google Scholar] - Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv
**2014**, arXiv:1411.1784. [Google Scholar] - Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial feature learning. arXiv
**2016**, arXiv:1605.09782. [Google Scholar] - Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G. Bootstrap your own latent: A new approach to self-supervised learning. arXiv
**2020**, arXiv:2006.07733. [Google Scholar] - Kwasigroch, A.; Grochowski, M.; Mikołajczyk, A. Self-Supervised Learning to Increase the Performance of Skin Lesion Classification. Electronics
**2020**, 9, 1930. [Google Scholar] [CrossRef] - Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 132–149. [Google Scholar]
- Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv
**2020**, arXiv:2003.04297. [Google Scholar] - Tao, L.; Wang, X.; Yamasaki, T. Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning. arXiv
**2020**, arXiv:2010.15464. [Google Scholar] - Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv
**2018**, arXiv:1803.07728. [Google Scholar] - Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Larsson, G.; Maire, M.; Shakhnarovich, G. Learning representations for automatic colorization. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 577–593. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A. Colorful image colorization. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 649–666. [Google Scholar]
- Larsson, G.; Maire, M.; Shakhnarovich, G. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6874–6883. [Google Scholar]
- Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual Event. 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 776–794. [Google Scholar]
- Wang, X.; Qi, G.-J. Contrastive learning with stronger augmentations. arXiv
**2021**, arXiv:2104.07713. [Google Scholar] - Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; Hinton, G. Big self-supervised models are strong semi-supervised learners. arXiv
**2020**, arXiv:2006.10029. [Google Scholar] - He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Thung, K.-H.; Wee, C.-Y. A brief review on multi-task learning. Multimed. Tools Appl.
**2018**, 77, 29705–29725. [Google Scholar] [CrossRef] - Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3733–3742. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.
**2012**, 25, 1097–1105. [Google Scholar] [CrossRef] - Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Albelwi, S.A. An Intrusion Detection System for Identifying Simultaneous Attacks using Multi-Task Learning and Deep Learning. In Proceedings of the 2022 2nd International Conference on Computing and Information Technology (ICCIT), Tabuk, Saudi Arabia, 25–27 January 2022; pp. 349–353. [Google Scholar]
- Yang, X.; He, X.; Liang, Y.; Yang, Y.; Zhang, S.; Xie, P. Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms. arXiv
**2020**, arXiv:2007.04234. [Google Scholar] - Zhang, R.; Isola, P.; Efros, A.A. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1058–1067. [Google Scholar]
- Dumoulin, V.; Belghazi, I.; Poole, B.; Mastropietro, O.; Lamb, A.; Arjovsky, M.; Courville, A. Adversarially learned inference. arXiv
**2016**, arXiv:1606.00704. [Google Scholar] - Zhang, L.; Qi, G.-J.; Wang, L.; Luo, J. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2547–2555. [Google Scholar]
- Chen, L.; Bentley, P.; Mori, K.; Misawa, K.; Fujiwara, M.; Rueckert, D. Self-supervised learning for medical image analysis using image context restoration. Med. Image Anal.
**2019**, 58, 101539. [Google Scholar] [CrossRef] - Shurrab, S.; Duwairi, R. Self-supervised learning methods and applications in medical imaging analysis: A survey. arXiv
**2021**, arXiv:2109.08685. [Google Scholar] - Holmberg, O.G.; Köhler, N.D.; Martins, T.; Siedlecki, J.; Herold, T.; Keidel, L.; Asani, B.; Schiefelbein, J.; Priglinger, S.; Kortuem, K.U. Self-supervised retinal thickness prediction enables deep learning from unlabelled data to boost classification of diabetic retinopathy. Nat. Mach. Intell.
**2020**, 2, 719–726. [Google Scholar] [CrossRef] - LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Handwritten digit recognition with a back-propagation network. Adv. Neural Inf. Process. Syst.
**1989**, 2. [Google Scholar] - Yang, C.; An, Z.; Cai, L.; Xu, Y. Mutual Contrastive Learning for Visual Representation Learning. arXiv
**2021**, arXiv:2104.12565. [Google Scholar] - Kalantidis, Y.; Sariyildiz, M.B.; Pion, N.; Weinzaepfel, P.; Larlus, D. Hard negative mixing for contrastive learning. arXiv
**2020**, arXiv:2010.01028. [Google Scholar] - Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. arXiv
**2020**, arXiv:2004.11362. [Google Scholar] - Tian, Y.; Chen, X.; Ganguli, S. Understanding self-supervised learning dynamics without contrastive pairs. arXiv
**2021**, arXiv:2102.06810. [Google Scholar] - Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies
**2020**, 9, 2. [Google Scholar] [CrossRef] - Ohri, K.; Kumar, M. Review on self-supervised image recognition using deep neural networks. Knowl.-Based Syst.
**2021**, 224, 107090. [Google Scholar] [CrossRef] - Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 69–84. [Google Scholar]
- Huynh, T.; Kornblith, S.; Walter, M.R.; Maire, M.; Khademi, M. Boosting contrastive self-supervised learning with false negative cancellation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 2785–2795. [Google Scholar]
- Balestriero, R.; Misra, I.; LeCun, Y. A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments. arXiv
**2022**, arXiv:2202.08325. [Google Scholar] - Lee, H.; Hwang, S.J.; Shin, J. Rethinking data augmentation: Self-supervision and self-distillation. arXiv
**2019**, arXiv:1910.05872. [Google Scholar] - Tomasev, N.; Bica, I.; McWilliams, B.; Buesing, L.; Pascanu, R.; Blundell, C.; Mitrovic, J. Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? arXiv
**2022**, arXiv:2201.05119. [Google Scholar] - Liu, H.; Jia, J.; Qu, W.; Gong, N.Z. EncoderMI: Membership inference against pre-trained encoders in contrastive learning. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event Republic of Korea. 15–19 November 2021; pp. 2081–2095. [Google Scholar]
- Appalaraju, S.; Zhu, Y.; Xie, Y.; Fehérvári, I. Towards Good Practices in Self-supervised Representation Learning. arXiv
**2020**, arXiv:2012.00868. [Google Scholar] - Bachman, P.; Hjelm, R.D.; Buchwalter, W. Learning representations by maximizing mutual information across views. Adv. Neural Inf. Process. Syst.
**2019**, 32. [Google Scholar] - Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 297–304. [Google Scholar]
- Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv
**2018**, arXiv:1807.03748. [Google Scholar] - Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. Adv. Neural Inf. Process. Syst.
**2016**, 29. [Google Scholar] - Wang, F.; Liu, H. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2495–2504. [Google Scholar]
- Wu, C.; Wu, F.; Huang, Y. Rethinking InfoNCE: How Many Negative Samples Do You Need? arXiv
**2021**, arXiv:2105.13003. [Google Scholar] - Noroozi, M.; Pirsiavash, H.; Favaro, P. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5898–5906. [Google Scholar]
- Frankle, J.; Schwab, D.J.; Morcos, A.S. Are all negatives created equal in contrastive instance discrimination? Arxiv E-Prints
**2020**, arXiv:2010.06682. [Google Scholar] - Zheng, M.; Wang, F.; You, S.; Qian, C.; Zhang, C.; Wang, X.; Xu, C. Weakly supervised contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10042–10051. [Google Scholar]
- Misra, I.; Maaten, L.V.D. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6707–6717. [Google Scholar]
- Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15750–15758. [Google Scholar]
- Asano, Y.M.; Rupprecht, C.; Vedaldi, A. Self-labelling via simultaneous clustering and representation learning. arXiv
**2019**, arXiv:1911.05371. [Google Scholar] - Li, J.; Zhou, P.; Xiong, C.; Hoi, S.C. Prototypical contrastive learning of unsupervised representations. arXiv
**2020**, arXiv:2005.04966. [Google Scholar] - Goyal, P.; Caron, M.; Lefaudeux, B.; Xu, M.; Wang, P.; Pai, V.; Singh, M.; Liptchinsky, V.; Misra, I.; Joulin, A. Self-supervised pretraining of visual features in the wild. arXiv
**2021**, arXiv:2103.01988. [Google Scholar] - Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis.
**2010**, 88, 303–338. [Google Scholar] [CrossRef] [Green Version] - Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; Oliva, A. Learning deep features for scene recognition using places database. Adv. Neural Inf. Process. Syst.
**2014**, 27. [Google Scholar] - Everingham, M.; Eslami, S.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis.
**2015**, 111, 98–136. [Google Scholar] [CrossRef] - Donahue, J.; Simonyan, K. Large scale adversarial representation learning. Adv. Neural Inf. Process. Syst.
**2019**, 32. [Google Scholar] - Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst.
**2015**, 28. [Google Scholar] [CrossRef] [Green Version] - Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. arXiv
**2020**, arXiv:2006.09882. [Google Scholar] - Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning, Virtual Event. 18–24 July 2021; pp. 12310–12320. [Google Scholar]
- Choi, H.M.; Kang, H.; Oh, D. Unsupervised Representation Transfer for Small Networks: I Believe I Can Distill On-the-Fly. Adv. Neural Inf. Process. Syst.
**2021**, 34. [Google Scholar] - Keshav, V.; Delattre, F. Self-supervised visual feature learning with curriculum. arXiv
**2020**, arXiv:2001.05634. [Google Scholar] - Jing, L.; Vincent, P.; LeCun, Y.; Tian, Y. Understanding dimensional collapse in contrastive self-supervised learning. arXiv
**2021**, arXiv:2110.09348. [Google Scholar] - Hua, T.; Wang, W.; Xue, Z.; Ren, S.; Wang, Y.; Zhao, H. On feature decorrelation in self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual. 11–17 October 2021; pp. 9598–9608. [Google Scholar]

**Figure 1.**The workflows of SSL and TL. The workflows of SSL and TL are similar, with only slight differences. The key difference between TL and SSL is that TL pre-trains on labeled data, whereas SSL utilizes unlabeled to learn features, as shown in the first step. In the second step, SSL and TL are the same: both techniques are further trained on the target task, but we need only a small number of labelled examples.

**Figure 2.**Several examples of pretext tasks. Pretext tasks easily generate pseudo-labels from the data (images) itself. Solving pretext tasks allows the model to extract useful latent representations that later improve the downstream tasks.

**Figure 3.**Different methods of data augmentation. It is common practice to combine multiple types of data augmentation (e.g., cropping, resizing, and flipping) for higher-quality learning and better latent features.

**Figure 4.**Different models of context prediction. (

**a**) A pair of patches extracted randomly from each image train the CNN to identify a neighboring patch’s position in contrast to the initial patch. The weights between both CNNs are shared. (

**b**) Learning representation by solving jigsaw puzzles with 3 × 3 patches. (

**a**) is the original image; (

**b**) is the puzzle created by shuffling the tiles using a pre-defined permutation; (

**c**) is the feeding of shuffled patches into a CNF network trained to recognize permutations. (

**c**) An illustration of SSL using the rotation of an input image. The model learns to predict the correct rotation from four possible angles (0, 90, 180, or 270 degrees). (

**d**) proposes object counting as a pretext task for learning feature representation, thereby training a CNN architecture to count.

**Figure 5.**Colorization as a pretext task for learning representation. The CNN is trained to produce real-color images from a grayscale input image.

**Figure 6.**The architecture of a BiGAN. Using this technique, both ($z$ and $E\left(x\right))$ and ($G\left(z\right)$ and $x$) have the same dimensions. The concatenated pairs $\left[G\left(z\right),z\right]$ and $\left[x,E\left(x\right)\right]$ are the two inputs of the discriminator $D$. Both the generator $G$ and encoder $E$ are optimized using the loss created by the discriminator $D$.

**Figure 8.**A split-brain autoencoder architecture. Comprised of two sub-networks, F

_{1}and F

_{2}, the model is trained to predict data using the other network’s hypothesis to complement its own. Combining both hypotheses predicts the full image.

**Figure 10.**The structure of SimCLR. Data augmentation $T(.)$ is applied on the input image $x$ to generate two augmented images ${x}_{1}^{+}$ and ${x}_{2}^{+}$. A base encoder network $f(.)$ and a projection head $g(.)$ are trained to maximize the similarity between the augmented images using contrastive loss. After completing the training process, the representation $h$ is used for downstream tasks.

**Figure 12.**A BYOL architecture. BYOL reduces similarity loss between ${q}_{\theta}$(${z}_{\theta}$ ) and $sg$ (${z}_{\xi}^{\text{'}}$ ), where $\theta $ represents the trained weights, ξ represents an exponential moving average of $\theta $, and sg means the stop-gradient. After training, everything but ${f}_{\theta}$ is discarded. ${y}_{\theta}$ represents the image representation.

**Figure 13.**The architecture of SimSiam. Two augmented images passed through the same encoder, which is comprised of a backbone (ResNet) and a projection MLP. A prediction MLP ($h)$ is used on one side, and a stop-gradient strategy is employed on the other to avoid collapse. The model aims to maximize the similarity between both views. SimSiam does not use negative pairs or a momentum encoder.

**Figure 14.**The structure of SwAV. It assigns a code to an augmented image and then anticipates that code by using a second augmentation of that image. SwAV ensures consistency by comparing features directly, the same as contrastive learning.

**Figure 15.**The accuracy of the ImageNet Top-1 linear classifiers, which were trained on feature representations created via self-supervised techniques with different widths of ResNet 50. All were pretrained on ImageNet.

**Table 1.**Image classification with linear models on ImageNet, VOC07, and Place205 testing sets. All models are pre-trained on ImageNet without labels, using different SSL methods.

Method | Architecture | # of Param (Million) | ImageNet | VOC07 | Place205 |
---|---|---|---|---|---|

Supervised | ResNet-50 | 24 M | 76.5 | - | - |

Colorization [23] | ResNet-50 | 24 M | 39.6 | - | - |

Jigsaw [50] | ResNet-50 | 24 M | 45.7 | 64.5 | 41.2 |

Rotation [20] | ResNet-50 | 24 M | 48.9 | 63.9 | 41.4 |

NPID [32] | ResNet-50 | 24 M | 54 | - | - |

BigBiGAN [74] | ResNet-50 | 24 M | 56.6 | - | - |

MoCo [30] | ResNet-50 | 24 M | 60.6 | 79.2 | 48.9 |

PCL [69] | ResNet-50 | 24 M | 61.5 | 82.3 | 49.2 |

PIRL [66] | ResNet-50 | 24 M | 63.6 | 81.1 | 49.8 |

CPC v2 [59] | ResNet-50 | 24 M | 63.8 | - | - |

SimCLR [26] | ResNet-50 | 24 M | 69.3 | - | - |

MoCo v2 [18] | ResNet-50 | 24 M | 71.1 | - | - |

SwAV [17] | ResNet-50 | 24 M | 75.3 | 88.9 | 56.7 |

Different Architectures and Setups | |||||

Supervised | ResNet-50 | 25.6M | 75.9 | 87.5 | 51.5 |

Colorization [23] | ResNet-50 | 25.6M | 39.6 | 55.6 | 37.5 |

Rotation [20] | ResNet-50 | 25.6M | 48.9 | 63.9 | 41.4 |

DeepCluster [17] | VGG16 | 15 M | 48.9 | 63.9 | 41.4 |

NPID [32] | ResNet-50 | 25.6 M | 54 | - | 45.5 |

PCL v2 [69] | ResNet-50-MLP | 28 M | 67.6 | 85.4 | 50.3 |

BYOL [15] | ResNet-50-MLP | 35 M | 74.3 | - | - |

DeepCluster [17] | AlexNet | 61 M | 54 | - | 37.5 |

AMDIM [57] | Custom-RN | 670 M | 68.1 | - | 55.1 |

Method | VOC07 Classification | VOC07 Detection | VOC12 Segmentation |
---|---|---|---|

Supervised [33] | 79.9 | 56.8 | 48.0 |

Context [25] | 55.3 | 46.0 | - |

Jigsaw [50] | 67.6 | 53.2 | 37.6 |

ContextEncoder [21] | 56.5 | 44.5 | 29.7 |

BiGAN [14] | 58.6 | 46.2 | 34.9 |

Colorization [22] | 65.9 | 46.9 | 35.6 |

Split-Brain [37] | 67.1 | 46.7 | 36.0 |

ColorBroxy [24] | 65.9 | - | 38.0 |

Counting [63] | 67.7 | 51.4 | 36.6 |

PIRL [66] | 81.1 | 80.7 | - |

Barlow Twins [77] | 86.2 | - | - |

MoCo [30] | - | 81.4 | - |

SwAV [17] | 88.9 | 82.6 | - |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Albelwi, S.
Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging. *Entropy* **2022**, *24*, 551.
https://doi.org/10.3390/e24040551

**AMA Style**

Albelwi S.
Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging. *Entropy*. 2022; 24(4):551.
https://doi.org/10.3390/e24040551

**Chicago/Turabian Style**

Albelwi, Saleh.
2022. "Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging" *Entropy* 24, no. 4: 551.
https://doi.org/10.3390/e24040551