Robustness of Contrastive Learning on Multilingual Font Style Classification Using Various Contrastive Loss Functions

Memon, Irfanullah; Muhammad, Ammar ul Hassan; Choi, Jaeyoung

doi:10.3390/app13063635

Open AccessArticle

Robustness of Contrastive Learning on Multilingual Font Style Classification Using Various Contrastive Loss Functions

by

Irfanullah Memon

^1,2

,

Ammar ul Hassan Muhammad

¹ and

Jaeyoung Choi

^1,*

¹

School of Computer Science and Engineering, Soongsil University, Seoul 06978, Republic of Korea

²

Department of Software Engineering, Shaheed Zulfiqar Ali Bhutto Campus, Mehran University of Engineering & Technology, Khairpur Mir’s 66020, Sindh, Pakistan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(6), 3635; https://doi.org/10.3390/app13063635

Submission received: 17 January 2023 / Revised: 26 February 2023 / Accepted: 7 March 2023 / Published: 13 March 2023

(This article belongs to the Special Issue Advances in Intelligent Information Systems and AI Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Font is a crucial design aspect, however, classifying fonts is challenging compared with that of other natural objects, as fonts differ from images. This paper presents the application of contrastive learning in font style classification. We conducted various experiments to demonstrate the robustness of contrastive image representation learning. First, we built a multilingual synthetic dataset for Chinese, English, and Korean fonts. Next, we trained the model using various contrastive loss functions, i.e., normalized temperature scaled cross-entropy loss, triplet loss, and supervised contrastive loss. We made some explicit changes to the approach of applying contrastive learning in the domain of font style classification by not applying any image augmentation. We compared the results with those of a fully supervised approach and achieved comparable results using contrastive learning with fewer annotated images and a smaller number of training epochs. In addition, we also evaluated the effect of applying different contrastive loss functions on training.

Keywords:

contrastive learning; font style; multilingual font style; font classification

1. Introduction

The increasing use of computers and digital typography has resulted in a large number of available typefaces, providing users with many options for selecting typefaces for various purposes. With the increasing amount of digital text, there are several situations where it may be helpful to recognize the font or identify the resemblance between two fonts that are included in a collection of fonts. In these cases, it is essential to be able to distinguish between fonts that are similar to one another. For instance, if a user has an image containing text, they may want to determine the typeface used in the image. In addition, a user may wish to discover a font comparable to the font used in the image, because using the font used in the image may be prohibitively expensive or may not be an option in a specific application. However, classifying fonts or identifying similar fonts has become increasingly difficult for several reasons, such as the open-ended nature of font classes and extremely sensitive and characteristic-dependent differences among the fonts such as the different weights, strokes, slopes, etc. Additionally, the numerous online font repositories make this challenge even more challenging.

Computer vision models have traditionally been trained through supervised learning. This means that people examined the images and labeled them with various labels so that the model could learn the patterns of these labels [1,2,3,4,5,6]. For instance, a human annotator adds a class label to an image or draws bounding boxes around the objects in the images. However, as everyone who has dealt with labeling jobs knows, creating a suitable training dataset requires considerable effort as the annotation cost per image is very high. Additionally, existing generative adversarial network (GAN)-based methods applied in the font domain focus on font generation rather than classification [7,8,9]. Although these methods use classifiers to learn different styles, they primarily focus on font generation rather than style classification.

Self-supervised methods, in contrast, do not require any labels created by humans. As the name suggests, the model learns to supervise itself. In computer vision, the most typical technique to represent this self-supervision is to take various cropped images or apply different augmentations on images before feeding the modified inputs through the model. Even if the image has the same visual information but does not appear identical, Chen et al. [10] presented a contrastive learning framework based on metric learning [11,12,13]; they trained the model so that these images still contain the same visual information, i.e., the same item, even though they appear different. This results in the model learning a comparable latent representation (an output vector) for identical items in a label-free environment. Later, Khosla et al. [14] extended Chen’s work [10] in a supervised setting and used class labels to draw more positive images instead of using only one positive image. In self-supervised contrastive learning, for every anchor image, there is only one positive image (i.e., an augmented image) in the entire batch. However, as demonstrated in [10,14], it cannot be directly applied to the font domain by simply applying augmentations [15,16] on training images. As shown in Figure 1, flipping a font image horizontally or vertically results in different image contents, i.e., flipping ‘R’ horizontally results in an undefined character. Because of this reason, the application of contrastive learning in the font domain, in general, and font style classification in particular, has remained largely underexplored.

On unlabeled data, contrastive learning learns generic representations of images, which may subsequently be fine-tuned with a limited number of labeled images to achieve acceptable performance for a specified classification problem, in this case, font style classification. The generic representations are learned by maximizing the agreement between multiple transformed views of the same image, while simultaneously reducing the agreement between transformed views of different images. When the parameters of a neural network are updated using this contrastive aim, comparable view representations “attract” each other while noncorresponding view representations “repel” each other. Therefore, the model learns the similarities between two inputs instead of learning how to classify the input images. Therefore, we used Normalized Temperature-scaled Cross-Entropy Loss (NT-Xent) [17], Triplet Loss [18], and Supervised Contrastive Loss functions [14].

Another important problem in font style classification is the ability to learn new styles over time. The traditional approach of fine-tuning [19,20,21] a fully supervised classifier has limitations in terms of generalization to new classes, resulting in the need for separate classifiers for old and new classes, which can be ineffective when a few training samples [22,23] of new classes are used. However, contrastive learning, allows for continued learning of new representations and features without sacrificing the model’s ability to classify previously learned classes. By fine-tuning [24,25,26] the pretrained model in the pretext task using few-shot images [27], the model can learn new representations that encode useful features of new input data, thereby improving its ability to classify new classes in the downstream tasks. This approach leads to more efficient and effective learning, as well as more robust and adaptive models that can continue to improve over time.

In this paper, we present a study on the use of contrastive-based learning techniques to learn font style characteristics in feature spaces for Chinese, English, and Korean characters. We experimented with different contrastive loss functions and compared them with a baseline label-based fully supervised classifier. We also proposed a triplet dataset sampling technique to select hard-negative and hard-positive images for font images. We achieved high accuracy in downstream tasks and demonstrated that the proposed approach is effective for font style recognition, regardless of the base encoder used.

2. Related Works

The success of Convolutional Neural Networks (CNNs) is that they can learn from labeled data end-to-end. Deep CNNs have been successfully applied to many problems in Document Image Analysis [28,29] which include complete image classification [30,31], image preprocessing [32], character recognition [33], and classifying font classes into predefined font classes trained on labeled data [34].

2.1. Convolutional Neural Networks for Font Classification

Chris Tensmeyer et al. [34] presented a CNN-based framework to classify text lines into fonts. They used Optical Character Recognition (OCR) to recognize the text in documents. Because handling multiple text fonts is a difficult task for OCR, they used a specialist OCR to recognize text based on labeled text lines. They trained a CNN to classify small patches into font classes. However, at the inference time, they densely extracted patches and averaged the individual predictions of each patch to get the predicted font. Although it achieved a prediction accuracy of 98.6% on the King Fahd University Arabic Font Database (KAFD) for 40 typefaces in four styles and ten different sizes, the work is limited to predicting and classifying the Arabic fonts on which the model is trained on, it was also trained on labeled data. On the Classification of Latin Medieval Manuscripts (CLaMM) dataset, the edge performance of this model is significantly lower, with an accuracy rate of 86.6%. This study also demonstrates that CNNs are capable of correctly classifying fonts. By varying the characteristics of the images that were fed into CNNs, they demonstrated the features that were used for classification.

In contrast, we trained our model in a self-supervised environment. The model was trained in two stages. During the first stage, the model learns the underlying mapping in a self-supervised manner based on contrastive learning, which is referred to as Pretext training. During the second stage, termed the downstream task, the model was fine-tuned using a small number of labels to train only the classifier, whereas the base network remained frozen.

2.2. Contrastive Learning

Our study is based on the existing literature [10,14,17,18]. Chen et al., in their work, proposed a framework that focuses on self-supervised representation learning by leveraging data augmentation to train the model on learning similarities among images using N-pair loss or NT-Xent loss. Similar to the work of Chen et al. [10], Khosla et al. [14] proposed a loss function for supervised learning that incorporates label information. The loss function is designed to pull normalized embeddings from the same class close together while keeping embeddings from different classes farther apart. The key difference in this loss function is the use of many positives per anchor and many negatives rather than just a single positive, as in self-supervised contrastive learning. This loss function uses class labels to draw positive samples of the same class as the anchor. However, these studies cannot be directly extended to the font domain by simply applying augmentations to obtain positive views of the same image or using labels to train the classifier via cross-entropy loss. Inspired by triplet loss, another contrastive-based loss that uses metric learning, we used labels to guide the choice of the positive image instead of just applying the augmentation on the font image.

3. Dataset

We used printed fonts to create the training dataset. We downloaded Chinese fonts from freechinesefont.com, which has 187 Chinese fonts. We downloaded and filtered 65 fonts with all the desired 1000 commonly used Chinese characters. For English, we used Google Fonts, which has 1455 different font families. We downloaded and filtered 1000 fonts supporting all 52 capital and lower-case English characters. For Korean fonts, we downloaded font files from Naver and used 70 Korean fonts to create our dataset; ee used 500 commonly used Korean characters. We used Python code to generate training images of 224 × 224 pixel size. Therefore, we generated 65 × 1000, 1000 × 52, and 70 × 500 training images for the Chinese, English, and Korean characters, respectively.

3.1. Triplet Dataset Sampling

The triplet loss function was initially presented in [16] and has since grown in popularity as a contrastive-based learning technique. A triplet-based loss function requires three images: an anchor image, a positive image that is a member of the same class as the anchor, and a negative image. This suggests that dissimilar pairs should be some margin away from similar ones. Figure 2 visualizes the triplet loss learning objective. Initially, the authors used the Euclidian distance to measure the distance, but we used cosine similarity in our study. Unlike the contrastive loss function, where we can randomly select an image sample and apply transformations to obtain two augmented images. Font images are typically grayscale and provide limited information, such as strokes, thickness, and serifs, which differentiate one font style from another. Applying augmentation in this scenario will result in different contents of the image, as illustrated in Figure 1. For this reason, here, we chose the hard positive and hard negative of an anchor in such a way that the distance between the anchor and negative is smaller than the distance between the anchor and positive. This process is known as triplet mining.

In our test case, as shown in Figure 3, suppose we select a random image

x_{i}^{a}

, that belongs to style class m, with character class i, and we select a positive image

x_{j}^{p}

from the same class m but with different character content, j from the anchor. However, in the case of the negative image

x_{i}^{n}

, the image should belong to another class but have the same character content i as the anchor, as font images are grayscale, and very little information is available within the training images, such as strokes, thickness, serifs, etc., which separate one font style from another. It is a crucial aspect of triplet mining to select the hard-negative image of the anchor, where the distance between the anchor and the selected negative image should be the shortest compared to the remaining images. Therefore, we chose images with identical character labels from different style classes as the negative images. We followed the same triplet selection strategy for Chinese and English data loaders.

3.2. Self-Supervised and Supervised Contrastive Data Sampling

Contrastive image representation learning aims to capture the visual similarities between images by embedding them in a common feature space. Previous works have conducted this by applying image augmentations to the original image. However, font images are different in nature compared with natural images, and applying augmentations on font images may lead to varying contents of font images. As shown in Figure 1, it can be observed that a flipped or cropped squirrel is still a squirrel, but that is not true in case of the font images; therefore applying augmentation on them will lead to a different image content. Due to this reason, as shown in Figure 4, we did not apply any image augmentations to train the model using self-supervised and supervised contrastive loss functions. Because fonts are readily accessible and have associated labels, we only utilized these labels during data preprocessing at the time of data loading; similar to triplet mining for positive images, we arbitrarily selected different images from the same class, as discussed in the previous section.

4. Training Details

This section discusses the several experiments that we conducted. We demonstrate the efficacy of using contrastive learning to learn font style characteristics in the feature space, in contrast to training a model that classifies based on labels. In addition, we demonstrate the impact of several contrastive loss functions, such as NT-Xent, triplet loss, and supervised contrastive loss.

4.1. Contrastive Learning Framework

The basic idea behind contrastive learning is to learn a representation space in which examples of the same class or category are brought closer together in the feature space; in contrast, those belonging to different classes are pushed farther apart. We achieved this by training the model on a pretext task that involves comparing the representations of different views of the same input, i.e., two-character images belonging to the same font style but with different character contents. By carrying this out, the model learns to identify the similarities and differences between the different views, which in turn allows it to learn meaningful representations of the data.

4.1.1. Pretext Training

As defined in [6], contrastive learning is a two-step training framework: the pretext and the downstream task. We primarily used ResNet-18 [3] as our base encoder with output dimensions of 100 for the embedding vector for pretext representation learning. We trained the base encoder with the Adam optimizer [25] using decay with a learning rate of 0.004 and standard batch size of 256. We also used VGG-19 [2] as a base encoder for pretext training to countercheck the effect of contrastive learning on the achieved results, whether they resulted from using contrastive learning; these results were consistent. Therefore, it is the ability of contrastive-based learning to learn better representation results with such high accuracy irrespective of the base encoder used. Another reason to use ResNet-18 or VGG-19 as our base encoders is that we used them as our baseline for the fully supervised classifier. However, it is worth noting that the choice of the base encoder depends on the specific task at hand, and it is possible that a different architecture may perform better on another task. Moreover, ResNet-18 is less likely to overfit the training data and more likely to perform well on new data; therefore it was used as the base encoder in our experiments to exploit its superior generalization capabilities and ensure a fair comparison between different contrastive loss functions and baseline classifiers.

4.1.2. Downstream Task

We removed the projection head for the downstream tasks and attached a classification layer with 65, 100, and 70 output dimensions for Chinese, English, and Korean datasets, respectively. The classifier was then trained in three distinct settings with a batch size of 256 and optimized using SGD with momentum [23] with a momentum parameter of 0.9. We trained the classifiers using the labeled dataset for 50 epochs in the first configuration for all the languages. However, we trained the classifiers using 50% and 10% ot the dataset labeled images in the second and third settings, respectively. It is important to note that only the classifiers were fine-tuned during the downstream task, and the base encoder was not trained again.

4.2. Hardware and Software

We trained the deep neural network models on a single machine comprising an Intel Core i9 Processor with 16 cores clocked at 5.2 GHz, an NVIDIA RTX 3080 Ti GPU, and 32 GB of RAM. We used PyTorch 1.11 as our deep learning framework and Python 3.6 and CUDA 10.2 on an Ubuntu 20.04 LTS operating system. To preprocess our data, we applied normalization techniques to ensure optimal performance during training. During pretext training, we used the Adam optimizer with a learning rate of 0.004 for 50 epochs and a batch size of 256. Based on the existing literature [35,36] on contrastive learning, it has been stated that there is a positive correlation between increasing batch size and the quality of embeddings generated by the model. Specifically, larger batch sizes result in more discriminative and robust embeddings, faster convergence, and more efficient training for contrastive learning models. However, due to limited computing power availability, we used a batch size of 256 throughout our experiments.

5. Results and Evaluation

The results are presented in Table 1. The classifier trained on a supervised contrastive model’s downstream task showed better test accuracy. The classifier achieved the highest test accuracies of 99.76%, 99.73%, and 99.31% for Korean, English, and Chinese fonts, respectively, with supervised contrastive loss when trained using 100% of the images in the labeled dataset. Notably, the supervised contrastive classifier achieved such high accuracy in only 50 epochs, which is significantly fewer epochs than the fully supervised classifier, which was trained for 500 epochs. When trained with only 10% of the labeled images, the classifier achieved a test accuracy of 98.1%, 98%, and 97.8% for Korean, Chinese, and English font styles, respectively, in fewer training iterations than a fully supervised classifier trained on the entire dataset. These results demonstrate the effectiveness of using contrastive learning for font-style classification in scenarios where labeled data are scarce, which can achieve high accuracy with few labeled images.

Figure 5 shows the t-SNE Projections of Chinese, English, and Korean embeddings vectors models trained using NT-Xent, Triplet loss and supervised contrastive loss. In the pretext part, it shows the effect of utilizing various contrastive loss functions on training the model. For this, we randomly selected ten fonts from each language, i.e., Chinese, English, and Korean, and plotted the embeddings of 10 characters in each font. The model can group all the font images belonging to the same font style irrespective of the character content of the images. Moreover, if we closely inspect all the visualizations of embedding vectors for NT-Xent, Triplet loss, and Supervised Contrastive loss functions, we will notice that supervised contrastive loss projections are the most segregated of the three. Supervised contrastive training produces better clusters for image representations because it is trained using labels; this explains why supervised contrastive loss is superior to the other loss functions. Contrary to training the model using labels, the triplet loss function also produced well-separated projections, while the NT-Xent loss function produced less separated projections. Overall, the results suggest that the supervised contrastive loss function is superior to the other loss functions in this font classification task.

To further establish the effectiveness of training our model with a contrastive loss function for learning individual font styles, rather than relying on labels and cross-entropy loss for classification, we conducted an experiment to recommend similar font styles based on input font images. Figure 6 shows the recommended similar images for query images in Chinese, English, and Korean font styles on the seen and unseen images during training. After training the models of all languages, i.e., Chinese, English, and Korean, we removed the projection layer and passed all the images through the encoder to get the embeddings for the entire training dataset and stored them into separate pickle files. At the test time, we passed two images of each language, one seen and one unseen, and calculated the distance using the cosine similarity from embedding vectors of training images. Subsequently, we selected the three images closest to query images based on a similarity value, as shown against all query images in Figure 6. It can be observed that the model precisely recommends the same style for seen style and the closest style images for unseen styles. This indicates that the contrastive-based approach is effective in learning the semantics of individual font styles and can be used to recommend similar font styles to users.

The results presented in Table 2 demonstrate the effectiveness of using contrastive learning for few-shot fine-tuning of a pretrained model on new font styles. Compared with the fully supervised method, the contrastive-based approach achieved higher training and test accuracies for all the numbers of shots considered. For instance, when only 50 shots were used for fine-tuning, the contrastive-based method achieved a training accuracy of 93.0% and a test accuracy of 82.0%. In contrast, the fully supervised method only achieved a training accuracy of only 89.0% and a test accuracy of only 35.0%. As the number of shots increased, the contrastive-based method continued to outperform the fully supervised method, achieving test accuracy of 96.3% with 250 shots, compared with 74.4% for the fully supervised method. These results highlight the potential of contrastive learning in improving the robustness and adaptability of font style classification models, particularly when learning new styles with limited data.

6. Discussion

In this paper, we demonstrated the application of contrastive-based learning in the domain of font style classification and effect of various contrastive-based techniques to learn font style characteristics in feature spaces for Chinese, English, and Korean characters. We achieved results comparable to those of a label-based fully supervised approach to font style classification. Experiments revealed that the pretext training base architecture had no noticeable impact on training efficacy. By combining our findings with metric-based learning on visual representations, we also demonstrated the effectiveness of contrastive learning in a font style recommender system to identify related font styles or fonts with the closest pixel dimensions.

This study opens up possibilities for future work in the font domain, where unsupervised contrastive learning can be used to learn generic representations of fonts and be fine-tuned with a limited number of labeled images to achieve acceptable performance in font-based problems. Our work can also be extended to other scripts, such as Arabic and Devanagari. Overall, our study demonstrates that self-supervised contrastive learning can be a powerful tool for font style recognition, particularly when labeled datasets are limited or expensive to create.

Author Contributions

Conceptualization, I.M.; Methodology, I.M. and A.u.H.M.; Software, I.M.; Validation, A.u.H.M.; Formal Analysis, not applicable; Investigation, I.M.; Resources, J.C.; Data curation, I.M.; Writing—original draft preparation, I.M.; Writing—review and editing, A.u.H.M., I.M. and J.C.; Visualization, I.M.; Supervision, J.C.; Project administration, J.C.; Funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIP) (No. 2016-0-00166, Technology Development Project for Information, Communication, and Broadcast).

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

Some or all data, trained models, or codes that support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:14091556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. CoRR [Internet]. arXiv 2014, arXiv:1409.4842. Available online: http://arxiv.org/abs/1409.4842 (accessed on 1 September 2022).
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. CoRR [Internet]. arXiv 2016, arXiv:1608.06993. Available online: http://arxiv.org/abs/1608.06993 (accessed on 1 September 2022).
Hinton, G.; Srivastava, N.; Swersky, K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited 2012, 14, 2. [Google Scholar]
Hassan, A.U.; Memon, I.; Choi, J. Real-time high quality font generation with Conditional Font GAN. Expert Syst. Appl. 2022, 213, 118907. [Google Scholar] [CrossRef]
Hassan, A.U.; Ahmed, H.; Choi, J. Unpaired font family synthesis using conditional generative adversarial networks. Knowl. Based Syst. 2021, 229, 107304. [Google Scholar] [CrossRef]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; pp. 1597–1607. [Google Scholar]
Hjelm, D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 539–546. [Google Scholar]
Weinberger, K.Q.; Saul, L.K. Distance Metric Learning for Large Margin Nearest Neighbor Classification. J. Mach. Learn. Res. 2009, 10, 207–244. Available online: http://jmlr.org/papers/v10/weinberger09a.html (accessed on 15 September 2022).
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Zhang, H.; Cissé, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018; 2018. Available online: https://openreview.net/forum?id=r1Ddp1-Rb (accessed on 1 September 2022).
Yun, S.; Han, D.; Chun, S.; Oh, S.; Yoo, Y.; Choe, J. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE Computer Society. pp. 6022–6031. Available online: https://doi.ieeecomputersociety.org/10.1109/ICCV.2019.00612 (accessed on 12 November 2022).
Sohn, K. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Available online: https://proceedings.neurips.cc/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf (accessed on 1 November 2022).
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Zhou, Z.; Shin, J.; Zhang, L.; Gurudu, S.; Gotway, M.; Liang, J. Fine-Tuning Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4761–4772. [Google Scholar]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. CoRR [Internet]. arXiv 2018, arXiv:1812.01187. Available online: http://arxiv.org/abs/1812.01187 (accessed on 1 November 2022).
Triantafillou, E.; Zemel, R.S.; Urtasun, R. Few-Shot Learning Through an Information Retrieval Lens. CoRR [Internet]. arXiv 2017, arXiv:1707.02610. Available online: http://arxiv.org/abs/1707.02610 (accessed on 1 September 2022).
Chen, W.Y.; Liu, Y.C.; Kira, Z.; Wang, Y.C.; Huang, J.B. A Closer Look at Few-shot Classification. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Qian, N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999, 12, 145–151. Available online: https://www.sciencedirect.com/science/article/pii/S0893608098001166 (accessed on 1 September 2022). [CrossRef] [PubMed]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. CoRR [Internet]. arXiv 2020, arXiv:2006.07733. Available online: https://arxiv.org/abs/2006.07733 (accessed on 1 November 2022).
Gunel, B.; Du, J.; Conneau, A.; Stoyanov, V. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning. CoRR [Internet]. arXiv 2020, arXiv:2011.01403. Available online: https://arxiv.org/abs/2011.01403 (accessed on 1 November 2022).
Chen, Y.; Liu, Z.; Xu, H.; Darrell, T.; Wang, X. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9062–9071. [Google Scholar]
Dhillon, G.S.; Chaudhari, P.; Ravichandran, A.; Soatto, S. A Baseline for Few-Shot Image Classification. CoRR [Internet]. arXiv 2019, arXiv:1909.02729. Available online: http://arxiv.org/abs/1909.02729 (accessed on 15 September 2022).
Afzal, M.Z.; Capobianco, S.; Malik, M.I.; Marinai, S.; Breuel, T.M.; Dengel, A.; Liwicki, M. Deepdocclassifier: Document classification with deep convolutional neural network. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1111–1115. [Google Scholar]
Harley, A.W.; Ufkes, A.; Derpanis, K.G. Evaluation of deep convolutional nets for document image classification and retrieval. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 991–995. [Google Scholar]
Cloppet, F.; Eglin, V.; Helias-Baron, M.; Kieu, C.; Vincent, N.; Stutzmann, D. Icdar2017 competition on the classification of medieval handwritings in latin script. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 1371–1376. [Google Scholar]
Kang, L.; Kumar, J.; Ye, P.; Li, Y.; Doermann, D. Convolutional neural networks for document image classification. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 3168–3172. [Google Scholar]
Shi, B.; Bai, X.; Yao, C. Script identification in the wild via discriminative convolutional neural network. Pattern Recognit. 2016, 52, 448–458. [Google Scholar] [CrossRef]
Narayan, A.; Muthalagu, R. Image Character Recognition using Convolutional Neural Networks. In Proceedings of the 2021 Seventh International Conference on Bio Signals, Images, and Instrumentation (ICBSII), Chennai, India, 25–27 March 2021; pp. 1–5. [Google Scholar]
Tensmeyer, C.; Saunders, D.; Martinez, T. Convolutional Neural Networks for Font Classification. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 985–990. [Google Scholar]
Wang, T.; Isola, P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the International Conference on Machine Learning, Miami, FL, USA, 14–17 December 2020; pp. 9929–9939. [Google Scholar]
Izacard, G.; Caron, M.; Hosseini, L.; Riedel, S.; Bojanowski, P.; Joulin, A.; Grave, E. Towards Unsupervised Dense Information Retrieval with Contrastive Learning. CoRR [Internet]. arXiv 2021. Available online: https://arxiv.org/abs/2112.09118 (accessed on 1 December 2022).

Figure 1. Illustration of the difference between augmenting a real image and a font image.

Figure 2. Triplet loss training objective. The anchor and positive images have distinct characters in the same font style, whereas the negative image has a different font style and the same character as the anchor.

Figure 3. Example of triplet data loader for (a) Chinese, (b) English, and (c) Korean font styles.

x_{i}^{a}

,

x_{j}^{p}

, and

x_{i}^{n}

represents anchor, positive, and negative images, respectively.

Figure 3. Example of triplet data loader for (a) Chinese, (b) English, and (c) Korean font styles.

x_{i}^{a}

,

x_{j}^{p}

, and

x_{i}^{n}

represents anchor, positive, and negative images, respectively.

Figure 4. Example of Self-Supervised and Supervised Contrastive data loaders for (a) Chinese, (b) English, and (c) Korean font styles.

Figure 5. t-SNE Projections of Chinese, English, and Korean embeddings vectors models trained using NT-Xent, Triplet loss and supervised contrastive loss. Images with the same color represent the same font style.

Figure 6. Recommended similar images of query image in Chinese, English, and Korean font styles on seen and unseen images during training based on cosine similarity. Caption of each image shows similarity score to query image, style number, and character number.

Table 1. Comparison of training and test accuracies of classifiers trained on NT-Xent, triplet, and supervised contrastive approaches with a fully supervised classifier used as a baseline.

Language	Method	Epochs	Labeled Dataset Size	Training Accuracy	Test Accuracy
Chinese	Supervised	500	65,000	99.90	99.38
	NT-Xent	50/50	65,000 (100%)	99.23	98.67
			32,500 (50%)	99.00	98.58
			6500 (10%)	98.34	98.13
	Triplet	50/50	65,000 (100%)	99.78	99.11
			32,500 (50%)	99.63	98.98
			6500 (10%)	98.93	98.27
	Supervised Contrastive	50/50	65,000 (100%)	99.82	99.31
			32,500 (50%)	99.53	99.14
			6500 (10%)	98.45	98.00
English	Supervised	500	52,000	100	99.95
	NT-Xent	50/50	52,000 (100%)	99.35	98.10
			26,000 (50%)	99.15	98.37
			5200 (10%)	98.90	98.45
	Triplet	50/50	52,000 (100%)	99.85	99.17
			26,000 (50%)	98.68	98.40
			5200 (10%)	99.33	99.20
	Supervised Contrastive	50/50	52,000 (100%)	99.91	99.73
			26,000 (50%)	99.53	99.07
			5200 (10%)	98.10	97.80
Korean	Supervised	500	35,000	100	100
	NT-Xent	50/50	35,000 (100%)	99.35	99.00
			17,500 (50%)	99.15	98.13
			3500 (10%)	98.90	98.05
	Triplet	50/50	35,000 (100%)	99.85	99.17
			17,500 (50%)	99.53	99.07
			3500 (10%)	99.10	98.95
	Supervised Contrastive	50/50	35,000 (100%)	100	99.76
			17,500 (50%)	99.68	99.40
			3500 (10%)	98.33	98.10

Table 2. Accuracy comparison of contrastive learning and fully supervised methods for few-shot fine-tuning for new styles.

Fine-Tuning Method	N-Shots	Training Accuracy	Test Accuracy
Fully Supervised	50/500	89.0	35.0
	100/500	95.6	53.0
	250/500	98.9	74.4
Contrastive based	50/500	93.0	82.0
	100/500	98.3	89.6
	250/500	99.1	96.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Memon, I.; Muhammad, A.u.H.; Choi, J. Robustness of Contrastive Learning on Multilingual Font Style Classification Using Various Contrastive Loss Functions. Appl. Sci. 2023, 13, 3635. https://doi.org/10.3390/app13063635

AMA Style

Memon I, Muhammad AuH, Choi J. Robustness of Contrastive Learning on Multilingual Font Style Classification Using Various Contrastive Loss Functions. Applied Sciences. 2023; 13(6):3635. https://doi.org/10.3390/app13063635

Chicago/Turabian Style

Memon, Irfanullah, Ammar ul Hassan Muhammad, and Jaeyoung Choi. 2023. "Robustness of Contrastive Learning on Multilingual Font Style Classification Using Various Contrastive Loss Functions" Applied Sciences 13, no. 6: 3635. https://doi.org/10.3390/app13063635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robustness of Contrastive Learning on Multilingual Font Style Classification Using Various Contrastive Loss Functions

Abstract

1. Introduction

2. Related Works

2.1. Convolutional Neural Networks for Font Classification

2.2. Contrastive Learning

3. Dataset

3.1. Triplet Dataset Sampling

3.2. Self-Supervised and Supervised Contrastive Data Sampling

4. Training Details

4.1. Contrastive Learning Framework

4.1.1. Pretext Training

4.1.2. Downstream Task

4.2. Hardware and Software

5. Results and Evaluation

6. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI