S2AC: Self-Supervised Attention Correlation Alignment Based on Mahalanobis Distance for Image Recognition

Wang, Zhi-Yong; Kang, Dae-Ki; Zhang, Cui-Ping

doi:10.3390/electronics12214419

Open AccessArticle

S²AC: Self-Supervised Attention Correlation Alignment Based on Mahalanobis Distance for Image Recognition

by

Zhi-Yong Wang

¹,

Dae-Ki Kang

^2,*

and

Cui-Ping Zhang

¹

Blockchain Laboratory of Agricultural Vegetables, Weifang University of Science and Technology, Weifang 262700, China

²

Department of Computer Engineering, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan 47011, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(21), 4419; https://doi.org/10.3390/electronics12214419

Submission received: 9 September 2023 / Revised: 16 October 2023 / Accepted: 21 October 2023 / Published: 26 October 2023

(This article belongs to the Special Issue Artificial Intelligence for Robotics)

Download

Browse Figures

Versions Notes

Abstract

:

Susceptibility to domain changes for image classification hinders the application and development of deep neural networks. Domain adaptation (DA) makes use of domain-invariant characteristics to improve the performance of a model trained on labeled data from one domain (source domain) on an unlabeled domain (target) with a different data distribution. But existing DA methods simply use pretrained models (e.g., AlexNet, ResNet) for feature extraction, which are convolutional models that are trapped in localized features and fail to acquire long-distance dependencies. Furthermore, many approaches depend too much on pseudo-labels, which can impair adaptation efficiency and lead to unstable and inconsistent results. In this research, we present S²AC, a novel approach for unsupervised deep domain adaptation, that makes use of a stacked attention architecture as a feature map extractor. Our method can fuse domain discrepancy with minimizing a linear transformation of the second statistics (covariances) extended by the p-norm, while simultaneously designing pretext tasks on heuristics to improve the generality of the learning representation. In addition, we have developed a new trainable relative position embedding that not only reduces the model parameters but also enhances model accuracy and expedites the training process. To illustrate our method’s efficacy and controllability, we designed extensive experiments based on the Office31, Office_Caltech_10, and OfficeHome datasets. To the best of our knowledge, the proposed method is the first attempt at incorporating attention-based networks and self-supervised learning for image domain adaptation, and has shown promising results.

Keywords:

domain adaptation; CORAL; self-supervised learning

1. Introduction

Neural networks have transformed the area of machine learning, enabling the development of powerful models capable of learning complicated patterns and making accurate predictions. However, one of the major challenges in employing neural networks is their sensitivity to distribution shift of the data on which they are trained. Figure 1 shows that the data used to train a model may not be indicative of the data encountered by the model in the real-world. Domain adaptation seeks to address this issue by adapting the model to the target domain, allowing it to generalize more effectively and generate more accurate predictions.

Domain adaptation (DA) is a sub-field of transfer learning that aims to bridge the source and target domains by deriving domain-invariant features from labeled data in the source domain and unlabeled data in the target domain. There has been a surge of interest in employing neural networks for DA in recent years. Many novel algorithms, ranging from simple techniques to more complicated ones, have been presented. These approaches have yielded promising results in a variety of applications, such as computer vision, natural language processing, and speech recognition.

Basically, recent advances in DA are divided mainly into two different types of DA methods: shallow methods and deep methods. The shallow method can be applied to homogeneous and heterogeneous visual tasks, where visual features are extracted from the images. According to [1], features generated by deep models can be repurposed to new tasks and outperform the shallow DA methods even without any adaptation. Inspired by this phenomenon, a commonly accepted paradigm considers the deep model to extract vectorial features as representations for the input image, which can then be used within the shallow DA methods to mitigate the domain discrepancy.

In this work, we present an End2End deep domain adaptation model S²AC, which stands for self-supervised attention CORAL based on Mahalanobis distance. There are two main ideas proposed to improve the previous methods.

First, in our model, the attention architecture relies entirely on attention mechanisms. We stack self-attention blocks and linear layers to draw global dependencies, and insert position encoding between them.

Second, the model leverages recent advances in self-supervised learning to learn representations that can be used for solving other downstream tasks of interest, but requires no label data.

The experimental results illustrate that our method can achieve state-of-the-art performance comparable with previously carefully tuned baselines.

The rest of this paper is organized as follows: Section 3 surveys the theoretical background related to our model; Section 4 describes the proposed S²AC model and provides a theoretical exposition; Section 5 shows experimental results; Section 6 presents the conclusions.

2. Related Work

Shallow DA Methods: Intensive efforts have been made in domain adaptation. Borgwardt et al. [2] introduce the maximum mean discrepancy (MMD) for structured biological data to estimate the differences in distribution between two molecular graph datasets. H. Shimodaira [3] discusses that the inference based on the maximum likelihood estimate (MLE) can be improved by weighting the covariate in the log-likelihood function. In order to adapt the existing classifiers across various video domains, Yang et al. [4] propose adaptive support vector machines (A-SVMs), which adapt the original and the adapted classifier through minimizing an SVM-like objective function.

Deep DA Methods: Deep learning’s advancement has heralded a new era for domain adaptation technology. DDC [5], an acronym for deep domain confusion, discovers the domain confusion metric, which can be utilized to estimate the dimensionality and ideal position of the domain adaptation layer; therefore, a domain confusion loss is introduced to reduce discrepancy between the source and target domains. Conditional domain adversarial networks (CDANs) [6] enhance the classic adversarial domain adaptation by including a conditional branch that takes class label information into account during training. The conditional branch directs the domain classifier in learning domain-invariant characteristics unique to each class. They have been found to be useful in a variety of applications, including sentiment analysis and object identification. In [7], they focus on reconciling the centroid-hypothesis conflict between two loss functions within source-free domain adaptation: the cross-entropy loss inclines to the probability implied by pseudo-labels, but nevertheless, the entropy minimization loss is prone to the class implied by hypothesis. Song et al. [8] introduce stable diffusion and score distillation sampling into a Style-GAN2 to guide the generator towards a target domain under the text prompt. Instead of using the text prompt, d-SNE [9] learns domain-agnostic latent space by using the stochastic neighborhood embedding techniques and a modified Hausdorffian distance. To learn adaptive classifiers, the residual transfer network (RTN) [10] plugs a residual block into CNN architectures to bridge tightly the source classifier and target classifier. To enable feature adaptation, they embed the fused features of multiple layers into Hilbert spaces to match distributions.

Attention Mechanisms: Attention mechanisms, especially self-attention (also called intra-attention), have become an integral part in transformer-related model, and present impressive improvements in various tasks [11,12,13,14,15]. Attention mechanisms are usually used to reweight the features of fully connected or convolutional layers. In [15], they show that inserting an attention layer into AlexNet (a convolutional neural network) used as the backbone network of deep CORAL achieves improved performance. Vaswani et al. [11] introduce a transformer based entirely on attention mechanisms to compute representations of its input on the WMT 2014 English-to-German translation tasks. The attention vectors are invariant during the training process in terms of sequence ordering, otherwise addition of position information can be considered [16]. GANs is a great tool in image-to-image transduction [17,18] and text-to-image translation tasks. Zhang et al. [13] employ an intra-attention module to implement long-range interaction across image regions.

Self-Supervised Model: Recent work shows that self-supervised learning [19] methods can significantly improve the accuracy on image classification tasks for most deep learning cases. Models trained on predefined surrogate tasks learn useful image semantics using only unsupervised data that can be used for solving other downstream tasks of interest. S⁴L [20] (self-supervised semi-supervised learning), proposed by Zhai et al., reports two techniques: S⁴L-Rotation rotates input images into four degrees

\{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}

and predicts the rotation degree for each of them, and S⁴L-exemplar learns a visual invariant representation to image transformations, such as HSV-space color randomization, random mirroring, etc. A key issue for GAN training instability is catastrophic forgetting, which can be countered through introducing temporal memory to the training process. Noroozi and Favaro [21] propose that combining adversarial and self-supervised learning relieves the impact of discriminator forgetting.

3. Theoretical Background

3.1. Mahalanobis Distance-Based CORAL Loss ( $L_{M D - C O R A L}$ )

Mahalanobis distance (MD) is a measure that can be seen as a modification of Euclidean distance, with the problem of inconsistent and correlated dimensional scales corrected.

Given two vectors

\vec{X}

and

\vec{Y}

, the squared MD

D_{M}^{2}

is:

\begin{matrix} D_{M}^{2} & = {(\frac{x_{1} - y_{1}}{\sqrt{λ_{1}}})}^{2} + {(\frac{x_{2} - y_{2}}{\sqrt{λ_{2}}})}^{2} + \dots + {(\frac{x_{n} - y_{n}}{\sqrt{λ_{n}}})}^{2} \\ = (x_{1} - y_{1}, x_{2} - y_{2}, \dots, x_{n} - y_{n}) [\begin{matrix} \sqrt{λ_{1}} \\ \sqrt{λ_{2}} \\ \dots \\ \sqrt{λ_{n}} \end{matrix}] (\begin{matrix} x_{1} - y_{1} \\ x_{2} - y_{2} \\ \dots \\ x_{n} - y_{n} \end{matrix}) \\ = {(\vec{X} - \vec{Y})}^{^{'}} Σ^{- 1} (\vec{X} - \vec{Y}) \end{matrix}

(1)

where

Σ^{- 1}

stands for the inverse of the covariance matrix between

\vec{X}

and

\vec{Y}

. Correlation alignment (CORAL) [22] proves that aligning second-order statistics (covariances) across domains facilitates learning domain-invariant features that are less affected by domain shift. P-ADC [15] attempted to extend CORAL loss with the idea of p-norm to balance the offset caused by the classification loss, but this led to dilemmas of accuracy and computation time.

\begin{matrix} CORAL loss & L = min_{ω} L_{C l a s s} (D_{S}; ω) + α L_{C O R A L} \end{matrix}

(2)

\begin{matrix} P - norm loss & L = min_{ω} L_{C l a s s} (D_{S}; ω) + \sum_{p} λ_{p} L_{p} \end{matrix}

(3)

The Mahalanobis distance-based loss function (2-norm) is:

L_{M D - C O R A L}^{2 - n o r m} = \frac{1}{4 d^{2}} {(C_{D_{S}} - C_{D_{T_{i}}})}^{^{'}} Σ^{- 1} (C_{D_{S}} - C_{D_{T_{i}}})

(4)

According to [15], as p increases, p-norm loss consumes a large amount of computational resources, leading to a drastic decrease in computing speed. We designed a linear transform to alleviate the conflict between accuracy and computational complexity, and to compensate for the loss of accuracy due to the neglect of higher-order terms.

\begin{matrix} L_{M D - C O R A L} & = β_{0} + β_{1} L_{M D - C O R A L}^{1 - n o r m} + β_{2} L_{M D - C O R A L}^{2 - n o r m} + β_{3} L_{M D - C O R A L}^{3 - n o r m} \\ where \{\begin{matrix} β_{0} = 1 + C_{D_{S}}^{^{'}} C_{D_{T_{i}}} \\ β_{1} = C_{D_{S}}^{^{'}} C_{D_{T_{i}}} \\ β_{2} = \frac{\sum (C_{D_{S}} + C_{D_{T_{i}}})}{2 n} \\ β_{3} = \frac{1}{3} \end{matrix} \end{matrix}

(5)

3.2. Self-Attention

Attention almost entirely replaces recurrence by allowing for a focus on important regions and modeling long-distance dependencies within input or output images. In practice, the queries, keys, and values, packed together into matrices Q, K, and V, are computed as linearly transformed input images x with parameter matrices

W^{Q} \in R^{\frac{C}{8} \times C}, W^{K} \in R^{C \times \frac{C}{8}}, W^{V} \in R^{\frac{C}{8} \times C}

, respectively. The self-attention function is:

\begin{matrix} S e l f - A t t e n t i o n (Q, K, V) & = s o f t m a x (\frac{Q K^{^{'}}}{\sqrt{d_{k}}}) V \\ = s o f t m a x (\frac{(W^{Q} x) {(W^{K} x)}^{^{'}}}{\sqrt{d_{k}}}) (W^{V} x) \end{matrix}

(6)

where the input images

x = \{x_{1}, x_{2}, \dots, x_{C}\}, x_{i} \in R^{W \times H}

are converted to two matrices to compute the attention after being normalized by a softmax function.

d_{k}

denotes the variance of

Q K^{^{'}}

. Figure 2 shows an example of a 64-channel self-attention block.

3.3. Positional Embedding

Self-attention raises the permutation equivalence without positional information, which limits the expressiveness of the model for visual tasks [23]. Positional embeddings refer to injecting some positional information of the tokens into the features according to different rules. There are two different strategies based on embedded location information. The attention with absolute positional embeddings, which utilizes the absolute position of pixels in an image, is called absolute attention, such as sinusoidal embedding [11], which uses sine and cosine functions to encode each dimension to form a geometric progression whose wavelength is from

2 π

to

10, 000 \cdot 2 π

. The attention based on relative positional embeddings, called relative attention, which calculates the relative distance to each location

(x, y) \in N (i, j)

to receive two offsets for the horizontal and vertical coordinate axes, respectively, i.e.,

(x - i, y - j)

, results in improved performance [23]. The relative attention function is:

S e l f - A t t e n t i o n_{r e l a t i v e} (Q, K, V) = s o f t m a x (\frac{Q^{^{'}} K + Q^{^{'}} (r_{x - i}, r_{y - j})}{\sqrt{d_{k}}}) V

(7)

4. Methods

In this section, we present our S²AC approach for unsupervised domain adaptation (UDA) problem. Let a model

q_{ω} (y | x)

trained on a labeled source dataset

D_{s}

composed of source images

x_{s}

and corresponding labels

y_{s}

,

D_{s} = \{(x_{s}, y_{s}) | x_{s} \in R^{H \times W \times C}, y_{s} \in {\{0, 1\}}^{H \times W \times K}\}

, where

W, H, C

are the width, height, and channel of the source domain images, respectively; K denotes the number of classes, to be tested over multiple unlabeled target datasets composed of target domain images,

D_{T_{1}}, D_{T_{2}}, \dots, D_{T_{n}}

, where

D_{T_{i}} = {\{x_{j}^{T}\}}_{j = 1}^{N_{T_{i}}}

is the

i^{t h}

target dataset and

N_{T_{i}}

denotes the scale of the target domain

D_{T_{i}}

.

Our method assumes that

q_{ω} (y | x)

is trained in an unsupervised manner; meanwhile, it utilizes only the pretrained model

q_{ω} (y | x)

, which works well in the source domain, and an unlabeled target dataset

D_{T_{i}}

.

Our framework mainly comprises the following three components (see Figure 3):

Attention-based backbone network module;
Self-supervised learning module;
Relative positional embedding.

4.1. Attention-Based Backbone Network Module

An attention-based backbone network module maps features to the space where joint training with the classification loss, the Mahalanobis distance-based CORAL loss, and the self-supervised loss are applied. We consider the total loss function in this study to have the following form:

L = min_{ω} L_{C l a s s} (D_{S}; ω) + α L_{M D - C O R A L} (D_{S}, D_{T_{i}}; ω) + β L_{s e l f} (D_{S}; ω)

(8)

where

L_{C l a s s}

is a standard cross-entropy loss of all labeled data in the source domain.

L_{M D - C O R A L}

stands for a linear transformation of the second statistics (covariances) extended by the p-norm to align the distributions of the source and target features.

L_{s e l f}

is the self-supervised loss that encourages the model to learn useful information. The corresponding self-supervised task is to predict the index of the chosen permutation in this paper.

The backbone network is composed of a stack of N = 5 identical layers, followed by layer position encoding and two linear layers. Each layer has two sub-layers. The first is a self-attention mechanism and the second is max-pooling layer. We employ a residual connection inside the self-attention mechanism. That is, the output of each layer is

H_{i} = M a x (R e L U (x_{i - 1} + S e l f - A t t e n t i o n (Q, K, V)))

(see Figure 4), where

M a x (\cdot)

is the function implemented by the max-pooling layer itself. To leverage these residual connections, the dimensions of input and output must be equal for all self-attention modules; thus, we set

k e r n e l_s i z e = 1

,

s t r i d e = 1

,

p a d d i n g = 0

for query, key, and value matrices in convolution-based attention.

4.2. Self-Supervised Learning Module

A self-supervised auxiliary task (called Random Context Prediction) module that transforms both given source and target data samples along with two prominent self-supervised techniques was adopted in our framework: context prediction + random rotation. Following [19,21], context prediction, as a “pretext”, forces the model to understand scenes and objects. The main idea is to learn general-purpose representation by exploiting the relative spatial location as labels that come for “free” within images. Doersch et al. [19]’s sample randomly pairs of patches in a three-by-thre grid. In their approach, one tile is the pivot kept in middle of a three-by-three grid; the other can be any of the eight remaining available locations for each pair. In contrast, ref. [21] observes all nine tiles through associating with the Jigsaw puzzle reassembly problem. A set of jigsaw puzzle permutations is defined, as well as the corresponding index for each item. To learn image representation, the Jigsaw puzzle solver shuffles the input tiles base on the set of permutations to predict the relevant index. However, since the number of possible permutations of nine tiles, which is

9!

, is too numerous to be applied directly, a subset must be established, which is also time-consuming (see Section 5). In this study, we combine two strategies to create a novel, harder, and more efficient pretext task. Following [21], we generate the subset using “random” sampling rather than Hamming distance. To achieve end-to-end processing, it is also feasible to omit subset creation and just perform random selection in the original matrix. We crop a region from an image at random and divide it into nine patches, which are rearranged according to a randomly selected permutation. At the same time, each patch rotates at a “random” degree, which is generally picked from

\{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}

(see Figure 5).

The self-supervised loss is:

\begin{matrix} L_{s e l f} (x, m; ω, μ) & = \frac{1}{N_{M} N_{D_{S} \cup D_{T_{i}}}} \sum_{m \in M} \sum_{x \in D_{S} \cup D_{T_{i}}} L_{C E} (p (f_{ω} (x^{m}); μ), g (m)) \\ = - \frac{1}{N_{M} N_{D_{S} \cup D_{T_{i}}}} \sum_{m \in M} \sum_{x \in D_{S} \cup D_{T_{i}}} g (m) ln p (f_{ω} (x^{m}); μ) \end{matrix}

(9)

where M stands for the predefined permutation set.

x^{r}

is the input x reordered via a randomly chosen permutation from m and rotated via a randomly chosen degree from

\{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}

.

g (\cdot)

is the mapping of the jigsaw puzzle permutations to

\{1, 2, \dots, 9!\}

, e.g.,

(6, 3, 7, 2, 9, 1, 4, 8, 5) \to 53

; N denotes the number of candidates.

D_{T_{i}}

is the target domain.

f_{ω} (\cdot)

illustrates the model with parameters

ω

.

p (\cdot, μ)

denotes the joint probability as

p_{i, j} (x; μ) = \frac{e^{μ_{i, j}^{^{'}} x}}{\sum_{k, l} e^{μ_{k, l}^{^{'}} x}}

.

4.3. Relative Positional Embedding

It has been proven in [11,23] that employing positional embeddings results in more efficient and accurate models. However, the definition of early relative position embeddings only takes into account relative distance, i.e., row offset and column offset, which is irrelevant to the picture content. Unlike earlier literature, our strategy attaches the relative position to the model, converting it into a trainable type that can be tuned during parameter optimization. Using

(x_{i}, y_{j})

as the base point, the relative distance of and other point

(x_{j}, y_{j})

in the image to

(x_{i}, y_{j})

is

(x_{d}, y_{d})

(see Figure 6). The relative distance

(x_{d}, y_{d})

is associated with embeddings

(r_{x_{d}}, r_{y_{d}})

.

x_{i} = σ (p_{1}) (w_{2} - w_{1}) + w_{1}, 0 \leq w_{1} < w_{2} \leq W

(10)

y_{i} = σ (p_{2}) (h_{2} - h_{1}) + h_{1}, 0 \leq h_{1} < h_{2} \leq H

(11)

x_{d} = x_{j} - x_{i}

(12)

y_{d} = y_{j} - y_{i}

(13)

where

x_{i}

,

y_{i}

stand for the horizontal and vertical coordinates of the base point.

x_{j}

,

y_{j}

are the horizontal and vertical coordinates of any point on feature map.

p_{1}

,

p_{2}

represent trainable parameters that are optimized during model training.

σ (x) = \frac{1}{1 + e^{- x}}

, is the sigmoid function. For example, given a

5 \times 5

feature map implying

W = H = 5

and also

x_{i} = 2, y_{i} = 3

after backpropagation computed from Equations (10) and (11), the positional embedding matix is:

[\begin{matrix} (0, 0) & (0, 1) & (0, 2) & (0, 3) & (0, 4) \\ (1, 0) & (1, 1) & (1, 2) & (1, 3) & (1, 4) \\ (2, 0) & (2, 1) & (2, 2) & (2, 3) & (2, 4) \\ (3, 0) & (3, 1) & (3, 2) & (3, 3) & (3, 4) \\ (4, 0) & (4, 1) & (4, 2) & (4, 3) & (4, 4) \end{matrix}] \to_{h_{1} = w 1 = 2, h_{2} = w_{2} = 3}^{x_{i} = 2, y_{i} = 3} [\begin{matrix} (- 2, - 3) & (- 2, - 2) & (- 2, - 1) & (- 2, 0) & (- 2, 1) \\ (- 1, - 3) & (- 1, - 2) & (- 1, - 1) & (- 1, 0) & (- 1, 1) \\ (0, - 3) & (0, - 2) & (0, - 1) & (0, 0) & (0, 1) \\ (1, - 3) & (1, - 2) & (1, - 1) & (1, 0) & (1, 1) \\ (2, - 3) & (2, - 2) & (2, - 1) & (2, 0) & (2, 1) \end{matrix}]

.

5. Experimental Results

5.1. Setup

In this study, we perform experiments on a computer that has an Intel(R) Xeon(R) Silver 4210R CPU, 64-bit 2.4GHz (two processors), and an NVIDIA Tesla T4 GPU. We evaluate our model on three image datasets: Office31 [24], Office_Caltech_10 [1], OfficeHome [25].

Office31 is a classic collection of images of objects from 31 distinct categories commonly encountered in office environments, including headphones, bottles, speakers, etc. It has 4652 images in total, with an average of 150 images per category. Three domains in the dataset are Amazon (A), DSLR (D), and Webcam (W).

Office_Caltech_10 is divided into two parts: the office part and the Caltech part. The office part contains images of office scenes, while the Caltech part contains images of Caltech scenes. It contains ten overlapping categories shared by four domains: Amazon (A), Caltech (C), DSLR (D), and Webcam (W).

OfficeHome contains images from four different workplace and home contexts. It includes images of objects from four different domains: Art (A), Clipart (C), Product (P), and Real-World (R). The dataset includes 15,588 images from 65 different categories, with a total of 4652 images from each domain.

Our model takes images of

224 \times 224

pixels as input, and the weight decay and momentum are set to

5 \times 10^{- 4}

and

0.9

, respectively; the initial learning rate is

1 \times 10^{- 3}

as well. Because of GPU memory limitations, both our train_batch size and test_batch size are set to 8.

5.2. Selection of Criteria for Generating Permutation Subset

The original

9!

-item permutation set in [21] is too large to be directly used in the computation; thus, the jigsaw puzzle solver must employ an external process, which is a non-end-to-end process, to calculate a subset of the permutations by hamming distance. We suggest using uniform random sampling to create the permutation subset in real time, which will increase the model’s efficacy.

We designed this experiment to show the time difference between executing three criteria, i.e., Mean Hamming distance, Max Hamming distance, and Random sampling, to generate the set of permutations. We discover that as the number of permutations increases, computing the Hamming distance has a substantial impact on running time when compared with random sampling, while the improvement in test accuracy is not significant. The runs are shown in Table 1. Based on the experiment results, we conclude that the use of random sampling can significantly reduce the running time without significantly affecting the accuracy.

Figure 7 displays us three graphs created from shift

D \to W

to help us analyze S²AC. In Figure 7a, we show the classification loss, the Mahalanobis distance-based CORAL loss, the self-supervised learning loss, and the total loss for the training phrase. The classification loss is initially significantly bigger than both the Mahalanobis distance-based CORAL loss and the self-supervised learning loss, but as iterations progress, the classification loss becomes minimal at around

e p o c h s = 50

and progressively stabilizes and converges. The Mahalanobis distance-based CORAL loss exhibits a notable growth between iteration cycles 20 and 40, but beyond 50, it steadily declines and converges. In Figure 7b, we visualize the average loss for the test phrase in the source and target domains. The plot shows that the average loss is generally on the decline in both the source and target domains. The fluctuation of the training (source) and test (target) accuracy is shown in Figure 7c. Figure 7d shows the performance of Frobenius distance, Mahalanobis distance, earth mover’s distance (EMD), Chebyshev distance, and cosine distance matrices. The EMD fluctuations are excessively large, and the Frobenius distance-based loss function is weaker than our method in this scenario, while the Chebyshev distance-based loss and the cosine distance-based loss continue to rise in the latter phases, causing the total loss to be poorly converged (see Figure 7e–g).

5.3. Ablation Studies

We designed an ablation analysis of the influence of the attention-based backbone network, the self-supervised learning module, and positional embedding. Figure 8 shows the classification accuracy curves. For convenience, ‘-S’, and ‘-T’ denote training and test accuracy, respectively. And incorporation of only the AlexNet [26], instead of the attention-based backbone network, pretrained on ImageNet [27], is called S²AC-AlexNet-S/T pretrained. The corresponding settings are shown in Table 2. The table shows that if AlexNet was replaced by the proposed attention-based backbone network, the test accuracy would increase by about 3% points, mainly benefiting from learning long-range dependencies, which is a main challenge for convolutional neural networks. Furthermore, if the extractor is pretrained, it will converge quicker in both the training and testing phases. But nevertheless, the final accuracy in the training phase is 0.3% points lower for S²AC-AlexNet-S pretrained than S²AC-AlexNet-S, and the test accuracy is about 1.5% points lower for S²AC-AlexNet-S pretrained than S²AC-AlexNet-S. This demonstrates that pre-training has no significant effects on the model’s identification accuracy, but only on its convergence speed. The pre-training procedure cannot be used for S²AC since the brand new backbone network for the weights pre-trained on ImageNet does not exist.

When we compare test accuracy alone, S²AC-AlexNet-S pretrained fluctuates the most after convergence, while S²AC is more stable than S²AC-AlexNet-S. Nevertheless, S²AC has the highest test accuracy when compared with S²AC-AlexNet-S.

The validity of positional embedding has been demonstrated [23], and the purpose of this experiment is to see if positional embedding is more effective for the domain adaptation problem when employed in the source and target domains individually, or in both domains. For convenience, not performing the positional embedding operation in both source and target domains is called S²AC-∼S&∼T; adding location information in the source domain but not in the target domain is called S²AC-S&∼T; and vice versa is referred to as S²AC-∼S&T.

Figure 9 illustrates that positional embedding brings a significant rise of the accuracy, consequently resulting in an increase in F₁ score from 73.33% to 85.74% (see Table 3). This highlights the effectiveness of location information in domain adaptation tasks heavy with the ability to present a more detailed representation of this information. Nevertheless, adding positional information to each domain does not improve operational outcomes compared with no location information. In fact, superior recognition accuracy requires only positional embedding in the target domain.

Last but not least, we attempt to tweak the self-supervised learning module in order to compare and analyze its influence on the model’s performance. Similarly, we set up the self-supervised learning module in the source and target domains, respectively, and compare the test accuracy. As for the legends of tables and figures, S²AC-S&T represents the execution of self-supervised learning in both the source and target domains; S²AC-S&!T denotes performing only the self-supervised module in the source domain; and S²AC-!S&T is the exact opposite, executed only in the target domain.

As observed in Figure 10, the self-supervised learning module has made a noteworthy contribution to the enhancement of the model’s test accuracy. In terms of convergence speed and stability of the algorithm, the two models, S²AC-S&T and S²AC-!S, perform the best, with S²AC-S&T exhibiting a faster convergence rate than S²AC-!S. Although S²AC-!S initially displayed the highest test accuracy, S²AC-S&T tends to outperform the previous one towards the end of the iteration. The trend is further evidenced by the F₁ scores, where S²AC-S&T’s final score of 83.21% is about 5 points higher than S²AC-!S&T’s score of 78.67%; see Table 4.

5.4. The Network Visualization and Impacts of Training Dataset Size

Although understanding the mechanism of deep neural networks can be arduous, certain visual clues reveal that S²AC is capable of fusing two domains to achieve domain adaptation. In order to guarantee fair comparison, we trained both models for 160 epochs. Figure 11 shows feature maps extracted by the model. One can see that the model learned the contour information of the object compared with AlexNet during the training phase (see Figure 11a), while also exhibiting competitive feature extraction capacity during the test phase (see Figure 11b). The first figure of Figure 11a) shows that two regions are activated corresponding to the top and bottom of the object, respectively, in the left, while only the region corresponding to the bottom of the object in the right (extracted by AlexNet). For the test phase, the second figure of Figure 11b demonstrates that S²AC (left) activates the majority of the light-colored areas, and the dark-colored region activated weakly can basically distinguish the shape of the object, whereas the right (AlexNet) cannot represent the outline of the object due to the lack of activated regions.

We also investigated the influence of dataset size on domain adaptation. Five training datasets of varying sizes were created and used for evaluation. In addition to the whole dataset, 100, 200, 300, and 400 images were randomly chosen from the source domain to create a new dataset. The corresponding F₁ score curves are shown in Figure 12b. From the figure, we conclude that the model’s performance improves as the size of the training set increases. The F₁ curve shows a trend of fast forward and slow backward, indicating that the number of images is less than 350 and the F₁ score grows rapidly as the number increases. As the dataset size, 350, increases, the performance improvement slows down and tends to saturate. Table 5 depicts F₁, recall, precision score, and the micro AP. The corresponding P-R curves are shown in Figure 12a.

5.5. Discussion

In this experiment, we chose one of the domains at random as the source domain for training and another as the target domain. The labels of the source domain are inherently known, while the labels of all target domains remain unknown. To compare the effectiveness of our method, in addition to a variant of S²AC with AlexNet, we evaluated popular algorithms, such as deep domain confusion (DDC), conditional domain adversarial networks (CDANs), etc., on the Office31, Office_Caltech_10, and OfficeHome datasets.

As seen in Table 6, our method (S²AC) obtained higher average performance than the other baseline approaches on the Office31 dataset. S²AC had the greatest running accuracy in four out of the six shifts. CDAN(+E) received the highest scores in the other two shifts.

In Table 7, we evaluated our method on Office_Caltech_10 compared with other baseline algorithms. The Office_Caltech_10 dataset contains four domains: Caltech, DLSR, Amazon, and Webcam. Three of them, i.e., DLSR, Amazon, and Webcam, are from the Office31 Dataset. To avoid duplication of effort, Table 7 simply compares the shifts of Caltech with the other three domains (DLSR, Amazon, and Webcam); hence, Table 7 only includes six shifts (

A \to C

,

D \to C

, etc.). In three out of the six shifts, our algorithm performs best, and the gap between S²AC and the best baseline method in the other three shifts is extremely small (⩽2 approximately), which illustrates its effectiveness on this dataset.

The results on OfficeHome are reported in Table 8, in most transfer tasks, the S²AC-model beats other baseline methods, demonstrating high performance and potential for improvement. One can see that OfficeHome contains more categories and a larger volume than the other two datasets, which are visually more distinct from each other, and hence are more difficult to classify because of their substantially lower classification accuracy within the domain [28]. Since the prior work’s feature extraction is distance-agnostic, the extracted features are unsuitable for classification in the presence of a large number of categories.

In the tables, AttNet-5 represents a five-layer stack, and it is interesting to note that although the number of layers in the stack is less than ResNet-34, AlexNet, S²AC outperforms them in most cases. Since the number of layers of AttNet-5 is significantly smaller than ResNet-34, AlexNet, its convergence is faster and it is easier to train. In addition, considering the hardware environment, we did not include ResNet-50, ResNet-101 in the Table 6, Table 7 and Table 8; researchers who are able to do so might use it as a starting point for their study.

Table 6. Performance comparison for all six domain shifts on Office31. A: Amazon, D: DLSR, W: Webcam.

Method	Backbone	A→W	A→D	W→A	W→D	D→A	D→W
No Adaptation	AlexNet [29]	35.3 ± 0.5	22.3 ± 1.2	30.2 ± 0.5	37.4 ± 1.6	21.2 ± 0.6	39.5 ± 0.3
	ResNet-34 [30]	37.6 ± 0.7	25.5 ± 0.6	32.8 ± 0.4	39.4 ± 0.4	24.3 ± 0.9	39.3 ± 0.5
DDC [5]	Deep CNN [5]	43.3 ± 0.8	30.2 ± 1.4	35.6 ± 0.7	45.6 ± 1.3	25.3 ± 0.5	65.4 ± 0.4
CDAN [6]	AlexNet [29]	41.4 ± 1.9	36.2 ± 1.6	35.1 ± 1.5	75.2 ± 0.6	38.4 ± 1.5	65.6 ± 1.6
	ResNet-34 [30]	44.7 ± 0.4	37.4 ± 0.6	38.1 ± 0.7	76.7 ± 0.9	40.6 ± 1.6	66.3 ± 0.8
CDAN+E [6]	AlexNet [29]	45.5 ± 0.6	37.4 ± 1.6	38.2 ± 0.6	72.4 ± 1.5	35.7 ± 0.43	68.4 ± 1.5
	ResNet-34 [30]	47.3 ± 0.5	40.5 ± 0.6	39.7 ± 1.6	74.7 ± 0.7	40.1 ± 1.1	71.5 ± 0.7
Deep CORAL [31]	AlexNet [29]	50.5 ± 0.7	38.1 ± 1.5	37.9 ± 0.6	71.4 ± 0.6	38.6 ± 0.8	60.2 ± 0.7
P-ADC [15]	AlexNet [29] with an attention layer	51.4 ± 0.9	40.3 ± 0.7	39.5 ± 0.6	73.9 ± 0.5	40.1 ± 0.7	63.8 ± 0.5
S²AC-AlexNet	AlexNet [29]	56.8 ± 0.7	44.0 ± 1.2	39.8 ± 0.9	74.7 ± 1.7	40.9 ± 1.1	63.6 ± 0.7
S²AC	AttNet-5	56.8 ± 1.1	45.6 ± 1.4	40.2 ± 0.5	75.9 ± 0.7	41.5 ± 0.6	61.6 ± 1.6

Table 7. Performance comparison for six domain shifts on Office_Caltech_10. C: Caltech, D: DLSR, A: Amazon, W: Webcam.

Method	Backbone	$A \to C$	$D \to C$	$W \to C$	$C \to A$	$C \to D$	$C \to W$
No Adaptation	AlexNet [29]	42.0 ± 0.4	26.5 ± 1.5	35.0 ± 0.9	45.3 ± 0.9	50.5 ± 0.5	49.2 ± 0.6
	ResNet-34 [30]	45.8 ± 0.7	31.4 ± 0.6	46.2 ± 0.3	47.5 ± 0.4	46.5 ± 0.6	51.3 ± 1.1
DDC [5]	Deep CNN [5]	64.4 ± 0.8	54.4 ± 0.9	68.4 ± 1.2	74.5 ± 1.4	52.4 ± 0.4	64.8 ± 0.7
CDAN [6]	AlexNet [29]	62.5 ± 0.9	60.2 ± 1.9	71.4 ± 1.9	73.6 ± 0.6	53.4 ± 0.5	64.3 ± 1.7
	ResNet-34 [30]	65.2 ± 0.4	63.3 ± 0.4	69.6 ± 0.9	74.3 ± 1.3	58.4 ± 0.6	68.4 ± 0.6
CDAN+E [6]	AlexNet [29]	64.9 ± 1.5	63.4 ± 1.1	71.3 ± 0.2	75.6 ± 0.8	60.4 ± 0.9	70.5 ± 1.2
	ResNet-34 [30]	67.2 ± 0.5	67.8 ± 0.7	73.2 ± 0.7	80.0 ± 1.2	63.6 ± 0.5	69.1 ± 0.8
Deep CORAL [31]	AlexNet [29]	74.8 ± 0.9	66.6 ± 0.6	67.4 ± 0.3	81.9 ± 0.7	60.2 ± 0.9	70.5 ± 1.3
P-ADC [15]	AlexNet [29] with an attention layer	76.5 ± 0.2	68.5 ± 0.1	70.7 ± 0.7	84.6 ± 0.1	63.8 ± 0.3	72.4 ± 0.2
S²AC-AlexNet	AlexNet [29]	72.5 ± 0.1	61.5 ± 0.1	69.6 ± 0.3	85.2 ± 0.2	55.7 ± 0.3	74.4 ± 0.2
S²AC	AttNet-5	79.3 ± 0.4	67.4 ± 0.2	71.1 ± 0.2	83.9 ± 0.3	69.8 ± 0.3	75.7 ± 0.5

Table 8. Performance comparison for 12 domain shifts on OfficeHome. A: Art, C: Clipart, P: Product, R: Real-World.

Method	Backbone	$A \to C$	$A \to P$	$A \to R$	$C \to A$	$C \to P$	$C \to R$	$P \to A$	$P \to C$	$P \to R$	$R \to A$	$R \to C$	$R \to P$
No Adaptation	AlexNet [29]	35.6 ± 0.3	33.6 ± 1.3	40.0 ± 0.4	27.5 ± 0.5	33.1 ± 0.3	29.6 ± 0.2	37.4 ± 0.8	32.7 ± 1.0	28.4 ± 0.6	30.3 ± 0.7	41.5 ± 0.3	37.3 ± 0.4
	ResNet-34 [30]	34.9 ± 0.5	38.4 ± 0.6	45.7 ± 0.4	38.0 ± 1.0	35.8 ± 0.5	32.7 ± 0.6	39.1 ± 1.3	39.1 ± 0.5	34.6 ± 1.1	34.8 ± 0.7	49.6 ± 0.2	41.5 ± 0.9
DDC [5]	Deep CNN [5]	41.0 ± 1.4	54.5 ± 0.6	63.4 ± 0.6	53.4 ± 0.6	66.9 ± 0.4	45.8 ± 1.0	49.5 ± 0.5	55.9 ± 0.7	39.4 ± 0.3	48.1 ± 1.6	71.5 ± 0.8	66.4 ± 0.6
CDAN [6]	AlexNet [29]	51.9 ± 1.7	53.7 ± 0.9	65.1 ± 2.1	71.4 ± 1.1	63.3 ± 0.6	51.4 ± 1.7	51.8 ± 0.7	58.5 ± 1.4	47.8 ± 0.4	49.7 ± 2.1	73.1 ± 0.7	67.1 ± 0.1
	ResNet-34 [30]	52.4 ± 0.5	55.3 ± 0.3	65.7 ± 1.0	69.3 ± 0.6	64.1 ± 0.9	57.2 ± 0.6	54.9 ± 0.7	59.1 ± 0.9	49.2 ± 0.8	50.1 ± 1.4	75.6 ± 0.9	68.5 ± 0.7
CDAN+E [6]	AlexNet [29]	55.3 ± 0.6	60.4 ± 1.5	67.1 ± 1.6	75.3 ± 1.0	65.2 ± 0.8	55.7 ± 1.1	61.7 ± 0.7	61.4 ± 2.0	49.9 ± 0.9	51.5 ± 0.7	75.8 ± 0.8	69.5 ± 0.7
	ResNet-34 [30]	60.2 ± 0.6	58.1 ± 0.9	66.9 ± 0.7	76.1 ± 0.6	66.9 ± 0.6	59.5 ± 0.6	63.1 ± 0.6	62.7 ± 0.9	50.6 ± 0.6	53.5 ± 0.6	74.0 ± 0.6	69.9 ± 0.7
Deep CORAL [31]	AlexNet [29]	54.6 ± 0.8	62.4 ± 1.3	67.1 ± 0.5	76.7 ± 0.5	67.5 ± 0.3	60.6 ± 0.7	60.5 ± 0.8	62.6 ± 0.6	45.7 ± 0.5	54.4 ± 1.4	73.1 ± 0.5	69.6 ± 0.5
P-ADC [15]	AlexNet [29] with an attention layer	55.2 ± 0.5	62.1 ± 0.8	65.6 ± 1.2	75.0 ± 0.8	68.1 ± 0.2	61.1 ± 0.4	62.7 ± 1.4	61.9 ± 0.7	47.1 ± 0.4	56.1 ± 0.9	72.2 ± 1.3	70.2 ± 0.9
S²AC-AlexNet	AlexNet [29]	56.3 ± 1.7	61.3 ± 0.7	68.1 ± 0.5	76.7 ± 0.6	68.0 ± 0.9	65.3 ± 0.5	61.3 ± 1.1	62.6 ± 0.3	50.2 ± 0.6	56.5 ± 0.3	73.7 ± 0.7	70.3 ± 1.3
S²AC	AttNet-5	56.9 ± 0.8	62.6 ± 0.6	68.1 ± 0.6	77.6 ± 1.0	68.5 ± 0.6	64.5 ± 0.7	62.1 ± 0.4	63.6 ± 1.1	53.2 ± 0.5	57.6 ± 0.6	76.4 ± 1.2	71.8 ± 0.9

5.6. Computational Complexity

We calculated the average training time over all the cross-domain scenarios on the Office31 dataset to compare the time complexity of our approach to other baseline algorithms, as shown in Table 9. When comparing to other algorithms, Deep CORAL has the lowest computational complexity. Overall, the adversarial-based approach is more computationally difficult than the distance-based approach. Both S²AC and P-ADC are located in the 12-s average computing time echelon, just behind Deep CORAL.

6. Conclusions and Future Work

Our study suggests a self-supervised attention CORAL based on a Mahalanobis distance approach for domain adaptation, based on the attention mechanism. This technique might eliminate certain influences from data insufficiency, enhance local extraction reliance, and increase the test accuracy of base algorithms, such as deep CORAL. Several ideas are applied to attain these improvements. To begin with, the attention architecture facilitates attribute extraction, which aids in the learning to long-distance dependencies and allows for enhanced attribute reuse. The second is that the incorporation of self-supervised learning enables S²AC to acquire high-quality representations even with limited labeled data, which can be utilized to learn new tasks that lack sufficient data. Integating these modules, S²AC constructs an adaptation model that is completely independent of pretrained models for feature extraction, with a global receptive field and that are pseudo-label-free.

Numerous experimental results show that it is effective and exceeds other compared algorithms in terms of test accuracy and time efficiency.

We hope that our work stimulates researchers in other fields to consider using this approach to broaden their research horizons, as well as domain adaptation researchers to be inspired by the great number of innovative methods that have lately been proposed.

In the future, category information will be considered to improve the domain alignment performance, particularly the utilization of object and background information in the same category in different domains. Moreover, in the absence of knowledge about the target domain, information about multiple source domains will be studied and generalized to unseen domains in diverse environments.

Author Contributions

Conceptualization, Z.-Y.W. and C.-P.Z.; methodology, Z.-Y.W.; software, Z.-Y.W. and C.-P.Z.; validation, D.-K.K.; formal analysis, Z.-Y.W. and D.-K.K.; investigation, Z.-Y.W. and C.-P.Z.; resources, D.-K.K.; data curation, C.-P.Z.; writing—original draft preparation, Z.-Y.W. and C.-P.Z.; writing—review and editing, D.-K.K.; visualization, Z.-Y.W. and C.-P.Z.; supervision, D.-K.K.; project administration, D.-K.K.; funding acquisition, Z.-Y.W. and D.-K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2022R1A2C2012243). First author was supported in part by Weifang University of Science and Technology Doctoral Fund Project under grant (2021KJBS14).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here (accessed on 9 September 2023): [https://faculty.cc.gatech.edu/~judy/domainadapt/] and [https://www.hemanthdv.org/officeHomeDataset.html].

Acknowledgments

The authors wish to thank members of the Dongseo University Machine Learning/Deep Learning Research Lab., and anonymous referees for their helpful comments on earlier drafts of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Csurka, G. A comprehensive survey on domain adaptation for visual applications. In Domain Adaptation in Computer Vision Applications; Springer: Cham, Switzerland, 2017; pp. 1–35. [Google Scholar]
Borgwardt, K.M.; Gretton, A.; Rasch, M.J.; Kriegel, H.P.; Schölkopf, B.; Smola, A.J. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 2006, 22, e49–e57. [Google Scholar] [CrossRef] [PubMed]
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244. [Google Scholar] [CrossRef]
Yang, J.; Yan, R.; Hauptmann, A.G. Cross-domain video concept detection using adaptive svms. In Proceedings of the 15th ACM international conference on Multimedia, Augsburg, Germany, 24–29 September 2007; pp. 188–197. [Google Scholar]
Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; Darrell, T. Deep domain confusion: Maximizing for domain invariance. arXiv 2014, arXiv:1412.3474. [Google Scholar]
Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. Adv. Neural Inf. Process. Syst. 2018, 31, 1640–1650. [Google Scholar]
Diamant, I.; Jennings, R.H.; Dror, O.; Habi, H.V.; Netzer, A. Reconciling a Centroid-Hypothesis Conflict in Source-Free Domain Adaptation. arXiv 2022, arXiv:2212.03795. [Google Scholar]
Song, K.; Han, L.; Liu, B.; Metaxas, D.; Elgammal, A. Diffusion Guided Domain Adaptation of Image Generators. arXiv 2022, arXiv:2212.04473. [Google Scholar]
Xu, X.; Zhou, X.; Venkatesan, R.; Swaminathan, G.; Majumder, O. d-sne: Domain adaptation using stochastic neighborhood embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2497–2506. [Google Scholar]
Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Unsupervised domain adaptation with residual transfer networks. Adv. Neural Inf. Process. Syst. 2016, 29, 136–144. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Pedro, R.; Oliveira, A.L. Assessing the impact of attention and self-attention mechanisms on the classification of skin lesions. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
Trockman, A.; Kolter, J.Z. Patches are all you need? arXiv 2022, arXiv:2201.09792. [Google Scholar]
Wang, Z.Y.; Kang, D.K. P-norm attention deep coral: Extending correlation alignment using attention and the p-norm loss function. Appl. Sci. 2021, 11, 5267. [Google Scholar] [CrossRef]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
Zhai, X.; Oliver, A.; Kolesnikov, A.; Beyer, L. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1476–1485. [Google Scholar]
Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 69–84. [Google Scholar]
Sun, B.; Feng, J.; Saenko, K. Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-Alone Self-Attention in Vision Models. arXiv 2019, arXiv:1906.05909. [Google Scholar]
Saenko, K.; Kulis, B.; Fritz, M.; Darrell, T. Adapting visual category models to new domains. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 213–226. [Google Scholar]
Venkateswara, H.; Eusebio, J.; Chakraborty, S.; Panchanathan, S. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5018–5027. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Tzeng, E.; Hoffman, J.; Darrell, T.; Saenko, K. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4068–4076. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Sun, B.; Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10, 15–16 October 2016; Proceedings, Part III 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 443–450. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 2030–2096. [Google Scholar]

Figure 1. Added Training data differ significantly from the data in the real-world scenario. Source domain (upper) is pre-processed studio images that are very different from the real-world scenario (lower), which is called domain shift.

Figure 2. Illustration of a 64-channel self-attention block. Given the original image (1st column, three channels), we compute the input x, which is 64 feature maps (2nd column) by convolutional layer with parameters

i n_c h a n n e l s = 3

,

o u t_c h a n n e l s = 64

. Out attention is just a dot-product between queries and keys, i.e.,

\frac{Q K^{^{'}}}{\sqrt{d_{k}}}

, scaled by its standard deviation.

Figure 2. Illustration of a 64-channel self-attention block. Given the original image (1st column, three channels), we compute the input x, which is 64 feature maps (2nd column) by convolutional layer with parameters

i n_c h a n n e l s = 3

,

o u t_c h a n n e l s = 64

. Out attention is just a dot-product between queries and keys, i.e.,

\frac{Q K^{^{'}}}{\sqrt{d_{k}}}

, scaled by its standard deviation.

Figure 3. An overview of the proposed model.

Figure 4. An overview of feature extraction module.

H_{i}

is the operation Attention-ReLU-MaxPool2d. ⨁ denotes the summation of the corresponding elements. ⨂ indicates the dot product operation.

Figure 4. An overview of feature extraction module.

H_{i}

is the operation Attention-ReLU-MaxPool2d. ⨁ denotes the summation of the corresponding elements. ⨂ indicates the dot product operation.

Figure 5. The algorithm is given nine rearranged and rotated tiles, and then predicts the index in the predefined permutation set.

Figure 6. Positional embedding. The middle rectangle indicates the range of

(x_{i}, y_{i})

, i.e.,

w_{1} \leq x_{i} \leq w_{2}

,

h_{1} \leq y_{i} \leq h_{2}

. The outer rectangle is the feature map, which is also the range of

(x_{j}, y_{j})

, the width and height are W and H, respectively.

(\cdot)

represents the pixel coordinate, the origin is O, and

[\cdot]

is the relative position with respect to O’.

Figure 6. Positional embedding. The middle rectangle indicates the range of

(x_{i}, y_{i})

, i.e.,

w_{1} \leq x_{i} \leq w_{2}

,

h_{1} \leq y_{i} \leq h_{2}

. The outer rectangle is the feature map, which is also the range of

(x_{j}, y_{j})

, the width and height are W and H, respectively.

(\cdot)

represents the pixel coordinate, the origin is O, and

[\cdot]

is the relative position with respect to O’.

Figure 7. Detailed analysis of shift

D \to W

for training S²AC. (a) The curves of the loss function (our method). (b) Testing the average loss of S²AC in the source and target domains. (c) Training and test accuracies of S²AC. As the number of iterations grows, we can observe that the S²AC’s training and test accuracy improve dramatically. (d) Comparison of difference distance matrices in this scenario. (e) The curves of the loss function under EMD. (f) The curves of the loss function under Chebyshev distance. (g) The curve of the loss function under cosine distance.

Figure 7. Detailed analysis of shift

D \to W

for training S²AC. (a) The curves of the loss function (our method). (b) Testing the average loss of S²AC in the source and target domains. (c) Training and test accuracies of S²AC. As the number of iterations grows, we can observe that the S²AC’s training and test accuracy improve dramatically. (d) Comparison of difference distance matrices in this scenario. (e) The curves of the loss function under EMD. (f) The curves of the loss function under Chebyshev distance. (g) The curve of the loss function under cosine distance.

Figure 8. Prediction accuracy curves of S²AC substituting the attention-based backbone network with AlexNet for ablation study based on Office31 dataset (

D \to W

). Due to the considerable volatility, the data have been smoothed by

w i n d o w = 3

.

Figure 8. Prediction accuracy curves of S²AC substituting the attention-based backbone network with AlexNet for ablation study based on Office31 dataset (

D \to W

). Due to the considerable volatility, the data have been smoothed by

w i n d o w = 3

.

Figure 9. Prediction accuracy curves of positional embedding for S²AC ablation study based on Office31 dataset (

D \to W

). Due of the considerable volatility, the data have been smoothed by

w i n d o w = 3

.

Figure 9. Prediction accuracy curves of positional embedding for S²AC ablation study based on Office31 dataset (

D \to W

). Due of the considerable volatility, the data have been smoothed by

w i n d o w = 3

.

Figure 10. Prediction accuracy curves of the self-supervised learning module for S²AC ablation study based on Office31 dataset (

D \to W

). Due to the considerable volatility, the data have been smoothed by

w i n d o w = 3

.

Figure 10. Prediction accuracy curves of the self-supervised learning module for S²AC ablation study based on Office31 dataset (

D \to W

). Due to the considerable volatility, the data have been smoothed by

w i n d o w = 3

.

Figure 11. Comparison of the source and target domain feature maps (sum of 256 channels) extracted by S²AC (left) and AlexNet (right) based on Office31. Yellow represents high values, whereas dark blue represents low levels.

Figure 12. (a) P-R curves and (b) the F₁ scores of the models trained with various dataset sizes.

Table 1. The running time and test accuracy analysis on three criteria: mean Hamming distance, max Hamming distance, and random sampling, to generate the permutation set. # means “the number of”.

Σ

is “the sum of”.

M e a n

stands of the mean Hamming distance.

M a x

indicates the max Hamming distance.

R a n d o m

is the random sampling.

Table 1. The running time and test accuracy analysis on three criteria: mean Hamming distance, max Hamming distance, and random sampling, to generate the permutation set. # means “the number of”.

Σ

is “the sum of”.

M e a n

stands of the mean Hamming distance.

M a x

indicates the max Hamming distance.

R a n d o m

is the random sampling.

# Permutations	Mean Hamming Distance			Max Hamming Distance			Random Sampling
# Permutations	$Σ$ Mean	Running Time (s)	Test Accuracy (%)	$Σ$ Max	Running Time (s)	Test Accuracy (%)	Running Time (s)	Test Accuracy (%)
6	5.333	0.411	45.6	6.000	0.366	45.4	0.056	45.5
7	6.222	0.583	45.4	7.000	0.482	45.6	0.071	45.6
15	13.333	1.740	45.8	14.400	1.449	45.7	0.165	45.6
25	22.222	3.794	46.1	23.525	3.349	46.1	0.195	45.8
35	32.111	6.895	47.4	32.571	6.034	46.6	0.282	46.5
55	48.889	15.412	48.9	50.527	14.289	47.2	0.409	47.6
73	64.889	26.863	49.32	66.651	24.137	48.8	0.597	48.6
85	75.556	35.069	49.9	77.381	32.536	49.4	0.688	48.9
100	88.889	47.602	50.1	90.791	42.939	49.5	0.718	49.4

Table 2. Ablation study of the attention-based backbone network (

D \to W

).

Table 2. Ablation study of the attention-based backbone network (

D \to W

).

Methods	Attention-Based Backbone Network	AlexNet	Pretrained	Accuracy(%)
S²AC	✓			66.5 ± 0.6
S²AC-AlexNet		✓		63.6 ± 0.7
S²AC-AlexNet pretrained		✓	✓	61.3 ± 0.5

Table 3. Ablation study of positional embeddings. "✓" denotes the execution of self-supervised learning in the corresponding domain. (

D \to W

).

Table 3. Ablation study of positional embeddings. "✓" denotes the execution of self-supervised learning in the corresponding domain. (

D \to W

).

Methods	Source Domain	Target Domain	Accuracy(%)	F1(%)
S²AC-S&T	✓	✓	43.2 ± 0.7	65.00
S²AC-∼S&∼T			45.5 ± 0.9	73.33
S²AC-S&∼T	✓		53.4 ± 1.6	84.23
S²AC-∼S&T		✓	61.6 ± 1.7	85.74

Table 4. Ablation study of the self-supervised learning module. (

D \to W

).

Table 4. Ablation study of the self-supervised learning module. (

D \to W

).

Methods	F1(%)
S²AC-S& !T	68.57
S²AC-S&T	83.21
S²AC-!S&T	78.67
S²AC-!S& !T	54.33

Table 5. The test performance with different size of datasets on Office31 (

D \to W

).

Table 5. The test performance with different size of datasets on Office31 (

D \to W

).

Dataset Size	Recall (%)	Precision (%)	F1 (%)	AP (%)
100	44.31	46.13	43.56	40.64
200	51.85	52.65	51.53	49.89
300	62.42	65.53	63.04	64.61
400	75.38	76.93	76.65	77.92
500	81.74	80.19	81.91	83.72

Table 9. The average training time of each approach on the Office31 dataset with epoch = 1, batch_size = 128 (Seconds).

Method	DDC [5]	Deep CORAL [31]	DANN [32]	CDAN [6]	CDAN + E [6]	P-ADC [15]	S²AC
Computational Time	34.8	11.9	17.6	18.01	18.06	12.3	12.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.-Y.; Kang, D.-K.; Zhang, C.-P. S²AC: Self-Supervised Attention Correlation Alignment Based on Mahalanobis Distance for Image Recognition. Electronics 2023, 12, 4419. https://doi.org/10.3390/electronics12214419

AMA Style

Wang Z-Y, Kang D-K, Zhang C-P. S²AC: Self-Supervised Attention Correlation Alignment Based on Mahalanobis Distance for Image Recognition. Electronics. 2023; 12(21):4419. https://doi.org/10.3390/electronics12214419

Chicago/Turabian Style

Wang, Zhi-Yong, Dae-Ki Kang, and Cui-Ping Zhang. 2023. "S²AC: Self-Supervised Attention Correlation Alignment Based on Mahalanobis Distance for Image Recognition" Electronics 12, no. 21: 4419. https://doi.org/10.3390/electronics12214419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

S²AC: Self-Supervised Attention Correlation Alignment Based on Mahalanobis Distance for Image Recognition

Abstract

1. Introduction

2. Related Work

3. Theoretical Background

3.1. Mahalanobis Distance-Based CORAL Loss ( $L_{M D - C O R A L}$ )

3.2. Self-Attention

3.3. Positional Embedding

4. Methods

4.1. Attention-Based Backbone Network Module

4.2. Self-Supervised Learning Module

4.3. Relative Positional Embedding

5. Experimental Results

5.1. Setup

5.2. Selection of Criteria for Generating Permutation Subset

5.3. Ablation Studies

5.4. The Network Visualization and Impacts of Training Dataset Size

5.5. Discussion

5.6. Computational Complexity

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

S2AC: Self-Supervised Attention Correlation Alignment Based on Mahalanobis Distance for Image Recognition

Abstract

1. Introduction

2. Related Work

3. Theoretical Background

3.1. Mahalanobis Distance-Based CORAL Loss ( L M D − C O R A L )

3.2. Self-Attention

3.3. Positional Embedding

4. Methods

4.1. Attention-Based Backbone Network Module

4.2. Self-Supervised Learning Module

4.3. Relative Positional Embedding

5. Experimental Results

5.1. Setup

5.2. Selection of Criteria for Generating Permutation Subset

5.3. Ablation Studies

5.4. The Network Visualization and Impacts of Training Dataset Size

5.5. Discussion

5.6. Computational Complexity

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

S²AC: Self-Supervised Attention Correlation Alignment Based on Mahalanobis Distance for Image Recognition

3.1. Mahalanobis Distance-Based CORAL Loss ( $L_{M D - C O R A L}$ )